Senior Systems Engineer – High-Performance AI And Networking Applications

at Nvidia
USD 184,000-356,500 per year
SENIOR
✅ On-site

SCRAPED

Used Tools & Technologies

Not specified

Required Skills & Competences ?

Ansible @ 4 Grafana @ 4 Kubernetes @ 3 Prometheus @ 4 R @ 4 Communication @ 4 Networking @ 4 Debugging @ 7 PyTorch @ 4

Details

Join the NVIDIA Deep Learning Frameworks Infrastructure team as a Senior Systems Engineer focusing on High-Performance AI & Networking Applications. This position offers a distinctive opportunity to engage in the latest technology advancements, collaborating closely with elite teams to elevate NVIDIA's impactful innovations.

Responsibilities

  • Collaborate with networking teams to plan, implement, and evaluate performance benchmarks on NVLINK, NVSwitch, and InfiniBand powered infrastructures.
  • Assess findings and work closely with framework, hardware, and support teams to improve system performance across various deep learning workloads.
  • Act as a primary resource for fixing networking and hardware integration issues, focusing on scalable multi-node systems.
  • Maintain high communication standards across multiple engineering, support, and R&D teams, ensuring technical and performance goals are met.
  • Offer technical mentorship and documentation for internal teams and external partners on standard methodologies in HPC networking deployments.
  • Share insights on improving networking strategies for substantial AI and deep learning infrastructure.

Requirements

  • BS/MS or PhD in Computer Science, Engineering, or related field, or equivalent experience.
  • 8+ years of proven experience in AI/HPC infrastructure.
  • Familiarity with AI/HPC job schedulers and orchestrators such as Slurm, Kubernetes (K8s), or LSF. Practical exposure to AI/HPC workflows employing MPI and NCCL.
  • Familiarity with high-speed networking pertaining to HPC including InfiniBand, RDMA, RoCE, and Amazon EFA.
  • Essential to have an understanding of PyTorch, MegatronLM, and deep learning inference frameworks such as vllm/sglang.
  • Proven experience with InfiniBand, NVLINK, and high-speed networking technologies in HPC or large-scale datacenter environments.
  • Experience investigating and evaluating performance in multi-node systems, especially in deep learning or scientific computing tasks.
  • Strong analytical, debugging, and technical communication skills.
  • Comfortable working in collaborative, multi-faceted teams.

Ways To Stand Out

  • Mastery in deep learning frameworks or distributed training systems.
  • Familiarity with datacenter automation, advanced network protocols, and supporting large HPC or AI clusters in production environments.
  • Understanding of fast, distributed storage systems like Lustre and GPFS for AI/HPC workloads.
  • Experience with networking and communications libraries such as NCCL, NIXL, NVSHMEM, and UCX.
  • Experience developing or maintaining cluster management and monitoring tools — for example, Ansible for infrastructure automation, Prometheus and Grafana for monitoring.

Compensation & Benefits

  • Base salary range (determined by location, experience, and pay of similar roles):
    • Level 4: 184,000 USD - 287,500 USD
    • Level 5: 224,000 USD - 356,500 USD
  • You will also be eligible for equity and benefits.

Additional Information

  • Applications for this job will be accepted at least until November 14, 2025.
  • NVIDIA is an equal opportunity employer and is committed to fostering a diverse work environment.