Senior AI Infrastructure Engineer - DGX Cloud

at Nvidia
USD 184,000-356,500 per year
SENIOR
βœ… On-site

SCRAPED

Used Tools & Technologies

Not specified

Required Skills & Competences ?

Go @ 6 Kubernetes @ 4 Linux @ 4 IaC @ 4 Terraform @ 4 Python @ 6 Java @ 6 Distributed Systems @ 4 Communication @ 7 Mathematics @ 4 Networking @ 4 OpenStack @ 4 SRE @ 4 GPU @ 4

Details

NVIDIA is looking for an outstanding, passionate, and talented Senior AI Infrastructure Engineer to join the DGX Cloud group. This engineering role will design, build, and maintain large-scale production systems with high efficiency and availability using a combination of software and systems engineering practices. The role requires cross-domain knowledge in systems, networking, coding, databases, capacity management, continuous delivery/deployment, and open-source cloud enabling technologies such as Kubernetes and OpenStack.

DGX Cloud SRE at NVIDIA ensures that internal and external-facing GPU cloud services run with maximum reliability and uptime while enabling changes through careful preparation and capacity/performance management. The organization values diversity, intellectual curiosity, problem solving, and openness, and promotes collaboration, mentorship, and autonomous ownership of meaningful projects.

Responsibilities

  • Design, build, deploy, and run internal tooling for large-scale AI training and inferencing platforms built on top of cloud infrastructure.
  • Conduct in-depth performance characterization and analysis on large multi-GPU and multi-node clusters.
  • Engage in and improve the whole lifecycle of services β€” from inception and design through deployment, operation, and refinement.
  • Support services before they go live via system design consulting, developing software tools/platforms/frameworks, capacity management, and launch reviews.
  • Maintain services once live by measuring and monitoring availability, latency, and overall system health.
  • Scale systems sustainably using automation, and evolve systems by driving changes that improve reliability and velocity.
  • Practice sustainable incident response and blameless postmortems.
  • Participate in on-call rotation to support production systems.

Requirements

  • BS degree in Computer Science or a related technical field involving coding (e.g., physics or mathematics), or equivalent experience.
  • 6+ years of relevant experience.
  • Proven ability to initiate projects, collaborate with others, and contribute to projects initiated by others.
  • Experience with infrastructure automation and distributed systems design; building tools for running large-scale private or public cloud systems in production.
  • Proficiency in one or more programming languages such as Python, Go, C/C++, or Java.
  • In-depth knowledge in one or more of: Linux, Networking, Storage, and Container technologies.
  • Experience with public cloud and Infrastructure as Code (IaC), including Terraform.
  • Demonstrated distributed systems experience.

Ways to stand out

  • Interest in crafting, analyzing, and fixing large-scale distributed systems.
  • Systematic problem-solving approach, strong communication skills, sense of ownership and drive.
  • Ability to debug and optimize code and automate routine tasks. Experience using or running large private and public cloud systems based on Kubernetes or Slurm.

Benefits and Additional Information

  • Base salary ranges by level:
    • Level 4: 184,000 USD - 287,500 USD per year
    • Level 5: 224,000 USD - 356,500 USD per year
  • Eligible for equity and benefits (see NVIDIA benefits).
  • Applications accepted at least until August 13, 2025.
  • NVIDIA is an equal opportunity employer and fosters a diverse work environment.