Senior AI Infrastructure Engineer - DGX Cloud

at Nvidia
USD 152,000-287,500 per year
SENIOR
✅ On-site

Used Tools & Technologies

Not specified

Required Skills & Competences

Go @ 4 Kubernetes @ 4 Linux @ 4 IaC @ 4 Terraform @ 4 Python @ 4 Java @ 4 Distributed Systems @ 4 Communication @ 7 Networking @ 4 GPU @ 4 AI @ 4 Slurm @ 4

Details

NVIDIA's DGX Cloud group is seeking a Senior AI Infrastructure Engineer to design, build, and maintain large-scale production GPU cloud services. The role focuses on building reliable, highly available systems for AI training and inference using a combination of software and systems engineering practices across infrastructure, networking, capacity management, and cloud technologies.

Responsibilities

  • Design, build, deploy, and run internal tooling for a large-scale AI training and inferencing platform built on cloud infrastructure.
  • Conduct in-depth performance characterization and analysis on large multi-GPU and multi-node clusters.
  • Participate in the full service lifecycle: inception, design, deployment, operation, and refinement.
  • Support systems pre-launch via system design consulting, development of software tools, platforms, frameworks, capacity management, and launch reviews.
  • Maintain services in production by measuring and monitoring availability, latency, and overall system health.
  • Scale systems sustainably through automation and advocate for changes that improve reliability and velocity.
  • Practice sustainable incident response and blameless postmortems; participate in on-call rotations to support production systems.

Requirements

  • BS in Computer Science or related technical field (or equivalent experience).
  • 5+ years of relevant experience.
  • Background in infrastructure automation and distributed systems architecture for managing large-scale private or public cloud platforms in production.
  • Experience with one or more languages: Python, Go, C/C++, Java.
  • Comprehensive understanding in one or more of: Linux, Networking, Storage, Containers technologies.
  • Experience with Public Cloud, Infrastructure as Code (IaC) and Terraform.
  • Distributed systems experience and experience operating or handling large private and public cloud systems (examples include Kubernetes or Slurm).
  • Strong balance of independent initiative and collaboration; strong communication and systematic problem-solving skills.

Ways to Stand Out

  • Interest in crafting, analyzing, and fixing large-scale distributed systems.
  • Capability to identify issues and improve code performance while automating routine tasks.
  • Experience operating large private/public cloud systems based on Kubernetes or Slurm.

Compensation & Benefits

  • Base salary range (Level 3): 152,000 USD - 241,500 USD.
  • Base salary range (Level 4): 184,000 USD - 287,500 USD.
  • Eligible for equity and benefits.

Other Information

  • Applications accepted at least until June 8, 2026.
  • NVIDIA uses AI tools in its recruiting processes and is an equal opportunity employer committed to diversity and inclusion.