Senior AI Infrastructure Engineer - DGX Cloud

at Nvidia

📍 Santa Clara, United States

USD 152,000-287,500 per year

SENIOR

✅ On-site

Used Tools & Technologies

Not specified

Required Skills & Competences
Tag name is followed by "@" symbol and proficiency level value. About proficiency levels:

1-2 — basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;

3-6 — daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;

7-9 — you are an expert, you can teach others, you know all the pitfalls and tricks;

10 — exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.

Go @ 4 Kubernetes @ 4 Linux @ 4 IaC @ 4 Terraform @ 4 Python @ 4 Java @ 4 Distributed Systems @ 4 Communication @ 7 Networking @ 4 GPU @ 4 AI @ 4 Slurm @ 4

Details

NVIDIA's DGX Cloud group is seeking a Senior AI Infrastructure Engineer to design, build, and maintain large-scale production GPU cloud services. The role focuses on building reliable, highly available systems for AI training and inference using a combination of software and systems engineering practices across infrastructure, networking, capacity management, and cloud technologies.

Responsibilities

Design, build, deploy, and run internal tooling for a large-scale AI training and inferencing platform built on cloud infrastructure.
Conduct in-depth performance characterization and analysis on large multi-GPU and multi-node clusters.
Participate in the full service lifecycle: inception, design, deployment, operation, and refinement.
Support systems pre-launch via system design consulting, development of software tools, platforms, frameworks, capacity management, and launch reviews.
Maintain services in production by measuring and monitoring availability, latency, and overall system health.
Scale systems sustainably through automation and advocate for changes that improve reliability and velocity.
Practice sustainable incident response and blameless postmortems; participate in on-call rotations to support production systems.

Requirements

BS in Computer Science or related technical field (or equivalent experience).
5+ years of relevant experience.
Background in infrastructure automation and distributed systems architecture for managing large-scale private or public cloud platforms in production.
Experience with one or more languages: Python, Go, C/C++, Java.
Comprehensive understanding in one or more of: Linux, Networking, Storage, Containers technologies.
Experience with Public Cloud, Infrastructure as Code (IaC) and Terraform.
Distributed systems experience and experience operating or handling large private and public cloud systems (examples include Kubernetes or Slurm).
Strong balance of independent initiative and collaboration; strong communication and systematic problem-solving skills.

Ways to Stand Out

Interest in crafting, analyzing, and fixing large-scale distributed systems.
Capability to identify issues and improve code performance while automating routine tasks.
Experience operating large private/public cloud systems based on Kubernetes or Slurm.

Compensation & Benefits

Base salary range (Level 3): 152,000 USD - 241,500 USD.
Base salary range (Level 4): 184,000 USD - 287,500 USD.
Eligible for equity and benefits.

Other Information

Applications accepted at least until June 8, 2026.
NVIDIA uses AI tools in its recruiting processes and is an equal opportunity employer committed to diversity and inclusion.