Senior AI Infrastructure Engineer - DGX Cloud
SCRAPED
Used Tools & Technologies
Not specified
Required Skills & Competences ?
Go @ 6 Kubernetes @ 4 Linux @ 4 IaC @ 4 Terraform @ 4 Python @ 6 Java @ 6 Distributed Systems @ 4 Communication @ 7 Mathematics @ 4 Networking @ 4 OpenStack @ 4 SRE @ 4 GPU @ 4Details
NVIDIA is looking for an outstanding, passionate, and talented Senior AI Infrastructure Engineer to join the DGX Cloud group. This engineering role will design, build, and maintain large-scale production systems with high efficiency and availability using a combination of software and systems engineering practices. The role requires cross-domain knowledge in systems, networking, coding, databases, capacity management, continuous delivery/deployment, and open-source cloud enabling technologies such as Kubernetes and OpenStack.
DGX Cloud SRE at NVIDIA ensures that internal and external-facing GPU cloud services run with maximum reliability and uptime while enabling changes through careful preparation and capacity/performance management. The organization values diversity, intellectual curiosity, problem solving, and openness, and promotes collaboration, mentorship, and autonomous ownership of meaningful projects.
Responsibilities
- Design, build, deploy, and run internal tooling for large-scale AI training and inferencing platforms built on top of cloud infrastructure.
- Conduct in-depth performance characterization and analysis on large multi-GPU and multi-node clusters.
- Engage in and improve the whole lifecycle of services β from inception and design through deployment, operation, and refinement.
- Support services before they go live via system design consulting, developing software tools/platforms/frameworks, capacity management, and launch reviews.
- Maintain services once live by measuring and monitoring availability, latency, and overall system health.
- Scale systems sustainably using automation, and evolve systems by driving changes that improve reliability and velocity.
- Practice sustainable incident response and blameless postmortems.
- Participate in on-call rotation to support production systems.
Requirements
- BS degree in Computer Science or a related technical field involving coding (e.g., physics or mathematics), or equivalent experience.
- 6+ years of relevant experience.
- Proven ability to initiate projects, collaborate with others, and contribute to projects initiated by others.
- Experience with infrastructure automation and distributed systems design; building tools for running large-scale private or public cloud systems in production.
- Proficiency in one or more programming languages such as Python, Go, C/C++, or Java.
- In-depth knowledge in one or more of: Linux, Networking, Storage, and Container technologies.
- Experience with public cloud and Infrastructure as Code (IaC), including Terraform.
- Demonstrated distributed systems experience.
Ways to stand out
- Interest in crafting, analyzing, and fixing large-scale distributed systems.
- Systematic problem-solving approach, strong communication skills, sense of ownership and drive.
- Ability to debug and optimize code and automate routine tasks. Experience using or running large private and public cloud systems based on Kubernetes or Slurm.
Benefits and Additional Information
- Base salary ranges by level:
- Level 4: 184,000 USD - 287,500 USD per year
- Level 5: 224,000 USD - 356,500 USD per year
- Eligible for equity and benefits (see NVIDIA benefits).
- Applications accepted at least until August 13, 2025.
- NVIDIA is an equal opportunity employer and fosters a diverse work environment.