Used Tools & Technologies
Not specified
Required Skills & Competences
Tag name is followed by "@" symbol and proficiency level value.
About proficiency levels:
- 1-2 — basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;
- 3-6 — daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;
- 7-9 — you are an expert, you can teach others, you know all the pitfalls and tricks;
- 10 — exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.
Go @ 4
Kubernetes @ 4
Linux @ 4
IaC @ 4
Terraform @ 4
Python @ 4
Java @ 4
Distributed Systems @ 4
Communication @ 7
Networking @ 4
GPU @ 4
AI @ 4
Slurm @ 4
- 1-2 — basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;
- 3-6 — daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;
- 7-9 — you are an expert, you can teach others, you know all the pitfalls and tricks;
- 10 — exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.
Details
NVIDIA's DGX Cloud group is seeking a Senior AI Infrastructure Engineer to design, build, and maintain large-scale production GPU cloud services. The role focuses on building reliable, highly available systems for AI training and inference using a combination of software and systems engineering practices across infrastructure, networking, capacity management, and cloud technologies.
Responsibilities
- Design, build, deploy, and run internal tooling for a large-scale AI training and inferencing platform built on cloud infrastructure.
- Conduct in-depth performance characterization and analysis on large multi-GPU and multi-node clusters.
- Participate in the full service lifecycle: inception, design, deployment, operation, and refinement.
- Support systems pre-launch via system design consulting, development of software tools, platforms, frameworks, capacity management, and launch reviews.
- Maintain services in production by measuring and monitoring availability, latency, and overall system health.
- Scale systems sustainably through automation and advocate for changes that improve reliability and velocity.
- Practice sustainable incident response and blameless postmortems; participate in on-call rotations to support production systems.
Requirements
- BS in Computer Science or related technical field (or equivalent experience).
- 5+ years of relevant experience.
- Background in infrastructure automation and distributed systems architecture for managing large-scale private or public cloud platforms in production.
- Experience with one or more languages: Python, Go, C/C++, Java.
- Comprehensive understanding in one or more of: Linux, Networking, Storage, Containers technologies.
- Experience with Public Cloud, Infrastructure as Code (IaC) and Terraform.
- Distributed systems experience and experience operating or handling large private and public cloud systems (examples include Kubernetes or Slurm).
- Strong balance of independent initiative and collaboration; strong communication and systematic problem-solving skills.
Ways to Stand Out
- Interest in crafting, analyzing, and fixing large-scale distributed systems.
- Capability to identify issues and improve code performance while automating routine tasks.
- Experience operating large private/public cloud systems based on Kubernetes or Slurm.
Compensation & Benefits
- Base salary range (Level 3): 152,000 USD - 241,500 USD.
- Base salary range (Level 4): 184,000 USD - 287,500 USD.
- Eligible for equity and benefits.
Other Information
- Applications accepted at least until June 8, 2026.
- NVIDIA uses AI tools in its recruiting processes and is an equal opportunity employer committed to diversity and inclusion.