Senior AI Infrastructure Software Engineer - DGX Cloud

at Nvidia

📍 Santa Clara, United States

USD 184,000-356,500 per year

SENIOR

✅ On-site

Used Tools & Technologies

Machine Learning GPU

Required Skills & Competences
Tag name is followed by "@" symbol and proficiency level value. About proficiency levels:

1-2 — basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;

3-6 — daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;

7-9 — you are an expert, you can teach others, you know all the pitfalls and tricks;

10 — exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.

Kubernetes @ 3 Prometheus @ 3 Python @ 6 Distributed Systems @ 4 TensorFlow @ 3 Communication @ 7 Debugging @ 7 API @ 4 LLM @ 4 PyTorch @ 3 Deep Learning @ 3 Observability @ 3 AI @ 4 InfiniBand @ 7 Agentic AI @ 4 GenAI @ 4 NCCL @ 7 JAX @ 3

Details

Joining NVIDIA's DGX Cloud Lepton Team means contributing to the leading cloud product that powers innovative AI research and developers. The team builds AI/ML platform components to improve productivity, optimize efficiency and resiliency of AI workloads, and develop scalable AI infrastructure services globally. This role focuses on designing, building, and maintaining AI platforms that enable large-scale AI training, inferencing, fine-tuning, and Agentic AI in production.

Responsibilities

Develop platform and tools for large-scale AI, LLM, and GenAI infrastructure.
Develop and optimize tools to improve AI/ML workload efficiency and resiliency.
Root cause, analyze, and triage failures from the application level to the hardware level.
Enhance infrastructure and products underpinning NVIDIA's AI platforms.
Co-design and implement APIs for integration with NVIDIA's resiliency stacks on the platform.
Define meaningful and actionable reliability metrics to track and improve system and service reliability.
Apply strong problem-solving, root cause analysis, and optimization skills.

Requirements

Minimum of 8+ years of experience in developing software infrastructure for large-scale AI systems.
Bachelor's degree or higher in Computer Science or a related technical field (or equivalent experience).
Strong debugging skills and experience in analyzing and triaging AI applications from the application level to the hardware level.
Proven track record in building and scaling large-scale distributed systems.
Experience with AI training and inferencing and data infrastructure services.
Familiarity with Kubernetes and operating large-scale observability platforms for monitoring and logging (e.g., ELK, Prometheus, Loki).
Proficiency in programming languages such as Python, C/C++, and scripting languages.
Excellent communication and collaboration skills; strong culture fit around diversity, intellectual curiosity, problem solving, and openness.

Ways to stand out

Experience working with large-scale AI clusters and cloud-native infrastructure.
Strong understanding of NVIDIA GPUs and network technologies (RDMA, InfiniBand, NCCL).
Familiarity with deep learning frameworks such as PyTorch, TensorFlow, JAX, Dynamo, and Ray.
Experience with root cause analysis of failures at datacenter scale.
Strong background in software design and development.

Compensation & Other Details

Base salary ranges provided by level:
- Level 4: 184,000 USD - 287,500 USD
- Level 5: 224,000 USD - 356,500 USD
You will also be eligible for equity and benefits.
Applications accepted at least until May 16, 2026.
Employment type: Full time.

NVIDIA uses AI tools in its recruiting processes and is an equal opportunity employer committed to fostering a diverse work environment.