Senior AI Infrastructure Software Engineer - DGX Cloud

at Nvidia
USD 184,000-356,500 per year
SENIOR
✅ On-site

Used Tools & Technologies

Machine Learning GPU

Required Skills & Competences

Kubernetes @ 3 Prometheus @ 3 Python @ 6 Distributed Systems @ 4 TensorFlow @ 3 Communication @ 7 Debugging @ 7 API @ 4 LLM @ 4 PyTorch @ 3 Deep Learning @ 3 Observability @ 3 AI @ 4 InfiniBand @ 7 Agentic AI @ 4 GenAI @ 4 NCCL @ 7 JAX @ 3

Details

Joining NVIDIA's DGX Cloud Lepton Team means contributing to the leading cloud product that powers innovative AI research and developers. The team builds AI/ML platform components to improve productivity, optimize efficiency and resiliency of AI workloads, and develop scalable AI infrastructure services globally. This role focuses on designing, building, and maintaining AI platforms that enable large-scale AI training, inferencing, fine-tuning, and Agentic AI in production.

Responsibilities

  • Develop platform and tools for large-scale AI, LLM, and GenAI infrastructure.
  • Develop and optimize tools to improve AI/ML workload efficiency and resiliency.
  • Root cause, analyze, and triage failures from the application level to the hardware level.
  • Enhance infrastructure and products underpinning NVIDIA's AI platforms.
  • Co-design and implement APIs for integration with NVIDIA's resiliency stacks on the platform.
  • Define meaningful and actionable reliability metrics to track and improve system and service reliability.
  • Apply strong problem-solving, root cause analysis, and optimization skills.

Requirements

  • Minimum of 8+ years of experience in developing software infrastructure for large-scale AI systems.
  • Bachelor's degree or higher in Computer Science or a related technical field (or equivalent experience).
  • Strong debugging skills and experience in analyzing and triaging AI applications from the application level to the hardware level.
  • Proven track record in building and scaling large-scale distributed systems.
  • Experience with AI training and inferencing and data infrastructure services.
  • Familiarity with Kubernetes and operating large-scale observability platforms for monitoring and logging (e.g., ELK, Prometheus, Loki).
  • Proficiency in programming languages such as Python, C/C++, and scripting languages.
  • Excellent communication and collaboration skills; strong culture fit around diversity, intellectual curiosity, problem solving, and openness.

Ways to stand out

  • Experience working with large-scale AI clusters and cloud-native infrastructure.
  • Strong understanding of NVIDIA GPUs and network technologies (RDMA, InfiniBand, NCCL).
  • Familiarity with deep learning frameworks such as PyTorch, TensorFlow, JAX, Dynamo, and Ray.
  • Experience with root cause analysis of failures at datacenter scale.
  • Strong background in software design and development.

Compensation & Other Details

  • Base salary ranges provided by level:
    • Level 4: 184,000 USD - 287,500 USD
    • Level 5: 224,000 USD - 356,500 USD
  • You will also be eligible for equity and benefits.
  • Applications accepted at least until May 16, 2026.
  • Employment type: Full time.

NVIDIA uses AI tools in its recruiting processes and is an equal opportunity employer committed to fostering a diverse work environment.