Senior Datacenter System Software Architect - DGX Cloud

at Nvidia
USD 224,000-356,500 per year
SENIOR
βœ… On-site

SCRAPED

Used Tools & Technologies

Not specified

Required Skills & Competences ?

Docker @ 4 Go @ 4 Kubernetes @ 4 Linux @ 4 Terraform @ 4 Python @ 4 Distributed Systems @ 3 Machine Learning @ 4 TensorFlow @ 7 Hiring @ 4 Communication @ 3 Parallel Programming @ 4 Rust @ 4 Microservices @ 4 Debugging @ 7 PyTorch @ 7 CUDA @ 4 GPU @ 7

Details

NVIDIA is hiring engineers to scale up its AI infrastructure. This role is on the DGX Cloud Software Team and focuses on architecting, designing and implementing next-generation DGX cloud clusters. The position involves full-stack deployment work across hardware architecture, workload orchestration, software release and application performance tuning for large-scale AI training infrastructure. The team expects strong programming background, deep understanding of distributed systems, familiarity with software testing and deployment, and excellent communication and planning abilities.

Responsibilities

  • Lead technical activities for data centers with a focus on hybrid deployments between cloud and on-prem.
  • Provide expertise in infrastructure workflows, including hardware, software release, workload orchestration and application tuning.
  • Deliver fast and creative solutions for complex problems and write clear, reliable architecture specifications.
  • Translate requirements into vision, architecture and roadmap.
  • Collaborate with engineering teams across NVIDIA to ensure software integrates from hardware up to AI training applications.

Requirements

  • Master's or PhD in Computer Science, Computer Engineering, Physics or equivalent experience.
  • 12+ years of experience in system software and infrastructure engineering.
  • Coursework or background in Data Sciences, Deep Learning, or Machine Learning.
  • Ability to seamlessly shift between Linux system environments and Python programming.
  • Programming skills in one or more high-level languages (examples listed: C, C++, Go, Rust).
  • System-level experience spanning both hardware and software.
  • Strong design, coding, analytical, debugging and problem-solving skills.
  • Motivated self-starter with customer-facing communication skills; ability to work concurrently with multiple groups locally and abroad.

Ways to stand out

  • Experience with GPU deep learning and data sciences; experience using TensorFlow, PyTorch or other deep learning frameworks.
  • Experience with Docker containers, Slurm, Terraform and Kubernetes for orchestration and deployment.
  • CUDA programming and NCCL experience.
  • HPC programming experience including MPI, OpenACC, or other parallel programming tools.
  • Hands-on experience with DGX Cloud, NVIDIA AI Enterprise Software, Base Command Manager, NEMO and NVIDIA Inference Microservices.
  • Interest in crafting, analyzing and fixing large-scale distributed systems; systematic problem-solving approach and strong ownership.

Benefits

  • Base salary range: 224,000 USD - 356,500 USD (determined based on location, experience, and comparable roles).
  • Eligible for equity and additional benefits (see NVIDIA benefits page).

Additional information

  • Full time role. Location listed as Santa Clara, CA, United States.
  • Applications accepted at least until August 13, 2025.
  • NVIDIA is an equal opportunity employer and committed to fostering a diverse work environment.