Senior Datacenter System Software Architect - DGX Cloud

at Nvidia

📍 Santa Clara, United States

USD 224,000-356,500 per year

SENIOR

✅ On-site

SCRAPED

Used Tools & Technologies

Not specified

Required Skills & Competences ^?

Docker @ 4 Go @ 4 Kubernetes @ 4 Linux @ 4 Terraform @ 4 Python @ 4 Distributed Systems @ 3 Machine Learning @ 4 TensorFlow @ 7 Hiring @ 4 Communication @ 3 Parallel Programming @ 4 Rust @ 4 Microservices @ 4 Debugging @ 7 PyTorch @ 7 CUDA @ 4 GPU @ 7

Details

NVIDIA is hiring engineers to scale up its AI infrastructure. This role is on the DGX Cloud Software Team and focuses on architecting, designing and implementing next-generation DGX cloud clusters. The position involves full-stack deployment work across hardware architecture, workload orchestration, software release and application performance tuning for large-scale AI training infrastructure. The team expects strong programming background, deep understanding of distributed systems, familiarity with software testing and deployment, and excellent communication and planning abilities.

Responsibilities

Lead technical activities for data centers with a focus on hybrid deployments between cloud and on-prem.
Provide expertise in infrastructure workflows, including hardware, software release, workload orchestration and application tuning.
Deliver fast and creative solutions for complex problems and write clear, reliable architecture specifications.
Translate requirements into vision, architecture and roadmap.
Collaborate with engineering teams across NVIDIA to ensure software integrates from hardware up to AI training applications.

Requirements

Master's or PhD in Computer Science, Computer Engineering, Physics or equivalent experience.
12+ years of experience in system software and infrastructure engineering.
Coursework or background in Data Sciences, Deep Learning, or Machine Learning.
Ability to seamlessly shift between Linux system environments and Python programming.
Programming skills in one or more high-level languages (examples listed: C, C++, Go, Rust).
System-level experience spanning both hardware and software.
Strong design, coding, analytical, debugging and problem-solving skills.
Motivated self-starter with customer-facing communication skills; ability to work concurrently with multiple groups locally and abroad.

Ways to stand out

Experience with GPU deep learning and data sciences; experience using TensorFlow, PyTorch or other deep learning frameworks.
Experience with Docker containers, Slurm, Terraform and Kubernetes for orchestration and deployment.
CUDA programming and NCCL experience.
HPC programming experience including MPI, OpenACC, or other parallel programming tools.
Hands-on experience with DGX Cloud, NVIDIA AI Enterprise Software, Base Command Manager, NEMO and NVIDIA Inference Microservices.
Interest in crafting, analyzing and fixing large-scale distributed systems; systematic problem-solving approach and strong ownership.

Benefits

Base salary range: 224,000 USD - 356,500 USD (determined based on location, experience, and comparable roles).
Eligible for equity and additional benefits (see NVIDIA benefits page).

Additional information

Full time role. Location listed as Santa Clara, CA, United States.
Applications accepted at least until August 13, 2025.
NVIDIA is an equal opportunity employer and committed to fostering a diverse work environment.