Senior Datacenter System Software Architect - DGX Cloud
at Nvidia
π Santa Clara, United States
USD 224,000-356,500 per year
SCRAPED
Used Tools & Technologies
Not specified
Required Skills & Competences ?
Docker @ 4 Go @ 4 Kubernetes @ 4 Linux @ 4 Terraform @ 4 Python @ 4 Distributed Systems @ 3 Machine Learning @ 4 TensorFlow @ 7 Hiring @ 4 Communication @ 3 Parallel Programming @ 4 Rust @ 4 Microservices @ 4 Debugging @ 7 PyTorch @ 7 CUDA @ 4 GPU @ 7Details
NVIDIA is hiring engineers to scale up its AI infrastructure. This role is on the DGX Cloud Software Team and focuses on architecting, designing and implementing next-generation DGX cloud clusters. The position involves full-stack deployment work across hardware architecture, workload orchestration, software release and application performance tuning for large-scale AI training infrastructure. The team expects strong programming background, deep understanding of distributed systems, familiarity with software testing and deployment, and excellent communication and planning abilities.
Responsibilities
- Lead technical activities for data centers with a focus on hybrid deployments between cloud and on-prem.
- Provide expertise in infrastructure workflows, including hardware, software release, workload orchestration and application tuning.
- Deliver fast and creative solutions for complex problems and write clear, reliable architecture specifications.
- Translate requirements into vision, architecture and roadmap.
- Collaborate with engineering teams across NVIDIA to ensure software integrates from hardware up to AI training applications.
Requirements
- Master's or PhD in Computer Science, Computer Engineering, Physics or equivalent experience.
- 12+ years of experience in system software and infrastructure engineering.
- Coursework or background in Data Sciences, Deep Learning, or Machine Learning.
- Ability to seamlessly shift between Linux system environments and Python programming.
- Programming skills in one or more high-level languages (examples listed: C, C++, Go, Rust).
- System-level experience spanning both hardware and software.
- Strong design, coding, analytical, debugging and problem-solving skills.
- Motivated self-starter with customer-facing communication skills; ability to work concurrently with multiple groups locally and abroad.
Ways to stand out
- Experience with GPU deep learning and data sciences; experience using TensorFlow, PyTorch or other deep learning frameworks.
- Experience with Docker containers, Slurm, Terraform and Kubernetes for orchestration and deployment.
- CUDA programming and NCCL experience.
- HPC programming experience including MPI, OpenACC, or other parallel programming tools.
- Hands-on experience with DGX Cloud, NVIDIA AI Enterprise Software, Base Command Manager, NEMO and NVIDIA Inference Microservices.
- Interest in crafting, analyzing and fixing large-scale distributed systems; systematic problem-solving approach and strong ownership.
Benefits
- Base salary range: 224,000 USD - 356,500 USD (determined based on location, experience, and comparable roles).
- Eligible for equity and additional benefits (see NVIDIA benefits page).
Additional information
- Full time role. Location listed as Santa Clara, CA, United States.
- Applications accepted at least until August 13, 2025.
- NVIDIA is an equal opportunity employer and committed to fostering a diverse work environment.