Senior Datacenter System Software Architect - DGX Cloud
at Nvidia
USD 184,000-356,500 per year
SCRAPED
Used Tools & Technologies
Not specified
Required Skills & Competences ?
Docker @ 4 Go @ 4 Kubernetes @ 4 Linux @ 7 Terraform @ 4 Python @ 7 Distributed Systems @ 4 Machine Learning @ 4 TensorFlow @ 3 Hiring @ 4 Communication @ 7 Parallel Programming @ 4 Rust @ 4 Microservices @ 4 Debugging @ 7 PyTorch @ 3 CUDA @ 4 GPU @ 3Details
NVIDIA is hiring engineers to scale up its AI infrastructure. The team seeks a highly motivated, creative engineer with strong system software experience to join the DGX Cloud Software Team. You will lead the architecture, design and implementation of next-generation DGX cloud clusters, working across hardware architecture, workload orchestration, and application performance tuning. The role involves full-stack deployment responsibilities spanning from hardware to AI training applications and hybrid deployments between cloud and on-prem.
Responsibilities
- Lead technical activities for data centers with a focus on hybrid cloud and on-prem deployments.
- Provide expertise in infrastructure workflows including hardware, software release, workload orchestration, and application tuning.
- Produce clear, reliable architecture specifications and translate requirements into vision, architecture, and roadmap.
- Work cross-functionally with engineering teams across NVIDIA to ensure software integrates from hardware up to AI training applications.
- Provide fast and creative solutions for complex problems and drive execution.
Requirements
- Masters or PhD in Computer Science, Computer Engineering, Physics, or equivalent experience.
- 9+ years of relevant experience.
- Coursework or background in Data Sciences, Deep Learning, or Machine Learning.
- Strong programming background and ability to move between Linux system environments and Python programming.
- Programming skills in one or more high-level languages (examples listed: C, C++, Go, Rust).
- System-level experience with both hardware and software.
- Familiarity with software testing and deployment processes.
- Strong design, coding, analytical, debugging, and problem-solving skills.
- Motivated self-starter with strong problem-solving and customer-facing communication skills.
- Passion for continuous learning and ability to collaborate with multiple groups across organizations and geographies.
Ways to stand out
- Experience with GPU deep learning and data sciences; familiarity with frameworks such as TensorFlow or PyTorch.
- Experience using Docker containers, Slurm, Terraform, and Kubernetes.
- CUDA programming and NCCL experience.
- HPC programming experience including MPI, OpenACC, or other parallel programming tools.
- Hands-on experience with DGX Cloud, NVIDIA AI Enterprise, Base Command Manager, NEMO, and NVIDIA Inference Microservices.
- Interest in analyzing and fixing large-scale distributed systems and a systematic problem-solving approach.
Compensation & Benefits
- Base salary range provided by location and level:
- Level 4: $184,000 - $287,500 USD
- Level 5: $224,000 - $356,500 USD
- Eligible for equity and additional benefits (see NVIDIA benefits).
Other details
- Full-time role. Applications accepted at least until August 19, 2025.
- NVIDIA is an equal opportunity employer committed to a diverse work environment.