Senior Systems Software Engineer, Kubernetes Scale - DGX Cloud
at Nvidia
USD 184,000-356,500 per year
Used Tools & Technologies
GoRequired Skills & Competences
Tag name is followed by "@" symbol and proficiency level value.
About proficiency levels:
- 1-2 — basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;
- 3-6 — daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;
- 7-9 — you are an expert, you can teach others, you know all the pitfalls and tricks;
- 10 — exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.
Kubernetes @ 4
Python @ 6
GCP @ 4
CI/CD @ 4
Distributed Systems @ 4
AWS @ 4
Azure @ 4
Communication @ 4
Networking @ 7
Reporting @ 4
GPU @ 4
AI @ 4
- 1-2 — basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;
- 3-6 — daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;
- 7-9 — you are an expert, you can teach others, you know all the pitfalls and tricks;
- 10 — exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.
Details
The DGX Cloud organization at NVIDIA brings together cutting-edge hardware and software innovation to deliver industry-leading accelerated computing for the world's most adventurous AI workloads. The team focuses on scaling AI infrastructure and optimizing total cost of ownership for large-scale AI workloads.
Responsibilities
- Drive end-to-end performance and scale characterization for the NVIDIA DGX Cloud software stack, from Kubernetes control and data planes through NVIDIA components such as GPU Operator, Network Operator, DCGM, NIM, and distributed inference serving, following issues from orchestration down to the metal.
- Collaborate with AI researchers, developers, and customers to develop innovative, automated tests that simulate real user workloads using custom-built and open-source tools and frameworks.
- Deep dive into performance and scale issues in complex distributed systems, including interactions between Kubernetes and the NVIDIA software stack, to identify and resolve root causes.
- Design and develop monitoring, reporting and analysis tools for performance and scale testing across software, GPU and CPU resources.
- Triage, debug and root-cause issues related to operating Kubernetes clusters at ultra-large scale, ensuring reliability and efficiency.
- Build and maintain a high-velocity framework that enables continuous, always-on performance and scale testing via a modern CI/CD pipeline.
- Document research, methodologies and results clearly and concisely, and present findings internally and externally (e.g., KubeCon, GTC).
- Engage with upstream communities including Kubernetes, CNCF and NVIDIA open-source projects to validate performance and scalability of AI workloads and influence design decisions.
Requirements
- 8+ years of experience in Computer Architecture, Networking, Storage systems, Accelerators and a Bachelors/Masters in Engineering (Electrical Engineering, Computer Engineering, Computer Science) or equivalent experience.
- Expertise in Kubernetes and familiarity with related CNCF projects.
- Background working with large-scale parallel and distributed accelerator-based systems.
- Expertise optimizing performance and AI workloads on large-scale systems.
- Experience with performance modeling and benchmarking at scale.
- Proficiency in Golang and Python.
- Background with the NVIDIA software ecosystem in both training and inference domains.
- Expertise with at least one public cloud provider (GCP, AWS, Azure, OCI for example).
Ways to stand out
- Strong operational experience with any Kubernetes distribution.
- Prior experience scaling Kubernetes clusters to ultra-large node and object counts.
- Demonstrated history of working in the open-source community.
- Excellent communication and interpersonal abilities.
- PhD in relevant areas.
Compensation & Benefits
- Base salary ranges (dependent on level and location):
- Level 4: 184,000 USD - 287,500 USD
- Level 5: 224,000 USD - 356,500 USD
- You will also be eligible for equity and benefits (see NVIDIA benefits page).
Additional information
- Employment type: Full time
- Location note: #LI-Hybrid
- Applications accepted at least until June 14, 2026.
- NVIDIA uses AI tools in its recruiting processes and is an equal opportunity employer.