Principal Software Engineer - DGX Cloud Kubernetes Runtime Team
SCRAPED
Used Tools & Technologies
Not specified
Required Skills & Competences ?
Security @ 3 Go @ 7 Kubernetes @ 4 Distributed Systems @ 7 Helm @ 4 API @ 4 GPU @ 4Details
Join NVIDIA's DGX Cloud Kubernetes Runtime team and be at the forefront of building the next generation of GPU-accelerated Kubernetes runtime distributions. You will design and build automation systems that enable operators to seamlessly install, upgrade, and manage cluster runtime packages powering NVIDIA's AI Accelerators. The team provides a Kubernetes runtime distribution that can be applied to any cluster using NVIDIA accelerators, empowering operators with automation-first, self-service tools that minimize manual effort while enhancing reliability and reproducibility.
Location: US (Washington); Remote (US). Employment type: Full time.
Responsibilities
- Design and implement the runtime controller system that manages the lifecycle of runtime packages across thousands of Kubernetes clusters without manual pipeline intervention.
- Build and maintain the runtime builder that packages, validates, and distributes GPU operators, DRA drivers, network components, and other accelerated compute runtime packages.
- Develop Kubernetes controllers, CustomResourceDefinitions (CRDs), and operators that automate runtime installation, upgrade, and rollback operations with API-driven workflows.
- Create expansion rules and component management systems that enable flexible runtime composition across different cloud providers and GPU architectures.
- Work with internal teams to migrate from GitLab pipeline-based deployments to fully automated, controller-powered runtime management.
Requirements
- Experience building production Kubernetes systems with deep expertise in controllers, operators, and CustomResourceDefinitions.
- Strong proficiency in Go and experience building scalable Go services that manage complex distributed systems.
- Hands-on experience with Helm, Kustomize, and managing Kubernetes manifest packaging and templating at scale.
- Deep understanding of Kubernetes architecture including API machinery, admission controllers, and resource lifecycle management.
- Demonstrated ability to design and implement automation systems that replace manual processes with reliable, self-service tooling.
- Masters and/or PhD in Computer Science, or equivalent experience.
- 15+ years of professional experience, with at least 4 years experience with Kubernetes development.
Preferred / Ways to stand out
- Experience building multi-tenant platform services with focus on API design, versioning, and backward compatibility.
- Familiarity with OCI registries, artifact signing, SBOM generation, and supply chain security practices.
- Experience working with GPU operators, device plugins, or other hardware acceleration components in Kubernetes.
- Track record of migrating legacy systems to modern, automated platforms while maintaining zero-downtime operations.
- Contributions to upstream Kubernetes projects or experience extending Kubernetes API machinery.
Compensation & Benefits
- Base salary range: 272,000 USD - 425,500 USD (will be determined based on location, experience, and pay of employees in similar positions).
- Eligible for equity and company benefits (see NVIDIA benefits).
- Applications accepted at least until November 10, 2025.
Company & Diversity
NVIDIA is a leader in AI, High-Performance Computing and Visualization. NVIDIA is committed to fostering a diverse work environment and is an equal opportunity employer. The company does not discriminate on the basis of any characteristic protected by law.