Used Tools & Technologies
Not specified
Required Skills & Competences
Tag name is followed by "@" symbol and proficiency level value.
About proficiency levels:
- 1-2 ā basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;
- 3-6 ā daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;
- 7-9 ā you are an expert, you can teach others, you know all the pitfalls and tricks;
- 10 ā exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.
Security @ 3
Kubernetes @ 3
Distributed Systems @ 3
Communication @ 6
Networking @ 3
SRE @ 3
Prioritization @ 6
GPU @ 3
Observability @ 3
AI @ 3
- 1-2 ā basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;
- 3-6 ā daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;
- 7-9 ā you are an expert, you can teach others, you know all the pitfalls and tricks;
- 10 ā exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.
Details
NVIDIA DGX Cloud is building the operating model for reliable, scalable GPU infrastructure across internal, partner, and on-prem environments. This role leads a team of software and production engineers focused on Kubernetes-based operations, automation, reliability, and cluster lifecycle tooling. The manager will run current production systems while building automation and engineering practices for the next generation of DGX Cloud infrastructure.
Responsibilities
- Lead a team of software and production engineers building and operating DGX Cloud infrastructure across NVIDIA Cloud Partner (NCP) and on-prem environments.
- Drive execution across cluster operations, Kubernetes operability, automation, GitOps, observability, and incident response.
- Help define team priorities, roadmap, staffing, and operational ownership.
- Partner with platform, workload, storage, networking, security, and TPM teams to improve production readiness.
- Build a healthy on-call and incident review culture focused on learning, ownership, and durable fixes.
- Coach engineers, grow technical leaders, and create clear ownership across ambiguous problem spaces.
Requirements
- 8+ overall years of industry experience, including 2+ years leading or managing engineers.
- Experience building or operating production infrastructure, cloud platforms, Kubernetes environments, or distributed systems.
- Strong understanding of reliability engineering, automation, observability, incident response, and operational excellence.
- Ability to work across teams and influence without direct authority.
- Clear communication, strong prioritization, and sound judgment in fast-moving environments.
- BS/MS in Computer Science or equivalent experience.
Ways to stand out
- Experience leading SRE, production engineering, infrastructure automation, or platform teams.
- Experience with GPU infrastructure, Kubernetes fleet operations, GitOps, BMaaS/VMaaS, managed Kubernetes, or multi-cloud environments.
- Track record of reducing toil, improving SLOs, and turning operational work into software-driven systems.
Compensation & Benefits
- Base salary range: 224,000 USD - 356,500 USD (determined based on location, experience, and pay of employees in similar positions).
- Eligible for equity and benefits (see NVIDIA benefits information referenced in the posting).
Other information
- Applications accepted at least until May 31, 2026. This posting is for an existing vacancy.
- NVIDIA uses AI tools in its recruiting processes.
- NVIDIA is an equal opportunity employer and provides a diverse work environment.