Used Tools & Technologies
Not specified
Required Skills & Competences
Tag name is followed by "@" symbol and proficiency level value.
About proficiency levels:
- 1-2 — basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;
- 3-6 — daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;
- 7-9 — you are an expert, you can teach others, you know all the pitfalls and tricks;
- 10 — exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.
Security @ 4
Go @ 7
Kubernetes @ 4
Linux @ 4
Terraform @ 4
Python @ 7
ArgoCD @ 4
Distributed Systems @ 4
Hiring @ 4
Communication @ 4
Networking @ 4
Debugging @ 4
API @ 4
GPU @ 4
Observability @ 4
AI @ 4
- 1-2 — basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;
- 3-6 — daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;
- 7-9 — you are an expert, you can teach others, you know all the pitfalls and tricks;
- 10 — exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.
Details
NVIDIA DGX Cloud is building and operating large-scale GPU infrastructure for AI research and production workloads. This role is part of a production engineering team focused on Kubernetes-based infrastructure, GPU cluster operations, reliability, automation, GitOps, and Day 2 operability across DGX Cloud environments.
Responsibilities
- Build and operate automation for large-scale GPU clusters across NVIDIA Cloud Partners (NCP) and on-prem environments.
- Develop tools and services for provisioning, validation, upgrades, monitoring, repair, and cluster lifecycle operations.
- Improve Day 0 / Day 1 / Day 2 workflows for cluster bringup, handoff, and production operations.
- Reduce manual production touches through APIs, GitOps, automation, and agent-assisted workflows.
- Participate in on-call, incident response, debugging, and durable follow-up work.
- Partner with platform, storage, networking, security, and workload teams to make infrastructure production-ready.
Requirements
- 8+ years of experience building or operating production infrastructure.
- Strong programming skills in Python, Go, or similar languages.
- Experience with Linux, Kubernetes, containers, cloud infrastructure, or infrastructure automation.
- Ability to troubleshoot distributed systems in production.
- Clear communication and ability to work across teams.
- BS/MS in Computer Science or equivalent experience.
Ways to stand out
- Experience with GPU infrastructure, Kubernetes operators, GitOps, Terraform, ArgoCD, or fleet automation.
- Experience with SLOs, on-call, incident response, observability, and reliability practices.
- Exposure to BMaaS, VMaaS, managed Kubernetes, or multi-cloud infrastructure.
Compensation
- Base salary range for Level 4: 184,000 USD - 287,500 USD per year.
- Base salary range for Level 5: 224,000 USD - 356,500 USD per year.
- You will also be eligible for equity and benefits.
Additional information
- Applications for this job will be accepted at least until May 22, 2026.
- NVIDIA uses AI tools in its recruiting processes.
- NVIDIA is an equal opportunity employer and values diversity in hiring and promotion practices.