Used Tools & Technologies
Not specified
Required Skills & Competences
Tag name is followed by "@" symbol and proficiency level value.
About proficiency levels:
- 1-2 — basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;
- 3-6 — daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;
- 7-9 — you are an expert, you can teach others, you know all the pitfalls and tricks;
- 10 — exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.
Security @ 4
Go @ 7
Kubernetes @ 4
Linux @ 4
Terraform @ 4
Python @ 7
ArgoCD @ 4
Distributed Systems @ 4
Communication @ 4
Networking @ 4
Debugging @ 4
API @ 4
GPU @ 4
Observability @ 4
AI @ 4
- 1-2 — basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;
- 3-6 — daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;
- 7-9 — you are an expert, you can teach others, you know all the pitfalls and tricks;
- 10 — exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.
Details
NVIDIA DGX Cloud is building and operating large-scale GPU infrastructure for AI research and production workloads. This role is part of a production engineering team focused on Kubernetes-based infrastructure, GPU cluster operations, reliability, automation, GitOps, and Day 2 operability across DGX Cloud environments.
Responsibilities
- Build and operate automation for large-scale GPU clusters across NVIDIA Cloud Partners (NCP) and on-prem environments.
- Develop tools and services for provisioning, validation, upgrades, monitoring, repair, and cluster lifecycle operations.
- Improve Day 0 / Day 1 / Day 2 workflows for cluster bringup, handoff, and production operations.
- Reduce manual production touches through APIs, GitOps, automation, and agent-assisted workflows.
- Participate in on-call, incident response, debugging, and durable follow-up work.
- Partner with platform, storage, networking, security, and workload teams to make infrastructure production-ready.
Requirements
- 8+ years of experience building or operating production infrastructure.
- Strong programming skills in Python, Go, or similar.
- Experience with Linux, Kubernetes, containers, cloud infrastructure, or infrastructure automation.
- Ability to troubleshoot distributed systems in production.
- Clear communication and ability to work across teams.
- BS/MS in Computer Science or equivalent experience.
Ways to stand out
- Experience with GPU infrastructure, Kubernetes operators, GitOps, Terraform, ArgoCD, or fleet automation.
- Experience with SLOs, on-call, incident response, observability, and reliability practices.
- Exposure to BMaaS, VMaaS, managed Kubernetes, or multi-cloud infrastructure.
Compensation & Benefits
- Base salary ranges provided by location and level:
- Level 4: 184,000 USD - 287,500 USD
- Level 5: 224,000 USD - 356,500 USD
- You will also be eligible for equity and benefits (link provided in original posting).
Other information
- Applications for this job will be accepted at least until June 8, 2026.
- This posting is for an existing vacancy.
- NVIDIA uses AI tools in its recruiting processes.
- NVIDIA is an equal opportunity employer and committed to fostering an inclusive work environment.