Used Tools & Technologies
Not specified
Required Skills & Competences
Tag name is followed by "@" symbol and proficiency level value.
About proficiency levels:
- 1-2 — basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;
- 3-6 — daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;
- 7-9 — you are an expert, you can teach others, you know all the pitfalls and tricks;
- 10 — exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.
Go @ 6
Kubernetes @ 4
Terraform @ 4
Python @ 6
GCP @ 4
ArgoCD @ 4
Leadership @ 7
AWS @ 4
Mathematics @ 4
Networking @ 4
Microservices @ 4
API @ 4
OpenShift @ 4
GPU @ 4
AI @ 4
- 1-2 — basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;
- 3-6 — daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;
- 7-9 — you are an expert, you can teach others, you know all the pitfalls and tricks;
- 10 — exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.
Details
NVIDIA has been reinventing computer graphics, PC gaming, and accelerated computing for 30 years. Today the company is focused on the next era of computing powered by AI. This role will lead architectural vision and operationalization for a massive global compute platform that supports frontier-class internal AI inference systems across on-prem and cloud environments.
Responsibilities
- Define platform architecture for a global enterprise compute platform running thousands of nodes and tens of thousands of VMs and containers via OpenShift and KubeVirt, including service tiers, SLAs, and automated cluster lifecycles.
- Operationalize frontier AI infrastructure: develop automated remediation pipelines, hardware watchdogs, and telemetry for pre-release, rack-scale GPU systems (including Blackwell and upcoming architectures).
- Drive strategic capacity and scale: collect and review system data for capacity planning, plan for hardware supply constraints, and implement strategies such as public cloud bursting, hardware dogfooding, and evaluating alternative compute architectures (e.g., ARM).
- Build self-service "paved road" platforms, APIs, and Terraform/OpenTofu providers to enable autonomous engineering teams to adopt standard platforms.
- Lead complex migrations of massive legacy workloads (including large-scale, long-running VDI environments) into modern Kubernetes orchestration.
Requirements
- Bachelor’s degree in Engineering, Computer Science, Mathematics, or related field, or equivalent experience.
- 15+ years of experience in compute platform engineering, site reliability, or systems architecture with heavy focus on automation at massive scale.
- Deep expertise in Kubernetes architecture and designing/deploying virtualization architectures, specifically operating VMs inside Kubernetes (KubeVirt, OpenShift).
- In-depth knowledge of hardware technologies (GPUs, high-speed backplane networking) and experience mitigating hardware-level failures, silent data corruption, and anomalies at large scale.
- Experience running large global environments spanning bare metal, virtualized infrastructure, and cloud with a unified GitOps posture (ArgoCD or similar).
- Proficiency in programming languages such as Go and/or Python; expert-level infrastructure-as-code development (Terraform, configuration management).
- Strong leadership and influencing skills across highly autonomous engineering teams.
Ways To Stand Out From The Crowd
- Hands-on experience managing bleeding-edge, pre-release hardware in production environments.
- Deep understanding of advanced storage migrations and protocols (NFSv4, NVMe/TCP, hyperconverged storage).
- Solid understanding of microservices architecture and multi-cloud deployment strategies (AWS, GCP).
- Proven track record building Day 2 operational maturity: self-service, advanced auto-remediation, and strict SLAs.
Compensation & Benefits
- Base salary range: 248,000 USD - 391,000 USD (determined based on location, experience, and pay of employees in similar positions).
- Eligible for equity and company benefits.
Application deadline: May 17, 2026.
#LI-Hybrid