Principal Software Engineer - Compute Infrastructure

at Nvidia
USD 248,000-391,000 per year
SENIOR
✅ Hybrid

Used Tools & Technologies

Not specified

Required Skills & Competences

Go @ 6 Kubernetes @ 4 Terraform @ 4 Python @ 6 GCP @ 4 ArgoCD @ 4 Leadership @ 7 AWS @ 4 Mathematics @ 4 Networking @ 4 Microservices @ 4 API @ 4 OpenShift @ 4 GPU @ 4 AI @ 4

Details

NVIDIA has been reinventing computer graphics, PC gaming, and accelerated computing for 30 years. Today the company is focused on the next era of computing powered by AI. This role will lead architectural vision and operationalization for a massive global compute platform that supports frontier-class internal AI inference systems across on-prem and cloud environments.

Responsibilities

  • Define platform architecture for a global enterprise compute platform running thousands of nodes and tens of thousands of VMs and containers via OpenShift and KubeVirt, including service tiers, SLAs, and automated cluster lifecycles.
  • Operationalize frontier AI infrastructure: develop automated remediation pipelines, hardware watchdogs, and telemetry for pre-release, rack-scale GPU systems (including Blackwell and upcoming architectures).
  • Drive strategic capacity and scale: collect and review system data for capacity planning, plan for hardware supply constraints, and implement strategies such as public cloud bursting, hardware dogfooding, and evaluating alternative compute architectures (e.g., ARM).
  • Build self-service "paved road" platforms, APIs, and Terraform/OpenTofu providers to enable autonomous engineering teams to adopt standard platforms.
  • Lead complex migrations of massive legacy workloads (including large-scale, long-running VDI environments) into modern Kubernetes orchestration.

Requirements

  • Bachelor’s degree in Engineering, Computer Science, Mathematics, or related field, or equivalent experience.
  • 15+ years of experience in compute platform engineering, site reliability, or systems architecture with heavy focus on automation at massive scale.
  • Deep expertise in Kubernetes architecture and designing/deploying virtualization architectures, specifically operating VMs inside Kubernetes (KubeVirt, OpenShift).
  • In-depth knowledge of hardware technologies (GPUs, high-speed backplane networking) and experience mitigating hardware-level failures, silent data corruption, and anomalies at large scale.
  • Experience running large global environments spanning bare metal, virtualized infrastructure, and cloud with a unified GitOps posture (ArgoCD or similar).
  • Proficiency in programming languages such as Go and/or Python; expert-level infrastructure-as-code development (Terraform, configuration management).
  • Strong leadership and influencing skills across highly autonomous engineering teams.

Ways To Stand Out From The Crowd

  • Hands-on experience managing bleeding-edge, pre-release hardware in production environments.
  • Deep understanding of advanced storage migrations and protocols (NFSv4, NVMe/TCP, hyperconverged storage).
  • Solid understanding of microservices architecture and multi-cloud deployment strategies (AWS, GCP).
  • Proven track record building Day 2 operational maturity: self-service, advanced auto-remediation, and strict SLAs.

Compensation & Benefits

  • Base salary range: 248,000 USD - 391,000 USD (determined based on location, experience, and pay of employees in similar positions).
  • Eligible for equity and company benefits.

Application deadline: May 17, 2026.

#LI-Hybrid