Vacancy is archived. Applications are no longer accepted.
Used Tools & Technologies
Not specified
Required Skills & Competences ?
Go @ 6 Grafana @ 4 Kubernetes @ 4 Linux @ 4 Prometheus @ 4 Terraform @ 4 Python @ 6 GCP @ 4 Distributed Systems @ 4 TensorFlow @ 3 AWS @ 4 Azure @ 4 Communication @ 7 Mentoring @ 4 SRE @ 7 KubeFlow @ 3 Rust @ 6 System Architecture @ 4 PyTorch @ 3 GPU @ 4Details
NVIDIA is a technology leader in AI, accelerated computing, and graphics. This role will architect, lead, and scale globally distributed production systems that support AI/ML, HPC, and critical engineering platforms across hybrid and multi-cloud environments. The position emphasizes automation, reliability, observability, and mentoring/leading global teams to deliver resilient, high-velocity platforms for GPU-heavy AI workloads.
Responsibilities
- Architect, lead, and scale globally distributed production systems for AI/ML, HPC, and engineering platforms across hybrid and multi-cloud environments.
- Design and implement automation frameworks to reduce manual work, improve resilience, and standardize system health, change safety, and release velocity processes.
- Define and evolve platform-wide reliability metrics, capacity forecasting strategies, and uncertainty testing for distributed systems.
- Lead cross-organizational efforts to assess operational maturity, mitigate systemic risks, and establish long-term reliability strategies with engineering, infrastructure, and product teams.
- Influence NVIDIA's AI platform roadmap through co-development with internal partners and external vendors; stay current with academic and industry advances.
- Publish technical insights (papers, patents, whitepapers) and drive production engineering and system design innovation.
- Lead and mentor global teams technically, contribute to recruitment and design reviews, and develop standard methodologies for incident response, observability, and system architecture.
Requirements
- 15+ years of experience in Site Reliability Engineering (SRE), Production Engineering, or Cloud Infrastructure, with a strong track record of leading platform-scale efforts and high-impact programs.
- Deep expertise in Linux/Unix systems engineering and public/private cloud platforms (AWS, GCP, Azure, OCI).
- Expert-level programming in Python and proficiency in one or more of C++, Go, or Rust.
- Demonstrated experience operating Kubernetes at scale, including CPU/GPU scheduling, microservice orchestration, and container lifecycle management in production.
- Hands-on expertise with observability frameworks (Prometheus, Grafana, ELK, Loki, etc.) and Infrastructure as Code tools (Terraform, CDK, Pulumi).
- Proficiency in SRE concepts such as error budgets, SLOs, distributed tracing, and architectural fault tolerance.
- Strong written and verbal communication skills with ability to influence cross-functional collaborators and drive technical decisions.
- Proven ability to complete long-term, forward-looking platform strategies.
- Degree in Computer Science or related field, or equivalent experience.
Ways to Stand Out (Preferred / Nice-to-have)
- Hands-on experience building platforms for large-scale AI training, inferencing, and data movement pipelines.
- Familiarity with deep learning frameworks (PyTorch, TensorFlow, JAX) and orchestration frameworks (Ray, Kubeflow).
- Expertise in hardware fleet observability, predictive failure analysis, and power/resource-aware scheduling.
- Experience leading operational readiness and reliability engineering in GPU-heavy environments.
- Track record of improving incident management culture, root cause analysis, and postmortem processes across large teams.
Compensation & Benefits
- Base salary range: 272,000 USD - 425,500 USD (final base depends on location, experience, and internal pay comparisons).
- You will also be eligible for equity and a comprehensive benefits package.
Additional Information
- Applications for this job will be accepted at least until August 3, 2025.
- NVIDIA is an equal opportunity employer and committed to fostering a diverse work environment.