Senior Site Reliability Engineer - HPC

at Nvidia
USD 152,000-287,500 per year
SENIOR
✅ On-site

Used Tools & Technologies

Machine Learning

Required Skills & Competences

Go @ 6 Kubernetes @ 4 Ruby @ 6 IaC @ 4 Python @ 6 GCP @ 4 CI/CD @ 6 Leadership @ 6 AWS @ 4 Communication @ 7 Mentoring @ 6 Perl @ 6 SRE @ 4 Debugging @ 7 Observability @ 7 AI @ 4 Slurm @ 4 HPC @ 4

Details

NVIDIA has been transforming computer graphics, PC gaming, and accelerated computing for more than 25 years. NVIDIA is leading developments in Artificial Intelligence, High-Performance Computing and Visualization. The company is looking for a Senior SRE to join the Compute Farm team to build the next generation of a global services platform and keep critically important systems running.

Responsibilities

  • Own SRE solutions end-to-end, from design and implementation to operation and continuous improvement, ensuring they integrate cleanly with HPC schedulers, storage, and network fabrics.
  • Use Infrastructure-as-Code (IaC) and configuration management to standardize and automate provisioning everywhere.
  • Deliver solutions in a globally distributed, multi-cloud hybrid environment (On-prem, AWS, GCP, OCI).
  • Design for failure with redundancy, failure domains, progressive delivery, and strict change control.
  • Ensure the highest level of uptime and Quality of Service (QoS) for internal customers through operational excellence.
  • Conduct capacity management and planning to meet ongoing operational needs.
  • Detect performance issues and recommend solutions to maintain world-class service quality.
  • Collaborate with various teams in a fast-paced environment to ensure seamless project completion.
  • Participate in on-call rotations, incident reviews, assist in root cause identification, and produce high-quality RCA reports.

Requirements

  • B.S. degree in Computer Science or related technical field (or equivalent experience) with 5+ years professional experience building and supporting critical services.
  • Experience supporting large-scale HPC clusters using Slurm, LSF or Kubernetes, including setup, tuning, and troubleshooting.
  • Proficiency in modern CI/CD techniques and Infrastructure as Code (IaC) for managing services.
  • Strong experience crafting large-scale infrastructure platforms for automated host lifecycle management, fleet reliability/auto-healing, end-to-end observability, or data-driven operations (AIOps/ML-driven signals) that materially reduce manual intervention.
  • Proficient in monitoring, metrics, container management, and log collection tools.
  • 5+ years of coding/scripting experience in at least two high-level programming languages such as Python, Go, Perl, or Ruby.
  • Experience mentoring other engineers and influencing technical direction through design reviews, architecture documents, and partnership with product and leadership.
  • Creative problem solver with excellent debugging skills and strong communication and documentation abilities.

Ways to stand out

  • Published technical write-ups or talks (conference presentations, meetups, engineering blogs) that deep-dive into real-world reliability, observability, or large-scale HPC/SRE problems and their solutions.
  • Maintainer or co-maintainer responsibilities for an open source component used in production (plugins, operators, exporters, controllers, or SDKs) at large scale.

Compensation and Other Information

  • Base salary range (location- and level-dependent): 152,000 USD - 241,500 USD for Level 3; 184,000 USD - 287,500 USD for Level 4.
  • You will also be eligible for equity and benefits.
  • Applications accepted at least until March 1, 2026.
  • NVIDIA uses AI tools in its recruiting processes and is an equal opportunity employer committed to fostering a diverse work environment.