Director, Site Reliability and Software Engineering - DGX Cloud

at Nvidia
USD 320,000-575,000 per year
SENIOR
✅ On-site

Used Tools & Technologies

SRE

Required Skills & Competences

Security @ 6 Linux @ 7 DevOps @ 4 Distributed Systems @ 4 Leadership @ 4 People Management @ 4 Mentoring @ 4 Product Management @ 4 Reporting @ 4 Engineering Management @ 8 Cloud Computing @ 4 GPU @ 4 AI @ 4

Details

NVIDIA is the AI computing company driving modern AI with GPUs. The NVIDIA GPU Cloud (NGC) is a GPU-accelerated platform used by data scientists and researchers to build, train, and deploy neural network models. The DGX Cloud Computing team is looking for a leader to manage software, automation, and operations of multi-colo distributed NVIDIA GPU cloud clusters and contribute to product strategy.

Responsibilities

  • Manage a team of Software and Site Reliability engineers, including program development, task planning and code reviews.
  • Define team strategy and roadmap; drive adoption of scalable SDLC practices, test infrastructure, and modern practices within Nvidia’s DGX Cloud Computing environment.
  • Drive technical projects and provide leadership in an innovative and fast-paced environment.
  • Be responsible for the overall planning, tracking and success of technical projects.
  • Work closely with project and product management teams to ensure best-in-class product development.
  • Contribute technically to projects for DGX Cloud Computing Services.
  • Interact with key internal stakeholders to provide operational and financial clarity on technical spend.
  • Drive decision making, visibility and operational rigor across business analytic initiatives such as budget and project & portfolio reporting; lead executive reporting, dashboards, and operational CTO metrics focused on continuous improvement.

Requirements

  • 12+ years of overall experience in engineering management; 5+ years of leadership experience.
  • Bachelor’s or Master’s degree in Computer Science or equivalent experience.
  • Experience designing and implementing large-scale distributed systems.
  • Experience with containers, virtualization environments, and cluster solutions; experience managing Technical Support / DevOps teams.
  • Strong knowledge of Unix/Linux.
  • Experience implementing tools, processes, internal instrumentation, methodologies and resolving blockages.
  • Demonstrated people management and leadership skills with a proven track record of mentoring and coaching team members.
  • Ability to quickly learn and evaluate new technologies and to influence and establish relationships with other software and IT functional groups (development, server, storage, security).

Compensation and Other Details

  • Base salary ranges (determined by location, experience, and peer pay):
    • Level 5: 320,000 USD - 488,750 USD
    • Level 6: 384,000 USD - 575,000 USD
  • You will also be eligible for equity and benefits (link provided in original posting).
  • Applications accepted at least until May 8, 2026.
  • NVIDIA uses AI tools in its recruiting processes.
  • NVIDIA is an equal opportunity employer and is committed to fostering a diverse work environment.