Director, Site Reliability and Software Engineering - DGX Cloud

at Nvidia

📍 Santa Clara, United States

USD 320,000-575,000 per year

SENIOR

✅ On-site

Used Tools & Technologies

SRE

Required Skills & Competences
Tag name is followed by "@" symbol and proficiency level value. About proficiency levels:

1-2 — basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;

3-6 — daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;

7-9 — you are an expert, you can teach others, you know all the pitfalls and tricks;

10 — exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.

Security @ 6 Linux @ 7 DevOps @ 4 Distributed Systems @ 4 Leadership @ 4 People Management @ 4 Mentoring @ 4 Product Management @ 4 Reporting @ 4 Engineering Management @ 8 Cloud Computing @ 4 GPU @ 4 AI @ 4

Details

NVIDIA is the AI computing company driving modern AI with GPUs. The NVIDIA GPU Cloud (NGC) is a GPU-accelerated platform used by data scientists and researchers to build, train, and deploy neural network models. The DGX Cloud Computing team is looking for a leader to manage software, automation, and operations of multi-colo distributed NVIDIA GPU cloud clusters and contribute to product strategy.

Responsibilities

Manage a team of Software and Site Reliability engineers, including program development, task planning and code reviews.
Define team strategy and roadmap; drive adoption of scalable SDLC practices, test infrastructure, and modern practices within Nvidia’s DGX Cloud Computing environment.
Drive technical projects and provide leadership in an innovative and fast-paced environment.
Be responsible for the overall planning, tracking and success of technical projects.
Work closely with project and product management teams to ensure best-in-class product development.
Contribute technically to projects for DGX Cloud Computing Services.
Interact with key internal stakeholders to provide operational and financial clarity on technical spend.
Drive decision making, visibility and operational rigor across business analytic initiatives such as budget and project & portfolio reporting; lead executive reporting, dashboards, and operational CTO metrics focused on continuous improvement.

Requirements

12+ years of overall experience in engineering management; 5+ years of leadership experience.
Bachelor’s or Master’s degree in Computer Science or equivalent experience.
Experience designing and implementing large-scale distributed systems.
Experience with containers, virtualization environments, and cluster solutions; experience managing Technical Support / DevOps teams.
Strong knowledge of Unix/Linux.
Experience implementing tools, processes, internal instrumentation, methodologies and resolving blockages.
Demonstrated people management and leadership skills with a proven track record of mentoring and coaching team members.
Ability to quickly learn and evaluate new technologies and to influence and establish relationships with other software and IT functional groups (development, server, storage, security).

Compensation and Other Details

Base salary ranges (determined by location, experience, and peer pay):
- Level 5: 320,000 USD - 488,750 USD
- Level 6: 384,000 USD - 575,000 USD
You will also be eligible for equity and benefits (link provided in original posting).
Applications accepted at least until May 8, 2026.
NVIDIA uses AI tools in its recruiting processes.
NVIDIA is an equal opportunity employer and is committed to fostering a diverse work environment.