Used Tools & Technologies
Not specified
Required Skills & Competences
Tag name is followed by "@" symbol and proficiency level value.
About proficiency levels:
- 1-2 — basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;
- 3-6 — daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;
- 7-9 — you are an expert, you can teach others, you know all the pitfalls and tricks;
- 10 — exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.
Go @ 7
Grafana @ 6
Prometheus @ 6
Python @ 7
Leadership @ 4
SRE @ 4
OpenTelemetry @ 6
GPU @ 4
Observability @ 6
AI @ 4
HPC @ 4
- 1-2 — basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;
- 3-6 — daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;
- 7-9 — you are an expert, you can teach others, you know all the pitfalls and tricks;
- 10 — exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.
Details
NVIDIA has been transforming computer graphics, PC gaming, and accelerated computing for more than 25 years. Today we’re tapping into the unlimited potential of AI to define the next era of computing. Join the DGX Cloud team as a Senior Reliability Engineer and help redefine operational excellence for NVIDIA's AI infrastructure.
Responsibilities
- Build org-wide reliability strategy, guiding how NVIDIA matures its operational practices in a 24/7 environment.
- Stand up and maintain a rigorous SLO program, defining and maintaining high standards across teams.
- Lead incident response for high-severity incidents, ensuring low-drama and high-signal resolution.
- Build and improve production code daily, enhancing the data platform and related tooling.
- Implement chaos engineering, failure injection, and resilience testing to elevate team practices.
- Improve standards by setting an example with hands-on experience and leadership.
Requirements
- Deep, hands-on experience running large-scale production systems with a proven track record.
- Detailed understanding of failure modes in large systems, including cascading dependencies and retry storms.
- Strong software engineering skills with current, hands-on experience in Go, Python, or similar languages.
- Proven experience establishing and maintaining an SLO program with operational rigor.
- Practical experience in reliability fields such as chaos engineering and failure injection.
- Ability to influence across team boundaries through credibility and expertise.
- 10+ years of industry experience with a Bachelor’s or Master’s degree, or equivalent experience operating systems at scale.
Ways to stand out
- Experience within a world-class reliability function like Google SRE or Meta production engineering.
- Expertise in operating GPU, HPC, or AI training infrastructure with outstanding failure modes.
- Track record of measurable reliability improvements within an organization.
- Proficiency with modern observability and operational tools like Prometheus, OpenTelemetry, Grafana, PagerDuty, and Rootly.
Compensation and benefits
- Base salary ranges (location- and level-dependent):
- Level 4: 168,000 USD - 270,250 USD
- Level 5: 208,000 USD - 333,500 USD
- You will also be eligible for equity and benefits. See NVIDIA benefits information at www.nvidiabenefits.com.
Other information
- Applications for this job will be accepted at least until June 26, 2026.
- This posting is for an existing vacancy.
- NVIDIA uses AI tools in its recruiting processes.
- NVIDIA is committed to fostering an inclusive work environment and is an equal opportunity employer. The company does not discriminate on the basis of characteristics protected by law.