Site Reliability Engineer, HPC And LSF

at Nvidia

📍 Santa Clara, United States

USD 124,000-218,500 per year

MIDDLE SENIOR

✅ Hybrid

Used Tools & Technologies

Not specified

Required Skills & Competences
Tag name is followed by "@" symbol and proficiency level value. About proficiency levels:

1-2 — basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;

3-6 — daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;

7-9 — you are an expert, you can teach others, you know all the pitfalls and tricks;

10 — exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.

Ansible @ 3 CentOS @ 5 Docker @ 3 Linux @ 5 Python @ 5 Leadership @ 3 Bash @ 5 Communication @ 3 Networking @ 3 Perl @ 5 SRE @ 3 GPU @ 3

Details

NVIDIA has been transforming computer graphics, PC gaming, and accelerated computing for more than 25 years. It’s a unique legacy of innovation that’s fueled by great technology—and amazing people. Today, we’re tapping into the unlimited potential of AI to define the next era of computing. An era in which our GPU acts as the brains of computers, robots, and self-driving cars that can understand the world. Doing what’s never been done before takes vision, innovation, and the world’s best talent. As an NVIDIAN, you’ll be immersed in a diverse, supportive environment where everyone is inspired to do their best work.

As a member of the Hardware Infrastructure Farm team, you will provide leadership in the design and implementation of ground breaking compute clusters that power all silicon development across NVIDIA. We seek an expert to build and operate these clusters at high reliability, efficiency, and performance and drive foundational improvements and automation to improve engineers' productivity. As a Site Reliability Engineer, you are responsible for the big picture of how our systems relate to each other; we use a breadth of tools and approaches to tackle a broad spectrum of problems. Practices such as limiting time spent on reactive operational work, blameless postmortems and proactive identification of potential outages factor into iterative improvement that is key to both product quality and interesting dynamic day-to-day work. SRE culture of diversity, intellectual curiosity, problem solving and openness is important to our success. Our organization brings together people with a wide variety of backgrounds, experiences and perspectives. We encourage collaboration, thinking big and taking risks in a blame-free environment. We promote self-direction to work on meaningful projects, while we also strive to build an environment that provides the support and mentorship needed to learn and grow.

Responsibilities

Troubleshoot incoming support requests in a large-scale HPC environment.
Contribute enhancements to existing deployment automation, configuration management, observability, and operational monitoring; drive day-to-day operations through automation.
Ensure compute servers are running correct operating system and configuration.
Perform comprehensive troubleshooting from bare metal to application level, ensuring system reliability and efficiency.
Collaborate with specialist teams to drive issues to closure.
Work with domain experts to improve how chip development processes utilize infrastructure.
Directly contribute to the overall quality and improve time to market for next-generation chips.

Requirements

Proficient in administering CentOS / RHEL Linux distributions.
Understanding of container technologies such as Docker.
Proficiency in Python and UNIX scripting languages such as bash.
Excellent problem-solving skills; ability to analyze complex systems, identify bottlenecks, and implement scalable solutions.
Excellent communication and teamwork skills; ability to work effectively with diverse teams and individuals.
BS in Computer Science, similar degree (or equivalent experience) with 2+ years of relevant post-degree experience.
Solid understanding of cluster configuration management tools such as Ansible.

Preferred / Ways to stand out

Understanding of key Linux technologies such as NFS, automounter, LDAP, DNS, and TCP/IP networking in Red Hat Linux distribution flavors.
Familiarity with job scheduler administration (e.g. IBM Spectrum LSF or SLURM) and experience building/operating large scale compute infrastructure.
Knowledge of the FlexLM license management system.
Proficiency in Perl for maintaining legacy automation scripts.
Familiarity with high-speed networking (InfiniBand, RDMA, RoCE) and fast, distributed storage systems (Lustre, GPFS).

Compensation & Benefits

Your base salary will be determined based on your location, experience, and the pay of employees in similar positions. The base salary range is 124,000 USD - 195,500 USD for Level 2, and 152,000 USD - 218,500 USD for Level 3. You will also be eligible for equity and benefits. Applications for this job will be accepted at least until January 16, 2026.

Additional information

This posting is for an existing vacancy.
NVIDIA uses AI tools in its recruiting processes.
NVIDIA is committed to fostering a diverse work environment and is an equal opportunity employer.
#LI-Hybrid