Used Tools & Technologies
Not specified
Required Skills & Competences ?
Ansible @ 4 Chef @ 4 Docker @ 4 Go @ 4 Grafana @ 4 Kubernetes @ 4 Linux @ 4 Prometheus @ 4 Terraform @ 4 Python @ 4 GCP @ 4 Java @ 4 Hiring @ 4 AWS @ 4 Azure @ 4 Git @ 4 Networking @ 4 SRE @ 4 Android @ 4 CUDA @ 4Details
NVIDIA is looking for a Senior Site Reliability Engineer to work in IPP (Infrastructure, Planning and Process). IPP is a global organization within NVIDIA that partners with groups across NVIDIA Software (Graphics Processors, Mobile Processors, Deep Learning, Artificial Intelligence and Driverless Cars) to provide and maintain the infrastructure these teams rely on. The cloud services support hundreds of thousands of automated jobs per day across thousands of servers, empowering thousands of NVIDIA software engineers worldwide.
The cloud hosts a heterogeneous mix of machines and devices running various operating systems (Windows, Linux, Android) and hardware platforms including NVIDIA GPUs and Tegra processors. The role focuses on building next-generation cloud services, automating workflows, designing scalable and resilient systems, mining data to uncover issues, and resolving complex infrastructure problems.
Responsibilities
- Develop frameworks and scripts to automate workflows and deployments in the cloud environment.
- Deploy and maintain a large farm of machines using Configuration Management & Infrastructure Automation tools such as Chef, Ansible, and Terraform.
- Build and maintain extensive monitoring systems to provide a fast, reliable, real-time pulse of infrastructure subsystems (Zabbix, Grafana, Prometheus).
- Participate in on-call and rotational L1 support for round-the-clock monitoring and remediation of infrastructure.
- Solve complex problems related to infrastructure scaling, capacity planning, and resilience. Analyze and debug operating system, networking, configuration, and performance issues.
- Assist in roll-out and deployment of new development features to support the latest NVIDIA hardware and technologies.
- Develop SRE agents to streamline Cost of Business activities, reduce toil, and improve operational efficiency.
Requirements
- Bachelor's or Master's degree in Computer Science, Software Engineering, or equivalent experience.
- 8+ years of operational experience in large-scale enterprise production systems.
- Familiarity with implementing load balancing strategies, disaster recovery planning, business continuity best practices, and designing scalable, resilient systems based on SRE principles.
- Ability to debug and analyze source code to triage, root cause, and resolve infrastructure issues. Collaborate closely with development teams to improve build and test systems.
- Hands-on coding experience with Python or Go. Proficiency with Unix shell. Knowledge of Java and C is also valued.
- Experience with version control systems such as Perforce and Git.
Ways to stand out
- Experience with public clouds (AWS, GCP, Azure) and virtualization/container technologies (VMware, KVM, Docker, Kubernetes).
- Background automating bare metal and VM provisioning.
- Experience supporting GPUs, embedded device development, driver development, and CUDA/TensorRT applications.
Compensation & Benefits
- Base salary ranges (determined by location, experience, and comparable roles):
- Level 4: 168,000 USD - 270,250 USD
- Level 5: 208,000 USD - 333,500 USD
- Eligible for equity and benefits (see NVIDIA benefits: https://www.nvidia.com/en-us/benefits/).
Additional information
- Applications for this job will be accepted at least until September 14, 2025.
- NVIDIA is an equal opportunity employer and values diversity in its workforce. They do not discriminate in hiring or promotion practices on the basis of protected characteristics.