Used Tools & Technologies
GPURequired Skills & Competences
Tag name is followed by "@" symbol and proficiency level value.
About proficiency levels:
- 1-2 — basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;
- 3-6 — daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;
- 7-9 — you are an expert, you can teach others, you know all the pitfalls and tricks;
- 10 — exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.
Security @ 7
Ansible @ 7
Docker @ 6
Grafana @ 4
Jenkins @ 7
Kubernetes @ 4
MySQL @ 4
Prometheus @ 4
Kibana @ 4
SQL @ 4
Distributed Systems @ 4
Communication @ 4
Networking @ 4
OpenStack @ 4
SRE @ 4
Planning @ 4
Splunk @ 4
Deep Learning @ 4
AI @ 4
- 1-2 — basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;
- 3-6 — daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;
- 7-9 — you are an expert, you can teach others, you know all the pitfalls and tricks;
- 10 — exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.
Details
NVIDIA has been transforming computer graphics, PC gaming, and accelerated computing for more than 25 years. Today, NVIDIA is tapping into the unlimited potential of AI to define the next era of computing. As an NVIDIAN, you will be immersed in a diverse, supportive environment working on infrastructure that supports GPUs, Tegra systems, deep learning, AI and driverless cars.
Responsibilities
- Manage NVIDIA's on-prem infrastructure across multiple data centers to maintain uptime, reliability, and readiness of engineering cloud environments.
- Guard service level agreements (SLAs) for critical engineering services by implementing monitoring, alerting, and incident response procedures.
- Perform root cause analysis and post-mortems for incidents and threshold breaches.
- Deploy, configure, and manage applications and services on Kubernetes clusters; ensure high availability, fault tolerance, and disaster recovery for Kubernetes workloads.
- Implement logging, monitoring, and alerting solutions (examples given: Prometheus, Grafana, ELK/EFK).
- Drive automation of monitoring and operations to improve insight into application and system health.
- Support user-reported issues, monitor alerts, participate in WAR rooms for critical incidents, and assist in capacity planning and optimization efforts.
- Reuse AI techniques to extract useful signals from machine and job data where applicable.
Requirements
- 5+ years of demonstrable experience in maintaining cloud infrastructure and highly-available production environments.
- Experience handling and maintaining systems installed in on-premises data centers; hands-on proficiency with BMC interfaces (Redfish), KVM, and IPMI for hardware provisioning, remote access, and troubleshooting.
- Knowledge of OpenStack architecture and services is a plus.
- Experience with databases including relational (SQL/MySQL) and time-series databases (Prometheus); experience in data querying and performance tuning.
- Solid understanding of networking principles and protocols (TCP/IP, DNS, DHCP, VLANs) and diagnosing connectivity in distributed systems.
- Practical experience with data analytics and visualization tools such as Kibana, Grafana, Splunk, or similar platforms for logs and metrics analysis.
- Strong experience with automation tools like Jenkins and/or Temporal and configuration tools like Ansible.
- Proficiency with Kubernetes, Docker, and virtualization technologies for deploying and operating containerized workloads in production.
- Advanced knowledge of standard security methodologies and protocols, including system hardening, access control, vulnerability management, and secure operations.
- Bachelor’s degree in Computer Science, Information Technology, or related field, or equivalent experience.
Ways to stand out
- Prior experience with SRE teams managing on-prem infrastructure.
- Experience managing NVIDIA hardware such as GPUs and Tegra devices.
- Ability to thrive in a multi-tasking environment with evolving priorities and excellent interpersonal and communication skills.
Compensation and benefits
- Base salary ranges by level:
- Level 3: 148,000 USD - 235,750 USD
- Level 4: 176,000 USD - 276,000 USD
- Eligible for equity and benefits. A generous benefits package is referenced on NVIDIA's benefits page.
Additional information
- Applications accepted at least until June 13, 2026.
- This posting is for an existing vacancy. NVIDIA uses AI tools in its recruiting processes and is an equal opportunity employer.