Used Tools & Technologies
Not specified
Required Skills & Competences
Tag name is followed by "@" symbol and proficiency level value.
About proficiency levels:
- 1-2 — basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;
- 3-6 — daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;
- 7-9 — you are an expert, you can teach others, you know all the pitfalls and tricks;
- 10 — exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.
Ansible @ 4
Grafana @ 6
Jenkins @ 4
Kubernetes @ 4
Linux @ 7
Prometheus @ 6
DevOps @ 4
Python @ 4
CI/CD @ 4
Communication @ 4
Networking @ 4
SRE @ 4
Rust @ 4
API @ 4
GPU @ 4
Observability @ 6
AI @ 4
InfiniBand @ 4
Slurm @ 3
HPC @ 4
NVLink @ 4
- 1-2 — basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;
- 3-6 — daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;
- 7-9 — you are an expert, you can teach others, you know all the pitfalls and tricks;
- 10 — exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.
Details
NVIDIA's software infrastructure team builds software systems for rack, networking, and datacenter provisioning and management supporting large-scale GPU clusters connected through NVLink and InfiniBand. These clusters run HPC and AI workloads. This role contributes to stable release train architectures and Site Reliability Engineering (SRE) practices.
Responsibilities
- Develop and manage software for hands-off datacenter provisioning and lifecycle management, including rack installation, bare-metal networking configuration, and cluster scaling.
- Build and implement scalable release train architectures that modularize systems and enable independent, reliable release cycles.
- Define, monitor, and enforce Service Level Indicators (SLI), Objectives (SLO), and Agreements (SLA) for core infrastructure services to ensure high availability and reliability.
- Develop intuitive user interfaces (UIs) and APIs for internal provisioning and management tools to improve cluster operations and visibility.
- Lead technical requirement definition: articulate requirements, inputs, outputs, and quantifiable outcomes for new infrastructure features and improvements.
- Build and maintain CI/CD pipelines supporting fast, reliable integration and deployment across complex systems.
- Build tools and automation workflows to simplify software releases, manage dependencies, and increase reliability.
- Automate software updates and monitor system health to improve reliability and availability.
- Resolve operational issues across distributed infrastructure and manage firmware and software rollouts to minimize downtime and ensure consistency.
- Collaborate with global engineering teams to align infrastructure tools and support project goals.
Requirements
- BS or MS in Computer Science, Computer Engineering, or a related field, or equivalent experience.
- 8+ years of experience managing infrastructure or systems in high-performance or distributed environments.
- Expertise in software programming using Python, Rust, C++, and Shell or similar high-level languages.
- Practical experience with modern CI/CD tools and infrastructure-as-code frameworks such as Jenkins, GitLab, Ansible, GitOps, and Kubernetes.
- Ability to use AI coding tools and agents effectively to increase efficiency.
- Strong understanding of Linux, networking, and distributed system building.
- Ability to break down monolithic systems into scalable, loosely coupled components.
- Excellent communication and collaboration skills across multi-functional areas.
Ways to stand out from the crowd
- Demonstrated experience implementing SRE practices, specifically defining and tracking SLIs, SLOs, and SLAs.
- Proficiency with observability tools such as Prometheus and Grafana for system health monitoring and analysis.
- Experience crafting user-facing components (front-end or CLI) for infrastructure management tools.
- Experience with cluster management tools like Slurm and familiarity with NVIDIA DGX systems and GPU-based clusters (e.g., GB200, GB300, VR-NVL72).
- Consistent track record leading DevOps process improvements and driving team efficiency.
Compensation & Benefits
- Base salary range: 184,000 USD - 287,500 USD (determined based on location, experience, and pay of employees in similar positions).
- Eligible for equity and benefits (link to NVIDIA benefits referenced in the posting).
Additional information
- Applications accepted at least until May 9, 2026.
- NVIDIA uses AI tools in its recruiting processes and is an equal opportunity employer committed to diversity.