Senior Software Engineer - Datacenter Systems

at Nvidia
USD 184,000-287,500 per year
SENIOR
✅ On-site

Used Tools & Technologies

Not specified

Required Skills & Competences

Ansible @ 4 Grafana @ 6 Jenkins @ 4 Kubernetes @ 4 Linux @ 7 Prometheus @ 6 DevOps @ 4 Python @ 4 CI/CD @ 4 Communication @ 4 Networking @ 4 SRE @ 4 Rust @ 4 API @ 4 GPU @ 4 Observability @ 6 AI @ 4 InfiniBand @ 4 Slurm @ 3 HPC @ 4 NVLink @ 4

Details

NVIDIA's software infrastructure team builds software systems for rack, networking, and datacenter provisioning and management supporting large-scale GPU clusters connected through NVLink and InfiniBand. These clusters run HPC and AI workloads. This role contributes to stable release train architectures and Site Reliability Engineering (SRE) practices.

Responsibilities

  • Develop and manage software for hands-off datacenter provisioning and lifecycle management, including rack installation, bare-metal networking configuration, and cluster scaling.
  • Build and implement scalable release train architectures that modularize systems and enable independent, reliable release cycles.
  • Define, monitor, and enforce Service Level Indicators (SLI), Objectives (SLO), and Agreements (SLA) for core infrastructure services to ensure high availability and reliability.
  • Develop intuitive user interfaces (UIs) and APIs for internal provisioning and management tools to improve cluster operations and visibility.
  • Lead technical requirement definition: articulate requirements, inputs, outputs, and quantifiable outcomes for new infrastructure features and improvements.
  • Build and maintain CI/CD pipelines supporting fast, reliable integration and deployment across complex systems.
  • Build tools and automation workflows to simplify software releases, manage dependencies, and increase reliability.
  • Automate software updates and monitor system health to improve reliability and availability.
  • Resolve operational issues across distributed infrastructure and manage firmware and software rollouts to minimize downtime and ensure consistency.
  • Collaborate with global engineering teams to align infrastructure tools and support project goals.

Requirements

  • BS or MS in Computer Science, Computer Engineering, or a related field, or equivalent experience.
  • 8+ years of experience managing infrastructure or systems in high-performance or distributed environments.
  • Expertise in software programming using Python, Rust, C++, and Shell or similar high-level languages.
  • Practical experience with modern CI/CD tools and infrastructure-as-code frameworks such as Jenkins, GitLab, Ansible, GitOps, and Kubernetes.
  • Ability to use AI coding tools and agents effectively to increase efficiency.
  • Strong understanding of Linux, networking, and distributed system building.
  • Ability to break down monolithic systems into scalable, loosely coupled components.
  • Excellent communication and collaboration skills across multi-functional areas.

Ways to stand out from the crowd

  • Demonstrated experience implementing SRE practices, specifically defining and tracking SLIs, SLOs, and SLAs.
  • Proficiency with observability tools such as Prometheus and Grafana for system health monitoring and analysis.
  • Experience crafting user-facing components (front-end or CLI) for infrastructure management tools.
  • Experience with cluster management tools like Slurm and familiarity with NVIDIA DGX systems and GPU-based clusters (e.g., GB200, GB300, VR-NVL72).
  • Consistent track record leading DevOps process improvements and driving team efficiency.

Compensation & Benefits

  • Base salary range: 184,000 USD - 287,500 USD (determined based on location, experience, and pay of employees in similar positions).
  • Eligible for equity and benefits (link to NVIDIA benefits referenced in the posting).

Additional information

  • Applications accepted at least until May 9, 2026.
  • NVIDIA uses AI tools in its recruiting processes and is an equal opportunity employer committed to diversity.