Site Reliability Engineer - Air Platform Team

at Nvidia
πŸ“ Durham, United States
USD 148,000-287,500 per year
MIDDLE SENIOR
βœ… On-site

SCRAPED

Used Tools & Technologies

Not specified

Required Skills & Competences ?

Security @ 3 Ansible @ 3 Docker @ 5 Grafana @ 3 Jenkins @ 3 Kubernetes @ 3 Linux @ 3 Prometheus @ 3 DevOps @ 5 Terraform @ 3 Python @ 3 CI/CD @ 3 Hiring @ 3 AWS @ 3 Git @ 3 Networking @ 3 IaaS @ 3 Debugging @ 5 Compliance @ 2

Details

NVIDIA is hiring a Site Reliability Engineer to join the NVIDIA AIR team β€” the Digital Twin for Data Center Simulation web application. NVIDIA AIR creates identical replicas of real-world data center infrastructure deployments to enable cloud-scale efficiency.

Responsibilities

  • Design, deploy, and manage IaaS platforms with a focus on high availability and performance.
  • Automate infrastructure operations using tools like Terraform, Ansible, and Python.
  • Automate repetitive workflows to improve operational efficiency.
  • Develop monitoring and observability tooling to detect and prevent outages using Prometheus, Grafana, ELK/EFK, and integrated alerting systems.
  • Deploy and troubleshoot non-disruptive cloud operations with an emphasis on secure production infrastructure.
  • Manage deployment and upgrades for operating systems, Kubernetes (k8s) clusters, and other orchestration tools.
  • Provide day-to-day support for engineering activities with CI/CD tools such as Git and Jenkins.
  • Implement and enforce best practices around infrastructure security, access control, and operational efficiency.

Requirements

  • BS degree in Computer Science, Software Engineering, or a related field β€” or equivalent experience.
  • 5+ years of experience in a Site Reliability, DevOps, or Systems Engineering role.
  • Strong automation and scripting skills (Ansible, Python, Shell scripting).
  • Experience in IaaS environments, including deploying, configuring, and administering Linux-based bare metal servers.
  • Deep infrastructure engineering experience focused on managing and monitoring highly available production infrastructure.
  • Skilled in observability practices and tools (Prometheus, Grafana, ELK/EFK) and alerting.
  • Solid grasp of Linux internals and core networking concepts including NAT, DNS, DHCP, routing, and firewall configuration (iptables or nftables).
  • Experience with modern deployment architectures for non-disruptive cloud operations, including blue-green and canary rollouts.
  • Proficiency with Kubernetes, Docker, QEMU, and Libvirt.

Preferred / Nice to have

  • Hands-on expertise with AWS for deploying complex, load-balanced, and highly available workloads.
  • Proficiency in debugging network issues in both infrastructure and SDN.
  • Experience with performance tuning and benchmarking across storage, compute, or networking.
  • Implemented robust metrics collection and alerting infrastructure.
  • Familiarity with compliance standards such as FedRAMP, HIPAA, and SOC 2.

Compensation & Benefits

  • Base salary range (Level 3): 148,000 USD - 235,750 USD.
  • Base salary range (Level 4): 184,000 USD - 287,500 USD.
  • Eligible for equity and benefits (see NVIDIA benefits pages).

Additional information

  • Location: Durham, NC, United States (role listed as full time).
  • Applications accepted at least until October 16, 2025.
  • NVIDIA is an equal opportunity employer and values diversity in its workforce.