Site Reliability Engineer - Air Platform Team

at Nvidia

📍 Durham, United States

USD 148,000-287,500 per year

MIDDLE SENIOR

✅ On-site

SCRAPED

Used Tools & Technologies

Not specified

Required Skills & Competences ^?

Security @ 3 Ansible @ 3 Docker @ 5 Grafana @ 3 Jenkins @ 3 Kubernetes @ 3 Linux @ 3 Prometheus @ 3 DevOps @ 5 Terraform @ 3 Python @ 3 CI/CD @ 3 Hiring @ 3 AWS @ 3 Git @ 3 Networking @ 3 IaaS @ 3 Debugging @ 5 Compliance @ 2

Details

NVIDIA is hiring a Site Reliability Engineer to join the NVIDIA AIR team — the Digital Twin for Data Center Simulation web application. NVIDIA AIR creates identical replicas of real-world data center infrastructure deployments to enable cloud-scale efficiency.

Responsibilities

Design, deploy, and manage IaaS platforms with a focus on high availability and performance.
Automate infrastructure operations using tools like Terraform, Ansible, and Python.
Automate repetitive workflows to improve operational efficiency.
Develop monitoring and observability tooling to detect and prevent outages using Prometheus, Grafana, ELK/EFK, and integrated alerting systems.
Deploy and troubleshoot non-disruptive cloud operations with an emphasis on secure production infrastructure.
Manage deployment and upgrades for operating systems, Kubernetes (k8s) clusters, and other orchestration tools.
Provide day-to-day support for engineering activities with CI/CD tools such as Git and Jenkins.
Implement and enforce best practices around infrastructure security, access control, and operational efficiency.

Requirements

BS degree in Computer Science, Software Engineering, or a related field — or equivalent experience.
5+ years of experience in a Site Reliability, DevOps, or Systems Engineering role.
Strong automation and scripting skills (Ansible, Python, Shell scripting).
Experience in IaaS environments, including deploying, configuring, and administering Linux-based bare metal servers.
Deep infrastructure engineering experience focused on managing and monitoring highly available production infrastructure.
Skilled in observability practices and tools (Prometheus, Grafana, ELK/EFK) and alerting.
Solid grasp of Linux internals and core networking concepts including NAT, DNS, DHCP, routing, and firewall configuration (iptables or nftables).
Experience with modern deployment architectures for non-disruptive cloud operations, including blue-green and canary rollouts.
Proficiency with Kubernetes, Docker, QEMU, and Libvirt.

Preferred / Nice to have

Hands-on expertise with AWS for deploying complex, load-balanced, and highly available workloads.
Proficiency in debugging network issues in both infrastructure and SDN.
Experience with performance tuning and benchmarking across storage, compute, or networking.
Implemented robust metrics collection and alerting infrastructure.
Familiarity with compliance standards such as FedRAMP, HIPAA, and SOC 2.

Compensation & Benefits

Base salary range (Level 3): 148,000 USD - 235,750 USD.
Base salary range (Level 4): 184,000 USD - 287,500 USD.
Eligible for equity and benefits (see NVIDIA benefits pages).

Additional information

Location: Durham, NC, United States (role listed as full time).
Applications accepted at least until October 16, 2025.
NVIDIA is an equal opportunity employer and values diversity in its workforce.