Site Reliability Engineer - Air Platform Team
at Nvidia
π Durham, United States
USD 148,000-287,500 per year
SCRAPED
Used Tools & Technologies
Not specified
Required Skills & Competences ?
Security @ 3 Ansible @ 3 Docker @ 5 Grafana @ 3 Jenkins @ 3 Kubernetes @ 3 Linux @ 3 Prometheus @ 3 DevOps @ 5 Terraform @ 3 Python @ 3 CI/CD @ 3 Hiring @ 3 AWS @ 3 Git @ 3 Networking @ 3 IaaS @ 3 Debugging @ 5 Compliance @ 2Details
NVIDIA is hiring a Site Reliability Engineer to join the NVIDIA AIR team β the Digital Twin for Data Center Simulation web application. NVIDIA AIR creates identical replicas of real-world data center infrastructure deployments to enable cloud-scale efficiency.
Responsibilities
- Design, deploy, and manage IaaS platforms with a focus on high availability and performance.
- Automate infrastructure operations using tools like Terraform, Ansible, and Python.
- Automate repetitive workflows to improve operational efficiency.
- Develop monitoring and observability tooling to detect and prevent outages using Prometheus, Grafana, ELK/EFK, and integrated alerting systems.
- Deploy and troubleshoot non-disruptive cloud operations with an emphasis on secure production infrastructure.
- Manage deployment and upgrades for operating systems, Kubernetes (k8s) clusters, and other orchestration tools.
- Provide day-to-day support for engineering activities with CI/CD tools such as Git and Jenkins.
- Implement and enforce best practices around infrastructure security, access control, and operational efficiency.
Requirements
- BS degree in Computer Science, Software Engineering, or a related field β or equivalent experience.
- 5+ years of experience in a Site Reliability, DevOps, or Systems Engineering role.
- Strong automation and scripting skills (Ansible, Python, Shell scripting).
- Experience in IaaS environments, including deploying, configuring, and administering Linux-based bare metal servers.
- Deep infrastructure engineering experience focused on managing and monitoring highly available production infrastructure.
- Skilled in observability practices and tools (Prometheus, Grafana, ELK/EFK) and alerting.
- Solid grasp of Linux internals and core networking concepts including NAT, DNS, DHCP, routing, and firewall configuration (iptables or nftables).
- Experience with modern deployment architectures for non-disruptive cloud operations, including blue-green and canary rollouts.
- Proficiency with Kubernetes, Docker, QEMU, and Libvirt.
Preferred / Nice to have
- Hands-on expertise with AWS for deploying complex, load-balanced, and highly available workloads.
- Proficiency in debugging network issues in both infrastructure and SDN.
- Experience with performance tuning and benchmarking across storage, compute, or networking.
- Implemented robust metrics collection and alerting infrastructure.
- Familiarity with compliance standards such as FedRAMP, HIPAA, and SOC 2.
Compensation & Benefits
- Base salary range (Level 3): 148,000 USD - 235,750 USD.
- Base salary range (Level 4): 184,000 USD - 287,500 USD.
- Eligible for equity and benefits (see NVIDIA benefits pages).
Additional information
- Location: Durham, NC, United States (role listed as full time).
- Applications accepted at least until October 16, 2025.
- NVIDIA is an equal opportunity employer and values diversity in its workforce.