SRE Engineer - Air Platform Team

at Nvidia

📍 Durham, United States

USD 120,000-235,800 per year

MIDDLE

✅ On-site

SCRAPED

Used Tools & Technologies

Not specified

Required Skills & Competences ^?

Security @ 3 Ansible @ 3 Docker @ 5 Grafana @ 3 Jenkins @ 3 Kubernetes @ 3 Linux @ 3 Prometheus @ 3 DevOps @ 5 Terraform @ 3 Python @ 3 CI/CD @ 3 AWS @ 3 Git @ 3 Networking @ 3 SRE @ 3 IaaS @ 3 Debugging @ 5 Compliance @ 2

Details

NVIDIA is seeking a highly motivated SRE Engineer to join the NVIDIA AIR team, which focuses on the Digital Twin for Data Center Simulation web application. NVIDIA AIR creates identical replicas of real-world data center infrastructure deployments, enabling cloud-scale efficiency.

Responsibilities

Design, deploy, and manage IaaS platforms focusing on high availability and performance.
Automate infrastructure operations using Terraform, Ansible, and Python.
Automate repetitive workflows to improve efficiency.
Develop monitoring and observability tools to detect and prevent outages using Prometheus, Grafana, ELK, etc.
Deploy and troubleshoot non-disruptive cloud operations with emphasis on secure production infrastructure.
Manage deployment and upgrades for Operating Systems, Kubernetes clusters, and other orchestration tools.
Provide engineering support with CI/CD tools including Git and Jenkins.
Implement and enforce best practices around infrastructure security, access control, and operational efficiency.

Requirements

BS degree in Computer Science, Software Engineering, or related field (or equivalent experience).
3–5+ years in Site Reliability, DevOps, or Systems Engineering roles.
Strong automation and scripting skills in Ansible, Python, and Shell scripting.
Experience deploying, configuring, and administering Linux-based bare metal servers and IaaS environments.
Deep infrastructure engineering experience managing and monitoring highly available production infrastructure.
Skilled in observability practices using Prometheus, Grafana, ELK/EFK, and integrated alerting.
Solid knowledge of Linux internals and core networking: NAT, DNS, DHCP, routing, firewall configuration (iptables/nftables).
Experience with modern deployment architectures, including blue-green and canary rollouts.
Proficient with Kubernetes, Docker, QEMU, and Libvirt.

Ways to Stand Out

Hands-on expertise with AWS for deploying complex, load-balanced, and highly available workloads.
Proficiency in debugging network issues in infrastructure and SDN.
Experience with performance tuning and benchmarking across storage, compute, or networking.
Implemented robust metrics collection and alerting infrastructure.
Familiarity with compliance standards like FedRAMP, HIPAA, and SOC 2.

Benefits

Competitive salaries, equity, and a generous benefits package. NVIDIA is committed to diversity and equal opportunity employment and offers a supportive and innovative working environment.