SRE Engineer - Air Platform Team

at Nvidia
πŸ“ Durham, United States
USD 120,000-235,800 per year
MIDDLE
βœ… On-site

SCRAPED

Used Tools & Technologies

Not specified

Required Skills & Competences ?

Security @ 3 Ansible @ 3 Docker @ 5 Grafana @ 3 Jenkins @ 3 Kubernetes @ 3 Linux @ 3 Prometheus @ 3 DevOps @ 5 Terraform @ 3 Python @ 3 CI/CD @ 3 AWS @ 3 Git @ 3 Networking @ 3 SRE @ 3 IaaS @ 3 Debugging @ 5 Compliance @ 2

Details

NVIDIA is seeking a highly motivated SRE Engineer to join the NVIDIA AIR team, which focuses on the Digital Twin for Data Center Simulation web application. NVIDIA AIR creates identical replicas of real-world data center infrastructure deployments, enabling cloud-scale efficiency.

Responsibilities

  • Design, deploy, and manage IaaS platforms focusing on high availability and performance.
  • Automate infrastructure operations using Terraform, Ansible, and Python.
  • Automate repetitive workflows to improve efficiency.
  • Develop monitoring and observability tools to detect and prevent outages using Prometheus, Grafana, ELK, etc.
  • Deploy and troubleshoot non-disruptive cloud operations with emphasis on secure production infrastructure.
  • Manage deployment and upgrades for Operating Systems, Kubernetes clusters, and other orchestration tools.
  • Provide engineering support with CI/CD tools including Git and Jenkins.
  • Implement and enforce best practices around infrastructure security, access control, and operational efficiency.

Requirements

  • BS degree in Computer Science, Software Engineering, or related field (or equivalent experience).
  • 3–5+ years in Site Reliability, DevOps, or Systems Engineering roles.
  • Strong automation and scripting skills in Ansible, Python, and Shell scripting.
  • Experience deploying, configuring, and administering Linux-based bare metal servers and IaaS environments.
  • Deep infrastructure engineering experience managing and monitoring highly available production infrastructure.
  • Skilled in observability practices using Prometheus, Grafana, ELK/EFK, and integrated alerting.
  • Solid knowledge of Linux internals and core networking: NAT, DNS, DHCP, routing, firewall configuration (iptables/nftables).
  • Experience with modern deployment architectures, including blue-green and canary rollouts.
  • Proficient with Kubernetes, Docker, QEMU, and Libvirt.

Ways to Stand Out

  • Hands-on expertise with AWS for deploying complex, load-balanced, and highly available workloads.
  • Proficiency in debugging network issues in infrastructure and SDN.
  • Experience with performance tuning and benchmarking across storage, compute, or networking.
  • Implemented robust metrics collection and alerting infrastructure.
  • Familiarity with compliance standards like FedRAMP, HIPAA, and SOC 2.

Benefits

Competitive salaries, equity, and a generous benefits package. NVIDIA is committed to diversity and equal opportunity employment and offers a supportive and innovative working environment.