SRE Engineer - Air Platform Team
at Nvidia
π Durham, United States
USD 120,000-235,800 per year
SCRAPED
Used Tools & Technologies
Not specified
Required Skills & Competences ?
Security @ 3 Ansible @ 3 Docker @ 5 Grafana @ 3 Jenkins @ 3 Kubernetes @ 3 Linux @ 3 Prometheus @ 3 DevOps @ 5 Terraform @ 3 Python @ 3 CI/CD @ 3 AWS @ 3 Git @ 3 Networking @ 3 SRE @ 3 IaaS @ 3 Debugging @ 5 Compliance @ 2Details
NVIDIA is seeking a highly motivated SRE Engineer to join the NVIDIA AIR team, which focuses on the Digital Twin for Data Center Simulation web application. NVIDIA AIR creates identical replicas of real-world data center infrastructure deployments, enabling cloud-scale efficiency.
Responsibilities
- Design, deploy, and manage IaaS platforms focusing on high availability and performance.
- Automate infrastructure operations using Terraform, Ansible, and Python.
- Automate repetitive workflows to improve efficiency.
- Develop monitoring and observability tools to detect and prevent outages using Prometheus, Grafana, ELK, etc.
- Deploy and troubleshoot non-disruptive cloud operations with emphasis on secure production infrastructure.
- Manage deployment and upgrades for Operating Systems, Kubernetes clusters, and other orchestration tools.
- Provide engineering support with CI/CD tools including Git and Jenkins.
- Implement and enforce best practices around infrastructure security, access control, and operational efficiency.
Requirements
- BS degree in Computer Science, Software Engineering, or related field (or equivalent experience).
- 3β5+ years in Site Reliability, DevOps, or Systems Engineering roles.
- Strong automation and scripting skills in Ansible, Python, and Shell scripting.
- Experience deploying, configuring, and administering Linux-based bare metal servers and IaaS environments.
- Deep infrastructure engineering experience managing and monitoring highly available production infrastructure.
- Skilled in observability practices using Prometheus, Grafana, ELK/EFK, and integrated alerting.
- Solid knowledge of Linux internals and core networking: NAT, DNS, DHCP, routing, firewall configuration (iptables/nftables).
- Experience with modern deployment architectures, including blue-green and canary rollouts.
- Proficient with Kubernetes, Docker, QEMU, and Libvirt.
Ways to Stand Out
- Hands-on expertise with AWS for deploying complex, load-balanced, and highly available workloads.
- Proficiency in debugging network issues in infrastructure and SDN.
- Experience with performance tuning and benchmarking across storage, compute, or networking.
- Implemented robust metrics collection and alerting infrastructure.
- Familiarity with compliance standards like FedRAMP, HIPAA, and SOC 2.
Benefits
Competitive salaries, equity, and a generous benefits package. NVIDIA is committed to diversity and equal opportunity employment and offers a supportive and innovative working environment.