Site Reliability Engineer - Cloud

at Nvidia
USD 136,000-212,800 per year
MIDDLE
βœ… Remote

SCRAPED

Used Tools & Technologies

Not specified

Required Skills & Competences ?

Marketing @ 3 Kubernetes @ 6 Linux @ 3 Python @ 3 Java @ 3 AWS @ 3 Communication @ 3 SRE @ 3 GPU @ 3

Details

NVIDIA has continuously reinvented itself over two decades. Our invention of the GPU in 1999 sparked the growth of the PC gaming market, redefined modern computer graphics, and revolutionized parallel computing β€” with the GPU acting as the brains of computers, robots, and self-driving cars that can perceive and understand the world. Today, we are increasingly known as β€œthe AI computing company.” We're looking to grow our company and build our teams with the smartest people in the world. Join us at the forefront of technological advancement.

Responsibilities

  • Rapidly debug and triage user-reported issues on the Digital Marketing Organization.
  • On-board new applications and services on AWS Infrastructure.
  • Make valuable contribution to the overall health, performance, and uptime of our services running in Linux and Windows.
  • Implement monitors, alerts, and SOPs to ensure early detection and accurate response to service-impacting issues.
  • Take ownership of automating, scripting, and tooling of new/existing scripts to help the team achieve 100% automation of daily tasks.

Requirements

  • MS or BS in Computer Science/Engineering or a related field or equivalent experience.
  • 5+ years of experience supporting technical operations in a live-site production environment with a real passion for automation and tooling.
  • Experience building and running critical production services packaged or custom Python/Java on Windows or Linux.
  • Strong knowledge of Kubernetes platform, deployments, and automation.
  • Experience contributing to incident management for early detection, accurate triage, partner communication, impact containment, service restoration, and post-incident follow-up. SRE On call experience is mandatory.
  • Advanced level experience with scripting and development in Python, fully automating steps with a "one-click" rapid solution.
  • Strong problem-solving skills and root cause analysis, continuously seeking ways to drive optimization and efficiency.
  • Must live in East Coast time zones.

Ways to Stand Out

  • Strong experience with AWS Cloud Platform, Kubernetes as a platform, and SRE On call experience.
  • Excellent communication, presentation, social, and analytical skills; ability to communicate complex concepts clearly across various levels of the organization.

Benefits

Competitive salaries and a generous benefits package. Eligibility for equity and benefits. NVIDIA values diversity and is an equal opportunity employer.