Senior Service Reliability Operations Administrator

at Nvidia
USD 124,000-224,200 per year
SENIOR
✅ On-site

SCRAPED

Used Tools & Technologies

Not specified

Required Skills & Competences ?

Security @ 4 System Administration @ 4 Ansible @ 4 DevOps @ 4 Python @ 4 Communication @ 7 Git @ 4 Networking @ 7 SRE @ 6

Details

NVIDIA's NGC team is looking for highly motivated System Administrator/DevOps engineers to design, develop and implement a global, dynamic, innovative Service Reliability Operations Center to provide extraordinary levels of support for our Cloud products and services. As a key member of the CIS Team (Compute Infrastructure Support), you will partner with Site Reliability Engineering, Security Operations Center, DevOps teams, and other partners to help make our services capable of providing near 100% availability. On the rare occasion that an incident occurs, you will be our front line to decrease the frequency and duration of any issue. Working in partnership with the development community the CIS team will develop monitors, alarms, and alerts to help make the service more reliable and improve our customer experience.

Responsibilities

  • Provide 24/7 follow-the-sun support as part of a global team; report to a manager in the United States.
  • Perform CIS shifts that may require either a Saturday or Sunday each week; some schedules include an early or late start (10 hours per day × 4 days per week) to ensure US and India teams provide 24/7 coverage.
  • Use alerts and alarms to help prevent issues and incidents; work with developers to develop and implement predictive support or diagnostic routines.
  • Perform systems administration tasks, network administration tasks, and security incident monitoring.
  • Work with developers to learn how services work and translate that understanding into runbooks; update runbooks as features/functionality change.
  • Discover incidents and issues, initiate incident management procedures, and involve subject matter experts or service owners as needed.
  • Maintain strong interpersonal communication to keep the team engaged through resolution and ensure positive client interactions.
  • Perform other tasks needed to provide extraordinary service levels for customers.

Requirements

  • 5+ years administering open system servers in a production environment.
  • 3+ years experience in demanding Internet, Cloud, or Telecommunications environments in Systems Administration, DevOps, SRE, or NOC roles.
  • B.S. in a relevant discipline or equivalent experience.
  • Expertise using monitoring tools and problem ticketing systems.
  • Strong problem-solving, analytical, and troubleshooting abilities.
  • Strong server administration experience: shell scripting, automation, DNS, DHCP, storage concepts, basic networking, IP Tables, etc.; RHCE or equivalent knowledge level.
  • Experience scripting in Python preferred but not required.
  • Prior experience running virtual machines under open source or commercial hypervisors.
  • Experience operating services on public or private clouds.
  • Knowledge and understanding of application containers and container orchestration systems.
  • Basic understanding of Git.
  • Experience performing system administration tasks using Ansible.
  • Prior experience analyzing system and network performance using monitoring alerts, data, and graphs.
  • Demonstrated ability to master and maintain complicated environments.

Compensation & Benefits

  • Base salary ranges (determined by location, experience, and comparable pay):
    • Level 3: 124,000 USD - 195,500 USD
    • Level 4: 140,000 USD - 224,250 USD
  • Eligible for equity and benefits (see NVIDIA benefits page).

Additional Information

  • Applications accepted at least until September 26, 2025.
  • NVIDIA is an equal opportunity employer committed to fostering a diverse work environment.