Service Reliability Operations Administrator

at Nvidia

📍 Santa Clara, United States

USD 124,000-224,200 per year

MIDDLE

✅ On-site

SCRAPED

Used Tools & Technologies

Not specified

Required Skills & Competences ^?

Security @ 3 System Administration @ 3 Ansible @ 3 DevOps @ 3 Python @ 3 Git @ 3 Networking @ 6 SRE @ 3

Details

NVIDIA's NGC team is seeking highly motivated System Administrator/DevOps engineers to design, develop, and implement a global, dynamic, and innovative Service Reliability Operations Center (Mission Control) to provide extraordinary support for Cloud products and services. The role involves collaborating with Site Reliability Engineering, Security Operations Center, DevOps teams, and other partners to achieve near 100% service availability.

Responsibilities

Provide 24/7 follow-the-sun services across continents.
Work a 4-day week, 10 hours per day schedule including either Saturday or Sunday.
Monitor and manage production compute and storage environments.
Develop and use alerts and alarms to prevent issues and incidents.
Collaborate with developers to develop predictive support and diagnostic routines.
Perform systems administration, network administration, and security incident monitoring.
Create and update runbooks based on service understanding and feature changes.
Initiate incident management procedures and engage subject matter experts as needed.
Maintain positive team engagement and client relations through incident resolution.
Perform other tasks to ensure extraordinary service levels.

Requirements

5+ years administering open system servers in production.
2+ years in demanding Internet, Cloud, or Telecommunications environments in Systems Administration, DevOps, SRE, or NOC roles.
B.S. in relevant discipline or equivalent experience.
Expertise using monitoring tools and ticketing systems.
Strong problem-solving, analytical, and troubleshooting skills.
Strong server administration including shell scripting, automation, DNS, DHCP, storage concepts, basic networking, IP Tables, RHCE or equivalent.
Python scripting experience preferred but not required.
Experience running virtual machines on open source or commercial hypervisors.
Experience operating cloud services (public or private).
Knowledge of application containers and container orchestration.
Basic understanding of Git.
Experience using Ansible for system administration and analyzing performance via monitoring data.

Benefits

Base salary range: $124,000 - $224,250 USD, depending on location, experience, and comparable pay.
Equity and benefits eligibility.
Ongoing application acceptance.

NVIDIA fosters a diverse workforce and is an equal opportunity employer valuing innovation, excellence, determination, and teamwork.