Service Reliability Operations Engineer

at Nvidia

📍 Pune, India

USD 50,000-120,000 per year

MIDDLE

✅ On-site

SCRAPED

Used Tools & Technologies

Not specified

Required Skills & Competences ^?

Security @ 3 Ansible @ 3 Linux @ 3 DevOps @ 3 Python @ 3 Communication @ 6 Git @ 3 Networking @ 6 SRE @ 5 GPU @ 3

Details

NVIDIA's NGC (NVIDIA GPU Cloud) team is seeking highly motivated System Administrator/DevOps engineers to design, develop, and implement a global, dynamic, state-of-the-art Service Reliability Operations Center (Mission Control) that provides extraordinary support for Cloud products and services.

Responsibilities

Provide 24/7 service support in a follow-the-sun environment across continents.
Report directly to a manager in Bangalore.
Work a 4-day schedule of 10 hours per day, including either Saturday or Sunday each week to ensure 24/7 coverage with US and India teams.
Monitor and triage a growing on-premises and Cloud Service Provider (CSP) production compute and storage datacenter environment.
Utilize alerts and alarms to help prevent issues and incidents.
Collaborate with the developer community to develop and execute predictive support or diagnostic routines.
Perform Linux administration tasks, network administration, and security incident monitoring.
Learn how the service works and translate that into runbooks for team use, updating runbooks with new features and functionality.
Maintain strong communication and interpersonal skills to keep the team engaged through incident resolution, including initiating incident management procedures.

Requirements

BS/BE degree in Computer Science, Electronics, or equivalent experience.
Minimum 3 years' experience administering open system servers in demanding Internet, Cloud, or Telecommunications environments as Linux Systems Administrator, DevOps, SRE, or NOC role.
Strong problem-solving, analytical, and troubleshooting abilities on Linux clusters on public or private clouds.
Strong Linux administration experience including shell scripting, automation, DNS, DHCP, storage concepts, basic networking, IP Tables, RHCE or equivalent knowledge.
Preferred experience with scripting in Python and Ansible playbooks.
Knowledge of application containers, container orchestration systems, and git workflow.
Experience analyzing system and network performance using monitoring alerts, data, and graphs.
Ability to master and maintain complicated environments.

Benefits

Competitive salaries and a comprehensive benefits package.
Opportunity to work with some of the most forward-thinking and talented engineering teams.
Fast-growing world-class engineering environment.
Autonomy and creative freedom in engineering work.