Service Reliability Operations Engineer

at Nvidia
πŸ“ Pune, India
USD 50,000-120,000 per year
MIDDLE
βœ… On-site

SCRAPED

Used Tools & Technologies

Not specified

Required Skills & Competences ?

Security @ 3 Ansible @ 3 Linux @ 3 DevOps @ 3 Python @ 3 Communication @ 6 Git @ 3 Networking @ 6 SRE @ 5 GPU @ 3

Details

NVIDIA's NGC (NVIDIA GPU Cloud) team is seeking highly motivated System Administrator/DevOps engineers to design, develop, and implement a global, dynamic, state-of-the-art Service Reliability Operations Center (Mission Control) that provides extraordinary support for Cloud products and services.

Responsibilities

  • Provide 24/7 service support in a follow-the-sun environment across continents.
  • Report directly to a manager in Bangalore.
  • Work a 4-day schedule of 10 hours per day, including either Saturday or Sunday each week to ensure 24/7 coverage with US and India teams.
  • Monitor and triage a growing on-premises and Cloud Service Provider (CSP) production compute and storage datacenter environment.
  • Utilize alerts and alarms to help prevent issues and incidents.
  • Collaborate with the developer community to develop and execute predictive support or diagnostic routines.
  • Perform Linux administration tasks, network administration, and security incident monitoring.
  • Learn how the service works and translate that into runbooks for team use, updating runbooks with new features and functionality.
  • Maintain strong communication and interpersonal skills to keep the team engaged through incident resolution, including initiating incident management procedures.

Requirements

  • BS/BE degree in Computer Science, Electronics, or equivalent experience.
  • Minimum 3 years' experience administering open system servers in demanding Internet, Cloud, or Telecommunications environments as Linux Systems Administrator, DevOps, SRE, or NOC role.
  • Strong problem-solving, analytical, and troubleshooting abilities on Linux clusters on public or private clouds.
  • Strong Linux administration experience including shell scripting, automation, DNS, DHCP, storage concepts, basic networking, IP Tables, RHCE or equivalent knowledge.
  • Preferred experience with scripting in Python and Ansible playbooks.
  • Knowledge of application containers, container orchestration systems, and git workflow.
  • Experience analyzing system and network performance using monitoring alerts, data, and graphs.
  • Ability to master and maintain complicated environments.

Benefits

  • Competitive salaries and a comprehensive benefits package.
  • Opportunity to work with some of the most forward-thinking and talented engineering teams.
  • Fast-growing world-class engineering environment.
  • Autonomy and creative freedom in engineering work.