Used Tools & Technologies
Not specified
Required Skills & Competences ?
Security @ 3 Ansible @ 3 Linux @ 3 DevOps @ 3 Python @ 3 Communication @ 6 Git @ 3 Networking @ 6 SRE @ 5 GPU @ 3Details
NVIDIA's NGC (NVIDIA GPU Cloud) team is seeking highly motivated System Administrator/DevOps engineers to design, develop, and implement a global, dynamic, state-of-the-art Service Reliability Operations Center (Mission Control) that provides extraordinary support for Cloud products and services.
Responsibilities
- Provide 24/7 service support in a follow-the-sun environment across continents.
- Report directly to a manager in Bangalore.
- Work a 4-day schedule of 10 hours per day, including either Saturday or Sunday each week to ensure 24/7 coverage with US and India teams.
- Monitor and triage a growing on-premises and Cloud Service Provider (CSP) production compute and storage datacenter environment.
- Utilize alerts and alarms to help prevent issues and incidents.
- Collaborate with the developer community to develop and execute predictive support or diagnostic routines.
- Perform Linux administration tasks, network administration, and security incident monitoring.
- Learn how the service works and translate that into runbooks for team use, updating runbooks with new features and functionality.
- Maintain strong communication and interpersonal skills to keep the team engaged through incident resolution, including initiating incident management procedures.
Requirements
- BS/BE degree in Computer Science, Electronics, or equivalent experience.
- Minimum 3 years' experience administering open system servers in demanding Internet, Cloud, or Telecommunications environments as Linux Systems Administrator, DevOps, SRE, or NOC role.
- Strong problem-solving, analytical, and troubleshooting abilities on Linux clusters on public or private clouds.
- Strong Linux administration experience including shell scripting, automation, DNS, DHCP, storage concepts, basic networking, IP Tables, RHCE or equivalent knowledge.
- Preferred experience with scripting in Python and Ansible playbooks.
- Knowledge of application containers, container orchestration systems, and git workflow.
- Experience analyzing system and network performance using monitoring alerts, data, and graphs.
- Ability to master and maintain complicated environments.
Benefits
- Competitive salaries and a comprehensive benefits package.
- Opportunity to work with some of the most forward-thinking and talented engineering teams.
- Fast-growing world-class engineering environment.
- Autonomy and creative freedom in engineering work.