Used Tools & Technologies
Not specified
Required Skills & Competences
Tag name is followed by "@" symbol and proficiency level value.
About proficiency levels:
- 1-2 — basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;
- 3-6 — daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;
- 7-9 — you are an expert, you can teach others, you know all the pitfalls and tricks;
- 10 — exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.
Docker @ 4
Go @ 4
Kubernetes @ 4
Linux @ 4
Ruby @ 4
Python @ 4
Distributed Systems @ 7
Communication @ 7
Mathematics @ 4
Networking @ 4
OpenStack @ 4
Perl @ 4
SRE @ 4
GPU @ 4
- 1-2 — basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;
- 3-6 — daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;
- 7-9 — you are an expert, you can teach others, you know all the pitfalls and tricks;
- 10 — exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.
Details
Site Reliability Engineering (SRE) at NVIDIA is an engineering discipline to design, build and maintain large-scale production systems with high efficiency and availability using a combination of software and systems engineering practices. This role focuses on ensuring internal and external GPU cloud services run with maximum reliability and uptime while enabling developers to make changes safely through automation, performance tuning and capacity management.
Responsibilities
- Lead, design, implement and support operational and reliability aspects of large-scale Kubernetes clusters with focus on performance at scale, real-time monitoring, logging and alerting
- Engage in and improve the whole lifecycle of services — from inception and design through deployment, operation and refinement
- Support services before they go live through system design consulting, developing software tools, platforms and frameworks, capacity management and launch reviews
- Maintain services once live by measuring and monitoring availability, latency and overall system health
- Scale systems sustainably through automation and by driving changes that improve reliability and velocity
- Practice sustainable incident response and blameless postmortems
- Participate in an on-call rotation to support production systems
Requirements
- BS degree in Computer Science or a related technical field involving coding (e.g., physics or mathematics), or equivalent experience
- 16+ years of experience with infrastructure automation and distributed systems design; experience designing and developing tools for running large-scale private or public cloud systems in production
- Experience in one or more of: Python, Go, Perl or Ruby
- In-depth knowledge of Linux, networking and containers
- Experience across coding, databases, capacity management, continuous delivery and deployment, and open-source cloud technologies
Ways to stand out
- Interest in crafting, analyzing and fixing large-scale distributed systems
- Systematic problem-solving approach, strong communication skills, sense of ownership and drive
- Ability to debug and optimize code and automate routine tasks
- Experience using or running large private and public cloud systems based on Kubernetes, OpenStack and Docker
Benefits
- Base salary range: 320,000 USD - 488,750 USD
- Eligible for equity and NVIDIA benefits (link to company benefits)