Senior Site Reliability Engineer

at Nvidia

📍 Santa Clara, United States

$184,000-356,500 per year

SENIOR

✅ On-site

SCRAPED

Used Tools & Technologies

Not specified

Required Skills & Competences ^?

Software Development @ 7 Docker @ 4 Go @ 4 Grafana @ 4 Jenkins @ 7 Kubernetes @ 4 Prometheus @ 4 Python @ 4 GitHub @ 7 CI/CD @ 4 Datadog @ 4 Hiring @ 4 Leadership @ 7 AWS @ 4 Azure @ 4 Communication @ 4 SRE @ 4 Performance Monitoring @ 4 Microservices @ 4 Jira @ 6 Debugging @ 4 API @ 4 HTTP @ 4 Oracle @ 4

Details

NVIDIA has been transforming computer graphics, PC gaming, and accelerated computing for more than 25 years. It’s a unique legacy of innovation that’s fueled by great technology—and amazing people. Today, we’re tapping into the unlimited potential of AI to define the next era of computing. An era in which our GPU acts as the brains of computers, robots, and self-driving cars that can understand the world. Doing what’s never been done before takes vision, innovation, and the world’s best talent. As an NVIDIAN, you’ll be immersed in a diverse, supportive environment where everyone is inspired to do their best work. Come join the team and see how you can make a lasting impact on the world.

Responsibilities

As a Senior Site Reliability Engineer, you will be working with the development and support teams to provide platform deployment, performance monitoring, build support, participate in on-call rotations, perform incident triage, and system recovery, Holding development teams accountable for their features (quality, usability, telemetry, documentation).

Our mission is to build and host a suite of microservices to empower NVIDIA's AI technology. The NeMo organization is responsible for building and deploying Generative AI services, including large language models. You will apply engineering leadership and deep knowledge of infrastructure and software development at scale to own the operation, adoption, and evolution of these services. You will lead by example, mentor the engineering teams, and establish credibility through quality technical execution, including hands-on contributions to code and automation to keep things running efficiently. The solutions include a web portals, REST APIs, and a micro-service architecture backend. The SRE technical stack includes:

Cloud Platforms (AWS, Azure, Google, Oracle)
Kubernetes, Docker, various container runtimes
Python, Go, Unix Shells
Gitlab (source control, CI/CD)
Datadog, Prometheus, Grafana, ELK
JIRA, Confluence

Requirements

Strong cloud management foundation.
8+ years of relevant experience.
Sysadmin experience in handling large-scale distributed system software deployments.
Proficient with containerization and cluster management technologies like Docker and Kubernetes.
Strong understanding of build/release systems, CI/CD and experience with solutions like Gitlab, Github, Jenkins etc.
Excellent problem solving and debugging skills.
Scripting and coding experience with Python, Golang, or Shell.
Experience providing effective customer oriented incident support.
Outstanding teammate who can collaborate and influence in a multifaceted environment.
Excellent interpersonal, and written communication skills.
BS degree in Computer Science or a related technical field, or equivalent experience.

Benefits

Widely considered to be one of the technology world’s most desirable employers, NVIDIA offers highly competitive salaries and a comprehensive benefits package. As you plan your future, see what we can offer to you and your family www.nvidiabenefits.com
You will also be eligible for equity and benefits. NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.