Used Tools & Technologies
Not specified
Required Skills & Competences
Tag name is followed by "@" symbol and proficiency level value.
About proficiency levels:
- 1-2 — basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;
- 3-6 — daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;
- 7-9 — you are an expert, you can teach others, you know all the pitfalls and tricks;
- 10 — exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.
Go @ 6
Kubernetes @ 4
Prometheus @ 4
Python @ 6
GCP @ 4
GitHub @ 4
GitHub Actions @ 4
Datadog @ 4
ArgoCD @ 4
AWS @ 4
Azure @ 4
Bash @ 6
Communication @ 4
SRE @ 4
Planning @ 4
Microservices @ 7
Debugging @ 4
LLM @ 4
GPU @ 4
Observability @ 4
AI @ 4
Change Management @ 4
- 1-2 — basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;
- 3-6 — daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;
- 7-9 — you are an expert, you can teach others, you know all the pitfalls and tricks;
- 10 — exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.
Details
NVIDIA is looking for a Senior Site Reliability Engineer (SRE) to join its GeForce Now (GFN) team. SRE at NVIDIA ensures that internal and external-facing GPU cloud gaming services meet reliability and uptime commitments while enabling developers to make changes through careful preparation and planning, keeping an eye on capacity, latency, and performance. The SRE is responsible for the big picture of how systems relate to each other and uses a breadth of tools and approaches to tackle complex problems.
Responsibilities
- Drive Service Response and Workflows; develop tools and services to maintain and improve service SLOs.
- Build tools to improve SRE observability.
- Participate in the Kubernetes migration journey with VMI setup and problem solving.
- Rapidly debug and triage incidents and user-reported issues.
- Automate, script, and build tooling to achieve high automation of daily tasks.
- Support services pre-launch through system design consulting, platform/framework development, capacity management, and launch reviews.
- Serve on an on-call rotation to support production systems.
Requirements
- MS or BS in Computer Science/Engineering or a related field, or equivalent experience.
- 8+ years of site reliability engineering experience working on large-scale distributed microservices in production with a strong focus on automation and tooling.
- Very strong Kubernetes background, including understanding complex and highly available VMI setups on Kubernetes.
- Experience leading production improvements: change management, post-mortem reviews, workflow processes, and delivering software automation in various languages.
- Strong problem-solving and root-cause analysis skills with a focus on optimization and efficiency.
- Prior experience with Datadog, Prometheus, Alertmanager, or similar monitoring systems.
- Experience managing multi-region cloud deployments on hyperscalers such as AWS, GCP, or Azure.
- Experience designing and managing deployment pipelines using tools such as GitHub Actions, GitLab CI, or ArgoCD.
- Excellent communication, presentation, social, and analytical skills; able to communicate complex concepts clearly to varied audiences.
- Production-grade coding proficiency in languages like Go, Python, or robust Bash scripting.
- Production on-call experience is required; must have served in a primary production on-call rotation and responded to high-severity infrastructure alerts and service degradations.
Ways to stand out
- Experience with automated anomaly detection, log clustering tools, or LLM-assisted debugging platforms.
- Comfortable using AI on a day-to-day basis as an SRE.
- Prior experience as an SRE or Service Engineer is a plus.
Compensation and benefits
- Base salary range: 168,000 USD - 270,250 USD (final base salary determined by location, experience, and internal pay equity).
- Eligible for equity and company benefits (link to NVIDIA benefits provided in the original posting).
Other information
- Applications for this job will be accepted at least until June 4, 2026.
- This posting is for an existing vacancy.
- NVIDIA uses AI tools in its recruiting processes.
- NVIDIA is an equal opportunity employer and committed to an inclusive work environment.