Senior System Software Engineer, Cloud Services

at Nvidia

📍 Santa Clara, United States

USD 184,000-287,500 per year

SENIOR

✅ On-site

SCRAPED

Used Tools & Technologies

Not specified

Required Skills & Competences ^?

Ansible @ 4 Docker @ 6 Go @ 4 Grafana @ 4 Kubernetes @ 6 Prometheus @ 4 Vault @ 4 Terraform @ 4 Python @ 4 GCP @ 6 Java @ 3 Datadog @ 4 AWS @ 6 Azure @ 6 Communication @ 7 Helm @ 4 JavaScript @ 4 Networking @ 4 Next.js @ 4 React @ 4 Cassandra @ 4 Spring Boot @ 3 OpenTelemetry @ 4

Details

Our team builds, operates, and maintains cloud-hosted services that provide user and service authentication/authorization across NVIDIA. Ensuring continuity of operations is critical to our mission. You will ensure the reliability, performance, and scalability of these services and build observability infrastructure to proactively identify, fix, and address operational issues.

Responsibilities

Architect, implement, and maintain observability systems at scale to enable monitoring, alerting, logging, and tracing for cloud-based services.
Define and refine service-level indicators (SLIs), service-level objectives (SLOs), and error budgets in partnership with service owners and product teams.
Invent, construct, and maintain actionable dashboards that display key metrics, SLI/SLOs, and system health for distributed services.
Collaborate with software, platform, and networking teams to integrate observability across the application lifecycle, from development to incident response.
Drive automation to reduce manual toil in monitoring, telemetry, and incident-response workflows; build and maintain self-service observability tooling.
Diagnose and resolve performance and reliability issues using root cause analysis, distributed tracing, and log correlation.
Participate in on-call rotations (PagerDuty), contribute to post-incident reviews, document findings, and drive remediation to improve long-term resilience and visibility.
Develop expertise in the team's offerings and assist in managing support channels for other NVIDIA teams.

Requirements

Bachelor’s or Master’s degree in Computer Science, Engineering, or equivalent experience.
8+ years in large-scale systems engineering roles with experience in live service development, end-to-end deployment, observability, and being on-call.
Hands-on experience with modern monitoring and observability systems such as Prometheus, Grafana, Loki, Tempo, Datadog, New Relic, OpenTelemetry in production environments.
Advanced coding skills in Python and/or Go (or similar languages) for building automation and integrating observability solutions.
Comfort with JavaScript frameworks such as React and Next.js; willingness to support front-end admin services is a plus.
Proficiency in cloud platforms (AWS primary; GCP, Azure) and containerized environments (Kubernetes, Docker).
Experience with configuration-as-code and deployment tooling such as Terraform, Helm, and Ansible.
Strong communication and collaboration skills; experience working in global, cross-disciplinary teams.
Experience with incident management, postmortem processes, and PagerDuty/on-call operations.
Strong analytical problem-solving approach and high standards for operational excellence and customer satisfaction.

Ways to stand out / Nice to have

Familiarity with Java Spring Boot framework.
Hands-on experience with Apache Cassandra and HashiCorp Vault.
Previous work on custom React-based admin/front-end services.

Compensation & Benefits

Base salary range: 184,000 USD - 287,500 USD (determined by location, experience, and market comparisons).
Eligible for equity and NVIDIA benefits.

Additional details

Applications accepted at least until September 7, 2025.
NVIDIA is an equal opportunity employer committed to diversity and inclusion.