Staff Site Reliability Engineer - Enterprise Identity And Access

at Nvidia
USD 168,000-258,800 per year
SENIOR
âś… On-site

SCRAPED

Used Tools & Technologies

Not specified

Required Skills & Competences ?

Security @ 4 Ansible @ 4 Consul @ 4 Go @ 7 Grafana @ 4 Jenkins @ 7 Kubernetes @ 4 Linux @ 6 Prometheus @ 4 Vault @ 4 DevOps @ 7 IaC @ 4 Terraform @ 4 Python @ 7 GCP @ 4 CI/CD @ 4 AWS @ 4 Azure @ 4 Networking @ 6 SRE @ 4 CloudFormation @ 4 Reporting @ 4 OpenShift @ 7 Audit @ 4 Compliance @ 4 OpenTelemetry @ 6

Details

At NVIDIA, continuous innovation in AI and accelerated computing demands robust, automated, and secure production environments. We are seeking a deeply skilled Staff Site Reliability Engineer (SRE) to advance our enterprise security initiatives around identity and access, delivering zero trust outcomes by implementing, integrating, and scaling innovative technologies across cloud-native and hybrid infrastructures.

This position requires a strong software engineering background, but focuses on reliability, scalability, and operational excellence. A strong candidate excels in crafting secure systems, integrating internal and commercial products, and using sophisticated tools.

Responsibilities

  • Support, operationalize, and scale zero trust identity and access platforms—driving reliability, automation, and secure credential and policy management across on-premise and cloud environments.
  • Integrate and automate the deployment, monitoring, and lifecycle management of existing commercial and open-source products (SPIRE, Teleport, etc.), emphasizing ephemeral certificate-based authentication, mTLS, and SPIFFE protocols.
  • Advocate for operational guidelines for CI/CD, infrastructure as code (IaC), policy as code, and security observability, using tools like Kubernetes, Argo CD, GitLab CI, Terraform, Vault, Prometheus, and Grafana.
  • Apply AI-assisted and data-driven approaches to automate anomaly detection, incident response, and compliance reporting, driving continuous improvement in system uptime and threat mitigation.
  • Collaborate with engineering, DevSecOps, and security teams to minimize manual intervention, limit privileged access, and enforce policy compliance through scalable automation.
  • Support incident management, triaging, and blameless postmortems with security context, ensuring rapid root-cause analysis and recovery.
  • Conduct ongoing risk assessments, proactively address emerging threats and vulnerabilities, and contribute to post-incident reviews focused on reliability and trust boundary breaches.

Requirements

  • Bachelor’s or Master’s degree in Computer Science or related field, or equivalent experience.
  • 6+ years of software engineering/DevOps/SRE experience, with a significant focus on operational security, automation, and identity management.
  • Proficiency in Linux administration, networking concepts, and security protocols.
  • Proven track record integrating and operating container platforms (Kubernetes, OpenShift, Nomad), with strong emphasis on automation and CI/CD (Argo CD, GitLab CI, Jenkins, Spinnaker, etc.).
  • Hands-on knowledge of zero trust security principles, including SPIFFE/SPIRE, mTLS, X.509 rotation, SSO, OAuth2/OIDC, LDAP, and cloud IAM services.
  • Experience with secrets management (Vault, AWS/Azure/Google Secret Manager, K8s Secrets) and infrastructure as code (Terraform, Pulumi, Ansible, CloudFormation).
  • Proficient in observability and monitoring tools (Prometheus, Grafana, ELK Stack, OpenTelemetry or equivalent experience) and policy automation frameworks.
  • Strong background in automation using Python, Go, or similar languages.
  • Demonstrated ability leading operational and incident response efforts at scale, developing runbooks and playbooks that leverage both automation and AI tools.

Ways To Stand Out

  • Experience operationalizing service mesh, identity federation, or policy engines in reliability-focused environments (Istio, Linkerd, Consul Connect).
  • Track record advancing zero trust architecture through automation and minimized human access, including ephemeral credentials and policy enforcement.
  • Background in integrating AI/ML-assisted tools for operational intelligence, anomaly detection, and reliability improvements.
  • Experience driving compliance, audit readiness, and operational security in cloud (AWS/GCP/Azure) and hybrid environments.
  • Relevant security/DevOps/SRE certifications and open-source contributions.

Compensation & Benefits

  • Base salary range: 168,000 USD - 258,750 USD (determined based on location, experience, and pay of employees in similar positions).
  • You will also be eligible for equity and benefits.

Additional Details

  • Location: Santa Clara, CA, United States (full-time).
  • Applications accepted at least until October 3, 2025.
  • NVIDIA is an equal opportunity employer and committed to fostering a diverse work environment.