Senior DevOps Engineer, AIOps

at Nvidia
USD 148,000-276,000 per year
SENIOR
✅ On-site

Used Tools & Technologies

Not specified

Required Skills & Competences

ElasticSearch @ 4 Grafana @ 4 Kafka @ 4 Kubernetes @ 4 Linux @ 7 Prometheus @ 4 DevOps @ 4 IaC @ 4 Terraform @ 4 Python @ 4 Spark @ 4 CI/CD @ 4 Distributed Systems @ 6 Flink @ 4 Hiring @ 4 Bash @ 4 Helm @ 4 Networking @ 7 SRE @ 6 Microservices @ 7 Debugging @ 7 API @ 4 GPU @ 4 Observability @ 4 AI @ 4 ClickHouse @ 4 Change Management @ 4

Details

NVIDIA is building an AI Data Center AIOps platform that turns raw, high-volume telemetry into reliable, job-centric insights and automation for GPU fleets. The team is hiring a DevOps Engineer to operate the platform itself (not the compute cluster): uptime, performance, data integrity, and safe change management. The role owns SLOs/SLIs, incident response, and postmortems for telemetry ingestion, processing, storage, and APIs/dashboards that operators depend on, partnering with Software and Systems Engineering to translate platform signals into actionable alerts and automation.

Responsibilities

  • Continuously monitor platform health via dashboards/logs/metrics, automate recurring checks, and keep reliability and resource efficiency on track.
  • Own Kubernetes deployments end-to-end (runbooks, canary checks, post-deploy validation), and lead rollbacks/remediations when needed.
  • Lead first-level incident triage: collect diagnostics, identify likely root causes, and hand off clear, actionable findings to engineering.
  • Build and maintain runbooks, SOPs, and checklists; push continuous improvement through automation.
  • Manage deployment infrastructure and packaging (Helm, Terraform / IaC) to keep environments scalable, consistent, and reproducible.
  • Contribute in adjacent functional areas to grow and help your team members.

Requirements

  • BS/MS in Computer Science/Computer Engineering (or equivalent experience) and 5+ years operating production distributed systems as SRE/DevOps/Platform Ops.
  • Proven ownership of reliability for an observability/AIOps platform: SLOs/SLIs, on-call, incident handling, and post-incident evaluations that drive measurable improvements.
  • Deep Kubernetes and container experience (deploying, debugging, scaling) for telemetry-heavy microservices (ingestion, processing, storage, APIs, and UI).
  • Automation-first approach: solid scripting (Python/Bash), CI/CD, and infrastructure-as-code (Terraform and Helm) to deliver safe rollouts (canaries/rollbacks), reproducible environments, and minimal toil.
  • Clear communicator who writes excellent runbooks/docs and can translate ambiguous requirements into concrete operational practices and dependable customer-facing reliability.

Preferred / Ways to stand out

  • Strong Linux and networking fundamentals, distributed systems instincts, and hands-on ops for Kubernetes, services, and streaming stacks.
  • Experience building safe automation that operators trust: canary releases, automated rollback criteria, monitoring-for-the-monitoring (lag/drop/error budgets), and replay/backfill pipelines with correctness checks.
  • Experience with distributed/streaming systems operations (Kafka/Pulsar, Flink/Spark), and with storage/analytics systems (ClickHouse, Elasticsearch, TSDBs, object storage).
  • Proven programming experience building automation tools/services (ideally in Python) to simplify operations and scale recurring processes.
  • Experience running large-scale production deployments and multiple Kubernetes environments or clusters across teams/customers, coordinating changes and rollouts with minimal disruption.
  • Hands-on experience with observability tools (dashboards, metrics, logs, traces) such as Prometheus and Grafana.

Benefits

  • Base salary range (location/level dependent): 148,000 USD - 235,750 USD for Level 3; 176,000 USD - 276,000 USD for Level 4.
  • Eligible for equity and additional benefits (link to NVIDIA benefits referenced in posting).

Additional information

  • Applications accepted at least until February 27, 2026.
  • NVIDIA uses AI tools in its recruiting processes and is an equal opportunity employer.