Senior DevOps Engineer, AIOps

at Nvidia

📍 Santa Clara, United States

USD 148,000-276,000 per year

SENIOR

✅ On-site

Used Tools & Technologies

Not specified

Required Skills & Competences
Tag name is followed by "@" symbol and proficiency level value. About proficiency levels:

1-2 — basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;

3-6 — daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;

7-9 — you are an expert, you can teach others, you know all the pitfalls and tricks;

10 — exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.

ElasticSearch @ 4 Grafana @ 4 Kafka @ 4 Kubernetes @ 4 Linux @ 7 Prometheus @ 4 DevOps @ 4 IaC @ 4 Terraform @ 4 Python @ 4 Spark @ 4 CI/CD @ 4 Distributed Systems @ 6 Flink @ 4 Hiring @ 4 Bash @ 4 Helm @ 4 Networking @ 7 SRE @ 6 Microservices @ 7 Debugging @ 7 API @ 4 GPU @ 4 Observability @ 4 AI @ 4 ClickHouse @ 4 Change Management @ 4

Details

NVIDIA is building an AI Data Center AIOps platform that turns raw, high-volume telemetry into reliable, job-centric insights and automation for GPU fleets. The team is hiring a DevOps Engineer to operate the platform itself (not the compute cluster): uptime, performance, data integrity, and safe change management. The role owns SLOs/SLIs, incident response, and postmortems for telemetry ingestion, processing, storage, and APIs/dashboards that operators depend on, partnering with Software and Systems Engineering to translate platform signals into actionable alerts and automation.

Responsibilities

Continuously monitor platform health via dashboards/logs/metrics, automate recurring checks, and keep reliability and resource efficiency on track.
Own Kubernetes deployments end-to-end (runbooks, canary checks, post-deploy validation), and lead rollbacks/remediations when needed.
Lead first-level incident triage: collect diagnostics, identify likely root causes, and hand off clear, actionable findings to engineering.
Build and maintain runbooks, SOPs, and checklists; push continuous improvement through automation.
Manage deployment infrastructure and packaging (Helm, Terraform / IaC) to keep environments scalable, consistent, and reproducible.
Contribute in adjacent functional areas to grow and help your team members.

Requirements

BS/MS in Computer Science/Computer Engineering (or equivalent experience) and 5+ years operating production distributed systems as SRE/DevOps/Platform Ops.
Proven ownership of reliability for an observability/AIOps platform: SLOs/SLIs, on-call, incident handling, and post-incident evaluations that drive measurable improvements.
Deep Kubernetes and container experience (deploying, debugging, scaling) for telemetry-heavy microservices (ingestion, processing, storage, APIs, and UI).
Automation-first approach: solid scripting (Python/Bash), CI/CD, and infrastructure-as-code (Terraform and Helm) to deliver safe rollouts (canaries/rollbacks), reproducible environments, and minimal toil.
Clear communicator who writes excellent runbooks/docs and can translate ambiguous requirements into concrete operational practices and dependable customer-facing reliability.

Preferred / Ways to stand out

Strong Linux and networking fundamentals, distributed systems instincts, and hands-on ops for Kubernetes, services, and streaming stacks.
Experience building safe automation that operators trust: canary releases, automated rollback criteria, monitoring-for-the-monitoring (lag/drop/error budgets), and replay/backfill pipelines with correctness checks.
Experience with distributed/streaming systems operations (Kafka/Pulsar, Flink/Spark), and with storage/analytics systems (ClickHouse, Elasticsearch, TSDBs, object storage).
Proven programming experience building automation tools/services (ideally in Python) to simplify operations and scale recurring processes.
Experience running large-scale production deployments and multiple Kubernetes environments or clusters across teams/customers, coordinating changes and rollouts with minimal disruption.
Hands-on experience with observability tools (dashboards, metrics, logs, traces) such as Prometheus and Grafana.

Benefits

Base salary range (location/level dependent): 148,000 USD - 235,750 USD for Level 3; 176,000 USD - 276,000 USD for Level 4.
Eligible for equity and additional benefits (link to NVIDIA benefits referenced in posting).

Additional information

Applications accepted at least until February 27, 2026.
NVIDIA uses AI tools in its recruiting processes and is an equal opportunity employer.