Used Tools & Technologies
Not specified
Required Skills & Competences
Tag name is followed by "@" symbol and proficiency level value.
About proficiency levels:
- 1-2 — basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;
- 3-6 — daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;
- 7-9 — you are an expert, you can teach others, you know all the pitfalls and tricks;
- 10 — exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.
ElasticSearch @ 4
Grafana @ 4
Kafka @ 4
Kubernetes @ 4
Linux @ 7
Prometheus @ 4
DevOps @ 4
IaC @ 4
Terraform @ 4
Python @ 4
Spark @ 4
CI/CD @ 4
Distributed Systems @ 6
Flink @ 4
Hiring @ 4
Bash @ 4
Helm @ 4
Networking @ 7
SRE @ 6
Microservices @ 7
Debugging @ 7
API @ 4
GPU @ 4
Observability @ 4
AI @ 4
ClickHouse @ 4
Change Management @ 4
- 1-2 — basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;
- 3-6 — daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;
- 7-9 — you are an expert, you can teach others, you know all the pitfalls and tricks;
- 10 — exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.
Details
NVIDIA is building an AI Data Center AIOps platform that turns raw, high-volume telemetry into reliable, job-centric insights and automation for GPU fleets. The team is hiring a DevOps Engineer to operate the platform itself (not the compute cluster): uptime, performance, data integrity, and safe change management. The role owns SLOs/SLIs, incident response, and postmortems for telemetry ingestion, processing, storage, and APIs/dashboards that operators depend on, partnering with Software and Systems Engineering to translate platform signals into actionable alerts and automation.
Responsibilities
- Continuously monitor platform health via dashboards/logs/metrics, automate recurring checks, and keep reliability and resource efficiency on track.
- Own Kubernetes deployments end-to-end (runbooks, canary checks, post-deploy validation), and lead rollbacks/remediations when needed.
- Lead first-level incident triage: collect diagnostics, identify likely root causes, and hand off clear, actionable findings to engineering.
- Build and maintain runbooks, SOPs, and checklists; push continuous improvement through automation.
- Manage deployment infrastructure and packaging (Helm, Terraform / IaC) to keep environments scalable, consistent, and reproducible.
- Contribute in adjacent functional areas to grow and help your team members.
Requirements
- BS/MS in Computer Science/Computer Engineering (or equivalent experience) and 5+ years operating production distributed systems as SRE/DevOps/Platform Ops.
- Proven ownership of reliability for an observability/AIOps platform: SLOs/SLIs, on-call, incident handling, and post-incident evaluations that drive measurable improvements.
- Deep Kubernetes and container experience (deploying, debugging, scaling) for telemetry-heavy microservices (ingestion, processing, storage, APIs, and UI).
- Automation-first approach: solid scripting (Python/Bash), CI/CD, and infrastructure-as-code (Terraform and Helm) to deliver safe rollouts (canaries/rollbacks), reproducible environments, and minimal toil.
- Clear communicator who writes excellent runbooks/docs and can translate ambiguous requirements into concrete operational practices and dependable customer-facing reliability.
Preferred / Ways to stand out
- Strong Linux and networking fundamentals, distributed systems instincts, and hands-on ops for Kubernetes, services, and streaming stacks.
- Experience building safe automation that operators trust: canary releases, automated rollback criteria, monitoring-for-the-monitoring (lag/drop/error budgets), and replay/backfill pipelines with correctness checks.
- Experience with distributed/streaming systems operations (Kafka/Pulsar, Flink/Spark), and with storage/analytics systems (ClickHouse, Elasticsearch, TSDBs, object storage).
- Proven programming experience building automation tools/services (ideally in Python) to simplify operations and scale recurring processes.
- Experience running large-scale production deployments and multiple Kubernetes environments or clusters across teams/customers, coordinating changes and rollouts with minimal disruption.
- Hands-on experience with observability tools (dashboards, metrics, logs, traces) such as Prometheus and Grafana.
Benefits
- Base salary range (location/level dependent): 148,000 USD - 235,750 USD for Level 3; 176,000 USD - 276,000 USD for Level 4.
- Eligible for equity and additional benefits (link to NVIDIA benefits referenced in posting).
Additional information
- Applications accepted at least until February 27, 2026.
- NVIDIA uses AI tools in its recruiting processes and is an equal opportunity employer.