Senior Product Manager - Observability and Resilience
at Nvidia
π Santa Clara, United States
USD 208,000-327,800 per year
SCRAPED
Used Tools & Technologies
Not specified
Required Skills & Competences ?
Marketing @ 4 Docker @ 4 Grafana @ 3 Kubernetes @ 7 Prometheus @ 3 Datadog @ 3 Distributed Systems @ 4 MLOps @ 4 Communication @ 4 Networking @ 7 SRE @ 4 Performance Monitoring @ 4 IaaS @ 7 Splunk @ 3 Compliance @ 7 Agile @ 7 OpenTelemetry @ 3 GPU @ 4 LLMOps @ 4Details
NVIDIA is building foundational tools to ensure resiliency and observability for large-scale accelerated computing platforms. This role leads development of system diagnostics, performance monitoring, and automated recovery tooling to help customers operate AI training and inference workloads with maximum uptime and efficiency.
Responsibilities
- Be a subject-matter expert on resiliency and observability: understand failure modes across GPU hardware, network, and software stacks, the telemetry signals that reveal them, and how they correlate to workload health and SLOs.
- Master modern reliability architectures and keep up-to-date with industry trends.
- Drive joint project planning and define concrete achievements, tasks, and work for resiliency and observability initiatives with external partners.
- Fuel innovation in reliability tooling: lead ideation sessions and shape proofs-of-concept.
- Bridge development, SRE, and partner teams: facilitate clear communication, triage emergent issues rapidly, and ensure tight feedback loops between engineering and customer operations.
- Coordinate execution across functions: work with engineering, design, operations, sales, and marketing to embed resiliency and observability requirements into product launches, capacity expansions, and lifecycle transitions.
Requirements
- BS or MS in Computer Science, Computer Engineering, or a related field (or equivalent experience) and 12+ years of product-management experience in enterprise technology.
- Experience with GPU observability (DCGM, NVML, etc.) and integration into large-scale telemetry systems.
- Deep knowledge of AI/ML infrastructure, high-performance computing (HPC), networking, and cloud technologies (IaaS, PaaS), including containerization, Kubernetes, and automation tools.
- Familiarity with modern observability stacks: metrics, logs, traces, OpenTelemetry, Prometheus/Grafana, ELK/OpenSearch.
- Experience building and preferably deep understanding of secure, compliance-focused telemetry pipelines (SOC2, FedRAMP).
- Ability to articulate trade-offs among latency, throughput, cost, and reliability to both engineering and executive audiences.
- Data-driven approach: define SLIs/SLOs, manage error budgets, and develop value models.
- Strong cross-functional execution: write clear specs and PRDs, produce GTM collateral, and lead agile processes.
Preferred / Ways to Stand Out
- Masters/PhD or expertise in distributed systems, performance modeling, or fault-tolerant computing.
- Experience with MLOps and LLMOps ecosystems, integrating with enterprise platforms; delivered ML/AI observability solutions for LLMOps, predictive incident detection, or anomaly classification.
- Startup or 0->1 experience building cloud-native observability or resilience tools; history bringing open-source observability products to market and shaping GTM strategy.
- Familiarity with monitoring platforms and integrations such as Splunk, Datadog, and Grafana Cloud.
- Expertise with containerization (Docker, Kubernetes), virtualization, network architecture, and high-performance interconnects (InfiniBand, Ethernet, RoCE).
Compensation & Benefits
- Base salary range: 208,000 USD - 327,750 USD (determined by location, experience, and pay of employees in similar positions).
- Eligible for equity and NVIDIA benefits.
Additional Information
- Position type: Full time. Hybrid (#LI-Hybrid).
- Applications accepted at least until August 21, 2025.
- NVIDIA is an equal opportunity employer committed to fostering a diverse work environment.