Distinguished Engineer, Observability, Monitoring, and Remediation

at Nvidia
USD 308,000-471,500 per year
SENIOR
✅ On-site

SCRAPED

Used Tools & Technologies

Not specified

Required Skills & Competences ?

Security @ 4 Kubernetes @ 6 Leadership @ 4 Technical Proficiency @ 6 Communication @ 4 IaaS @ 4 Technical Leadership @ 4

Details

NVIDIA is seeking a technology leader to own the development of DGX Cloud strategy for observability, monitoring, and remediation across infrastructure, IaaS, platforms, and applications. The role includes defining technical strategy for collecting, storing, analyzing, and applying models to observability data, issue detection and notification, and developing auto-remediation strategies to detect, fix, validate, and restore-to-service components across multiple systems and layers. The role requires close cross-functional collaboration with NVIDIA leadership, customers, infrastructure providers, and partners to deliver high availability and operational excellence for accelerated computing infrastructure.

Responsibilities

  • Define and drive the technical implementation for DGX Cloud offerings in the observability, monitoring, and remediation practice.
  • Collaborate across domains to drive awareness and technical strategy into DGX Cloud engineering practices.
  • Guide technical delivery into DGX Cloud systems across enterprise, public cloud, and high security / isolated deployments.
  • Engage with external stakeholders (customers, infrastructure providers, strategic partners) to ensure solutions meet industry standards for availability and operational excellence.
  • Enable DGX Cloud teams, customers, and partners to achieve operational excellence across environments.
  • Lead full software and system lifecycle activities: ideation, design, development, continuous deployment, operations, and lifecycle management for large technical scopes.

Requirements

  • 18+ years overall in technical roles with a focus on observability and monitoring for cloud infrastructure, platforms, and applications.
  • 5+ years of lead experience.
  • BS/MS or higher or equivalent experience in systems/software engineering or related engineering fields.
  • Technical proficiency in multi-tenant data center and cloud-native architectures, including bare metal, virtualization, containerization, and higher-level abstractions (IaaS, Kubernetes, Slurm), and AI/ML platforms and applications.
  • Proven success delivering technically sophisticated solutions that provide transparency into resource utilization, performance, and operational insights.
  • Strong technical leadership: ability to synthesize cross-functional needs into architecture and design while guiding execution across teams.
  • Excellent communication and partnership skills; capable of leading engineering engagement and presenting to peers, partners, and high-performance accelerated computing customers.

Ways to Stand Out

  • Real-world experience applying AI (model development, RAG, MCP, Agentic AI) to observability data analytics, issue identification, and remediation.
  • Direct experience designing, developing, delivering, and operating highly available, scaled (up/out) systems in enterprise and cloud environments.
  • History of creating scalable processes and extensible systems that make observability, monitoring, and remediation foundational capabilities for engineers building infrastructure, IaaS, platforms, and applications.
  • Familiarity with open source ecosystems and projects in the observability space and ability to collaborate and influence open source project governance.

Compensation & Benefits

  • Base salary range: 308,000 USD - 471,500 USD (base salary determined by location, experience, and pay of employees in similar positions).
  • Eligible for equity and benefits (see NVIDIA benefits page).
  • Applications accepted at least until August 28, 2025.

Equal Opportunity

NVIDIA is committed to fostering a diverse work environment and is an equal opportunity employer. They do not discriminate based on protected characteristics and value diversity in employees.