Distinguished Engineer, Observability, Monitoring, and Remediation
SCRAPED
Used Tools & Technologies
Not specified
Required Skills & Competences ?
Security @ 4 Kubernetes @ 6 Leadership @ 7 Technical Proficiency @ 6 Communication @ 7 IaaS @ 4 Technical Leadership @ 7Details
NVIDIA is leading the industry in delivering accelerated computing in cloud and enterprise environments. As a technology leader, you will own the development of DGX Cloud strategy for observability, monitoring, and remediation across all layers of infrastructure, IaaS, platforms and applications. You will define and drive the technical strategy for collecting, storing, analyzing, and model development related to observability data, define strategies for issue detection and notification, and develop auto-remediation strategies to detect, fix, validate, and restore-to-service components across all layers and systems. You will work cross-functionally with NVIDIA leadership to build accelerated computing infrastructure with high availability and strong operational standards without compromising performance or developer experience.
Base salary range: 308,000 USD - 471,500 USD. You will also be eligible for equity and benefits. Applications for this job will be accepted at least until August 28, 2025.
Responsibilities
- Define and drive the technical implementation and architecture for DGX Cloud offerings in observability, monitoring, and remediation.
- Collaborate across domain disciplines to drive awareness and technical strategy into DGX Cloud engineering practices.
- Guide technical delivery into DGX Cloud systems across enterprise, public cloud, and high-security / isolated deployments.
- Engage with customers, infrastructure providers, and strategic partners to ensure industry standards for availability and operational excellence of accelerated computing infrastructure and platforms.
- Enable operational excellence for DGX Cloud customers and partners across environments.
- Lead full software and system lifecycle activities: ideation, design, development, continuous deployment, operations, and lifecycle management for large technical scope.
Requirements
- 18+ overall years in technical roles with a focus on observability and monitoring for cloud infrastructure, platforms, and applications.
- 5+ years of lead experience.
- BS/MS or higher or equivalent experience in systems/software engineering or related engineering fields.
- Technical proficiency in multi-tenant data center and cloud-native architectures, including bare metal, virtualization, containerization, and higher-level abstractions (IaaS, Kubernetes, Slurm), and AI/ML platforms and applications.
- Proven success delivering technically sophisticated solutions that provide transparency into resource utilization, performance, and operational insights.
- Strong technical leadership: ability to synthesize multi-functional needs into architecture and design while guiding execution across teams.
- Strong communication and partnership skills to lead engineering engagement and collaborate with peers, partners, and customers.
Ways to Stand Out / Nice to Have
- Real-world experience applying AI for observability analytics and remediation, including model development, RAG, MCP, and agentic AI solutions.
- Direct experience designing, developing, delivering, and operating highly available, scaled systems in enterprise and cloud environments.
- History of creating scalable processes and extensible systems that make observability, monitoring, and remediation foundational capabilities.
- Familiarity with open source ecosystems and projects in observability and monitoring; ability to collaborate and influence open source governance and technical direction.
Benefits
- Eligibility for equity and company benefits (see NVIDIA benefits).
- Opportunity to work on accelerated computing, AI, and high-performance computing initiatives at large scale.
Other Information
- Location: Santa Clara, California, United States.
- Full-time role. Application window open at least until August 28, 2025.
- NVIDIA is an equal opportunity employer committed to fostering a diverse work environment.