Used Tools & Technologies
Not specified
Required Skills & Competences ?
Security @ 4 Docker @ 4 Go @ 6 Grafana @ 7 Kafka @ 4 Kubernetes @ 4 Prometheus @ 7 VictoriaMetrics @ 7 Python @ 6 Java @ 6 Datadog @ 4 Distributed Systems @ 4 Machine Learning @ 4 Networking @ 7 Microservices @ 4 Compliance @ 4 OpenTelemetry @ 7Details
NVIDIA is looking for a highly skilled Principal Software Engineer to design and develop AIOps & Observability platforms. These platforms are used by internal teams to monitor, diagnose, and optimize products, millions of assets and services in cloud, on-prem, data centers, supply chain, and edge. You will work with engineers, product managers, and partners to define observability strategy, roadmap, and standards for NVIDIA, and mentor other engineers on observability, machine learning, tools and techniques.
Responsibilities
- Lead the design, development, and deployment of AIOps & Observability platforms, including metrics, logs, traces, events, alerts, dashboards, and visualizations.
- Drive the technical vision and roadmap for AIOps and Observability initiatives, aligning with business goals and industry best practices.
- Collaborate with teams and customers to understand observability needs and provide solutions that meet requirements and expectations.
- Establish and implement observability standards, guidelines, and processes across NVIDIA; research, evaluate, and adopt new observability technologies and frameworks.
- Provide peer reviews including feedback on performance, scalability, security, and correctness.
- Work with data scientists to implement ML models for anomaly detection, forecasting, and root cause analysis on logs, metrics, and events; handle large volumes of data and ensure data quality, security, and compliance.
- Develop and operate scalable, reliable, distributed systems that handle high traffic and complex workloads.
- Find opportunities to automate remediation of common issues to operate systems reliably and efficiently.
Requirements
- Bachelor’s degree in computer science, engineering, or related field, or equivalent experience.
- 15+ years of experience in product development and full stack engineering, with 5+ years developing and operating observability platforms and solutions, preferably in a cloud-native environment.
- Strong knowledge and experience with observability tools such as Prometheus, VictoriaMetrics, Vector, Loki, Grafana, Alertmanager, ClickHouse, OpenTelemetry.
- Hands-on knowledge in AIOps tools such as BigPanda, PagerDuty, Datadog.
- Experience with Kubernetes, Nomad, Docker, and microservices architectures.
- Experience with streaming services to ingest high volumes of events using NATS, Kafka, etc.
- Proficient in one or more programming languages such as Go, Python, Java, C#.
- Experience developing observability solutions to monitor on-prem and public cloud environments; experience running large observability platforms on bare-metal infrastructure.
- Experience establishing scalable data pipelines and instrumentation for collecting, aggregating, and visualizing telemetry and operational metrics.
Ways to Stand Out
- Deep understanding of implementing observability solutions at large scale for on-prem infrastructure and networking.
- Hands-on experience managing large-scale observability platforms with LLMs & ML models and building custom services to ingest billions of metrics and logs from a wide range of assets.
- Experience developing a unified cloud observability platform to monitor network, compute, power, storage, operating systems, security, applications, and SaaS platforms.
- Demonstrated experience using machine learning and generative AI for predictive monitoring, incident diagnosis, summarization, and correlation.
- Proficiency in AI/ML systems, generative AI, or agentic AI frameworks.
Benefits
- Eligible for equity and benefits (see NVIDIA benefits page).
- NVIDIA is an equal opportunity employer committed to diversity and inclusion.
Compensation and Logistics
- Base salary range: 248,000 USD - 391,000 USD (final base salary determined by location, experience, and pay of employees in similar positions).
- You will also be eligible for equity and additional benefits.
- Application deadline: at least until December 23, 2025.
- Location: Santa Clara, California, United States. #LI-Hybrid