Senior System Software Engineer - Data Platform Observability

at Nvidia

📍 Santa Clara, United States

USD 184,000-287,500 per year

SENIOR

✅ On-site

Used Tools & Technologies

Not specified

Required Skills & Competences
Tag name is followed by "@" symbol and proficiency level value. About proficiency levels:

1-2 — basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;

3-6 — daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;

7-9 — you are an expert, you can teach others, you know all the pitfalls and tricks;

10 — exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.

Software Development @ 7 Ansible @ 4 Go @ 7 Grafana @ 4 Kubernetes @ 4 Prometheus @ 4 Terraform @ 4 Python @ 7 Spark @ 4 Java @ 7 Helm @ 4 JavaScript @ 7 React @ 7 Rust @ 7 Microservices @ 4 API @ 4 Audit @ 4 Compliance @ 4 OpenTelemetry @ 4 Observability @ 4 AI @ 4

Details

NVIDIA’s Hardware Infrastructure organization is seeking a Senior System Software Engineer to lead the evolution of the next-generation Data & Observability Platform. The team serves and collaborates with NVIDIA’s AI, hardware, and software engineering and research teams. The role is a full-stack technical lead responsible for the Observability stack and building a centralized platform used by thousands of NVIDIA engineers to visualize chip telemetry, debug distributed pipelines, and ensure platform reliability.

Responsibilities

Architect high-performance ingestion: design and build centralized telemetry pipelines capable of handling massive scale and solve global latency challenges by implementing modern, push-based edge collection architectures to replace legacy proxy models.
Build policy enforcement systems: design and implement infrastructure for data governance, policy engines, access control enforcement points, secure credential management, and audit logging (building governance controls into a platform, not just administering them).
Focus on user experience: develop a modern web interface and APIs that unify distinct observability signals into a consolidated user experience.
Optimize storage and cost: implement cost-effective tiered storage architectures and define strategies for routing high-volume data to cold storage solutions to reduce costs while maintaining multi-year data retention.
Drive platform automation: architect workflow orchestration systems to automate platform maintenance, data lifecycle management, and complex pipeline operations.
Provide operational and strategic data to empower engineers and researchers to continuously improve quality, workloads, and processes through better observability.

Requirements

BS or MS in Computer Science, Electrical Engineering, or related field (or equivalent experience).
8+ years of full-stack software development experience with a focus on Data Platforms or Infrastructure Tools.
Strong full-stack fluency: proficiency in high-performance backend systems programming and modern frontend web frameworks for building responsive user interfaces (examples listed: Python, JavaScript, Java, Rust, Go, React, or similar).
Observability expertise: experience with platforms such as Apache Spark, Elastic/OpenSearch, Grafana, Prometheus, and other similar open-source tools. Hands-on experience operating and extending the Grafana ecosystem or ELK stack at scale. Understanding of internals of time-series databases and inverted indexes.
Infrastructure-as-code experience: deploying complex stateful services on Kubernetes using Helm, Terraform, or Ansible.
Familiarity with streaming and storage technologies and modern data lake formats.

Ways to Stand Out

Experience writing custom Grafana data source plugins or backend plugins in Go.
Background migrating legacy monoliths to microservices or Vector-based pipelines.
Experience with OpenTelemetry (OTEL) collector configuration, writing custom processors, or instrumentation SDKs.
Background in data governance, including implementation of Policy-as-Code or compliance frameworks in regulated environments.

Compensation & Benefits

Base salary range: 184,000 USD - 287,500 USD (determined based on location, experience, and pay of employees in similar positions).
Eligible for equity and company benefits (link to benefits referenced in original posting).

Other Information

Location listed: Santa Clara, CA, United States.
Employment type: Full time.
Applications accepted at least until March 1, 2026.
This posting is for an existing vacancy.
NVIDIA uses AI tools in its recruiting processes.
NVIDIA is an equal opportunity employer committed to fostering a diverse work environment.