Senior System Software Engineer - Data Platform Observability
at Nvidia
USD 184,000-287,500 per year
Used Tools & Technologies
Not specified
Required Skills & Competences
Tag name is followed by "@" symbol and proficiency level value.
About proficiency levels:
- 1-2 — basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;
- 3-6 — daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;
- 7-9 — you are an expert, you can teach others, you know all the pitfalls and tricks;
- 10 — exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.
Software Development @ 7
Ansible @ 4
Go @ 7
Grafana @ 4
Kubernetes @ 4
Prometheus @ 4
Terraform @ 4
Python @ 7
Spark @ 4
Java @ 7
Helm @ 4
JavaScript @ 7
React @ 7
Rust @ 7
Microservices @ 4
API @ 4
Audit @ 4
Compliance @ 4
OpenTelemetry @ 4
Observability @ 4
AI @ 4
- 1-2 — basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;
- 3-6 — daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;
- 7-9 — you are an expert, you can teach others, you know all the pitfalls and tricks;
- 10 — exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.
Details
NVIDIA’s Hardware Infrastructure organization is seeking a Senior System Software Engineer to lead the evolution of the next-generation Data & Observability Platform. The team serves and collaborates with NVIDIA’s AI, hardware, and software engineering and research teams. The role is a full-stack technical lead responsible for the Observability stack and building a centralized platform used by thousands of NVIDIA engineers to visualize chip telemetry, debug distributed pipelines, and ensure platform reliability.
Responsibilities
- Architect high-performance ingestion: design and build centralized telemetry pipelines capable of handling massive scale and solve global latency challenges by implementing modern, push-based edge collection architectures to replace legacy proxy models.
- Build policy enforcement systems: design and implement infrastructure for data governance, policy engines, access control enforcement points, secure credential management, and audit logging (building governance controls into a platform, not just administering them).
- Focus on user experience: develop a modern web interface and APIs that unify distinct observability signals into a consolidated user experience.
- Optimize storage and cost: implement cost-effective tiered storage architectures and define strategies for routing high-volume data to cold storage solutions to reduce costs while maintaining multi-year data retention.
- Drive platform automation: architect workflow orchestration systems to automate platform maintenance, data lifecycle management, and complex pipeline operations.
- Provide operational and strategic data to empower engineers and researchers to continuously improve quality, workloads, and processes through better observability.
Requirements
- BS or MS in Computer Science, Electrical Engineering, or related field (or equivalent experience).
- 8+ years of full-stack software development experience with a focus on Data Platforms or Infrastructure Tools.
- Strong full-stack fluency: proficiency in high-performance backend systems programming and modern frontend web frameworks for building responsive user interfaces (examples listed: Python, JavaScript, Java, Rust, Go, React, or similar).
- Observability expertise: experience with platforms such as Apache Spark, Elastic/OpenSearch, Grafana, Prometheus, and other similar open-source tools. Hands-on experience operating and extending the Grafana ecosystem or ELK stack at scale. Understanding of internals of time-series databases and inverted indexes.
- Infrastructure-as-code experience: deploying complex stateful services on Kubernetes using Helm, Terraform, or Ansible.
- Familiarity with streaming and storage technologies and modern data lake formats.
Ways to Stand Out
- Experience writing custom Grafana data source plugins or backend plugins in Go.
- Background migrating legacy monoliths to microservices or Vector-based pipelines.
- Experience with OpenTelemetry (OTEL) collector configuration, writing custom processors, or instrumentation SDKs.
- Background in data governance, including implementation of Policy-as-Code or compliance frameworks in regulated environments.
Compensation & Benefits
- Base salary range: 184,000 USD - 287,500 USD (determined based on location, experience, and pay of employees in similar positions).
- Eligible for equity and company benefits (link to benefits referenced in original posting).
Other Information
- Location listed: Santa Clara, CA, United States.
- Employment type: Full time.
- Applications accepted at least until March 1, 2026.
- This posting is for an existing vacancy.
- NVIDIA uses AI tools in its recruiting processes.
- NVIDIA is an equal opportunity employer committed to fostering a diverse work environment.