Senior Platform Telemetry Engineer

at Nvidia

📍 Santa Clara, United States

USD 148,000-287,500 per year

SENIOR

✅ On-site

SCRAPED

Used Tools & Technologies

Not specified

Required Skills & Competences ^?

Grafana @ 7 Prometheus @ 7 Python @ 7 Algorithms @ 7 Machine Learning @ 4 Hiring @ 4 Communication @ 7 Git @ 4 Jira @ 4 Product Management @ 4 Debugging @ 4 API @ 7 Project Management @ 4 QA @ 4 System Architecture @ 4 Customer Support @ 4 GPU @ 4

Details

NVIDIA’s invention of the GPU in 1999 sparked the growth of the PC gaming market, redefined modern computer graphics, and revolutionized parallel computing. More recently, GPU deep learning ignited modern deep learning — the next era of computing — with the GPU acting as the brain of computers, robots, and self-driving cars that can perceive and understand the world. Today, we are increasingly known as “the AI computing company.” We are looking to grow our company and establish teams with the most thoughtful people in the world.

NVIDIA GH200 superchip provides performance and productivity required for strong scaling for HPC and generative AI workload. Scale out is inherent to the design of this massive superchip. We are looking for expert engineers to come and help design rack level solutions for next generation scaling AI supercomputing platforms.

Join us at the forefront of technological advancement.

Responsibilities

Drive next generation fleet management solutions for scaling AI infrastructure using GPUs and Grace solution from Nvidia. Work with customers, product management and other architects to narrow down on requirements for implementation to ensure speed of light product development.
Design and clarify architecture for fleet health monitoring and fault-remediation solutions at scale. Work with customers and other architects, understand their requirements on health monitoring, making best use of available capabilities in-band as well as out of band. Produce detailed architecture and perform POCs to validate architecture.
Educate customers about product architecture and take feedback to make necessary changes. Write architecture specs and design documents and own end-to-end delivery of product by working across teams. Perform code reviews for code produced from architecture specs.
Ensure product is properly tested by working with development teams to enhance unit testing and create proper test plans.
Drive product life cycles with QA teams to productize the code and be responsible as a product owner.
Articulate requirements using Jira and bug management tools and work out end-to-end execution plans in collaboration with other managers.
Contribute to all phases of product development: product definition, architecture and design, implementation, debugging, testing and early customer support.

Requirements

BS, MS, or PhD in EE/CS or related field (or equivalent experience).
5+ years hands-on coding experience.
Strong knowledge of time series databases such as InfluxDB and Prometheus.
Strong knowledge of building and consuming REST APIs (Redfish is a big plus).
Strong knowledge of telemetry visualization solutions such as Grafana and Influx.
Strong knowledge of firmware architecture and optimizing firmware for low-latency APIs.
Strong knowledge of analyzing algorithms for time & space complexity and projecting system resource requirements.
Proven record of designing scalable solutions.
Strong and demonstrable skill in C++ and Python.
Experience programming and debugging for server platforms.
Experience with SCM (e.g., Git, Perforce) and project management tools like Jira.
Excellent written and oral communication skills, strong teamwork and work ethic, and commitment to producing high-quality work.
Self-starter who enjoys finding creative solutions to complicated problems and is hands-on with coding.

Ways to Stand Out

Experience building telemetry collection & analysis engines.
Experience with Redfish and notification systems like PagerDuty.
Active Open Compute (OCP) and DMTF contributor in relevant areas.
Hands-on experience with x86 or ARM system architecture.
Familiarity with Confidential Compute.
Experience with machine learning and multi-variable optimization techniques.

Benefits & Compensation

Your base salary will be determined based on your location, experience, and the pay of employees in similar positions.
Base salary ranges provided in the posting:
- Level 3: 148,000 USD - 235,750 USD
- Level 4: 184,000 USD - 287,500 USD
You will also be eligible for equity and benefits (see company benefits information).

Applications for this job will be accepted at least until October 23, 2025.

NVIDIA is committed to fostering a diverse work environment and is proud to be an equal opportunity employer. We do not discriminate (including in hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.