Senior Platform Telemetry Engineer

at Nvidia
USD 148,000-287,500 per year
SENIOR
✅ On-site

SCRAPED

Used Tools & Technologies

Not specified

Required Skills & Competences ?

Grafana @ 7 Prometheus @ 7 Python @ 7 Algorithms @ 7 Communication @ 7 Git @ 4 Jira @ 4 Product Management @ 4 Debugging @ 4 API @ 1 Project Management @ 4 QA @ 4 System Architecture @ 4 Customer Support @ 4 GPU @ 4

Details

NVIDIA’s invention of the GPU in 1999 sparked the growth of the PC gaming market, redefined modern computer graphics, and revolutionized parallel computing. More recently, GPU deep learning ignited modern deep learning — the next era of computing — with the GPU acting as the brain of computers, robots, and self-driving cars that can perceive and understand the world. Today, we are increasingly known as “the AI computing company.” We are looking to grow our company and establish teams with the most thoughtful people in the world.

NVIDIA GH200 superchip provides performance and productivity required for strong scaling for HPC and generative AI workload. Scale out is inherent to the design of this massive superchip. We are looking for expert engineers to come and help design rack level solutions for next generation scaling AI supercomputing platforms.

Join us at the forefront of technological advancement.

Responsibilities

  • Drive next-generation fleet management solutions for scaling AI infrastructure using GPUs and the Grace solution from NVIDIA. Collaborate with customers, product management and architects to refine implementation requirements and accelerate product development.
  • Define architecture for fleet health monitoring and fault-remediation solutions at scale (in-band and out-of-band). Produce detailed architecture, perform POCs to validate approaches.
  • Educate customers about product architecture, collect feedback, and update designs. Write architecture specs and design documents. Own end-to-end product delivery and perform code reviews for code produced from architecture specs.
  • Work with development teams to ensure proper testing, enhance unit tests, and establish comprehensive test plans.
  • Drive product life cycles with QA teams and act as a product owner to productize code.
  • Articulate requirements in Jira and bug management tools; collaborate with managers to build end-to-end execution plans.
  • Contribute across product development phases: product definition, architecture, design, implementation, debugging, testing, and early customer support.

Requirements

  • BS, MS, or PhD in Electrical Engineering, Computer Science, or a related field (or equivalent experience).
  • 5+ years hands-on coding experience.
  • Strong knowledge of time-series databases such as InfluxDB and Prometheus.
  • Strong experience building and consuming REST APIs (Redfish experience is a big plus).
  • Strong knowledge of telemetry visualization solutions such as Grafana and Influx.
  • Strong knowledge of firmware architecture and optimizing firmware for low-latency APIs.
  • Strong ability to analyze algorithms for time and space complexity and project system resource requirements.
  • Proven record delivering scalable solutions.
  • Strong and demonstrable skill in C/C++ and Python.
  • Experience programming and debugging server platforms.
  • Experience with SCM tools (e.g., Git, Perforce) and project management tools like Jira.
  • Excellent written and oral communication skills, strong work ethic, teamwork orientation, and commitment to delivering high-quality work daily.
  • Self-starter mindset with hands-on coding and creative problem solving.

Ways to Stand Out

  • Experience building telemetry collection and analysis engines.
  • Experience with Redfish and notification systems like PagerDuty.
  • Active Open Compute (OCP) and DMTF contributions in relevant areas.
  • Hands-on experience with x86 or ARM system architecture.
  • Familiarity with Confidential Compute.
  • Experience with ML and multi-variable optimization techniques.

Compensation & Benefits

Your base salary will be determined based on your location, experience, and the pay of employees in similar positions. The base salary range is 148,000 USD - 235,750 USD for Level 3, and 184,000 USD - 287,500 USD for Level 4. You will also be eligible for equity and benefits (see NVIDIA benefits page).

Other

Applications for this job will be accepted at least until August 18, 2025.

NVIDIA is committed to fostering a diverse work environment and is an equal opportunity employer. We do not discriminate on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.