Systems Software Engineer - NIM Factory Platforms

at Nvidia

📍 Santa Clara, United States

$180,000-339,200 per year

SENIOR
✅ Hybrid

SCRAPED

Used Tools & Technologies

Not specified

Required Skills & Competences ?

Kubernetes @ 4 Distributed Systems @ 4 Networking @ 4 Microservices @ 4 Debugging @ 4

Details

NVIDIA is the platform upon which every new AI-powered application is built. We are seeking a senior engineer to design and build factory automation for NVIDIA Inference Microservices (NIMs). The right person for this role brings technical drive and creativity to change the way NVIDIA optimizes and serves performant inferencing for every AI model. Our NIM offerings are easy to use, highly performant and tested in all deployment scenarios, in the cloud, on customer’s self-hosted infrastructure and locally on all NVIDIA GPUs. You will apply your deep technical expertise to design an efficient, scalable and reliable automation factory pipeline that will take AI models to become NIMs that are validated for best in class performance and accuracy.

Responsibilities

  • Develop, analyze and optimize factory infrastructure that will take an AI model in and produce a deployable service that is validated across Cloud, On-prem and Kubernetes environments. With the team, define and deliver rapid iterations on the group's technical strategies and roadmaps to deliver and improve the NIM factory. You will be developing harness, automating hardware acceptance, analyze benchmarks, data gathering and statistical analysis of systems health and performance analysis of NIMs.
  • Work with technical leaders designing and developing scalable and reliable factory acceptance and performance tuning of hardware platforms. You will collaborate with multiple AI model teams to understand their requirements to build an efficient infrastructure that improves every team’s productivity.
  • Define metrics and drive improvements based on user feedback. You will mentor and collaborate throughout the team and with other teams to grow your colleagues and yourself. You will have a history of learning and growing your skills and those around you.

Requirements

  • A history of using your advanced programming skills to build tooling and automation for hardware system characterization and benchmarking.
  • Proven experience debugging and analyzing performance of compute applications and systems.
  • Deep technical expertise working with system software and platform layers including Kernel, device driver, memory, storage, networking and PCIe devices.
  • Passion for building platform engineering components and automation of system benchmarking and characterization.
  • Excellent interpersonal skills and the ability to lead multi-functional efforts.
  • Experience working with hardware clusters, distributed systems, networking, GPU interconnects (PCie, NVlink), node and cluster interconnect (Infiniband).
  • BS or MS in Computer Science, Computer Engineering or related field (or equivalent experience).
  • 6+ years of shown experience developing performant microservice, cloud software and/or tooling roles.

Benefits

We are widely considered to be one of the technology world’s most desirable employers. We have some of the most forward-thinking and creative people in the world working for us. If you're creative and autonomous with a real passion for technology we want to hear from you.