Senior Software Engineer, Distributed Systems - NIM Factory

at Nvidia
USD 168,000-322,000 per year
SENIOR
βœ… On-site

SCRAPED

Used Tools & Technologies

Not specified

Required Skills & Competences ?

Docker @ 4 Kafka @ 4 Kubernetes @ 4 Prometheus @ 4 Redis @ 4 CI/CD @ 4 Hiring @ 4 Helm @ 4 Microservices @ 4 Debugging @ 4

Details

NVIDIA is the platform upon which every new AI-powered application is built. We are seeking a senior engineer to design and build factory infrastructure and automation for NVIDIA Inference Microservices (NIMs). The right person for this role brings technical drive and creativity to change the way NVIDIA optimizes and serves performant inferencing for every AI model in heterogeneous cluster environments. Our NIM offerings are easy to use, highly performant and tested in all deployment scenarios: cloud, customers' self-hosted infrastructure and locally on all NVIDIA GPUs. You will apply your deep technical expertise to design an efficient, scalable and reliable automation factory infrastructure that will take AI models to become NIMs that are validated for best-in-class performance and accuracy.

You will harness groundbreaking technologies, and build a highly efficient factory to power how NVIDIA builds and validates NIMs for inferencing all the way through deployment in heterogeneous hardware and software environments. You will influence and drive technical advances in NVIDIA's workflows and build the infrastructure that strives to accelerate the delivery of every AI model on NVIDIA's GPUs anywhere. We are looking for technical talent to design and build our factory capabilities, including the underlying infrastructure, pipelines, backends, Docker build, test harness, metrics, performance engineering, log ingestion, and more.

Responsibilities

  • Develop a factory pipeline that will take an AI model in and produce a deployable service that is validated across Cloud, On-prem and Kubernetes environments. With the team, define and deliver rapid iterations on the group's technical strategies and roadmaps to deliver and improve the NIM factory. Design interfaces, data modeling and schema design, and expand observability over the factory pipeline and its compute infrastructure.
  • Work with technical leaders designing and developing scalable and reliable factory components. Collaborate with multiple AI model teams to understand their requirements to build an efficient infrastructure that improves every team's productivity.
  • Define metrics and drive improvements based on user feedback. Mentor and collaborate throughout the team and with other teams to grow your colleagues and yourself.

Requirements

  • A history of using your advanced programming skills to build distributed and compute systems, backend services, microservices and cloud technologies.
  • Effective experience working with multi-functional teams, principals and architects, across organizational boundaries.
  • Mentorship experience, growing teams and team members, and the flexibility to adjust direction and expectations given customer needs.
  • Deep technical expertise in distributed containerized applications using technologies such as Docker, Kubernetes, Cloud Endpoints, Helm, and Prometheus.
  • Passion for building rich microservice applications and build/test automation pipelines.
  • Excellent interpersonal skills and the ability to lead multi-functional efforts.
  • Proven experience debugging and analyzing the performance of distributed microservices or cloud systems.
  • BS or MS in Computer Science, Computer Engineering or related field (or equivalent experience).
  • 8+ years of demonstrated experience developing performant microservice, cloud software and/or tooling.

Ways to stand out

  • Experience delivering event-driven applications using services such as Temporal, Kafka, Redis or others and ability to discuss pros and cons of these choices.
  • A history of building and deploying containers for Microservices, Cloud and On-prem deployments, and their associated CI/CD pipelines.
  • Prior experience working with large-scale full stack development.

Compensation & Benefits

Your base salary will be determined based on your location, experience, and the pay of employees in similar positions. The base salary range is 168,000 USD - 264,500 USD for Level 4, and 200,000 USD - 322,000 USD for Level 5. You will also be eligible for equity and benefits.

Other

Applications for this job will be accepted at least until January 10, 2026.

NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. We do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.