Senior Software Engineer - AI Research Clusters

at Nvidia
USD 152,000-287,500 per year
SENIOR
✅ On-site

Used Tools & Technologies

Not specified

Required Skills & Competences

Software Development @ 4 Docker @ 4 Kubernetes @ 4 Linux @ 4 Python @ 7 Distributed Systems @ 6 Machine Learning @ 4 Hiring @ 4 JavaScript @ 6 CSS @ 6 Rust @ 7 API @ 6 GPU @ 4 AI @ 4 Data Modeling @ 6 Agentic AI @ 4 Slurm @ 4

Details

NVIDIA is seeking a Senior Software Engineer to accelerate the next era of machine learning innovation by proposing and implementing engineering solutions that deliver functional, reliable, secure, and performance-optimal GPU clusters for internal researchers. The role focuses on reducing operational disruption and overhead, enabling researcher self-service, and improving reliability, operational excellence, and performance so scientists and engineers can train, fine-tune, and deploy advanced ML models on some of the world’s most powerful GPU systems.

Responsibilities

  • Understand pain points of validating, monitoring, and operating GPU clusters at scale by collaborating with coworkers across the AI Platform organization.
  • Design, develop, and maintain engineering solutions to systematically solve those pain points.
  • Research and apply traditional AIOps and emerging Agentic AI to reduce operational toil.
  • Participate in on-call support for systems and platforms built and owned by the team.

Requirements

  • BS/MS in Computer Science, Engineering, or equivalent experience.
  • 5+ years in software/platform engineering, including 3+ years in ML infrastructure or distributed systems.
  • Experience in the software development lifecycle on Linux-based platforms.
  • Strong coding skills in languages such as Python, C++ or Rust.
  • Experience with Docker, Kubernetes, GitLab CI, and automated deployments.
  • Experience with AIOps or Agentic AI and applying it successfully in production environments.
  • Participation in on-call support for production systems.

Ways To Stand Out

  • Proficiency with full-stack development: relational data modeling, DB optimization, REST API semantics, JavaScript, CSS, and providing APIs as a service.
  • Passion for building developer-centric platforms with great UX and strong operational reliability.
  • Experience running Slurm or custom scheduling frameworks in production ML environments.
  • Familiarity with GPU computing, Linux systems internals, and performance tuning at scale.

Compensation & Benefits

  • Base salary ranges: $152,000 - $241,500 USD for Level 3, and $184,000 - $287,500 USD for Level 4.
  • Eligible for equity and benefits (see NVIDIA benefits page).

Additional Information

  • Applications for this job will be accepted at least until February 24, 2026.
  • This posting is for an existing vacancy.
  • NVIDIA uses AI tools in its recruiting processes.
  • NVIDIA is an equal opportunity employer and states nondiscrimination in hiring and promotion practices.