Senior Software Engineer - AI Research Clusters

at Nvidia

📍 Santa Clara, United States

USD 152,000-287,500 per year

SENIOR

✅ On-site

Used Tools & Technologies

Not specified

Required Skills & Competences
Tag name is followed by "@" symbol and proficiency level value. About proficiency levels:

1-2 — basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;

3-6 — daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;

7-9 — you are an expert, you can teach others, you know all the pitfalls and tricks;

10 — exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.

Software Development @ 4 Docker @ 4 Kubernetes @ 4 Linux @ 4 Python @ 7 Distributed Systems @ 6 Machine Learning @ 4 Hiring @ 4 JavaScript @ 6 CSS @ 6 Rust @ 7 API @ 6 GPU @ 4 AI @ 4 Data Modeling @ 6 Agentic AI @ 4 Slurm @ 4

Details

NVIDIA is seeking a Senior Software Engineer to accelerate the next era of machine learning innovation by proposing and implementing engineering solutions that deliver functional, reliable, secure, and performance-optimal GPU clusters for internal researchers. The role focuses on reducing operational disruption and overhead, enabling researcher self-service, and improving reliability, operational excellence, and performance so scientists and engineers can train, fine-tune, and deploy advanced ML models on some of the world’s most powerful GPU systems.

Responsibilities

Understand pain points of validating, monitoring, and operating GPU clusters at scale by collaborating with coworkers across the AI Platform organization.
Design, develop, and maintain engineering solutions to systematically solve those pain points.
Research and apply traditional AIOps and emerging Agentic AI to reduce operational toil.
Participate in on-call support for systems and platforms built and owned by the team.

Requirements

BS/MS in Computer Science, Engineering, or equivalent experience.
5+ years in software/platform engineering, including 3+ years in ML infrastructure or distributed systems.
Experience in the software development lifecycle on Linux-based platforms.
Strong coding skills in languages such as Python, C++ or Rust.
Experience with Docker, Kubernetes, GitLab CI, and automated deployments.
Experience with AIOps or Agentic AI and applying it successfully in production environments.
Participation in on-call support for production systems.

Ways To Stand Out

Proficiency with full-stack development: relational data modeling, DB optimization, REST API semantics, JavaScript, CSS, and providing APIs as a service.
Passion for building developer-centric platforms with great UX and strong operational reliability.
Experience running Slurm or custom scheduling frameworks in production ML environments.
Familiarity with GPU computing, Linux systems internals, and performance tuning at scale.

Compensation & Benefits

Base salary ranges: $152,000 - $241,500 USD for Level 3, and $184,000 - $287,500 USD for Level 4.
Eligible for equity and benefits (see NVIDIA benefits page).

Additional Information

Applications for this job will be accepted at least until February 24, 2026.
This posting is for an existing vacancy.
NVIDIA uses AI tools in its recruiting processes.
NVIDIA is an equal opportunity employer and states nondiscrimination in hiring and promotion practices.