Used Tools & Technologies
Not specified
Required Skills & Competences
Tag name is followed by "@" symbol and proficiency level value.
About proficiency levels:
- 1-2 — basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;
- 3-6 — daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;
- 7-9 — you are an expert, you can teach others, you know all the pitfalls and tricks;
- 10 — exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.
Software Development @ 4
Docker @ 4
Kubernetes @ 4
Linux @ 4
Python @ 7
Distributed Systems @ 6
Machine Learning @ 4
Hiring @ 4
JavaScript @ 6
CSS @ 6
Rust @ 7
API @ 6
GPU @ 4
AI @ 4
Data Modeling @ 6
Agentic AI @ 4
Slurm @ 4
- 1-2 — basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;
- 3-6 — daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;
- 7-9 — you are an expert, you can teach others, you know all the pitfalls and tricks;
- 10 — exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.
Details
NVIDIA is seeking a Senior Software Engineer to accelerate the next era of machine learning innovation by proposing and implementing engineering solutions that deliver functional, reliable, secure, and performance-optimal GPU clusters for internal researchers. The role focuses on reducing operational disruption and overhead, enabling researcher self-service, and improving reliability, operational excellence, and performance so scientists and engineers can train, fine-tune, and deploy advanced ML models on some of the world’s most powerful GPU systems.
Responsibilities
- Understand pain points of validating, monitoring, and operating GPU clusters at scale by collaborating with coworkers across the AI Platform organization.
- Design, develop, and maintain engineering solutions to systematically solve those pain points.
- Research and apply traditional AIOps and emerging Agentic AI to reduce operational toil.
- Participate in on-call support for systems and platforms built and owned by the team.
Requirements
- BS/MS in Computer Science, Engineering, or equivalent experience.
- 5+ years in software/platform engineering, including 3+ years in ML infrastructure or distributed systems.
- Experience in the software development lifecycle on Linux-based platforms.
- Strong coding skills in languages such as Python, C++ or Rust.
- Experience with Docker, Kubernetes, GitLab CI, and automated deployments.
- Experience with AIOps or Agentic AI and applying it successfully in production environments.
- Participation in on-call support for production systems.
Ways To Stand Out
- Proficiency with full-stack development: relational data modeling, DB optimization, REST API semantics, JavaScript, CSS, and providing APIs as a service.
- Passion for building developer-centric platforms with great UX and strong operational reliability.
- Experience running Slurm or custom scheduling frameworks in production ML environments.
- Familiarity with GPU computing, Linux systems internals, and performance tuning at scale.
Compensation & Benefits
- Base salary ranges: $152,000 - $241,500 USD for Level 3, and $184,000 - $287,500 USD for Level 4.
- Eligible for equity and benefits (see NVIDIA benefits page).
Additional Information
- Applications for this job will be accepted at least until February 24, 2026.
- This posting is for an existing vacancy.
- NVIDIA uses AI tools in its recruiting processes.
- NVIDIA is an equal opportunity employer and states nondiscrimination in hiring and promotion practices.