Senior GPU Supercomputer Scheduler Engineer

at Nvidia

📍 Santa Clara, United States

USD 152,000-287,500 per year

SENIOR

✅ On-site

Used Tools & Technologies

Machine Learning

Required Skills & Competences
Tag name is followed by "@" symbol and proficiency level value. About proficiency levels:

1-2 — basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;

3-6 — daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;

7-9 — you are an expert, you can teach others, you know all the pitfalls and tricks;

10 — exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.

Software Development @ 4 Docker @ 4 Go @ 4 Kubernetes @ 7 Linux @ 4 Python @ 4 TensorFlow @ 4 Bash @ 4 Communication @ 4 PyTorch @ 4 GPU @ 4 Deep Learning @ 4 AI @ 4 Slurm @ 7 HPC @ 4 Performance Analysis @ 4

Details

NVIDIA is a pioneer in accelerated computing, known for inventing the GPU and driving breakthroughs in gaming, computer graphics, high-performance computing, and artificial intelligence. The Managed AI Research Superclusters (MARS) team builds and scales the infrastructure, platforms, and tools that enable researchers and engineers to develop the next generation of AI/ML systems. By joining this team you'll help design solutions that power some of the world’s most advanced computing workloads.

Responsibilities

Design and develop new scheduling features and add-on services to improve GPU compute clusters across many dimensions, such as resource usage fairness, GPU occupancy, GPU waste, application resilience, application performance and power usage.
Design and develop batch workload management and orchestration services.
Provide support to staff and end users to resolve batch scheduler issues.
Build and improve the ecosystem around GPU-accelerated computing.
Perform performance analysis and optimizations of deep learning workflows.
Develop large scale automation solutions.
Perform root cause analysis and suggest corrective action for problems at large and small scales.
Proactively find and fix problems before they occur.

Requirements

Bachelor’s degree in Computer Science, Electrical Engineering or related field or equivalent experience.
5+ years of work experience.
Strong understanding of batch scheduling, preferably with experience in schedulers such as SLURM or Kubernetes batch schedulers (Kueue, Volcano, etc.).
Significant experience in systems programming languages such as C/C++ and Go, as well as scripting languages such as Python and bash.
Established experience with the Linux operating system, environment and tools.
Experience analyzing and tuning performance for a variety of AI workloads.
In-depth understanding of container technologies like Docker, Singularity, Podman.
Flexibility and adaptability for working in a dynamic environment with different frameworks and requirements.
Excellent communication, interpersonal and customer collaboration skills.

Ways to stand out

Knowledge in high-performance computing (HPC).
Open source software contributions.
Experience with deep learning frameworks like PyTorch and TensorFlow.
Passion for software development processes.

Compensation and other details

The base salary range is 152,000 USD - 241,500 USD for Level 3, and 184,000 USD - 287,500 USD for Level 4. Actual base salary will be determined based on location, experience, and internal pay comparisons.
You will also be eligible for equity and benefits.
Applications for this job will be accepted at least until February 24, 2026. This posting is for an existing vacancy.
NVIDIA uses AI tools in its recruiting processes.
NVIDIA is an equal opportunity employer committed to fostering a diverse work environment.