AI And ML Infra Software Engineer, GPU Clusters

at Nvidia

📍 Santa Clara, United States

USD 148,000-287,500 per year

MIDDLE

✅ On-site

SCRAPED

Used Tools & Technologies

Not specified

Required Skills & Competences ^?

Docker @ 3 Go @ 2 Kubernetes @ 3 DevOps @ 3 Python @ 2 GCP @ 2 AWS @ 2 Azure @ 2 Bash @ 2 Communication @ 3 Data Engineering @ 3 Networking @ 3 PyTorch @ 6 GPU @ 3

Details

NVIDIA has been transforming computer graphics, PC gaming, and accelerated computing for more than 25 years. With a legacy of innovation fueled by great technology and amazing people, NVIDIA is now tapping into the unlimited potential of AI to define the next era of computing. This era envisions GPUs as the brains of computers, robots, and self-driving cars that can understand the world. The company seeks talented individuals to make a lasting impact.

Responsibilities

Collaborate closely with AI and ML research teams to understand their infrastructure needs and obstacles, translating those observations into actionable improvements.
Monitor and optimize infrastructure performance to ensure high availability, scalability, and efficient resource utilization.
Define and improve important measures of AI researcher efficiency, aligning actions with measurable results.
Collaborate with research, data engineering, and DevOps teams to build a seamless AI/ML infrastructure ecosystem.
Stay current with AI/ML technologies, frameworks, and strategies and promote their implementation within NVIDIA.

Requirements

BS or equivalent experience in Computer Science or related field, with 5+ years experience in AI/ML and HPC workloads and infrastructure.
Hands-on experience with HPC-grade infrastructure and in-depth knowledge of accelerated computing (GPU, custom silicon), storage (Lustre, GPFS, BeeGFS), scheduling & orchestration systems (Slurm, Kubernetes, LSF), high-speed networking (Infiniband, RoCE, Amazon EFA), and container technologies (Docker, Enroot).
Expertise in running and optimizing large-scale distributed training workloads using PyTorch (DDP, FSDP), NeMo, or JAX, with a deep understanding of AI/ML workflows including data processing, model training, and inference pipelines.
Proficiency in programming and scripting languages such as Python, Go, Bash; familiarity with cloud platforms (AWS, GCP, Azure) and parallel computing frameworks.
Passion for continual learning of new AI/ML infrastructure technologies and approaches.
Excellent communication and collaboration skills to work effectively with diverse teams.

Benefits

NVIDIA offers competitive salaries and a comprehensive benefits package. Equity eligibility and continuous professional growth in a rapidly expanding engineering team.

The base salary range is 148,000 USD - 287,500 USD, determined by location, experience, and comparable employee pay.