Senior Software Engineer, DGX Cloud AI Infrastructure

at Nvidia
USD 184,000-356,500 per year
SENIOR
✅ On-site

Used Tools & Technologies

GenAI

Required Skills & Competences

Python @ 4 Distributed Systems @ 4 Leadership @ 7 Communication @ 4 Networking @ 4 Debugging @ 4 Technical Leadership @ 7 LLM @ 4 PyTorch @ 4 CUDA @ 7 GPU @ 4 Deep Learning @ 4 Generative AI @ 4 AI @ 4 InfiniBand @ 7 Profiling @ 4 NCCL @ 4 TensorRT @ 4 HPC @ 7 NVLink @ 7

Details

NVIDIA is at the forefront of the generative AI revolution, building the software and systems that power the world’s most advanced large language model workloads. This role leads bring-up, triage, benchmarking, analysis, and optimization of distributed training and inference workloads across NVIDIA GPU platforms at the largest scales. The position sets technical direction across communication libraries, model frameworks, and inference/training stacks to ensure state-of-the-art LLM workloads run efficiently and reliably at scale. It is a hands-on senior individual-contributor role operating at the intersection of deep learning systems, GPU performance, distributed computing, and large-scale operations.

Responsibilities

  • Lead bring-up, validation, and debugging of large-scale AI clusters, infrastructure, and end-to-end workloads, setting the standard for how the team operates.
  • Bring up, tune, and benchmark AI pre-training, post-training, and inference workloads using PyTorch, NeMo / Megatron, TensorRT-LLM, and adjacent NVIDIA AI software stacks.
  • Profile and optimize end-to-end workload performance across compute, memory, networking, and communication layers using tools such as Nsight Systems, NCCL tests, and custom microbenchmarks.
  • Analyze scaling efficiency for distributed LLM workloads using data, tensor, pipeline, and expert parallelism across modern GPU clusters, and translate findings into concrete tuning guidance.
  • Own root-cause analysis of complex failures — hangs, performance regressions, topology sensitivity in large distributed environments.
  • Define and build the resilience and failure-attribution stack: detecting, triaging, and attributing node, fabric, and workload failures across the cluster at scale.
  • Build repeatable benchmark suites, automation, acceptance criteria, and qualification workflows on new platforms.
  • Tune runtime settings, communication parameters, and deployment configurations in close partnership with framework, systems, and platform teams.
  • Deliver actionable, data-driven recommendations based on profiling, benchmark results, and cluster characterization.
  • Mentor engineers, drive technical standards, and act as a force multiplier across the broader performance and infrastructure organization.

Requirements

  • Bachelor’s or Master’s in Computer Science or a related technical field (or equivalent experience).
  • 8+ years of experience developing software infrastructure for large-scale AI or HPC systems, including a track record of technical leadership.
  • Expertise debugging and triaging AI applications across the full stack — from the application layer down to the hardware.
  • Deep hands-on experience with NCCL, CUDA-aware distributed execution, and debugging multi-GPU and multi-node workloads at scale.
  • Proven track record of architecting, debugging, and scaling large-scale distributed systems.
  • Expert-level Python and C/C++ programming skills.
  • Experience operating workloads in scheduled, containerized cluster environments.
  • Excellent analytical, debugging, and communication skills, with the ability to influence across teams.

Ways to Stand Out

  • Demonstrated experience debugging and optimizing AI workloads at large scale.
  • Deep familiarity with the RDMA software stack (NCCL, IB verbs, UCX, libfabric).
  • Strong knowledge of GPU cluster fabrics and topology, including NVLink, NVSwitch, PCIe, RoCE, and InfiniBand.
  • Experience building acceptance tests, benchmark harnesses, regression gates, or cluster qualification tooling for AI platforms.
  • Experience building resilience, fault-detection, or failure-attribution systems for datacenter-scale infrastructure.

Compensation & Benefits

  • Base salary ranges (determined by location, experience, and comparable pay):
    • Level 4: 184,000 USD - 287,500 USD
    • Level 5: 224,000 USD - 356,500 USD
  • Eligible for equity and benefits.

Additional Information

  • Applications will be accepted at least until June 8, 2026.
  • This posting is for an existing vacancy.
  • NVIDIA uses AI tools in its recruiting processes.
  • NVIDIA is an equal opportunity employer and committed to fostering an inclusive work environment.