Senior Software Engineer, AI Networking

at Nvidia
USD 152,000-287,500 per year
SENIOR
βœ… On-site

Used Tools & Technologies

Not specified

Required Skills & Competences

Python @ 7 Algorithms @ 4 Distributed Systems @ 4 Machine Learning @ 4 TensorFlow @ 6 Bash @ 7 Communication @ 4 Networking @ 4 System Architecture @ 4 LLM @ 4 PyTorch @ 4 CUDA @ 4 GPU @ 4 Deep Learning @ 4 AI @ 4 Reinforcement Learning @ 4 NCCL @ 4 HPC @ 6 Performance Analysis @ 4 JAX @ 6

Details

NVIDIA seeks a senior software engineer to join the AI Networking co-design and benchmark R&D team. In this role you will build and productize machine learning tools (including ML-based combinatorial optimization and design space exploration techniques) to optimize AI workloads across large GPU and CPU clusters. The work focuses on distributed deep learning within LLM training and inference stacks and involves interaction with hardware (HCAs, switches, CPUs, GPUs, systems) and multiple software layers (LLM applications, ML frameworks, communication and computing libraries). The role includes developing ML-driven performance analysis and optimization tools and methodologies, potentially incorporating learning-based agentic techniques.

Responsibilities

  • Design and implement resource allocation and combinatorial optimization techniques (e.g., reinforcement learning, LLM agents for DSE, Bayesian optimization and other multi-objective optimization techniques) to optimize LLM models at datacenter scale.
  • Research, develop, and deploy AI/ML techniques to optimize large-scale deep learning (LLM) training and inference on NVIDIA supercomputers and distributed systems, with emphasis on high-performance networking and NVIDIA communication libraries.
  • Build and productionize ML-based tools for performance prediction and optimization, emphasizing networking aspects.
  • Develop and deploy a scalable, reliable data curation pipeline capable of handling complex data types (time series, PyTorch model graphs) to support training of high-performance ML models.
  • Collaborate across hardware and software teams to deliver performance analysis insights.
  • Lead performance test planning, establish performance targets for new technologies and solutions, and drive efforts to meet those goals.

Requirements

  • PhD or Master’s degree in Computer Science, Software Engineering, or equivalent experience.
  • 4+ years of experience applying machine learning techniques to computer architecture and system optimization problems; experience at the intersection of at least two of HPC, networking, and AI applications is desired.
  • Hands-on experience developing and deploying learning algorithms (e.g., reinforcement learning, offline RL, supervised learning) for optimization challenges in computer architecture, system design, or networking.
  • Proficiency building and using ML models with frameworks such as PyTorch, TensorFlow, or JAX.
  • Proven ability to apply GNNs/transformer-based optimization to PyTorch model graphs and Kineto execution traces.
  • Expertise combining knowledge of NVIDIA GPUs, the CUDA library, and deep learning frameworks (TensorFlow/PyTorch) with networking concepts, including collective communication libraries (like NCCL) and protocols (such as RoCE and RDMA).
  • Strong programming capabilities in Python, Bash, and C++.
  • Collaborative team player with effective communication and interpersonal skills.

Ways to stand out from the crowd

  • In-depth knowledge and experience with machine learning/reinforcement learning and frameworks.
  • Comprehensive understanding of computer architecture, system architecture, and networking.
  • Extensive experience applying machine learning techniques such as GNNs or related graph-based models.
  • Knowledge in PyTorch, CUDA, and NCCL libraries.
  • Proven software engineering/development skills.

Compensation

Base salary range:

  • Level 3: 152,000 USD - 241,500 USD per year
  • Level 4: 184,000 USD - 287,500 USD per year

You will also be eligible for equity and benefits.

Additional information

  • Applications for this job will be accepted at least until April 10, 2026.
  • NVIDIA uses AI tools in its recruiting processes.
  • NVIDIA is an equal opportunity employer and committed to fostering a diverse work environment.