Senior Systems Engineer – High-Performance AI And Networking Applications
at Nvidia
USD 184,000-356,500 per year
SCRAPED
Used Tools & Technologies
Not specified
Required Skills & Competences ?
Ansible @ 4 Grafana @ 4 Kubernetes @ 3 Prometheus @ 4 R @ 4 Communication @ 4 Networking @ 4 Debugging @ 7 PyTorch @ 4Details
Join the NVIDIA Deep Learning Frameworks Infrastructure team as a Senior Systems Engineer focusing on High-Performance AI & Networking Applications. This position offers a distinctive opportunity to engage in the latest technology advancements, collaborating closely with elite teams to elevate NVIDIA's impactful innovations.
Responsibilities
- Collaborate with networking teams to plan, implement, and evaluate performance benchmarks on NVLINK, NVSwitch, and InfiniBand powered infrastructures.
- Assess findings and work closely with framework, hardware, and support teams to improve system performance across various deep learning workloads.
- Act as a primary resource for fixing networking and hardware integration issues, focusing on scalable multi-node systems.
- Maintain high communication standards across multiple engineering, support, and R&D teams, ensuring technical and performance goals are met.
- Offer technical mentorship and documentation for internal teams and external partners on standard methodologies in HPC networking deployments.
- Share insights on improving networking strategies for substantial AI and deep learning infrastructure.
Requirements
- BS/MS or PhD in Computer Science, Engineering, or related field, or equivalent experience.
- 8+ years of proven experience in AI/HPC infrastructure.
- Familiarity with AI/HPC job schedulers and orchestrators such as Slurm, Kubernetes (K8s), or LSF. Practical exposure to AI/HPC workflows employing MPI and NCCL.
- Familiarity with high-speed networking pertaining to HPC including InfiniBand, RDMA, RoCE, and Amazon EFA.
- Essential to have an understanding of PyTorch, MegatronLM, and deep learning inference frameworks such as vllm/sglang.
- Proven experience with InfiniBand, NVLINK, and high-speed networking technologies in HPC or large-scale datacenter environments.
- Experience investigating and evaluating performance in multi-node systems, especially in deep learning or scientific computing tasks.
- Strong analytical, debugging, and technical communication skills.
- Comfortable working in collaborative, multi-faceted teams.
Ways To Stand Out
- Mastery in deep learning frameworks or distributed training systems.
- Familiarity with datacenter automation, advanced network protocols, and supporting large HPC or AI clusters in production environments.
- Understanding of fast, distributed storage systems like Lustre and GPFS for AI/HPC workloads.
- Experience with networking and communications libraries such as NCCL, NIXL, NVSHMEM, and UCX.
- Experience developing or maintaining cluster management and monitoring tools — for example, Ansible for infrastructure automation, Prometheus and Grafana for monitoring.
Compensation & Benefits
- Base salary range (determined by location, experience, and pay of similar roles):
- Level 4: 184,000 USD - 287,500 USD
- Level 5: 224,000 USD - 356,500 USD
- You will also be eligible for equity and benefits.
Additional Information
- Applications for this job will be accepted at least until November 14, 2025.
- NVIDIA is an equal opportunity employer and is committed to fostering a diverse work environment.