Senior Cloud Services Software Engineer

at Nvidia
USD 184,000-356,500 per year
SENIOR
✅ On-site

SCRAPED

Used Tools & Technologies

Not specified

Required Skills & Competences ?

Docker @ 4 Go @ 6 Kubernetes @ 4 Python @ 6 GCP @ 4 Distributed Systems @ 4 Machine Learning @ 4 TensorFlow @ 4 AWS @ 4 Azure @ 4 API @ 4 PyTorch @ 4 Cloud Computing @ 4

Details

Joining NVIDIA's DGX Cloud Team means contributing to the infrastructure that powers our innovative AI research. This team focuses on optimizing efficiency and resiliency of ML workloads, as well as developing scalable AI infrastructure tools and services. The objective is to deliver a stable, scalable environment for AI researchers, providing them with the vital resources and scale to champion innovation.

Responsibilities

  • Develop solutions at the intersection of machine learning, distributed systems, and high-performance computing to advance AI technologies.
  • Design, develop, and optimize (micro-)services orchestrated by Kubernetes to provide large-scale AI training workflows on AI training supercomputers located at major CSPs, with resiliency and efficiency.
  • Co-design and implement APIs to integrate services vertically with NVIDIA's resiliency stacks (from tier-0 telemetry services to break/fix automation services to checkpoint and execution systems).
  • Craft submission abstractions that enable model engineers and training platforms/frameworks to submit long-running training jobs while hiding the complexity of handling infrastructure failures, managing job lifecycles with auto-restarts, ensuring efficiency, and advising users.
  • Build modular services that can be coordinated with and deployed onto on-premises AI clusters that apply NVIDIA hardware and cloud services.

Requirements

  • Bachelor's degree or higher in Computer Science or a related technical field (or equivalent experience).
  • 5+ years of hands-on experience in backend development, preferably with Python, Go, C/C++ or similar high-performance languages.
  • Consistent track record of building and scaling large-scale distributed systems.
  • Experience with cloud computing platforms such as AWS, Azure, and GCP.
  • Experience with container technologies like Docker and Kubernetes.
  • Experience with HPC/AI platforms such as Slurm.

Ways to Stand Out

  • Real-world experience in deep learning frameworks and orchestrators (PyTorch, TensorFlow, JAX, Ray).
  • Experience developing a framework plugin architecture that allows a framework to integrate with a cluster scheduler in a user-visible way.
  • Strong understanding of NVIDIA GPUs, network technologies, and their failure patterns.
  • Experience with AI models and AI-based tools.
  • Provide references to code contributions.

Benefits

  • Base salary will be determined based on location, experience, and pay of employees in similar positions.
  • Base salary range: 184,000 USD - 287,500 USD for Level 4, and 224,000 USD - 356,500 USD for Level 5.
  • Eligible for equity and benefits (see NVIDIA benefits page).

Additional Information

  • Location: US, CA, Santa Clara.
  • Employment type: Full time.
  • Applications for this job will be accepted at least until August 10, 2025.
  • NVIDIA is an equal opportunity employer and committed to fostering a diverse work environment.