Senior Software Engineer - HPC

at Nvidia
USD 152,000-241,500 per year
SENIOR
✅ Hybrid

Used Tools & Technologies

Machine Learning

Required Skills & Competences

Go @ 7 Kubernetes @ 4 Python @ 7 Scala @ 7 GCP @ 4 Java @ 7 CI/CD @ 6 Distributed Systems @ 4 AWS @ 4 Azure @ 4 Communication @ 4 API @ 4 Elixir @ 7 Observability @ 4 AI @ 4 Slurm @ 4 HPC @ 4

Details

NVIDIA is continuing to improve its HPC infrastructure and is seeking a Senior Software Engineer to build and operate sophisticated infrastructure that enables business-critical services and AI applications. The team focuses on providing better tools to build and manage infrastructure, emphasizing reliable distributed systems and long-term maintenance strategies.

Responsibilities

  • Apply modern distributed systems patterns to push the limits of scale, latency, and reliability.
  • Continuously improve infrastructure provisioning and operations with automation, APIs, and self-service platforms.
  • Operate in a globally distributed, hybrid multi-cloud environment (AWS, GCP, on-prem), building cloud-native and location-agnostic systems.
  • Build strong cross-functional relationships and align with collaborators across various business units.
  • Improve uptime and Quality of Service (QoS) through data-driven operations, strong SLOs, and robust incident practices.
  • Participate in the team’s on-call rotation and lead high-impact incident response when needed.

Requirements

  • Strong coding skills in at least two of: Go, Java, C++, Scala, Python, Elixir (focus on backend, systems, or infrastructure engineering).
  • Deep understanding of scalability, consistency, and performance trade-offs in server-side systems; ability to build horizontally scalable, resilient, and low-latency services.
  • Experience owning services end-to-end: architecture, build reviews, implementation, testing, rollout, observability, and iterative improvement.
  • Hands-on experience with at least one major cloud provider (GCP, AWS, or Azure) and cloud-native primitives (managed storage, messaging, compute).
  • Proficiency with modern CI/CD, GitOps workflows, and Infrastructure as Code practices for safe, repeatable changes.
  • Bias for action, strong problem-solving skills, and a track record of simplifying complex systems.
  • B.S. in Computer Science or related field (or equivalent experience), with 5+ years of relevant experience.
  • Careful communication and collaboration skills; comfortable guiding technical decisions across teams.

Ways to stand out

  • Prior experience building core infrastructure or control planes for HPC clusters, large-scale AI/ML platforms, or systems managed by job schedulers (e.g., Slurm or Kubernetes).
  • Maintainer or co-maintainer responsibilities for an open source component used in production (plugins, operators, exporters, controllers, or SDKs) at large scale.

Compensation & Benefits

  • Base salary range: 152,000 USD - 241,500 USD (final base salary determined by location, experience, and internal pay comparisons).
  • Eligibility for equity and NVIDIA benefits (link to benefits available in original posting).

Additional details

  • #LI-Hybrid (role listed as hybrid).
  • Applications accepted at least until March 13, 2026.
  • This posting is for an existing vacancy.
  • NVIDIA uses AI tools in its recruiting processes and is an equal opportunity employer.