Senior Software Engineer - HPC

at Nvidia

📍 Santa Clara, United States

USD 152,000-241,500 per year

SENIOR

✅ Hybrid

Used Tools & Technologies

Machine Learning

Required Skills & Competences
Tag name is followed by "@" symbol and proficiency level value. About proficiency levels:

1-2 — basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;

3-6 — daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;

7-9 — you are an expert, you can teach others, you know all the pitfalls and tricks;

10 — exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.

Go @ 7 Kubernetes @ 4 Python @ 7 Scala @ 7 GCP @ 4 Java @ 7 CI/CD @ 6 Distributed Systems @ 4 AWS @ 4 Azure @ 4 Communication @ 4 API @ 4 Elixir @ 7 Observability @ 4 AI @ 4 Slurm @ 4 HPC @ 4

Details

NVIDIA is continuing to improve its HPC infrastructure and is seeking a Senior Software Engineer to build and operate sophisticated infrastructure that enables business-critical services and AI applications. The team focuses on providing better tools to build and manage infrastructure, emphasizing reliable distributed systems and long-term maintenance strategies.

Responsibilities

Apply modern distributed systems patterns to push the limits of scale, latency, and reliability.
Continuously improve infrastructure provisioning and operations with automation, APIs, and self-service platforms.
Operate in a globally distributed, hybrid multi-cloud environment (AWS, GCP, on-prem), building cloud-native and location-agnostic systems.
Build strong cross-functional relationships and align with collaborators across various business units.
Improve uptime and Quality of Service (QoS) through data-driven operations, strong SLOs, and robust incident practices.
Participate in the team’s on-call rotation and lead high-impact incident response when needed.

Requirements

Strong coding skills in at least two of: Go, Java, C++, Scala, Python, Elixir (focus on backend, systems, or infrastructure engineering).
Deep understanding of scalability, consistency, and performance trade-offs in server-side systems; ability to build horizontally scalable, resilient, and low-latency services.
Experience owning services end-to-end: architecture, build reviews, implementation, testing, rollout, observability, and iterative improvement.
Hands-on experience with at least one major cloud provider (GCP, AWS, or Azure) and cloud-native primitives (managed storage, messaging, compute).
Proficiency with modern CI/CD, GitOps workflows, and Infrastructure as Code practices for safe, repeatable changes.
Bias for action, strong problem-solving skills, and a track record of simplifying complex systems.
B.S. in Computer Science or related field (or equivalent experience), with 5+ years of relevant experience.
Careful communication and collaboration skills; comfortable guiding technical decisions across teams.

Ways to stand out

Prior experience building core infrastructure or control planes for HPC clusters, large-scale AI/ML platforms, or systems managed by job schedulers (e.g., Slurm or Kubernetes).
Maintainer or co-maintainer responsibilities for an open source component used in production (plugins, operators, exporters, controllers, or SDKs) at large scale.

Compensation & Benefits

Base salary range: 152,000 USD - 241,500 USD (final base salary determined by location, experience, and internal pay comparisons).
Eligibility for equity and NVIDIA benefits (link to benefits available in original posting).

Additional details

#LI-Hybrid (role listed as hybrid).
Applications accepted at least until March 13, 2026.
This posting is for an existing vacancy.
NVIDIA uses AI tools in its recruiting processes and is an equal opportunity employer.