Used Tools & Technologies
Machine LearningRequired Skills & Competences
Tag name is followed by "@" symbol and proficiency level value.
About proficiency levels:
- 1-2 — basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;
- 3-6 — daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;
- 7-9 — you are an expert, you can teach others, you know all the pitfalls and tricks;
- 10 — exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.
Go @ 7
Kubernetes @ 4
Python @ 7
Scala @ 7
GCP @ 4
Java @ 7
CI/CD @ 6
Distributed Systems @ 4
AWS @ 4
Azure @ 4
Communication @ 4
API @ 4
Elixir @ 7
Observability @ 4
AI @ 4
Slurm @ 4
HPC @ 4
- 1-2 — basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;
- 3-6 — daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;
- 7-9 — you are an expert, you can teach others, you know all the pitfalls and tricks;
- 10 — exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.
Details
NVIDIA is continuing to improve its HPC infrastructure and is seeking a Senior Software Engineer to build and operate sophisticated infrastructure that enables business-critical services and AI applications. The team focuses on providing better tools to build and manage infrastructure, emphasizing reliable distributed systems and long-term maintenance strategies.
Responsibilities
- Apply modern distributed systems patterns to push the limits of scale, latency, and reliability.
- Continuously improve infrastructure provisioning and operations with automation, APIs, and self-service platforms.
- Operate in a globally distributed, hybrid multi-cloud environment (AWS, GCP, on-prem), building cloud-native and location-agnostic systems.
- Build strong cross-functional relationships and align with collaborators across various business units.
- Improve uptime and Quality of Service (QoS) through data-driven operations, strong SLOs, and robust incident practices.
- Participate in the team’s on-call rotation and lead high-impact incident response when needed.
Requirements
- Strong coding skills in at least two of: Go, Java, C++, Scala, Python, Elixir (focus on backend, systems, or infrastructure engineering).
- Deep understanding of scalability, consistency, and performance trade-offs in server-side systems; ability to build horizontally scalable, resilient, and low-latency services.
- Experience owning services end-to-end: architecture, build reviews, implementation, testing, rollout, observability, and iterative improvement.
- Hands-on experience with at least one major cloud provider (GCP, AWS, or Azure) and cloud-native primitives (managed storage, messaging, compute).
- Proficiency with modern CI/CD, GitOps workflows, and Infrastructure as Code practices for safe, repeatable changes.
- Bias for action, strong problem-solving skills, and a track record of simplifying complex systems.
- B.S. in Computer Science or related field (or equivalent experience), with 5+ years of relevant experience.
- Careful communication and collaboration skills; comfortable guiding technical decisions across teams.
Ways to stand out
- Prior experience building core infrastructure or control planes for HPC clusters, large-scale AI/ML platforms, or systems managed by job schedulers (e.g., Slurm or Kubernetes).
- Maintainer or co-maintainer responsibilities for an open source component used in production (plugins, operators, exporters, controllers, or SDKs) at large scale.
Compensation & Benefits
- Base salary range: 152,000 USD - 241,500 USD (final base salary determined by location, experience, and internal pay comparisons).
- Eligibility for equity and NVIDIA benefits (link to benefits available in original posting).
Additional details
- #LI-Hybrid (role listed as hybrid).
- Applications accepted at least until March 13, 2026.
- This posting is for an existing vacancy.
- NVIDIA uses AI tools in its recruiting processes and is an equal opportunity employer.