Used Tools & Technologies
Go LLMRequired Skills & Competences
Tag name is followed by "@" symbol and proficiency level value.
About proficiency levels:
- 1-2 β basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;
- 3-6 β daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;
- 7-9 β you are an expert, you can teach others, you know all the pitfalls and tricks;
- 10 β exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.
Kubernetes @ 4
TypeScript @ 4
SQL @ 6
Distributed Systems @ 4
Hiring @ 4
Communication @ 7
JavaScript @ 6
PostgreSQL @ 4
React @ 4
GPU @ 4
AI @ 4
Slurm @ 7
- 1-2 β basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;
- 3-6 β daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;
- 7-9 β you are an expert, you can teach others, you know all the pitfalls and tricks;
- 10 β exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.
Details
NVIDIA is hiring experienced software engineers to help scale up its AI infrastructure. You will help advance NVIDIA's capacity to build and deploy leading infrastructure solutions for a broad range of AI-based applications. Expect to be challenged, to improve, and to evolve. If you are creative, passionate about GPUs, and enjoy working on large-scale systems, please apply.
Responsibilities
- Be part of the DGX Cloud team responsible for production systems that enable large scalable GPU clusters for a variety of AI workloads.
- Design and develop a massively distributed, scalable platform used to identify, diagnose, and remediate non-performant GPU assets.
- Work with teams across NVIDIA to ensure production AI clusters run reliably and consistently with maximum performance.
- Evaluate system failures and improve services based on a defined incident management process.
- Work across the product stack including frontend and backend technologies: React, Web Components, TypeScript, Golang, PostgreSQL, Temporal, Bazel, Kubernetes.
Requirements
- Significant software engineering experience within a highly technical organization with demonstrable impact.
- Strong communication skills and ability to work successfully with cross-functional teams, principals, and architects across organizational boundaries and geographies.
- 12+ years in a similar role with experience on large-scale production systems.
- BS in Computer Science or Engineering or equivalent experience.
- 6+ years of full-stack engineering experience.
- 3+ years building and shipping consumer-facing products.
- Proficiency in React, TypeScript/JavaScript, and Golang.
- Proficiency with a SQL database (PostgreSQL mentioned).
Ways to stand out
- Technical competency in managing and automating large-scale distributed systems independent of cloud providers. Advanced hands-on experience and deep understanding of cluster management systems (Kubernetes, Slurm, Base Command Manager).
- Empathy for users, attention to detail, and passion for creating world-class user experiences.
- Prior experience in asynchronous workflows and/or event-driven architecture.
- Proven operational excellence in maintaining reliable and performant infrastructure.
- A good understanding of how to use LLMs responsibly and the perils of blindly consuming their output.
Compensation & Benefits
- Base salary range: 224,000 USD - 356,500 USD (determined based on location, experience, and pay of employees in similar positions).
- Eligible for equity and company benefits (link to NVIDIA benefits in original posting).
Other details
- Applications for this job will be accepted at least until May 18, 2026.
- NVIDIA uses AI tools in its recruiting processes and is an equal opportunity employer.