Used Tools & Technologies
Not specified
Required Skills & Competences ?
Go @ 4 Kubernetes @ 4 IaC @ 4 Hiring @ 4 Networking @ 4 API @ 4 GPU @ 4Details
NVIDIA is on the journey to build the best cloud offering for AI workloads and to bring its latest GPU technology to our clients as a set of managed services under the DGX Cloud umbrella. We want to be able to innovate on behalf of our clients and provide an easy, no-hassle way of using the latest and greatest NVIDIA products through scalable managed self-service APIs. We are looking for a Cloud Platform Engineer to drive the technical design and build foundational elements of our high-performing cloud services for Artificial Intelligence and high-performance computing. This is a unique opportunity to be a founding member of a team building at the intersection of highly scalable fault-tolerant cloud services and AI.
If you are passionate about IaC and you can argue why declarative infra is the way to go, can explain Kubernetes PDB to your family in under 5 minutes, or if you always felt that Kubernetes is great but this is not the ultimate goal and always wanted to extend it and turn it into the distributed operating system for AI, you are a perfect fit to join our team!
Responsibilities
- As a part of the service team, build and design platforms for DGX Cloud services.
 - Figure out how to take the best from HPC and Kubernetes and help us make the unified platform.
 - Work within the team of software engineers and product people as well as engineering teams across all of NVIDIA on DGX Cloud AI Compute services.
 - Write Infrastructure as Code (IaC) code, work on Kubernetes, and help the team to design and implement release pipelines.
 - Collaborate to understand how to make the best use of GitOps and pipelines.
 
Requirements
- BS in Computer Science, Information Systems, Computer Engineering or equivalent experience.
 - Solid technical foundation in distributed computing and storage, including substantial experience with: server systems, storage, I/O, networking, and system software.
 - 12+ years of platform engineering experience on large-scale production systems.
 - Kubernetes and IaC expertise as an engineer.
 - Ability to understand and communicate complex designs, distributed infrastructure, and requirements to peers, customers, and vendors.
 - General shared storage knowledge such as NFS, LustreFS, GlusterFS.
 - Familiarity with system-level architecture, such as interconnects, memory hierarchy, interrupts, and memory-mapped I/O.
 
Ways to stand out
- Proven experience in high performance computing (HPC), Deep Learning, and/or GPU accelerated computing domains.
 - Large-scale distributed system, HPC, ML and training experience with Slurm and Kubernetes.
 - Deep knowledge of both software and hardware in HPC and ML infrastructure.
 
Compensation & Other Details
- Base salary range: 224,000 USD - 356,500 USD (base salary determined based on location, experience, and pay of employees in similar positions).
 - You will also be eligible for equity and benefits (see https://www.nvidia.com/en-us/benefits/).
 - Applications for this job will be accepted at least until September 22, 2025.
 
NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.