Used Tools & Technologies
GoRequired Skills & Competences
Tag name is followed by "@" symbol and proficiency level value.
About proficiency levels:
- 1-2 β basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;
- 3-6 β daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;
- 7-9 β you are an expert, you can teach others, you know all the pitfalls and tricks;
- 10 β exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.
Kubernetes @ 3
Python @ 5
GCP @ 3
CI/CD @ 3
Distributed Systems @ 3
AWS @ 3
Azure @ 3
Communication @ 3
Networking @ 3
Reporting @ 3
GPU @ 3
AI @ 3
- 1-2 β basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;
- 3-6 β daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;
- 7-9 β you are an expert, you can teach others, you know all the pitfalls and tricks;
- 10 β exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.
Details
The DGX Cloud organization at NVIDIA builds hardware and software to deliver accelerated computing for large AI workloads. This role focuses on scaling AI infrastructure, optimizing performance and total cost of ownership across the full stackβfrom Kubernetes control and data planes to NVIDIA components and distributed inference serving. The team collaborates with AI researchers, developers, customers, and upstream open-source communities to validate and improve performance at scale.
Responsibilities
- Drive end-to-end performance and scale characterization for the NVIDIA DGX Cloud software stack, from Kubernetes control and data planes through NVIDIA components such as GPU Operator, Network Operator, DCGM, NIM, and distributed inference serving.
- Follow issues from orchestration down to the metal; triage, debug and root-cause issues related to operating Kubernetes clusters at ultra-large scale.
- Collaborate with AI researchers, developers and customers to develop automated tests that simulate real user workloads using custom-built and open-source tools and frameworks.
- Deep dive into performance and scale issues in complex distributed systems, including interactions between Kubernetes and NVIDIA software components.
- Design and develop monitoring, reporting and analysis tools for performance and scale testing across software, GPU and CPU resources.
- Build and maintain a high-velocity framework that enables continuous, always-on performance and scale testing via a modern CI/CD pipeline.
- Document research, methodologies and results; present findings internally and externally (e.g., KubeCon, GTC).
- Engage with upstream communities (Kubernetes, CNCF, NVIDIA open-source projects) to validate performance and shape design decisions.
Requirements
- 2+ years experience in Computer Architecture, Networking, Storage systems, Accelerators and a Bachelors/Masters in Engineering (Electrical Engineering, Computer Engineering, Computer Science) or equivalent experience.
- Expertise in Kubernetes and familiarity with related CNCF projects.
- Experience with large-scale parallel and distributed accelerator-based systems.
- Expertise optimizing performance and AI workloads on large-scale systems; experience with performance modeling and benchmarking at scale.
- Proficiency in Golang and Python.
- Background with the NVIDIA software ecosystem in both training and inference domains (GPU Operator, device plugins, DCGM, NIM, etc.).
- Expertise with at least one public cloud provider (GCP, AWS, Azure, OCI, for example).
Ways to stand out
- Strong operational experience with a Kubernetes distribution.
- Prior experience scaling Kubernetes clusters to ultra-large node and object counts.
- Demonstrated history of working in open-source communities.
- Excellent communication and interpersonal skills.
- PhD in relevant areas.
Compensation
- Your base salary will be determined based on your location, experience, and pay of employees in similar positions.
- For Poland: The base salary range is 176,250 PLN - 305,500 PLN for Level 2, and 221,250 PLN - 383,500 PLN for Level 3.
Location & Employment Type
- Location: Germany or Remote.
- Employment type: Full time.