Systems Software Engineer, Kubernetes Scale - DGX Cloud

at Nvidia
πŸ“ World
πŸ“ Germany
PLN 176,200-383,500 per year
MIDDLE
βœ… Remote

Used Tools & Technologies

Go

Required Skills & Competences

Kubernetes @ 3 Python @ 5 GCP @ 3 CI/CD @ 3 Distributed Systems @ 3 AWS @ 3 Azure @ 3 Communication @ 3 Networking @ 3 Reporting @ 3 GPU @ 3 AI @ 3

Details

The DGX Cloud organization at NVIDIA builds hardware and software to deliver accelerated computing for large AI workloads. This role focuses on scaling AI infrastructure, optimizing performance and total cost of ownership across the full stackβ€”from Kubernetes control and data planes to NVIDIA components and distributed inference serving. The team collaborates with AI researchers, developers, customers, and upstream open-source communities to validate and improve performance at scale.

Responsibilities

  • Drive end-to-end performance and scale characterization for the NVIDIA DGX Cloud software stack, from Kubernetes control and data planes through NVIDIA components such as GPU Operator, Network Operator, DCGM, NIM, and distributed inference serving.
  • Follow issues from orchestration down to the metal; triage, debug and root-cause issues related to operating Kubernetes clusters at ultra-large scale.
  • Collaborate with AI researchers, developers and customers to develop automated tests that simulate real user workloads using custom-built and open-source tools and frameworks.
  • Deep dive into performance and scale issues in complex distributed systems, including interactions between Kubernetes and NVIDIA software components.
  • Design and develop monitoring, reporting and analysis tools for performance and scale testing across software, GPU and CPU resources.
  • Build and maintain a high-velocity framework that enables continuous, always-on performance and scale testing via a modern CI/CD pipeline.
  • Document research, methodologies and results; present findings internally and externally (e.g., KubeCon, GTC).
  • Engage with upstream communities (Kubernetes, CNCF, NVIDIA open-source projects) to validate performance and shape design decisions.

Requirements

  • 2+ years experience in Computer Architecture, Networking, Storage systems, Accelerators and a Bachelors/Masters in Engineering (Electrical Engineering, Computer Engineering, Computer Science) or equivalent experience.
  • Expertise in Kubernetes and familiarity with related CNCF projects.
  • Experience with large-scale parallel and distributed accelerator-based systems.
  • Expertise optimizing performance and AI workloads on large-scale systems; experience with performance modeling and benchmarking at scale.
  • Proficiency in Golang and Python.
  • Background with the NVIDIA software ecosystem in both training and inference domains (GPU Operator, device plugins, DCGM, NIM, etc.).
  • Expertise with at least one public cloud provider (GCP, AWS, Azure, OCI, for example).

Ways to stand out

  • Strong operational experience with a Kubernetes distribution.
  • Prior experience scaling Kubernetes clusters to ultra-large node and object counts.
  • Demonstrated history of working in open-source communities.
  • Excellent communication and interpersonal skills.
  • PhD in relevant areas.

Compensation

  • Your base salary will be determined based on your location, experience, and pay of employees in similar positions.
  • For Poland: The base salary range is 176,250 PLN - 305,500 PLN for Level 2, and 221,250 PLN - 383,500 PLN for Level 3.

Location & Employment Type

  • Location: Germany or Remote.
  • Employment type: Full time.