Senior Systems Software Engineer, Kubernetes Node Lifecycle - DGX Cloud

at Nvidia
USD 184,000-356,500 per year
SENIOR
✅ On-site

Used Tools & Technologies

Go

Required Skills & Competences

Security @ 4 Kubernetes @ 4 Linux @ 7 Python @ 6 GCP @ 6 CI/CD @ 4 AWS @ 6 Azure @ 6 Communication @ 4 Debugging @ 7 API @ 4 Compliance @ 4 GPU @ 4 AI @ 4

Details

At NVIDIA, the DGX Cloud division merges fresh hardware and software innovations to offer leading accelerated computing solutions for the most challenging AI workloads worldwide. Our team of skilled engineers is committed to addressing major global issues, consistently advancing technology, and making a difference in millions of lives around the world!

Responsibilities

  • Direct the building and refinement of CAPI providers for NVIDIA Kubernetes Engine, maintaining steady, consistent, and scalable node provisioning across DGX Cloud and NCP environments.
  • Develop and maintain bring-your-own-node workflows that allow customers to integrate different NVIDIA hardware into NKE clusters while ensuring high operational consistency.
  • Coordinate OS image generation, packaging, deployment, and update processes for NKE nodes. Ensure images are fine-tuned for NVIDIA GPU workloads and satisfy enterprise- and cloud-grade security and compliance criteria.
  • Develop and sustain node image hardening pipelines, incorporating CIS benchmarks, automated CVE remediation, and promotion gates connected to security posture.
  • Develop and maintain automated test suites for node images. These tests verify accuracy across Kubernetes versions and NVIDIA hardware configurations prior to production deployment and facilitate continuous validation through CI/CD pipelines.
  • Handle nodepool lifecycle at scale, including provisioning, upgrades, drain and cordon workflows, and seamless node replacement across very large clusters with diverse NVIDIA hardware.
  • Examine, resolve, and determine underlying causes of node-layer faults in production NKE clusters (image configuration, driver packaging, kubelet operation, hardware activation), and optimize the node layer in high-scale scenarios.
  • Partner with upstream communities including Cluster API, Kubernetes, and CNCF projects to establish node provisioning and lifecycle standards. Communicate progress and findings at internal and external events such as KubeCon and GTC.

Requirements

  • 8 years of experience with a background in systems software, cloud infrastructure, or Kubernetes node engineering.
  • Bachelor's or Master's degree in Engineering (Electrical, Computer Engineering, Computer Science) or equivalent experience.
  • Deep expertise in Cluster API (CAPI), including provider development and full machine lifecycle from provisioning to deletion.
  • Extensive experience with OS image build pipelines, node image packaging, and delivery systems for Kubernetes nodes (for example image-builder, containerd, cloud-init, packer).
  • Practical experience with bring-your-own-node models and integrating diverse hardware into live Kubernetes environments, including large-scale nodepool lifecycle management and upgrades.
  • Strong understanding of kubelet configuration, node bootstrap, and the Kubernetes node registration lifecycle.
  • Experience with node image security, including vulnerability scanning, patch automation, and compliance gating as part of image build pipelines.
  • Proficiency in Golang and/or Python, and hands-on experience with at least one major public cloud provider (GCP, AWS, Azure, OCI or equivalent).

Ways to Stand Out

  • Direct experience building or maintaining node image pipelines for a hyperscaler Kubernetes distribution (GKE, EKS, AKS, OKE, or equivalent).
  • Experience with supply chain security and hardening for node images, including image signing, provenance attestation, SBOM generation, CIS benchmark consistency, and automated CVE remediation.
  • Experience with automated node provisioning and optimal sizing at scale (for example Karpenter, GKE NAP) and how these interact with GPU workload scheduling.
  • Strong operational experience working with immutable OS image distributions (such as Flatcar, Bottlerocket, Azure Linux) and debugging node-layer failures in large Kubernetes clusters.
  • Proven background of upstream contributions to Cluster API, Kubernetes or related CNCF projects, combined with excellent communication and interpersonal abilities.

Compensation

  • Base salary range: 184,000 USD - 287,500 USD for Level 4, and 224,000 USD - 356,500 USD for Level 5.
  • You will also be eligible for equity and benefits (see NVIDIA benefits page).

Applications for this job will be accepted at least until June 14, 2026.

NVIDIA uses AI tools in its recruiting processes.

NVIDIA is committed to fostering an inclusive work environment and is an equal opportunity employer. The company does not discriminate on the basis of legally protected characteristics.