Senior System Engineer 11 DGX Cloud Lepton

at Nvidia

📍 Santa Clara, United States

USD 184,000-356,500 per year

SENIOR

✅ On-site

SCRAPED

Used Tools & Technologies

Not specified

Required Skills & Competences ^?

Security @ 4 Kubernetes @ 4 CI/CD @ 4 Scoping @ 4 Communication @ 4 SRE @ 4 Prioritization @ 4 GPU @ 4

Details

Joining NVIDIA's DGX Cloud Lepton Team means contributing to the infrastructure that powers AI research. The team focuses on optimizing efficiency and resiliency of AI workloads and developing scalable AI and data infrastructure tools and services. The objective is to deliver a stable, scalable environment for AI researchers. DGX Lepton delivers NVIDIA-managed GPU/Kubernetes capacity for AI workloads.

Responsibilities

Design, build, and operate core services and node/cluster foundations for the Lepton platform; automate deployments, upgrades, and day-2 operations.
Own intake, prioritization, rollout, and rollback rhythms across OS, drivers/firmware, and platform components (vulnerability & patch management).
Define, deliver, and maintain secure-by-default baselines (host hardening, workload isolation, network segmentation, least-privilege access) for AI infrastructure at scale.
Standardize patterns for service identity, role scoping, secrets handling, and certificate hygiene (identity & access stewardship).
Drive change control and release practices to ensure traceability and integrity of production releases (trusted releases).
Establish health signals and SLOs; lead investigations, root cause analysis, and follow-through actions to improve reliability and security (monitoring & incident practice).
Partner with product, SRE, and security stakeholders to assess risks for new features and close gaps with pragmatic controls (risk & readiness).
Publish runbooks and standards; review designs and coach engineers on secure operational practices (documentation & mentorship).

Requirements

7+ years in systems/platform engineering operating large-scale, production environments.
Demonstrated ability to deliver secure, reliable platforms (hardening, access control, isolation, monitoring, and strong operational runbooks).
Experience with containerized/managed cluster environments; familiarity with GPU-accelerated platforms or ability to ramp quickly.
Automation mindset with infrastructure-as-code and CI/CD; disciplined change management.
Clear communication and documentation skills; ability to turn requirements into practical, supportable designs.
Bachelor's degree or higher in Computer Science or a related technical field (or equivalent experience).

Ways to stand out:

Hands-on engineering experience delivering and driving platform security baselines in multi-tenant environments.
Production Kubernetes experience (EKS/AKS/GKE), especially private clusters and PSA restricted defaults.
Supply-chain basics at scale: signed images (cosign) enforced via policy-as-code (Kyverno/OPA).
Familiarity with NVIDIA GPU platforms (GPU Operator/device plugin, MIG-aware operations).

Benefits / Compensation

Base salary ranges (determined by location, experience, and comparable roles):
- Level 4: 184,000 USD - 287,500 USD
- Level 5: 224,000 USD - 356,500 USD
Eligible for equity and company benefits (see NVIDIA benefits).
Applications accepted at least until August 19, 2025.

NVIDIA is an equal opportunity employer committed to fostering a diverse work environment.