Senior DGX Cloud Software Engineer - Infrastructure Automation and Distributed Systems
at Nvidia
USD 144,000-270,200 per year
SCRAPED
Used Tools & Technologies
Not specified
Required Skills & Competences ?
Security @ 4 Docker @ 4 Go @ 6 Kubernetes @ 4 Linux @ 4 Python @ 6 Distributed Systems @ 4 Communication @ 4 Mathematics @ 4 Networking @ 4 OpenStack @ 4 SRE @ 4Details
We are seeking software engineers with experience building and operating private and public clouds at production scale. As part of the DGX Cloud team you'll support customers' AI training and inference journeys by building platforms, tools, and services that defend the operational capacity of bare-metal, accelerated compute infrastructure and codify reliability best practices across the DGX Cloud ecosystem.
Responsibilities
- Design, build, and run cloud infrastructure services to meet business goals, including integrations, migrations, bring-ups, updates, and decommissions.
- Define and participate in internal-facing service level objectives (SLOs) and error budgets as part of the overall observability strategy.
- Eliminate toil or automate it where the ROI of building and maintaining automation is justified.
- Practice sustainable blameless incident prevention and response and participate in an on-call rotation.
- Consult with peer teams on systems design best practices and provide technical guidance.
- Contribute to a values-driven, communicative, and self-organizing engineering culture.
Requirements
- Proficiency in one or more programming languages: Python or Go.
- BS degree in Computer Science or a related technical field (e.g., physics, mathematics) or equivalent experience.
- 5+ years of relevant experience in infrastructure and fleet management engineering.
- Experience with infrastructure automation and distributed systems design for running large-scale private or public cloud systems requiring fully automated management under active customer consumption in production.
- Practical experience defining and operating observability, SLOs, and error-budget driven processes.
- Experience with incident response and being part of on-call rotations.
- In-depth knowledge in one or more of: Linux, Slurm, Kubernetes, local and distributed storage, and systems networking.
- Demonstrated track record of initiating projects, collaborating across teams, and driving projects to completion.
Ways to stand out
- Systematic problem-solving, clear communication, ownership, and experience driving build/reuse/buy decisions.
- Experience with bare metal as a service (BMaaS) systems (vending BMaaS, Slurm running on containers, vending Kubernetes clusters) or multi-cloud infrastructure services.
- Experience teaching reliability engineering (SRE/CRE) or scale-oriented cloud systems practices to peers or other organizations.
- Experience running private or public cloud systems based on Kubernetes, OpenStack, Docker, or Slurm.
- Experience with accelerated compute and communications technologies such as BlueField Networking, InfiniBand topologies, NVMesh, and/or the NVIDIA Collective Communication Library (NCCL).
- Experience collaborating with centralized security teams to prioritize and mitigate security risks.
Compensation & Benefits
- Base salary ranges (location, level, and experience dependent):
- Level 3: 144,000 USD - 230,000 USD
- Level 4: 168,000 USD - 270,250 USD
- Eligible for equity and benefits.
Additional details
- Location: Santa Clara, CA, United States (Full time).
- Application deadline: Applications accepted at least until August 13, 2025.
- Employer statement: NVIDIA is committed to fostering a diverse work environment and is an equal opportunity employer. They do not discriminate based on legally protected characteristics.