Senior DGX Cloud Software Engineer - Infrastructure Automation and Distributed Systems
at Nvidia
USD 168,000-333,500 per year
SCRAPED
Used Tools & Technologies
Not specified
Required Skills & Competences ?
Security @ 4 Go @ 6 Kubernetes @ 4 Linux @ 4 Python @ 6 Distributed Systems @ 4 Communication @ 4 Mathematics @ 4 Networking @ 4 SRE @ 4Details
We are seeking Software Engineers with previous experience building and running private and public clouds at production scale. As part of the DGX Cloud team, you’ll support customer journeys in AI training and inference development by building platforms, tools, and services that defend the operational capacity of bare-metal, accelerated compute infrastructure and codify reliability best-practices in the broader DGX Cloud platform ecosystem.
Responsibilities
- Design, build, and run cloud infrastructure services to meet business goals, performing integrations, migrations, bringups, updates, and decommissions as necessary.
- Participate in the definition of internal-facing service level objectives (SLOs) and error budgets as part of an overall observability strategy.
- Eliminate toil by building automation where the ROI of building and maintaining automation is worth it.
- Practice sustainable blameless incident prevention and incident response while participating in an on-call rotation.
- Consult with and provide guidance to peer teams on systems design best practices.
- Participate in a supportive culture of values-driven introspection, communication, and self-organization.
Requirements
- Proficiency in one or more of the following programming languages: Python or Go.
- BS degree in Computer Science or a related technical field involving coding (e.g., physics or mathematics) or equivalent experience.
- 5+ years of relevant experience in infrastructure and fleet management engineering.
- Experience with infrastructure automation and distributed systems design developing tools for running large-scale private or public cloud systems at scales requiring fully automated management and under active customer consumption in production.
- A track record demonstrating initiative (starting your own projects), persuasion (convincing others to collaborate), and strong collaboration on projects initiated by others.
- In-depth knowledge in one or more of: Linux, Slurm, Kubernetes, local and distributed storage, and systems networking.
Ways to Stand Out
- Demonstrating systematic problem-solving, clear communication, ownership (example: driving build/reuse/buy decisions).
- Experience with bare metal as a service (BMaaS) systems (e.g., vending BMaaS, Slurm running on containers, vending Kubernetes clusters), and multi-cloud infrastructure services.
- Experience teaching reliability engineering (e.g., SRE) and/or other scale-oriented cloud systems practices (e.g., CRE).
- Experience with accelerated compute and communications technologies such as BlueField Networking, InfiniBand topologies, NVMesh, and/or the NVIDIA Collective Communication Library (NCCL).
- Experience working with centralized security organizations to prioritize and mitigate security risks.
- Prior ML/AI-focused role experience or working on teams matching relevant keywords is welcome but not required.
Compensation & Benefits
- Base salary range (determined by location, experience, and similar roles):
- Level 4: 168,000 USD - 270,250 USD
- Level 5: 208,000 USD - 333,500 USD
- You will also be eligible for equity and benefits (link to NVIDIA benefits policy referenced in the original posting).
Company
NVIDIA is a leader in AI, high-performance computing, and visualization. The company emphasizes creativity, autonomy, and collaboration, and seeks candidates who are creative and autonomous.
Additional Information
- Applications for this job will be accepted at least until December 16, 2025.
- The role includes on-call responsibilities and participation in incident response and SLO/error budget definition.
- The posting highlights working with bare-metal accelerated compute infrastructure and large-scale private/public cloud operations.
- NVIDIA is an equal opportunity employer and values diversity. The posting states non-discrimination across many protected characteristics.