Senior Systems Engineer, NVIDIA Mission Control
at Nvidia
π Santa Clara, United States
USD 184,000-356,500 per year
SCRAPED
Used Tools & Technologies
Not specified
Required Skills & Competences ?
Go @ 4 DevOps @ 4 CI/CD @ 4 Distributed Systems @ 4 Leadership @ 4 Networking @ 4 SRE @ 4 Rust @ 4 Technical Leadership @ 4 Agile @ 4 GPU @ 4Details
NVIDIA DGX Cloud is a fully managed, cloud-based AI supercomputing platform that provides organizations with direct access to NVIDIA's advanced GPU clusters, software, and AI expertise for developing and deploying AI workloads. NVIDIA Mission Control powers every aspect of AI factory operations β from developer workloads to infrastructure to facilities β with the skills of a world-class operations team delivered as software. The team is building and improving a platform that will automate diagnosis and repair of GPU and CPU clusters on public clouds, private clouds, and virtual and physical hardware.
Responsibilities
- Improve the existing cluster automation platform to be more fault-tolerant, agile, hardware/networking aware, and resource-efficient.
- Enable AI capabilities in the platform to enhance user experience and accelerate automation, diagnosis, and remediation of issues.
- Integrate with ecosystem tools to provide a rich, unified end-to-end user experience.
- Collaborate with stakeholders across NVIDIA to understand business context, influence product roadmap, drive adoption of the automation platform, and reduce operational toil.
- Operate critical software services with high availability and reliability.
- Program in systems languages such as Rust and Go.
- Drive engineering best practices, mentor engineers, and foster an inclusive team culture.
Requirements
- Bachelor's or Master's degree in Computer Science, Engineering, or a related field (or equivalent experience).
- Keen interest in driving Agent AI projects and enabling AI capabilities in production platforms.
- Approximately 10 years of equivalent professional experience.
- Demonstrated ability building scalable, agile, and robust distributed systems.
- Experience with successful product rollouts and collaborating with early adopters.
- Technical leadership and ownership of cross-organizational projects; hands-on approach and passion for continuous improvement.
- Experience working with ambiguity and driving clarity in complex technical decisions.
- Experience operating production services with high availability and reliability; familiarity with hardware- and networking-aware systems and cluster automation across public and private clouds.
Ways to Stand Out
- Skilled at using AI to scale team productivity and agility.
- Experience revamping complex systems alongside existing customers to take them to the next level.
- Experience with SRE, DevOps practices, and CI/CD across a variety of platforms.
Compensation & Benefits
- Base salary range (determined by location, experience, and comparable roles):
- Level 4: 184,000 USD - 287,500 USD
- Level 5: 224,000 USD - 356,500 USD
- Eligible for equity and NVIDIA benefits (see NVIDIA benefits page).
Additional Information
- Applications accepted at least until September 19, 2025.
- NVIDIA is an equal opportunity employer and is committed to fostering a diverse work environment. The company does not discriminate based on protected characteristics.