Used Tools & Technologies
Machine LearningRequired Skills & Competences
Tag name is followed by "@" symbol and proficiency level value.
About proficiency levels:
- 1-2 — basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;
- 3-6 — daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;
- 7-9 — you are an expert, you can teach others, you know all the pitfalls and tricks;
- 10 — exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.
Security @ 4
Docker @ 7
Go @ 6
Grafana @ 4
Kubernetes @ 4
Prometheus @ 4
Python @ 6
GCP @ 7
Java @ 6
Distributed Systems @ 4
Leadership @ 4
AWS @ 7
Azure @ 7
API @ 4
Technical Leadership @ 4
OpenTelemetry @ 4
CUDA @ 3
Cloud Computing @ 4
GPU @ 4
Observability @ 4
AI @ 4
Data Pipelines @ 4
Slurm @ 4
- 1-2 — basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;
- 3-6 — daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;
- 7-9 — you are an expert, you can teach others, you know all the pitfalls and tricks;
- 10 — exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.
Details
NVIDIA is seeking a Principal Software Engineer to join the DGX Cloud team to build foundational systems that drive NVIDIA's high-performance GPU infrastructure. You will craft scalable automation solutions, integrate diverse systems, and enable seamless workflows across global cloud operations. As a Principal Engineer in DGX Cloud you will provide technical leadership for the platform that supports AI and cloud computing workloads.
Responsibilities
- Lead the build and development of next-generation APIs, state management, and workflow orchestration systems that automate fleet lifecycle operations at massive scale.
- Drive technical alignment across dependent systems and partner teams to ensure cohesive integration, clear interfaces, and reliable end-to-end workflows, with a strong focus on delivery.
- Coach, mentor, and encourage senior engineers; elevate technical standards and guidelines across the organization.
- Maintain strong focus on customer experience and product requirements, translating technical insight into high-impact business solutions.
- Partner with executive and engineering leadership to codify business processes into self-measuring, scalable, and operationally consistent platforms to reduce manual toil.
- Direct the integration strategy for key technologies, including common AI schedulers (e.g., Kubernetes, Slurm) and observability systems (e.g., Prometheus, OpenTelemetry, Grafana).
Requirements
- 16+ years of progressive industry experience.
- Master's or Bachelor's degree, or equivalent experience defining and shipping complex distributed systems.
- Deep, hands-on expertise in establishing, operating, and scaling services in fast-paced, high-reliability environments.
- Ability to thrive in ambiguous, fast-paced environments by rapidly testing ideas, iterating toward working solutions, and hardening winners into reliable, scalable systems.
- Outstanding proficiency in modern systems programming languages such as Go, Java, or Python.
- Proven track record of defining, owning, and evolving the architecture of high-scale distributed systems, including advanced patterns for APIs, control planes, and data pipelines.
- Deep understanding of global cloud infrastructure (AWS, GCP, Azure) and container ecosystems (Docker, Kubernetes).
- Demonstrated ability to drive technical strategy and influence outcomes across organizational boundaries.
- Outstanding ability to communicate complex technical concepts, drive organizational consensus, and mentor high-performing engineers.
Ways to Stand Out from the Crowd
- History of leading development and adoption of organization-wide workflow orchestration systems for petabyte-scale infrastructure.
- Experience in a Principal/Staff+ capacity delivering measurable improvements in operational efficiency, reliability, and security across a large engineering organization.
- Deep familiarity with operational and deployment aspects of the NVIDIA AI/ML software stack (CUDA, cuDNN, containerization).
- Patent contributions or a strong publication record in distributed systems, cloud computing, or infrastructure automation.
Compensation & Benefits
- Base salary range: 272,000 USD - 431,250 USD (final base salary determined by location, experience, and pay of employees in similar positions).
- Eligible for equity and benefits. See: https://www.nvidiabenefits.com/
Other details
- Applications accepted at least until May 3, 2026.
- This posting is for an existing vacancy.
- NVIDIA uses AI tools in its recruiting processes.
- NVIDIA is an equal opportunity employer and does not discriminate based on protected characteristics.