Used Tools & Technologies
Go IaCRequired Skills & Competences
Tag name is followed by "@" symbol and proficiency level value.
About proficiency levels:
- 1-2 — basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;
- 3-6 — daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;
- 7-9 — you are an expert, you can teach others, you know all the pitfalls and tricks;
- 10 — exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.
Security @ 4
Docker @ 4
Grafana @ 3
Kubernetes @ 4
Prometheus @ 3
TypeScript @ 4
Automated Testing @ 7
Python @ 4
GCP @ 4
CI/CD @ 4
Distributed Systems @ 4
Machine Learning @ 4
TensorFlow @ 3
AWS @ 4
Azure @ 4
Communication @ 7
JavaScript @ 4
Next.js @ 4
React @ 4
Node.js @ 4
Debugging @ 4
API @ 7
PyTorch @ 3
GPU @ 4
Deep Learning @ 3
Observability @ 4
AI @ 4
Robotics @ 4
JAX @ 3
- 1-2 — basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;
- 3-6 — daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;
- 7-9 — you are an expert, you can teach others, you know all the pitfalls and tricks;
- 10 — exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.
Details
NVIDIA has been transforming computer graphics, PC gaming, and accelerated computing for more than 30 years. Today, the company is at the forefront of AI innovation powering breakthroughs in research, autonomous vehicles, robotics, and more. The DGX Cloud team builds and operates the AI infrastructure that fuels this progress.
The AI Hub team within the DGX Cloud AI Infrastructure organization accelerates AI research by ensuring NVIDIA’s AI infrastructure is used efficiently, transparently, and at scale. The team is building a unified, self-service “single pane of glass” portal that enables AI researchers to efficiently manage, monitor, and optimize their use of managed AI research Superclusters.
Responsibilities
- Lead the architecture and delivery of high-scale web products across frontend, backend services, and data layers, with clear availability and latency targets (SLOs/SLAs).
- Own multi-team initiatives end to end: problem discovery, RFCs/design reviews, phased rollouts, and success metrics tied to product and business outcomes.
- Drive reliability, performance, and observability improvements to meet exascale standards.
- Establish engineering standards and reusable platforms/design systems to reduce complexity, support load and long-term tech debt reduction.
- Collaborate with NVIDIA AI Research teams to identify pain points and deliver the next-generation user experience that accelerates their work.
- Mentor and sponsor engineers; improve code quality, testing, security, and observability through reviews, pairing, and coaching.
- Stay ahead of AI/ML infrastructure trends and drive adoption of best practices within the team.
Requirements
- 12+ years of software engineering experience delivering production web systems.
- Bachelor’s degree or higher in Computer Science or a related technical field (or equivalent experience).
- Strong cross-functional collaboration skills, including active listening, translating complex use cases into clear technical requirements, and designing data models aligned with business logic and outcomes.
- Deep cloud expertise (AWS, GCP, or Azure), infrastructure as code, containers, and orchestration (Docker, Kubernetes), along with mature CI/CD and safe deployment practices.
- Full-stack depth: modern SPA frameworks (React/Next.js or Vue/Nuxt), JavaScript/TypeScript, and one or more backend languages (Node.js, Python, and/or Golang).
- Familiarity with observability stacks such as OpenSearch, Prometheus, Grafana, or Loki.
- Proficiency in API design (REST), schema evolution, and integration patterns, with a strong commitment to automated testing.
- Experience building machine learning platforms or self-service internal infrastructure tools focused on efficiency, resiliency, and observability.
- Clear written and verbal communication skills, strong problem-solving ability, and a growth mindset.
- Experience leveraging AI-assisted development tools (e.g., Cursor).
Ways to Stand Out
- Hands-on ML platform depth (MLE experience or strong familiarity with deep learning frameworks such as PyTorch, TensorFlow, JAX; distributed training ecosystems like Ray).
- Datacenter-scale operational experience, including GPU cluster debugging, performance triage, and root-cause analysis across complex distributed systems.
Benefits
- NVIDIA provides competitive salaries, equity eligibility, and a comprehensive benefits package.
Additional Information
- Base salary range (location- and experience-dependent): 224,000 USD - 356,500 USD.
- You will also be eligible for equity and benefits.
- Applications for this job will be accepted at least until May 3, 2026. This posting is for an existing vacancy.
- NVIDIA uses AI tools in its recruiting processes and is an equal opportunity employer committed to a diverse work environment.