Used Tools & Technologies
Machine LearningRequired Skills & Competences
Tag name is followed by "@" symbol and proficiency level value.
About proficiency levels:
- 1-2 — basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;
- 3-6 — daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;
- 7-9 — you are an expert, you can teach others, you know all the pitfalls and tricks;
- 10 — exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.
Security @ 3
Go @ 6
Kubernetes @ 3
Python @ 6
Algorithms @ 3
Data Structures @ 3
Distributed Systems @ 3
MLOps @ 3
Hiring @ 3
Communication @ 3
SRE @ 7
Rust @ 6
Debugging @ 6
OWASP @ 3
LLM @ 3
Compliance @ 3
Agile @ 3
GPU @ 3
Observability @ 3
AI @ 3
- 1-2 — basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;
- 3-6 — daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;
- 7-9 — you are an expert, you can teach others, you know all the pitfalls and tricks;
- 10 — exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.
Details
NVIDIA is hiring an AI Platform Engineer to build, support, and maintain the next generation of AI-powered enterprise products that improve engineering efficiency, data security, and power product development. This role collaborates with Cloud and AI/ML teams in a multifaceted and agile environment and focuses on scalable, reliable AI-native infrastructure and tooling.
Responsibilities
- Define and lead AI-native infrastructure roadmaps and cross-organizational initiatives.
- Architect and scale LLM/ML infrastructure across cloud-native clusters and on-premises hardware.
- Design and implement observability for infrastructure health and AI model performance.
- Build LLM-aware monitoring and leverage AI to improve incident response and reduce toil.
- Develop automation and tooling to ensure reliability, scalability, and developer self-services.
- Troubleshoot complex distributed systems, including deep Kubernetes and AI/ML scaling challenges.
- Drive AI-assisted engineering practices and mentor engineers to foster an AI-first culture.
- Partner with product engineering and internal business units to translate AI platform capabilities into reliable, scalable solutions that accelerate product development.
Requirements
- 10+ years in cloud, platform, or SRE roles with relevant education or equivalent experience.
- Bachelor's degree or equivalent experience.
- Strong Python and at least one systems language (C++, Go, or Rust), with proven distributed systems debugging expertise.
- Deep experience building and scaling distributed systems, including Kubernetes and bare-metal infrastructure.
- Strong observability design across infrastructure and AI workloads (metrics, logging, tracing, AI quality signals).
- Hands-on experience operating AI/ML platforms, including MLOps, model serving, and GPU-accelerated environments.
- Experience with infrastructure and application security practices, such as identity/auth, network segmentation, supply chain security, and vulnerability management in cloud-native environments.
- Practical use of AI-assisted development tools and coding agents in daily workflows.
- Solid foundation in data structures, algorithms, and complexity analysis.
- Excellent problem-solving, communication, and collaboration across multiple functions.
Ways to stand out
- Deep experience with AI/ML platforms (e.g., Hugging Face, Weights & Biases, NVIDIA NIM).
- Proven use of AI agents and LLM tooling to enhance observability, incident response, or developer productivity.
- Experience with artifact management, AI supply chain security, or trusted model distribution systems.
- Experience with AI-specific threat models (OWASP Top 10 for LLMs, model poisoning, adversarial inputs), compliance frameworks (FedRAMP, SOC 2), and red-teaming or security evaluation of LLM systems.
- Strong ownership demeanor with a structured, automation-first approach.
- Demonstrated impact driving AI-first engineering practices across teams.
Compensation & Benefits
- Base salary ranges (determined by location, experience, and internal pay):
- Level 4: 168,000 USD - 270,250 USD
- Level 5: 200,000 USD - 322,000 USD
- Eligible for equity and benefits.
Other information
- Applications for this job will be accepted at least until March 23, 2026.
- This posting is for an existing vacancy.
- NVIDIA uses AI tools in its recruiting processes.
- NVIDIA is an equal opportunity employer and is committed to fostering a diverse work environment.