Used Tools & Technologies
Not specified
Required Skills & Competences ?
Security @ 3 System Administration @ 5 Docker @ 6 Jenkins @ 3 Kubernetes @ 6 Linux @ 5 DevOps @ 6 Terraform @ 3 Python @ 5 GCP @ 6 CI/CD @ 3 Hiring @ 3 AWS @ 6 Azure @ 6 Communication @ 3 SRE @ 3Details
For over 25 years, NVIDIA has pioneered visual and accelerated computing. The team is seeking a foundational System Software Engineer to ensure 24/7 operation, maintenance, and scaling of a multi-cloud and multi-architecture training delivery platform across multiple cloud service providers and regions. The role focuses on managing operational expenditure, optimizing cost per learner, preventing compute capacity shortages, and enabling scalable, reliable learning experiences on a purpose-built Learning Management System (LMS) platform.
Responsibilities
- Build systems to support maintenance, scaling, and operation of global compute platforms spanning multiple cloud providers.
- Drive continuous cost optimization for compute resources with a focus on efficiency and expenditure management.
- Design and implement flexible solutions to ensure adequate compute capacity and resource availability to meet fluctuating demands and diverse workload requirements.
- Build, maintain, and optimize orchestration functions by mapping workload requirements to cloud provider capabilities, implementing workers, and refining job queue and scaling systems.
- Manage and maintain artifacts to establish a consistent baseline compute capability across supported cloud providers and regions.
- Partner with cross-functional teams (educators, platform engineers) to set standards for scalable, reliable learning experiences and lead technical responses during incidents.
Requirements
- Bachelor’s degree in Computer Science, a related technical field, or equivalent experience.
- 8+ years of DevOps experience optimizing, deploying, and running heterogeneous containerized applications (Docker, Kubernetes) across trust boundaries on AWS, Azure, and GCP.
- Hands-on experience with EKS, AKS, and GKE.
- Practical experience building scalable, reliable services and distributed system integration topologies.
- Hands-on experience maintaining AWS security groups, roles, IAM, and role delegation.
- Proficiency in Python and Linux shell scripting for automation, application development, system administration, and problem resolution.
- Validated experience architecting, implementing, and managing cloud infrastructure using Terraform.
- Demonstrated ability as a meticulous problem-solver with strong analytical skills to rapidly diagnose and resolve complex technical challenges.
- Excellent communication, teamwork, and collaboration skills, with the ability to articulate technical concepts to diverse audiences and lead incident responses.
Ways to stand out
- Proven experience with event-driven architectures using pub/sub patterns (e.g., AWS SNS/SQS, Google Pub/Sub, Azure Service Bus).
- Knowledge of generative AI architectures (LLMs, diffusion models) and concepts such as RAG and vector databases.
- Hands-on experience with the NVIDIA AI stack (NeMo, Triton Inference Server, TensorRT); production experience with NVIDIA NIM is a strong plus.
- Experience building and running CI/CD pipelines (Jenkins, GitLab CI) and applying SRE principles to automate, enhance reliability, and improve performance.
- Familiarity with Python-based Learning Management Systems (LMS) such as Open edX and experience with highly heterogeneous compute deployments.
Compensation & Benefits
- Base salary range (location- and level-dependent):
- Level 4: 168,000 USD - 264,500 USD
- Level 5: 200,000 USD - 322,000 USD
- Eligible for equity and benefits (link to NVIDIA benefits provided in original posting).
Other details
- Employment type: Full time
- Location: Santa Clara, CA, United States (hybrid; posting includes "#LI-Hybrid").
- Applications accepted at least until September 12, 2025.
- NVIDIA is an equal opportunity employer and values diversity in hiring and promotion practices.