Used Tools & Technologies
Not specified
Required Skills & Competences ?
Go @ 4 Grafana @ 3 Kubernetes @ 4 Prometheus @ 3 DevOps @ 4 Python @ 4 CI/CD @ 3 Azure @ 4 Helm @ 4 Networking @ 4 Microservices @ 4 Debugging @ 4 GPU @ 4Details
We’re building the infrastructure that powers GR00T, NVIDIA’s general-purpose humanoid robotics platform. This is not a typical DevOps job. You’ll help engineer the cloud-native backend that drives simulation, synthetic data generation, multi-stage model training, and robotic deployment—all at massive scale. Our orchestration system, NVIDIA OSMO, is built to handle real-time robotics workflows in cloud environments across thousands of GPUs. We’re looking for a pragmatic Kubernetes-native backend and infrastructure engineer who excels in solving complex orchestration problems in distributed AI/ML systems.
Responsibilities
- Architect, develop, and deploy backend services supporting NVIDIA GR00T using Kubernetes and cloud-native technologies.
- Collaborate with ML, simulation, and robotics engineers to deploy scalable, reproducible, and observable multi-node training and inference workflows.
- Extend and maintain OSMO’s orchestration layers to support heterogeneous compute backends and robotic data pipelines.
- Develop Helm charts, controllers, CRDs, and service mesh integrations to support secure and fault-tolerant system operation.
- Implement microservices written in Go or Python that power GR00T task execution, metadata tracking, and artifact delivery.
- Optimize job scheduling, storage access, and networking across hybrid and multi-cloud Kubernetes environments (e.g., OCI, Azure, on-prem).
- Build tooling that simplifies deployment, debugging, and scaling of robotics workloads.
Requirements
- BS, MS, or PhD degree in Computer Science, Electrical Engineering, Computer Engineering, or related field (or equivalent experience).
- 5+ years of work experience in DevOps, backend, or cloud infrastructure engineering.
- Hands-on experience building and deploying microservices in Kubernetes-native environments.
- Proficiency in Golang or Python, especially for backend systems and operators.
- Experience with Helm, or other Kubernetes templating and config management tools.
- Familiarity with GitOps workflows, observability stacks (e.g., Prometheus, Grafana), and container CI/CD pipelines.
- Strong understanding of container networking, storage (e.g., PVCs, ephemeral), and scheduling.
Ways to stand out from the crowd
- Experience with ML training workflows, distributed job orchestration (e.g., MPI, Ray, Triton Inference Server).
- Knowledge of robotics frameworks (e.g., ROS2) or simulation tools (e.g., Isaac Sim, Omniverse).
- Background with GPU cluster management and scheduling across cloud providers.
- Contributions to open-source Kubernetes projects or custom operators/controllers.
NVIDIA is widely considered to be one of the technology world’s most desirable employers. We have some of the most forward-thinking and hardworking people in the world working for us. If you are creative and autonomous, we want to hear from you!