Senior Software Engineer, AI Frameworks
at Nvidia
š Santa Clara, United States
USD 152,000-287,500 per year
Used Tools & Technologies
Machine LearningRequired Skills & Competences
Tag name is followed by "@" symbol and proficiency level value.
About proficiency levels:
- 1-2 ā basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;
- 3-6 ā daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;
- 7-9 ā you are an expert, you can teach others, you know all the pitfalls and tricks;
- 10 ā exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.
Docker @ 3
Go @ 7
Kubernetes @ 4
Python @ 7
CI/CD @ 3
Distributed Systems @ 4
Communication @ 4
Helm @ 4
Networking @ 4
Debugging @ 4
API @ 4
LLM @ 4
PyTorch @ 4
CUDA @ 4
GPU @ 4
Observability @ 4
AI @ 4
Profiling @ 4
- 1-2 ā basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;
- 3-6 ā daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;
- 7-9 ā you are an expert, you can teach others, you know all the pitfalls and tricks;
- 10 ā exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.
Details
We are seeking a Senior Software Engineer to drive integration of the NVIDIA Grove project within Dynamo and across a set of leading open-source AI frameworks. In this role, you will develop production-grade software enabling Grove capabilities to be adopted, scaled, and operated smoothly. You will build production-grade software that enables seamless adoption, scaling, and operation of Grove capabilities across environments such as Dynamo, llm-d, Ray, PyTorch, and other emerging frameworks in the AI ecosystem. You will collaborate across engineering teams and the open-source community to deliver robust integrations, reference implementations, and developer-focused tooling.
Responsibilities
- Design and implement end-to-end integrations of Grove with open-source AI frameworks (e.g., Dynamo, llm-d, Ray, PyTorch, and related ecosystem projects).
- Build and maintain adapters, plugins, operators, and/or runtime components that enable Grove features to work smoothly across training and inference stacks.
- Partner with framework owners to upstream changes, contribute patches, and ensure long-term maintainability of integrations.
- Develop reference workflows, sample apps, and best-practice guides that accelerate adoption by users and partners.
- Optimize performance, scalability, and reliability for distributed training/inference, including multi-node and multi-GPU environments.
- Improve observability and operational readiness (metrics, logging, tracing, debugging tools) for Kubernetes-based deployments.
- Participate in technical design reviews, define APIs/contracts, and ensure compatibility across versions of frameworks and dependencies.
- Diagnose complex issues spanning containers, networking, scheduling, CUDA/GPU utilization, and framework runtime behavior.
Requirements
- BS/MS/PhD in Computer Science, Electrical Engineering, or related field (or equivalent experience).
- 5+ years of proven experience in a related field.
- Hands-on experience integrating with at least one major AI framework/runtime (e.g., PyTorch, Ray, Triton Inference Server ecosystem, distributed runtimes, model serving stacks).
- Solid understanding of AI workloads: model development basics, training vs. inference tradeoffs, and performance considerations (throughput/latency, batching, memory).
- Experience with distributed systems concepts (RPC, scheduling, fault tolerance, resource management).
- Practical Kubernetes experience: deploying and operating services/jobs, Helm/Kustomize, operators/controllers (nice to have), and debugging clusters.
- Familiarity with containers and cloud-native tooling (Docker, container registries, CI/CD pipelines).
- Strong software engineering experience in Go, C++ and/or Python, with a track record of shipping reliable systems.
- Strong interpersonal skills and ability to collaborate across teams and with open-source communities.
- Exceptional collaboration, communication, and documentation habits.
Ways to stand out
- Open-source contributions to Dynamo, PyTorch, Ray, llm-d, Kubernetes ecosystem, or related ML infrastructure projects.
- Experience with large-scale model serving, distributed inference, or multi-tenant AI platforms.
- Experience building SDKs/APIs or developer tooling that improves integration usability.
- Knowledge of GPU performance profiling and optimization (Nsight tools or similar), and/or kernel-level performance tuning.
- Experience with reproducibility, packaging, versioning, and compatibility testing across fast-moving dependencies.
Compensation & Other
- Base salary ranges (location and level dependent):
- Level 3: 152,000 USD - 241,500 USD
- Level 4: 184,000 USD - 287,500 USD
- You will also be eligible for equity and benefits.
- Applications for this job will be accepted at least until April 3, 2026.
- NVIDIA uses AI tools in its recruiting processes.
- NVIDIA is committed to fostering a diverse work environment and is an equal opportunity employer.