ML and Agentic Systems Engineer

at Nvidia
USD 224,000-431,200 per year
MIDDLE SENIOR
✅ On-site

Used Tools & Technologies

Not specified

Required Skills & Competences

Software Development @ 7 Python @ 3 Machine Learning @ 3 Debugging @ 3 Experimentation @ 3 LLM @ 6 PyTorch @ 3 AI @ 3

Details

At NVIDIA, the Cosmos team is building agentic systems that can reason about, build, evaluate, and improve AI systems themselves. This role focuses on creating the meta-layer of modern ML: agents, tooling, pipelines, and feedback loops that make model development faster, smarter, and increasingly automated. You will build systems in which agents can work with code, data, experiments, and evaluations to accelerate ML development and iteration.

Responsibilities

  • Design and implement agentic workflows across the ML lifecycle, including data generation and curation, evaluation, debugging, training orchestration, and iteration.
  • Build AI-native systems where models and agents interact with codebases, tools, experiments, and environments to improve developer and researcher productivity.
  • Create self-improving loops where agents help generate data, surface failures, evaluate outputs, and drive better decisions across the system.
  • Own and evolve large-scale Python and PyTorch codebases, turning fast-moving ideas into robust, modular, reusable software.
  • Design and scale evaluation platforms combining automated metrics, human feedback, and agent-driven analysis.
  • Build and maintain multimodal ML pipelines spanning data processing, experimentation, benchmarking, and deployment.
  • Integrate open-source and internal components into unified systems that enable rapid experimentation and reliable iteration.
  • Raise the bar on engineering excellence through testing, reproducibility, packaging, code health, and maintainability.

Requirements

  • Significant experience building machine learning systems and software platforms (not just models).
  • Expert-level Python skills, with strong judgment around modularity, abstraction boundaries, and long-term code health.
  • Deep familiarity with PyTorch, including ability to debug, adapt, and extend model behavior within larger software systems.
  • Experience building pipelines, evaluation systems, developer tooling, or workflow automation for ML at meaningful scale.
  • Strong software engineering fundamentals: system design, testing, packaging, debugging, and collaborative codebase evolution.
  • Strong agency in LLM-based systems (tool use, planning, multi-step workflows, code agents, or automation over data and experiments).
  • Comfort operating in fast-moving environments where ambiguous ideas must be turned into useful systems quickly.
  • BS, MS, or equivalent experience in Computer Science, Engineering, or a related field.
  • 12+ years of relevant software development experience.

Ways to stand out

  • Built agent-based systems that perform coding, evaluation, data generation, triage, experimentation, or orchestration.
  • Contributions to impactful open-source ML, Python, or developer tooling projects.
  • Background with context compression and agent memory techniques.
  • Familiarity with agent safety and agent identity (AuthN, AuthZ, IAM).
  • High bar for software craftsmanship while applying it effectively in research-adjacent environments.

Compensation & Benefits

  • Base salary ranges provided: 224,000 USD - 356,500 USD for Level 5; 272,000 USD - 431,250 USD for Level 6.
  • Eligible for equity and company benefits (link to NVIDIA benefits provided in the posting).

Additional information

  • Location: US, CA, Santa Clara.
  • Employment type: Full time.
  • Applications accepted at least until March 24, 2026. This posting is for an existing vacancy.
  • NVIDIA uses AI tools in its recruiting processes and is an equal opportunity employer.