System Software Engineer, Distributed Systems

at Nvidia
USD 152,000-287,500 per year
MIDDLE
βœ… On-site

Used Tools & Technologies

Not specified

Required Skills & Competences

Go @ 3 Kubernetes @ 3 Linux @ 3 Python @ 5 R @ 3 Distributed Systems @ 3 Perl @ 3 Debugging @ 6 LLM @ 6 Observability @ 3 AI @ 3

Details

NVIDIA's VLSI Productivity and Infrastructure team builds tools and platforms that support 1000+ chip design engineers. The team focuses on long shelf-life systems spanning build automation, observability, analytics, automated error detection/remediation, and codebase modernization. The core workflow infrastructure runs as userspace software on bare-metal Linux hosts (no sudo, no containers), coordinates shared state and artifacts via NFS, and launches long-running, compute-heavy workflows on IBM LSF. This role is a pragmatic, generalist systems engineering position with emphasis on distributed systems and operational excellence in a "below containers" world: coordination, reliability, performance, and safe evolution of legacy systems (including incremental modernization into Go).

Responsibilities

  • Design, build, and deliver core components of next-generation productivity platforms
  • Develop reliable userspace infrastructure for long-running engineering workflows at scale on bare-metal Linux hosts
  • Build state coordination over NFS (atomicity, idempotency/dedup, partial-write recovery, without privileged ops)
  • Build and improve orchestration around IBM LSF (submission/tracking, retries/cancel, log capture, fairness/backpressure)
  • Convert legacy codebases into modern code (e.g., incremental migration from Perl to Go) with stage gates and parity strategies
  • Debug and improve performance and reliability across Linux and Kubernetes, including operational tooling
  • Collaborate with engineering users to turn ambiguous workflows into durable production systems

Requirements

  • B.S. in Computer Science/Electrical Engineering or equivalent experience
  • 5+ years developing and operating production software in Go and/or Python, ideally in large codebases
  • Strong Linux fundamentals: processes, filesystems, permissions, synchronization/locks, concurrency, and debugging
  • Solid distributed-systems thinking: failures, retries/timeouts, backoff, idempotency, and operational rigor
  • Experience building long-runtime automation or services on shared compute clusters (batch schedulers, build systems)
  • Ability to translate high-level goals into safe delivery plans (instrumentation, staged rollout, measurable outcomes)

Ways to stand out

  • Hands-on experience with shared filesystems at scale (NFS) or coordination patterns on eventually-consistent storage
  • Experience with batch job scheduling, shared compute fleets, or build systems
  • Track record of incremental modernization (tests, shadow runs, canaries, rollback plans)
  • Experience partitioning/optimizing metadata-heavy systems and reducing I/O or R/W hot spots
  • Strong incident/debug tactics: root-cause analysis, remediation, guardrails, and rapid comprehension/ownership of unfamiliar codebases (including LLM-generated code)

Compensation & Benefits

  • Base salary ranges (determined by location, experience, and comparable roles):
    • Level 3: 152,000 USD - 241,500 USD
    • Level 4: 184,000 USD - 287,500 USD
  • Eligible for equity and benefits (link provided in original posting)

Other details

  • Applications for this job will be accepted at least until February 19, 2026.
  • NVIDIA uses AI tools in its recruiting processes.
  • NVIDIA is an equal opportunity employer committed to diversity and non-discrimination.