System Software Engineer, Distributed Systems
Used Tools & Technologies
Not specified
Required Skills & Competences
Tag name is followed by "@" symbol and proficiency level value.
About proficiency levels:
- 1-2 β basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;
- 3-6 β daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;
- 7-9 β you are an expert, you can teach others, you know all the pitfalls and tricks;
- 10 β exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.
Go @ 3
Kubernetes @ 3
Linux @ 3
Python @ 5
R @ 3
Distributed Systems @ 3
Perl @ 3
Debugging @ 6
LLM @ 6
Observability @ 3
AI @ 3
- 1-2 β basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;
- 3-6 β daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;
- 7-9 β you are an expert, you can teach others, you know all the pitfalls and tricks;
- 10 β exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.
Details
NVIDIA's VLSI Productivity and Infrastructure team builds tools and platforms that support 1000+ chip design engineers. The team focuses on long shelf-life systems spanning build automation, observability, analytics, automated error detection/remediation, and codebase modernization. The core workflow infrastructure runs as userspace software on bare-metal Linux hosts (no sudo, no containers), coordinates shared state and artifacts via NFS, and launches long-running, compute-heavy workflows on IBM LSF. This role is a pragmatic, generalist systems engineering position with emphasis on distributed systems and operational excellence in a "below containers" world: coordination, reliability, performance, and safe evolution of legacy systems (including incremental modernization into Go).
Responsibilities
- Design, build, and deliver core components of next-generation productivity platforms
- Develop reliable userspace infrastructure for long-running engineering workflows at scale on bare-metal Linux hosts
- Build state coordination over NFS (atomicity, idempotency/dedup, partial-write recovery, without privileged ops)
- Build and improve orchestration around IBM LSF (submission/tracking, retries/cancel, log capture, fairness/backpressure)
- Convert legacy codebases into modern code (e.g., incremental migration from Perl to Go) with stage gates and parity strategies
- Debug and improve performance and reliability across Linux and Kubernetes, including operational tooling
- Collaborate with engineering users to turn ambiguous workflows into durable production systems
Requirements
- B.S. in Computer Science/Electrical Engineering or equivalent experience
- 5+ years developing and operating production software in Go and/or Python, ideally in large codebases
- Strong Linux fundamentals: processes, filesystems, permissions, synchronization/locks, concurrency, and debugging
- Solid distributed-systems thinking: failures, retries/timeouts, backoff, idempotency, and operational rigor
- Experience building long-runtime automation or services on shared compute clusters (batch schedulers, build systems)
- Ability to translate high-level goals into safe delivery plans (instrumentation, staged rollout, measurable outcomes)
Ways to stand out
- Hands-on experience with shared filesystems at scale (NFS) or coordination patterns on eventually-consistent storage
- Experience with batch job scheduling, shared compute fleets, or build systems
- Track record of incremental modernization (tests, shadow runs, canaries, rollback plans)
- Experience partitioning/optimizing metadata-heavy systems and reducing I/O or R/W hot spots
- Strong incident/debug tactics: root-cause analysis, remediation, guardrails, and rapid comprehension/ownership of unfamiliar codebases (including LLM-generated code)
Compensation & Benefits
- Base salary ranges (determined by location, experience, and comparable roles):
- Level 3: 152,000 USD - 241,500 USD
- Level 4: 184,000 USD - 287,500 USD
- Eligible for equity and benefits (link provided in original posting)
Other details
- Applications for this job will be accepted at least until February 19, 2026.
- NVIDIA uses AI tools in its recruiting processes.
- NVIDIA is an equal opportunity employer committed to diversity and non-discrimination.