Principal Software Engineer - Rack Scale Systems Infrastructure
at Nvidia
USD 272,000-431,200 per year
Used Tools & Technologies
GPURequired Skills & Competences
Tag name is followed by "@" symbol and proficiency level value.
About proficiency levels:
- 1-2 — basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;
- 3-6 — daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;
- 7-9 — you are an expert, you can teach others, you know all the pitfalls and tricks;
- 10 — exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.
Security @ 4
Go @ 4
Kubernetes @ 4
Linux @ 4
Distributed Systems @ 8
Communication @ 7
Networking @ 4
Rust @ 4
Debugging @ 4
API @ 4
Observability @ 4
AI @ 4
InfiniBand @ 4
HPC @ 7
NVLink @ 4
- 1-2 — basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;
- 3-6 — daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;
- 7-9 — you are an expert, you can teach others, you know all the pitfalls and tricks;
- 10 — exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.
Details
NVIDIA is recruiting a Principal Rack Scale Systems Infrastructure Engineer to build and guide the development of software systems that support rack-scale infrastructure products and services. This role sits at the intersection of software and hardware and covers control planes, state machines, orchestration systems, firmware, OS lifecycle, and networking fabrics to convert complex rack-scale hardware into dependable, manageable, and programmable infrastructure for NVIDIA, partners, and cloud and enterprise customers.
Responsibilities
- Define the complete software architecture for rack-scale infrastructure products and services, covering control plane services, infrastructure management, firmware, operating systems, kernel drivers, networking fabrics, accelerator software, and user-mode manageability software.
- Use Kubernetes and cloud-native primitives (controllers, operators, reconciliation loops, open source components) as an infrastructure fabric when appropriate; build components that operate safely at rack and fleet scale and produce open source-friendly libraries, services, controllers, operators, and integration APIs.
- Bridge hardware and software teams across firmware, BMC, BIOS, boot flows, OS images, drivers, networking, NVLink domains, InfiniBand, GPUs, DPUs, CPUs, and system management interfaces; translate infrastructure roadmaps into software requirements, architecture specifications, and execution plans.
- Partner with hyperscalers, CSPs, enterprise customers, internal component leads, vendors, and business partners to align capabilities with deployment and integration needs; establish reliability, security, validation, and left-shift strategies to reduce risk before hardware reaches production.
- Mentor senior engineers and technical leads and raise the engineering bar for large-scale networked systems, foundational software, and rack-scale control plane development.
- Make high-quality technical decisions in ambiguous environments, balancing customer needs, schedule, hardware realities, software maintainability, open source adoption, and long-term infrastructure evolution.
Requirements
- BS or MS in Computer Engineering, Computer Science, Electrical Engineering, or a related field, or equivalent experience; proven experience (15+ years) in systems architecture, system software, distributed systems, infrastructure control planes, or infrastructure engineering.
- Solid architectural knowledge of coordination frameworks, state machines, declarative APIs, reconciliation loops, lifecycle orchestration, failure handling, upgrade and rollback workflows, and distributed systems tradeoffs.
- Practical coding skills in Go, C++, or Rust, with capability to write, review, and direct production-quality infrastructure software. Experience with Rust is highly valued.
- Experience with Kubernetes or similar orchestration systems, especially for managing infrastructure, hardware resources, or large-scale infrastructure services.
- Experience with Linux-based infrastructure software, OS rollout and image management, kernel or driver interactions, firmware lifecycle, and hardware bring-up workflows.
- Strong understanding of data center networking technologies and protocols such as Ethernet, InfiniBand, RDMA, and fabric-level manageability; experience with accelerator-based systems including GPUs, DPUs, FPGAs, custom silicon, or other HPC systems.
- Expertise in in-band and out-of-band management architectures, including BMCs, Redfish, IPMI, and related system management protocols; ability to work with security experts on tradeoffs for secure boot, attestation, access control, update safety, and serviceability.
- Experience crafting software intended for open source release, including API stability, modularity, documentation, community usability, and separation between shared software and deployment-specific integrations.
- Experience using AI-assisted development tools responsibly as an engineering multiplier for coding, test generation, debugging, build iteration, and documentation.
- Strong written and verbal communication skills and demonstrated ability to specify requirements, guide architecture, and manage delivery across engineering teams and organizations.
Ways To Stand Out
- Built software supporting multiple adoption models (internal services, CSP-integrated offerings, reusable libraries, customer-extensible APIs).
- Strong Rust skills in systems, infrastructure, or hardware-adjacent software.
- Hands-on experience with fleet-scale provisioning, updates, rollback, observability, health, and remediation.
- Led across the full data center product lifecycle (inception, pre- and post-silicon, manufacturing, deployment, operations).
- Deep experience with rack- or cluster-scale systems spanning compute, networking, storage, accelerators, firmware, and infra management as one operational domain.
Additional Information
- Base salary range: 272,000 USD - 431,250 USD (determined by location, experience, and pay of employees in similar positions).
- Eligible for equity and benefits.
- Applications accepted at least until May 19, 2026.
- NVIDIA uses AI tools in its recruiting processes and is an equal opportunity employer.