Senior Systems Software Engineer, Data Center Infrastructure Management - EngOps

at Nvidia

📍 United States

USD 152,000-287,500 per year

SENIOR

✅ Remote

Used Tools & Technologies

Not specified

Required Skills & Competences
Tag name is followed by "@" symbol and proficiency level value. About proficiency levels:

1-2 — basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;

3-6 — daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;

7-9 — you are an expert, you can teach others, you know all the pitfalls and tricks;

10 — exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.

Software Development @ 4 Grafana @ 4 Kubernetes @ 4 Communication @ 4 Networking @ 3 OpenStack @ 3 Debugging @ 6 GPU @ 4 Observability @ 4 AI @ 4

Details

NVIDIA is leading the way in Artificial Intelligence, High-Performance Computing and Visualization. The team develops and maintains software facilitating GPU communication and datacenter management. This EngOps role (5+ years experience) focuses on maintaining high-performance, rack-scale management solutions for datacenter environments and supporting deployment and debugging of hardware and the Infrastructure Manager.

Responsibilities

Take ownership of daily cluster failures and issues, troubleshooting promptly to maintain cluster availability and performance.
Manage updates to site controller management nodes.
Manage rollout and rollback of cluster software and firmware updates, ensuring smooth transitions and minimal disruption.
Work directly with Infrastructure Service software development teams to support deployment and debug hardware and management software.

Requirements

BS or MS in Computer Science, Computer Engineering, Electrical Engineering, or a related field, or equivalent experience.
5+ years of hands-on experience deploying and administrating clusters, servers, switches, and related infrastructure.
Experience deploying and configuring operating systems, computer networks, and high-performance applications.
Proven ability to work effectively with developers and test engineers across teams and time zones.
Experience deploying services in Kubernetes.
Datacenter or computer architecture experience—understanding of server, rack, and network topologies and hardware/firmware/software interactions.
Background with hardware management protocols (Redfish, IPMI, BMC) and firmware update automation.
Experience configuring and debugging complex datacenter networks.
Experience developing scripts to automate recovery actions for management controllers and datacenter systems.

Ways to Stand Out

Direct experience with industry standard alerting tools and emergency response practices; experience with observability tools such as Grafana.
Hands-on experience with GPU-focused hardware and software (e.g., DGX systems, Compute Clusters).
Proficiency in designing large-scale networking technologies and familiarity with OpenStack and Foreman.

Compensation & Benefits

Base salary ranges (location, level, and experience dependent):
- Level 3: 152,000 USD - 241,500 USD
- Level 4: 184,000 USD - 287,500 USD
Eligible for equity and benefits (link to NVIDIA benefits referenced in posting).

Additional Information

Applications accepted at least until April 3, 2026. This posting is for an existing vacancy.
NVIDIA uses AI tools in its recruiting processes.
NVIDIA is an equal opportunity employer committed to diversity and non-discrimination.