Senior Systems Software Engineer, Data Center Infrastructure Management - EngOps
at Nvidia
USD 152,000-287,500 per year
Used Tools & Technologies
Not specified
Required Skills & Competences
Tag name is followed by "@" symbol and proficiency level value.
About proficiency levels:
- 1-2 — basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;
- 3-6 — daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;
- 7-9 — you are an expert, you can teach others, you know all the pitfalls and tricks;
- 10 — exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.
Software Development @ 4
Grafana @ 4
Kubernetes @ 4
Communication @ 4
Networking @ 3
OpenStack @ 3
Debugging @ 6
GPU @ 4
Observability @ 4
AI @ 4
- 1-2 — basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;
- 3-6 — daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;
- 7-9 — you are an expert, you can teach others, you know all the pitfalls and tricks;
- 10 — exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.
Details
NVIDIA is leading the way in Artificial Intelligence, High-Performance Computing and Visualization. The team develops and maintains software facilitating GPU communication and datacenter management. This EngOps role (5+ years experience) focuses on maintaining high-performance, rack-scale management solutions for datacenter environments and supporting deployment and debugging of hardware and the Infrastructure Manager.
Responsibilities
- Take ownership of daily cluster failures and issues, troubleshooting promptly to maintain cluster availability and performance.
- Manage updates to site controller management nodes.
- Manage rollout and rollback of cluster software and firmware updates, ensuring smooth transitions and minimal disruption.
- Work directly with Infrastructure Service software development teams to support deployment and debug hardware and management software.
Requirements
- BS or MS in Computer Science, Computer Engineering, Electrical Engineering, or a related field, or equivalent experience.
- 5+ years of hands-on experience deploying and administrating clusters, servers, switches, and related infrastructure.
- Experience deploying and configuring operating systems, computer networks, and high-performance applications.
- Proven ability to work effectively with developers and test engineers across teams and time zones.
- Experience deploying services in Kubernetes.
- Datacenter or computer architecture experience—understanding of server, rack, and network topologies and hardware/firmware/software interactions.
- Background with hardware management protocols (Redfish, IPMI, BMC) and firmware update automation.
- Experience configuring and debugging complex datacenter networks.
- Experience developing scripts to automate recovery actions for management controllers and datacenter systems.
Ways to Stand Out
- Direct experience with industry standard alerting tools and emergency response practices; experience with observability tools such as Grafana.
- Hands-on experience with GPU-focused hardware and software (e.g., DGX systems, Compute Clusters).
- Proficiency in designing large-scale networking technologies and familiarity with OpenStack and Foreman.
Compensation & Benefits
- Base salary ranges (location, level, and experience dependent):
- Level 3: 152,000 USD - 241,500 USD
- Level 4: 184,000 USD - 287,500 USD
- Eligible for equity and benefits (link to NVIDIA benefits referenced in posting).
Additional Information
- Applications accepted at least until April 3, 2026. This posting is for an existing vacancy.
- NVIDIA uses AI tools in its recruiting processes.
- NVIDIA is an equal opportunity employer committed to diversity and non-discrimination.