Senior Systems Software Engineer, Data Center Infrastructure Management - EngOps

at Nvidia
USD 152,000-287,500 per year
SENIOR
✅ Remote

Used Tools & Technologies

Not specified

Required Skills & Competences

Software Development @ 4 Grafana @ 4 Kubernetes @ 4 Communication @ 4 Networking @ 3 OpenStack @ 3 Debugging @ 6 GPU @ 4 Observability @ 4 AI @ 4

Details

NVIDIA is leading the way in Artificial Intelligence, High-Performance Computing and Visualization. The team develops and maintains software facilitating GPU communication and datacenter management. This EngOps role (5+ years experience) focuses on maintaining high-performance, rack-scale management solutions for datacenter environments and supporting deployment and debugging of hardware and the Infrastructure Manager.

Responsibilities

  • Take ownership of daily cluster failures and issues, troubleshooting promptly to maintain cluster availability and performance.
  • Manage updates to site controller management nodes.
  • Manage rollout and rollback of cluster software and firmware updates, ensuring smooth transitions and minimal disruption.
  • Work directly with Infrastructure Service software development teams to support deployment and debug hardware and management software.

Requirements

  • BS or MS in Computer Science, Computer Engineering, Electrical Engineering, or a related field, or equivalent experience.
  • 5+ years of hands-on experience deploying and administrating clusters, servers, switches, and related infrastructure.
  • Experience deploying and configuring operating systems, computer networks, and high-performance applications.
  • Proven ability to work effectively with developers and test engineers across teams and time zones.
  • Experience deploying services in Kubernetes.
  • Datacenter or computer architecture experience—understanding of server, rack, and network topologies and hardware/firmware/software interactions.
  • Background with hardware management protocols (Redfish, IPMI, BMC) and firmware update automation.
  • Experience configuring and debugging complex datacenter networks.
  • Experience developing scripts to automate recovery actions for management controllers and datacenter systems.

Ways to Stand Out

  • Direct experience with industry standard alerting tools and emergency response practices; experience with observability tools such as Grafana.
  • Hands-on experience with GPU-focused hardware and software (e.g., DGX systems, Compute Clusters).
  • Proficiency in designing large-scale networking technologies and familiarity with OpenStack and Foreman.

Compensation & Benefits

  • Base salary ranges (location, level, and experience dependent):
    • Level 3: 152,000 USD - 241,500 USD
    • Level 4: 184,000 USD - 287,500 USD
  • Eligible for equity and benefits (link to NVIDIA benefits referenced in posting).

Additional Information

  • Applications accepted at least until April 3, 2026. This posting is for an existing vacancy.
  • NVIDIA uses AI tools in its recruiting processes.
  • NVIDIA is an equal opportunity employer committed to diversity and non-discrimination.