Senior Software Engineer, Bare Metal Automation - DGX Cloud
SCRAPED
Used Tools & Technologies
Not specified
Required Skills & Competences ?
Go @ 4 Python @ 4 Algorithms @ 4 Data Structures @ 4 Distributed Systems @ 4 Hiring @ 4 Communication @ 7 Mathematics @ 4 API @ 4 GPU @ 4Details
NVIDIA is hiring experienced software engineers with bare metal hardware experience to help scale up its AI Infrastructure. The role focuses on automating bare metal hardware management for large GPU clusters, including managing software/firmware versions, writing repeatable tests and benchmarks, and programming against APIs exposed by baseboard management controllers. The team builds production systems for DGX Cloud to enable large, scalable GPU clusters for a variety of AI workloads.
Responsibilities
- Build and maintain production systems that manage fleets of GPU nodes and enable large-scale GPU clusters for AI workloads.
- Implement monitoring and health management capabilities to achieve high reliability, availability, and scalability of GPU assets.
- Harness and process multiple data streams, including GPU hardware diagnostics and cluster/network telemetry.
- Work cross-functionally with teams across NVIDIA to ensure production AI clusters run reliably and with maximum performance.
- Evaluate system failures and improve services based on a defined incident management process.
- Develop repeatable and efficient tests and benchmarks for bare metal systems and firmware/software version management.
Requirements
- Significant software engineering experience (5+ years) in highly technical organizations with demonstrable impact.
- Direct experience automating and managing bare metal hardware and bare metal hardware APIs/frameworks, preferably on GPU servers.
- Technical knowledge of systems programming languages (Go, Python) and a solid understanding of data structures and algorithms.
- Experience working on large-scale production systems and distributed systems.
- Strong communication skills and ability to work with cross-functional teams, principals, and architects across geographies.
- BS in Computer Science, Engineering, Physics, Mathematics, or comparable degree or equivalent experience.
Ways to stand out
- Advanced hands-on experience and deep understanding of managing large fleets of bare metal hardware independent of cloud providers.
- Proven operational excellence in maintaining reliable and performant AI infrastructure.
Compensation & Benefits
Your base salary will be determined based on your location, experience, and pay of employees in similar positions. The base salary ranges are:
- Level 3: 148,000 USD - 235,750 USD
- Level 4: 184,000 USD - 287,500 USD
You will also be eligible for equity and benefits. NVIDIA is an equal opportunity employer and values diversity in its workforce.