Senior Software Engineer, Bare Metal Automation - DGX Cloud
at Nvidia
USD 148,000-287,500 per year
SCRAPED
Used Tools & Technologies
Not specified
Required Skills & Competences ?
Software Development @ 4 Go @ 4 Python @ 4 Algorithms @ 4 Data Structures @ 4 Distributed Systems @ 4 Hiring @ 4 Communication @ 7 Mathematics @ 4 API @ 4 GPU @ 4Details
NVIDIA is hiring experienced software engineers with bare metal hardware experience to help scale up its AI Infrastructure. The role involves significant software engineering experience with automating bare metal hardware, including managing software/firmware versions, writing repeatable and efficient tests and benchmarks, and programming using libraries and APIs exposed by baseboard management controllers. Candidates are expected to be creative, passionate about GPU hardware, and enjoy challenging, evolving work environments.
Responsibilities
- Be part of the DGX Cloud team responsible for production systems that enable large scalable GPU clusters for AI workloads.
- Work on custom software related to managing fleets of GPU nodes.
- Implement monitoring and health management capabilities to achieve industry-leading reliability, availability, and scalability of GPU assets.
- Harness multiple data streams, including GPU hardware diagnostics and cluster/network telemetry.
- Collaborate with cross-functional teams to ensure reliable, consistent, and high-performance AI cluster operations.
- Evaluate system failures and improve services based on an incident management process.
Requirements
- Proven software engineering experience in a highly technical environment with impact from your work.
- Software development experience with bare metal hardware APIs and frameworks, preferably on GPU servers.
- Strong communication skills and ability to work effectively with multi-functional teams, principles, and architects across organizations and geographies.
- 5+ years in a similar role with experience on large-scale production systems.
- Knowledge of common software engineering principles, tools, and techniques.
- Bachelor’s degree in Computer Science, Engineering, Physics, Mathematics, or equivalent experience.
- Technical knowledge of systems programming languages such as Go and Python, with a solid understanding of data structures and algorithms.
Ways to Stand Out
- Technical competency in managing and automating large-scale distributed systems independent of cloud providers.
- Advanced hands-on experience and deep understanding of managing large fleets of bare metal hardware.
- Proven operational excellence in maintaining reliable and performant AI infrastructure.
Benefits
- Eligibility for equity and benefits.
- NVIDIA is an equal opportunity employer committed to diversity.
The base salary range is 148,000 USD - 287,500 USD, determined by location, experience, and pay of employees in similar positions.