Used Tools & Technologies
Not specified
Required Skills & Competences ?
Software Development @ 8 Python @ 7 Hiring @ 4 Leadership @ 4 Communication @ 7 Git @ 4 Jira @ 4 Debugging @ 4 Technical Leadership @ 4 Project Management @ 4 GPU @ 4Details
NVIDIA’s invention of the GPU in 1999 sparked the growth of the PC gaming market, redefined modern computer graphics, and revolutionized parallel computing. More recently, GPU deep learning ignited modern deep learning — the next era of computing — with the GPU acting as the brain of computers, robots, and self-driving cars that can perceive and understand the world. Today, NVIDIA is increasingly known as “the AI computing company.” The NVIDIA GH200 superchip provides the performance and productivity required for strong scaling for HPC and generative AI workloads. Scale out is inherent to the design of this massive superchip. NVIDIA is looking for expert engineers to help design rack-level solutions for next generation scaling AI supercomputing platforms.
Responsibilities
- Drive server management for large clusters and data centers deploying GPUs and Grace solutions from NVIDIA.
- Work with data center architects and cloud customers to narrow down requirements for implementation, ensuring speed of product development.
- Collaborate closely with the hardware team to define low-level requirements and architecture for all data center products for their management.
- Own and deliver firmware for low-level management components and manage team to deliver firmware with quality.
- Work with internal teams to ensure that requirements are designed and implemented correctly with each firmware and software module.
- Collaborate with other leads to design and build data center health management workflows.
- Drive reliability and optimization in firmware architecture from a data center viewpoint.
- Work closely with cluster bring-up teams and resolve issues quickly.
- Own firmware delivered to data centers in terms of quality, reliability, and telemetry performance.
Requirements
- 10+ years of relevant experience working on server firmware (BMC) and platform software development.
- BS, MS, or PhD in Electrical Engineering, Computer Science or related fields, or equivalent experience.
- Hands-on experience with data center health management workflows.
- Proven record of delivering server firmware for large data centers.
- Strong knowledge of data center management, server architecture, and server manageability.
- 4+ years of proven experience managing teams of engineers.
- Strong and demonstrable skills in C/C++ and Python.
- Experience programming and debugging server platforms.
- Experience with SCM tools (e.g. Git, Perforce) and project management tools such as Jira.
- Excellent written and oral communication skills, good work ethics, strong teamwork orientation, and commitment to quality work.
- Self-starter with a passion for finding creative solutions to complicated problems and hands-on coding ability.
Ways to Stand Out
- Hands-on experience with data center health management and server manageability.
- Proven technical leadership handling large complex problems with 25+ engineers.
Benefits
- Competitive base salary range from $224,000 to $425,500 USD depending on location and experience.
- Eligibility for equity and other benefits.
- Commitment to diversity and equal opportunity employment in hiring and promotion practices.