Lead Systems HPC Engineer

at Nebius
USD 170,000-300,000 per year
SENIOR
✅ Remote

Used Tools & Technologies

Machine Learning

Required Skills & Competences

Software Development @ 6 Go @ 7 Linux @ 6 Python @ 7 Communication @ 4 Networking @ 4 Performance Optimization @ 6 Cloud Computing @ 4 GPU @ 4 AI @ 4 InfiniBand @ 4 NCCL @ 4 HPC @ 4

Details

Nebius is leading a new era in cloud computing to serve the global AI economy. We create the tools and resources our customers need to solve real-world challenges and transform industries, without massive infrastructure costs or the need to build large in-house AI/ML teams. Our employees work at the cutting edge of AI cloud infrastructure alongside experienced and innovative leaders and engineers.

We are looking for a Lead Systems HPC Engineer to play a key role in building our hyperscaler platform, working across its core components while analyzing and optimizing the performance of large-scale GPU clusters at the intersection of hardware and software. You will operate across the full stack — from hardware and system software to networking (InfiniBand/RoCE), virtualization (KVM/QEMU), and distributed communication layers (e.g., MPI, NCCL).

Responsibilities

  • Understand system behavior across multiple layers, identify performance bottlenecks, and drive improvements that shape how clusters are built, operated, tuned, and validated.
  • Investigate and troubleshoot performance issues of GPU clusters under real workloads (training and inference).
  • Evaluate and integrate new hardware, system configurations and tuning approaches through the software stack.
  • Support complex performance-related escalations from internal teams and customers.
  • Work closely with infrastructure, software engineering and hardware vendor teams (e.g., NVIDIA, Mellanox, Intel).
  • Contribute to hardware and cluster qualification (acceptance), ensuring systems meet performance expectations.

Requirements

  • 5+ years of professional experience in system-level software development focused on performance optimization and low-level programming.
  • 3+ years of hands-on experience with Linux systems (administration, troubleshooting, and performance tuning).
  • In-depth understanding of server architecture, including PCIe devices, NICs, Linux OS/Kernel, and high-performance computing (HPC) systems.
  • Strong proficiency in one or more performance-oriented programming languages (C/C++, Go, Python).
  • Experience with networking technologies such as InfiniBand and RoCE, virtualization (KVM/QEMU), and distributed communication layers (MPI, NCCL).

We conduct coding interviews as part of the process.

Benefits

  • Health insurance: 100% company-paid medical, dental and vision coverage for employees and families.
  • 401(k) plan: Up to 4% company match with immediate vesting.
  • Parental leave: 20 weeks paid for primary caregivers, 12 weeks for secondary caregivers.
  • Remote work reimbursement: Up to $85/month for mobile and internet.
  • Disability & life insurance: Company-paid short-term, long-term and life insurance coverage.

Compensation

We offer competitive salaries ranging from $170k-$300k OTE + equity based on your experience.

What we offer

  • Competitive salary and comprehensive benefits package.
  • Opportunities for professional growth within Nebius.
  • Flexible working arrangements.
  • A dynamic and collaborative work environment that values initiative and innovation.