Used Tools & Technologies
Machine LearningRequired Skills & Competences
Tag name is followed by "@" symbol and proficiency level value.
About proficiency levels:
- 1-2 — basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;
- 3-6 — daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;
- 7-9 — you are an expert, you can teach others, you know all the pitfalls and tricks;
- 10 — exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.
Software Development @ 6
Go @ 7
Linux @ 6
Python @ 7
Communication @ 4
Networking @ 4
Performance Optimization @ 6
Cloud Computing @ 4
GPU @ 4
AI @ 4
InfiniBand @ 4
NCCL @ 4
HPC @ 4
- 1-2 — basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;
- 3-6 — daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;
- 7-9 — you are an expert, you can teach others, you know all the pitfalls and tricks;
- 10 — exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.
Details
Nebius is leading a new era in cloud computing to serve the global AI economy. We create the tools and resources our customers need to solve real-world challenges and transform industries, without massive infrastructure costs or the need to build large in-house AI/ML teams. Our employees work at the cutting edge of AI cloud infrastructure alongside experienced and innovative leaders and engineers.
We are looking for a Lead Systems HPC Engineer to play a key role in building our hyperscaler platform, working across its core components while analyzing and optimizing the performance of large-scale GPU clusters at the intersection of hardware and software. You will operate across the full stack — from hardware and system software to networking (InfiniBand/RoCE), virtualization (KVM/QEMU), and distributed communication layers (e.g., MPI, NCCL).
Responsibilities
- Understand system behavior across multiple layers, identify performance bottlenecks, and drive improvements that shape how clusters are built, operated, tuned, and validated.
- Investigate and troubleshoot performance issues of GPU clusters under real workloads (training and inference).
- Evaluate and integrate new hardware, system configurations and tuning approaches through the software stack.
- Support complex performance-related escalations from internal teams and customers.
- Work closely with infrastructure, software engineering and hardware vendor teams (e.g., NVIDIA, Mellanox, Intel).
- Contribute to hardware and cluster qualification (acceptance), ensuring systems meet performance expectations.
Requirements
- 5+ years of professional experience in system-level software development focused on performance optimization and low-level programming.
- 3+ years of hands-on experience with Linux systems (administration, troubleshooting, and performance tuning).
- In-depth understanding of server architecture, including PCIe devices, NICs, Linux OS/Kernel, and high-performance computing (HPC) systems.
- Strong proficiency in one or more performance-oriented programming languages (C/C++, Go, Python).
- Experience with networking technologies such as InfiniBand and RoCE, virtualization (KVM/QEMU), and distributed communication layers (MPI, NCCL).
We conduct coding interviews as part of the process.
Benefits
- Health insurance: 100% company-paid medical, dental and vision coverage for employees and families.
- 401(k) plan: Up to 4% company match with immediate vesting.
- Parental leave: 20 weeks paid for primary caregivers, 12 weeks for secondary caregivers.
- Remote work reimbursement: Up to $85/month for mobile and internet.
- Disability & life insurance: Company-paid short-term, long-term and life insurance coverage.
Compensation
We offer competitive salaries ranging from $170k-$300k OTE + equity based on your experience.
What we offer
- Competitive salary and comprehensive benefits package.
- Opportunities for professional growth within Nebius.
- Flexible working arrangements.
- A dynamic and collaborative work environment that values initiative and innovation.