Inference Engineering Manager
USD 300,000-385,000 per year
Used Tools & Technologies
Not specified
Required Skills & Competences
Tag name is followed by "@" symbol and proficiency level value.
About proficiency levels:
- 1-2 — basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;
- 3-6 — daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;
- 7-9 — you are an expert, you can teach others, you know all the pitfalls and tricks;
- 10 — exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.
Kubernetes @ 3
Python @ 3
Machine Learning @ 3
TensorFlow @ 6
Leadership @ 5
Communication @ 6
Rust @ 3
API @ 3
Technical Leadership @ 5
LLM @ 6
PyTorch @ 3
CUDA @ 3
GPU @ 3
- 1-2 — basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;
- 3-6 — daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;
- 7-9 — you are an expert, you can teach others, you know all the pitfalls and tricks;
- 10 — exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.
Details
We are looking for an Inference Engineering Manager to lead our AI Inference team. This is a unique opportunity to build and scale the infrastructure that powers Perplexity's products and APIs, serving millions of users with state-of-the-art AI capabilities.
You will own the technical direction and execution of our inference systems while building and leading a world-class team of inference engineers. Our current stack includes Python, PyTorch, Rust, C++, and Kubernetes. You will help architect and scale the large-scale deployment of machine learning models behind Perplexity's Comet, Sonar, Search, and Deep Research products.
Why Perplexity?
- Build SOTA systems that are the fastest in the industry with cutting-edge technology
- High-impact work on a smaller team with significant ownership and autonomy
- Opportunity to build 0-to-1 infrastructure from scratch rather than maintaining legacy systems
- Work on the full spectrum: reducing cost, scaling traffic, and pushing the boundaries of inference
- Direct influence on technical roadmap and team culture at a rapidly growing company
Responsibilities
- Lead and grow a high-performing team of AI inference engineers
- Develop APIs for AI inference used by both internal and external customers
- Architect and scale our inference infrastructure for reliability and efficiency
- Benchmark and eliminate bottlenecks throughout our inference stack
- Drive large sparse/MoE model inference at rack scale, including sharding strategies for massive models
- Push the frontier with building inference systems to support sparse attention, disaggregated pre-fill/decoding serving, etc.
- Improve the reliability and observability of our systems and lead incident response
- Own technical decisions around batching, throughput, latency, and GPU utilization
- Partner with ML research teams on model optimization and deployment
- Recruit, mentor, and develop engineering talent
- Establish team processes, engineering standards, and operational excellence
Qualifications
- 5+ years of engineering experience with 2+ years in a technical leadership or management role
- Deep experience with ML systems and inference frameworks (PyTorch, TensorFlow, ONNX, TensorRT, vLLM)
- Strong understanding of LLM architecture: Multi-Head Attention, Multi/Grouped-Query Attention, and common layers
- Experience with inference optimizations: batching, quantization, kernel fusion, FlashAttention
- Familiarity with GPU characteristics, roofline models, and performance analysis
- Experience deploying reliable, distributed, real-time systems at scale
- Track record of building and leading high-performing engineering teams
- Experience with parallelism strategies: tensor parallelism, pipeline parallelism, expert parallelism
- Strong technical communication and cross-functional collaboration skills
Nice to Have
- Experience with CUDA, Triton, or custom kernel development
- Background in training infrastructure and RL workloads
- Experience with Kubernetes and container orchestration at scale
- Published work or contributions to inference optimization research
Benefits
- Full-time U.S. employees: equity, health, dental, vision, retirement, fitness, commuter and dependent care accounts, and more
- Full-time employees outside the U.S.: benefits tailored to region of residence
- USD salary ranges apply only to U.S.-based positions; international salaries are set based on local market