Used Tools & Technologies
Not specified
Required Skills & Competences
Tag name is followed by "@" symbol and proficiency level value.
About proficiency levels:
- 1-2 — basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;
- 3-6 — daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;
- 7-9 — you are an expert, you can teach others, you know all the pitfalls and tricks;
- 10 — exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.
Go @ 6
Kubernetes @ 7
Terraform @ 7
Python @ 6
CI/CD @ 4
Machine Learning @ 4
MLOps @ 4
Hiring @ 4
Leadership @ 4
AWS @ 7
Communication @ 4
API @ 4
LLM @ 4
LLMOps @ 4
Observability @ 4
AI @ 4
Agentic AI @ 4
RAG @ 4
GenAI @ 4
- 1-2 — basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;
- 3-6 — daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;
- 7-9 — you are an expert, you can teach others, you know all the pitfalls and tricks;
- 10 — exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.
Details
Reddit is a community of communities built on shared interests, passion, and trust. Every day, Reddit users submit, vote, and comment on the topics they care most about. With 100,000+ active communities and approximately 121 million daily active unique visitors, Reddit is one of the internet’s largest sources of information.
Team
The Machine Learning Platform team at Reddit owns the infrastructure that powers recommendations, content discovery, and user/content quantification, impacting teams such as Growth, Ads, Feeds, and Core Machine Learning teams.
Responsibilities
- Define and lead the vision, strategy, and roadmap for Reddit’s large-scale GenAI Platform.
- Design, implement, and maintain the LLM Gateway, including unified API endpoints for internal and external LLMs, rate/token limit management, and intelligent failover mechanisms to boost uptime and reliability.
- Set the direction for core platform capabilities such as rate and token limit management, intelligent failover, and production resilience.
- Shape Reddit’s approach to an enterprise-grade RAG (retrieval-augmented generation) system.
- Establish strategic direction for agentic AI workflows and tool-use patterns across the platform.
- Own end-to-end platform strategy from concept through production adoption and long-term evolution.
- Drive MLOps and LLMOps standards across CI/CD, testing, versioning, evaluation, and lifecycle management.
- Define best practices for observability, monitoring, governance, and operational excellence across GenAI systems.
- Partner across engineering, product, and leadership to align platform investments with company priorities and user needs.
- Champion platform thinking with a strong focus on scalability, reliability, performance, and developer experience.
- Influence technical direction across teams by turning emerging AI capabilities into a scalable platform strategy.
Requirements
- 10+ years of experience in ML Engineering, AI Platform Engineering, or Cloud AI Deployment roles.
- Track record of leading technical strategy and delivering AI platforms in cloud-based production environments at scale.
- Strong execution skills: driving complex initiatives end-to-end and delivering high-quality platform outcomes.
- Deep experience operating Kubernetes and other orchestration systems in large-scale production environments.
- Deep experience with cloud-based technologies for ML platforms, including tools like AWS, Google Cloud Storage, and infrastructure-as-code such as Terraform.
- Proficiency with common ML programming languages/frameworks such as Go and Python.
- Excellent communication skills and ability to articulate technical AI concepts to non-technical stakeholders.
- Strong focus on scalability, reliability, performance, and developer experience; advocate for platform users and understand genAI product development lifecycle.
- Strong knowledge of model serving, inference pipelines, monitoring, and observability for AI systems is a plus.
Benefits
- Comprehensive healthcare benefits and income replacement programs
- 401(k) with employer match
- Global benefit programs for workspace, professional development, and caregiving support
- Family planning support and gender-affirming care
- Mental health & coaching benefits
- Flexible vacation & paid volunteer time off
- Generous paid parental leave
Pay Transparency
- Base salary range (U.S.-based): $292,500 - $409,500 USD
- This role is eligible to receive equity in the form of restricted stock units and may be eligible for a commission depending on the position offered.
Additional Information
- In select roles and locations, interviews may be recorded, transcribed, and summarized by AI; candidates may opt out prior to scheduled interviews. The company will collect certain categories of personal information during interviews to evaluate applications and will delete recordings promptly after making a hiring decision.