Used Tools & Technologies
Not specified
Required Skills & Competences
Tag name is followed by "@" symbol and proficiency level value.
About proficiency levels:
- 1-2 — basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;
- 3-6 — daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;
- 7-9 — you are an expert, you can teach others, you know all the pitfalls and tricks;
- 10 — exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.
Kubernetes @ 7
Terraform @ 7
Python @ 6
Spark @ 4
GCP @ 7
Data Structures @ 4
Machine Learning @ 4
MLOps @ 4
TensorFlow @ 6
Hiring @ 4
Apache Beam @ 4
Communication @ 7
MLFlow @ 4
PyTorch @ 6
GPU @ 4
AI @ 4
Profiling @ 4
- 1-2 — basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;
- 3-6 — daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;
- 7-9 — you are an expert, you can teach others, you know all the pitfalls and tricks;
- 10 — exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.
Details
Reddit's Machine Learning Platform team owns infrastructure that powers recommendations, content discovery, user and content quantification, impacting teams such as Growth, Ads, Feeds, and Core ML teams. This role leads development of a platform for large-scale ML models at Reddit and focuses on designing model lifecycle patterns, building scalable graph ML infrastructure, and optimizing distributed training performance.
Responsibilities
- Design end-to-end model lifecycle patterns (MLOps) to boost velocity for ML engineers, including data preparation, model management, experiment tracking, and related workflows.
- Zero-to-one development and support of a graph ML codebase and platform that abstracts common patterns and enables model scalability and iteration.
- Collaborate with ML engineers on performance tuning, improving model training time, efficiency, and GPU training costs in a large distributed training environment.
- Optimize batch data processing within a data warehouse and with tools such as Apache Beam, Apache Spark, Ray Data, and similar frameworks.
- Architect pipelines to build and maintain massive graph data structures on the order of billions of nodes and tens of billions of edges.
Requirements
- 8+ years of experience in ML infrastructure, including model training and model deployments.
- Hands-on experience with ML optimization, including memory and GPU profiling.
- Deep experience with cloud-based technologies for supporting an ML platform, including tools like GCP BigQuery, Google Cloud Storage, and infrastructure-as-code (Terraform).
- Hands-on experience administering and integrating MLOps tools for experiment tracking, model serving, and model registries (e.g., MLflow or Weights & Biases).
- Proficiency with common ML programming languages and frameworks such as Python, PyTorch, and TensorFlow.
- Deep experience working with distributed training frameworks, including Ray and Kubernetes.
- Strong focus on scalability, reliability, performance, and ease of use; advocate for platform users with deep intuition for the ML development lifecycle.
- Strong organizational and communication skills.
- Experience working with graph databases (Neo4j, JanusGraph, TigerGraph) is a big plus.
- Experience working with graph neural networks (GNNs) and associated graph ML frameworks (PyTorch Geometric, Deep Graph Library) is a big plus.
Pay Transparency
- Base salary range (US): $230,000 - $322,000 USD.
- Role may be eligible for equity in the form of restricted stock units, and depending on the position offered, may also be eligible to receive a commission.
Other Notes
- In select roles and locations, interviews may be recorded, transcribed, and summarized by AI; candidates may opt out prior to interviews.
- During interview, Reddit may collect identifiers, professional/employment-related info, sensory information (audio/video), and other info candidates choose to share; recordings are deleted after hiring decisions.
- Reddit is an equal opportunity employer and provides reasonable accommodations for qualified individuals with disabilities.
Benefits
- Medical, dental, and vision insurance for U.S.-based employees.
- 401(k) program with employer match.
- Generous time off and parental leave.