Used Tools & Technologies
Not specified
Required Skills & Competences ?
Distributed Systems @ 6 Machine Learning @ 3 TensorFlow @ 1 Hiring @ 3 PyTorch @ 1 GPU @ 3Details
Reddit is a community of communities built on shared interests, passion, and trust and is home to open and authentic conversations on the internet. With 100,000+ active communities and approximately 101M+ daily active unique visitors, Reddit is one of the internet’s largest sources of information. The Ads Training Platform pod builds and maintains the distributed training and data processing infrastructure that powers Reddit’s Ads machine learning models, enabling fast, reliable, and scalable model training across large datasets to improve ad targeting, conversion prediction, and advertiser value.
We are looking for an engineer with deep experience in infrastructure, distributed systems, and ML platform operations to help evolve and scale our training systems.
Responsibilities
- Design, build, and maintain large-scale distributed training infrastructure for Ads ML models.
- Develop tools and frameworks on top of the Ray platform.
- Build tools to debug, profile, and tune distributed training jobs for performance and reliability.
- Integrate with object storage systems and improve data access patterns.
- Collaborate with ML engineers to improve model training time, efficiency, and GPU training costs.
- Drive improvements in scheduling, state management, and fault tolerance within the training platform to enhance overall performance.
Requirements
- 3+ years in infrastructure/platform engineering or large-scale distributed systems.
- 2+ years hands-on experience with the Ray platform.
- Strong understanding of distributed computing principles (task scheduling, fault tolerance, state management).
- Experience with distributed storage systems and large-scale data processing.
- Proven ability to debug and profile distributed jobs.
- Experience with deep learning frameworks (PyTorch, TensorFlow) is a big plus.
- Bonus: model optimization for distributed training, Ads ML experience.
Benefits
- Comprehensive Healthcare Benefits and Income Replacement Programs
- 401(k) Match
- Family Planning Support
- Gender-Affirming Care
- Mental Health & Coaching Benefits
- Flexible Vacation & Reddit Global Days off
- Generous paid Parental Leave
- Paid Volunteer time off
Work Location & Pay
- Remote - United States (Reddit supports remote work in countries where it has a physical presence; employees near offices may come in as desired).
- Base pay range (U.S.-based): $185,800 - $260,100 USD.
Additional Notes
- This posting may span more than one career level. In addition to base salary, the role may be eligible for equity (RSUs) and, depending on position, commission. Reddit offers a wide range of benefits to U.S.-based employees.
- Some interviews may be recorded, transcribed, and summarized by AI; candidates may opt out prior to scheduled interviews. Recorded interviews are deleted promptly after hiring decisions.