Senior Software Engineer - Core Communications Reliability

USD 160,000-240,000 per year
SENIOR
✅ On-site

Used Tools & Technologies

Not specified

Required Skills & Competences

Grafana @ 4 Kafka @ 3 Python @ 6 Spark @ 4 Java @ 3 Distributed Systems @ 4 Communication @ 7 SRE @ 4 Observability @ 7

Details

Bloomberg’s Core Communications platforms power real-time messaging across the global financial industry (e.g., Instant Bloomberg and MSG), handling billions of messages daily. This role focuses on improving the reliability of these large-scale, low-latency distributed systems by building tools, improving system design, and ensuring predictable behavior under load and failure.

Responsibilities

  • Work on large-scale distributed systems with high availability and low latency requirements.
  • Build tools and automation to improve how distributed systems are operated and debugged.
  • Define and implement service level objectives (SLOs) that reflect real user impact.
  • Identify and continuously assess reliability risks across services, infrastructure, and workflows; help teams prioritize work based on real impact.
  • Improve development and deployment workflows to drive more consistent and reliable paths to production.
  • Reduce time to recovery and triage effort by improving diagnostics, alerting, and system-level visibility.
  • Design and validate failure scenarios and resilience testing practices to ensure predictable behavior under stress.
  • Collaborate closely with software engineers and product teams to influence how systems are designed, built, and operated.

Requirements

  • 4+ years of experience in software engineering.
  • Proficiency in Python.
  • Experience working with distributed systems.
  • Strong understanding of system reliability, observability, and performance.
  • Familiarity with SLOs, SLIs, and SLAs and how to relate system performance back to client impact.
  • Strong collaboration and communication skills.
  • A degree in Computer Science, Engineering, or equivalent practical experience.

Preferred Qualifications

  • Experience with monitoring or tracing tools such as Grafana, Humio, or distributed tracing systems.
  • Familiarity with Kafka, Java, or large-scale data systems.
  • Experience with chaos engineering, failure injection, or resilience testing frameworks.
  • Exposure to capacity planning and scaling analysis.
  • Contributions to open source or involvement in SRE communities.
  • Experience with big data technologies like Apache Spark and Amazon S3.

Why this role

  • Work on systems operating at very high scale (billions of messages processed daily).
  • Tackle complex distributed systems challenges involving latency, consistency, and failure handling.
  • Build tooling and frameworks used across multiple teams.
  • Have direct impact on systems relied upon by the global financial industry.

Benefits

The company offers a comprehensive benefits plan that may include merit increases, incentive compensation (exempt roles only), paid holidays, paid time off, medical, dental, vision, short and long term disability, 401(k) with match, life insurance, and wellness programs. Note: benefits are not provided directly to contingent workers/contractors and interns.

Compensation

Salary range: 160000 - 240000 USD annually, plus benefits and bonus.