Senior Software Engineer - Trade Automation & Execution Reliability
Used Tools & Technologies
Not specified
Required Skills & Competences
Tag name is followed by "@" symbol and proficiency level value.
About proficiency levels:
- 1-2 — basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;
- 3-6 — daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;
- 7-9 — you are an expert, you can teach others, you know all the pitfalls and tricks;
- 10 — exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.
Grafana @ 4
Linux @ 6
Prometheus @ 4
Python @ 6
Java @ 6
Distributed Systems @ 4
Load Testing @ 3
Observability @ 6
- 1-2 — basic awareness. Minimal hands-on experience, and a rudimentary understanding of the technology's purpose;
- 3-6 — daily use. Comfortable and regular usage, capable of handling common tasks and challenges related to the technology;
- 7-9 — you are an expert, you can teach others, you know all the pitfalls and tricks;
- 10 — exceptional knowledge, comprehensive understanding, and adeptness in all aspects of the technology, including advanced problem-solving. Think twice before claiming or demanding such level.
Details
The Trade Automation & Execution (TRAX) group builds the platforms and services that power modern electronic trading at Bloomberg. We design and operate high-performance, distributed, real-time systems used by financial institutions worldwide to execute trades, automate workflows, and make data-driven decisions. As markets evolve toward automation, scale, and intelligence, ensuring these platforms remain scalable, resilient, and predictable is critical. TRAX Reliability focuses on ensuring these real-time trading systems can scale safely, perform reliably under extreme market conditions, and recover gracefully from failures before issues impact clients.
Our Team
The TRAX Reliability team partners closely with application and infrastructure engineers to embed scalability, resilience, and technical risk management into trading systems from the ground up. Rather than reacting to production incidents, we take a data-driven, proactive approach: convert telemetry from real production workloads into meaningful insights and run controlled experiments to understand how systems behave under load, how failures propagate, and where bottlenecks emerge. By connecting data on performance and capacity, we help teams plan for growth, traffic spikes, and adverse scenarios with clear scaling strategies that can be executed consistently and speedily to ensure we maintain our SLOs.
We also design and build tooling that continuously evaluates system risk and performance. This includes running targeted stress tests, collecting detailed metrics, and surfacing insights through real-time dashboards to quickly identify areas of improvement across services, queues, and infrastructure and to understand their impact on client experience.
What’s in it for you
- Direct impact on the stability and resilience of execution platforms relied upon by leading buy-side firms
- Work on high-stakes distributed systems with strong performance and reliability requirements
- Develop deep expertise in scaling, failure modes, and technical risk management for real-time trading systems
- Collaborate with engineers across New York, London, and Frankfurt
- Influence system design across application, observability, and infrastructure teams
Responsibilities
- Identify, prioritize, and track scalability and reliability risks across large-scale trading platforms
- Partner with application teams to diagnose and address performance and resilience challenges
- Analyze system behavior under real and simulated load, including latency, throughput, failover, and blast radius
- Design and run chaos engineering experiments and game-day exercises to validate system capacity and resilience
- Build and maintain automation and tooling for early detection and mitigation of production risks
- Communicate technical trade-offs, solutions, and roadmaps to engineering stakeholders
- Plan for traffic growth and peak market events with clear scaling strategies and guardrails
Requirements
- 5+ years of professional experience with a high-level programming language such as Python, Java, or C++, preferably on Unix/Linux
- Solid understanding of Unix/Linux fundamentals
- Hands-on experience contributing to or triaging scaling and reliability issues in production distributed systems
- Experience working with metrics, monitoring, or observability platforms such as Grafana, Prometheus, or log analytics tools
- Strong analytical skills and the ability to reason about complex system behavior and failure modes
Nice-to-have
- Familiarity with chaos engineering, fault injection, or load testing frameworks
- A track record of writing blameless postmortems and leading game-day or incident review exercises
- Curiosity and willingness to learn across all layers of the software and infrastructure stack
Why TRAX Reliability
You’ll work on systems that sit at the heart of global financial markets. Your work will directly influence how trades are executed, how platforms behave during market stress, and how reliably clients can operate at scale. If you’re excited by real-time systems, distributed architectures, and solving hard reliability problems with real-world impact, we want to talk to you.
Compensation
Salary Range = 160,000 - 240,000 USD Annual + Benefits + Bonus
Benefits
The company offers a comprehensive benefits plan that may include merit increases, incentive compensation (exempt roles only), paid holidays, paid time off, medical, dental, vision, short and long term disability benefits, 401(k) with match, life insurance, and various wellness programs. The Company does not provide benefits directly to contingent workers/contractors and interns.
Apply
Apply via the company careers site.