Designing a Cloud Based Exchange

Learnings from Alex Xu and Sahn Lam

Dec 12, 2024

A few months ago, I read a book by Alex Xu and Sahn Lam, where they discussed the design of a high-throughput, low-latency exchange with high availability. Given that Bitcoin recently surpassed the $100,000 mark, I thought it would be a great opportunity to share some of the insights I gained from the book. In this article, I'll walk you through the design of a low-latency, high-throughput system capable of handling loads similar to those seen by major exchanges like the New York Stock Exchange (NYSE).

The NYSE processes an estimated 1 billion orders daily during trading hours (09:30 EST to 16:00 EST). That equates to around 43,000 requests per second over the 6.5-hour trading window. During periods of high market volatility, the request rate can surge to up to 129,000 requests per second—three times the normal load. So, how can we design a system that handles 129,000 requests per second while still performing essential risk checks on incoming orders?

While I am a strong advocate for the microservices architecture, there is an area where this design pattern doesn't quite fit: ultra-low-latency systems. High-performance systems, especially those with sub-millisecond latencies, are typically not distributed by nature. Instead, they are co-located on a single instance. The reasoning is simple: proximity matters. The closer components are to each other, the faster the communication. Why risk network faults or packet loss by communicating over a network?

This doesn’t mean we can't use multiple instances, but the design usually involves a single "leader" instance that processes requests, while replica instances stand by to take over if the leader fails. The replicas are kept in sync and ready to assume the leader role if needed.

Core Components of an Exchange

At a high level, an exchange consists of several core components:

Client Gateway: Handles incoming requests, whether via the Financial Information Exchange Protocol (FIX) or a REST API.
Order Manager: Orchestrates and manages the workflow of orders.
Sequencer: Ensures consistency by marking orders and executions with incremental IDs, allowing the system to remain deterministic during replay.
Matching Engine: Matches incoming buy and sell orders.
Data Service: Provides information like candles, order books, trades, and more.
Reporter: Writes trades and executions to a database or disk.
Risk Checker: Ensures that users have not exceeded their trading volume limits.

The key to achieving low latency is minimising the reliance on remote systems. For ultra-low-latency applications, a monolithic architecture that incorporates the Client Gateway, Order Manager, Sequencer, and Matching Engine in the "hot path" is ideal. The hot path is the critical path where the data is processed in real-time:

Client Gateway → Order Manager → Sequencer → Matching Engine

Should Event-Driven Architecture Still Be Used?

Although event-driven architecture (EDA) is not ideal for the hot path in low-latency systems, it can still be used but without the normal event streaming platforms/ event buses such as Kafka, RabbitMQ, or Reflex. I believe that event-driven design promotes better software development due to its standardised communication and separation of concerns.

mmap(2) and Its Role in Low-Latency Systems

One interesting concept I encountered in the book was mmap(2). This system call allows you to map a file directly into memory, enabling much faster performance compared to traditional disk I/O operations. It also allows efficient interaction with large files without loading them entirely into memory.

Traditional exchanges such as NASDAQ, the NYSE, and the London Stock Exchange (LSE) are suspected of using mmap(2) (although we may never know what implementations or optimisations they have) to meet their low-latency requirements for high-frequency trading. By mapping the file to memory, the exchange can quickly access and modify large datasets, facilitating rapid order matching and execution.

Combining Event Sourcing with mmap(2)

By combining Event Sourcing (the event-driven paradigm) with mmap(2), it is possible to run an event-driven system at low-latency while ensuring data durability. Of course, no system is immune to failure. Servers can crash, hardware can malfunction, and unexpected issues can arise at any time.

In the monolithic design I described earlier, one instance takes on the "leader" role, processing incoming orders. If the leader crashes, how does the system recover, and how does a new leader take over?

Introducing Raft for Fault Tolerance

This is where the Raft consensus algorithm comes into play. Raft is a leader-election algorithm that ensures consistency and fault tolerance by replicating the file across instances. While the original leader has its mmap(2) persisted to disk, follower instances receive updates through the Raft protocol and maintain their own copies of the memory-mapped file. When a new leader is elected, it can continue writing to the mmap(2) file, and Raft will propagate these updates to all replicas.

If the original leader crashes and later comes back online, it can use Raft to sync with the latest updates, memory-map the file, and wait for the next opportunity to become the leader again.

Conclusion

In conclusion, this approach outlines how a cloud-based exchange can achieve low-latency, high-throughput processing while maintaining consistency and durability. By combining efficient memory-mapping techniques, event sourcing, and consensus algorithms like Raft, it's possible to build a fault-tolerant, high-performance system capable of handling massive amounts of data with minimal latency.

Disclaimer:
The design and ideas presented in this article are purely my own and do not represent the architecture or implementation of Luno, the cryptocurrency exchange. The views expressed are based on my personal research, and should not be interpreted as reflecting Luno's current or future technical infrastructure. This article is for educational and informational purposes only, and should not be construed as an official statement or endorsement by Luno.

Worm's World

Discussion about this post

Ready for more?