How Stock Exchange Works?
There is a moment, roughly once every market quarter, where some piece of news hits the wire and millions of traders hit their buy or sell buttons simultaneously. The exchange absorbs that shock. Prices move. Trades match. Confirmations fly back in milliseconds. Nobody on the outside thinks twice about it.
But if you crack open what actually happened in those milliseconds, you find one of the most carefully engineered distributed systems ever built. Stock exchanges are not just websites that match buyers and sellers. They are real-time, deterministic, ultra-low-latency financial infrastructure where a microsecond of delay can represent thousands of dollars of opportunity lost, and where a single bug in the matching engine can destabilize an entire market.

This blog is for engineers who want to understand what is actually happening under the hood. We will start from first principles and work our way through every major subsystem, from order entry to trade settlement, touching on the hardware, software, data structures, and architectural tradeoffs that make modern exchanges tick.
Why This Problem Is Hard
Before jumping into architecture, it helps to understand the constraints.
Modern exchanges process millions of orders per second across thousands of listed instruments. They must maintain strict price-time priority, meaning the first order at the best price must always get filled first, with no exceptions. They must distribute market data to thousands of subscribers in real time. They must do pre-trade risk checks without adding meaningful latency. And they must do all of this while being highly available, because an exchange outage is not just a technical failure, it is a regulatory and reputational catastrophe.
The latency bar is extraordinary. While a typical web application is considered fast if it responds in 100 milliseconds, exchange matching engines target round-trip latencies in the range of 10 to 100 microseconds. Some FPGA-based systems operate in the nanosecond range. The reason is simple: if your system is slower than your competitors, sophisticated traders will exploit the pricing difference between what your system shows and what the real market has moved to, a practice called latency arbitrage.
Electronic exchanges replaced physical trading floors not because computers are faster at paperwork, but because deterministic, auditable, and fair matching at scale is simply not possible with humans.
Core Features of a Stock Exchange
To understand the engineering, you first need a clear picture of what an exchange actually does.
Order placement is the entry point. A trader or algorithm submits an order specifying a financial instrument, a direction (buy or sell), a quantity, and optionally a price. The exchange receives this order and processes it.
Order books maintain the current state of all outstanding orders for each instrument. There is a buy-side book (bids) and a sell-side book (asks). Orders are organized by price level and, within each price level, by arrival time.
Order matching is the core function. When a new buy order arrives at a price that meets or exceeds an existing sell order, a trade occurs. The matching engine executes the trade, removes the matched quantities from the order book, and notifies both parties.
Market data feeds broadcast the state of the order book and trade activity in real time. Subscribers, which include brokers, traders, data vendors, and regulators, consume these feeds to understand current market conditions.
Clearing is the post-trade process of confirming the details of a trade between counterparties and computing net obligations. Settlement is the actual transfer of cash and securities, which typically happens one or two business days after the trade.
Risk management runs both before orders are submitted to the matching engine (pre-trade) and continuously throughout the session (real-time). It enforces position limits, margin requirements, and market-wide safeguards like circuit breakers.
Price discovery and auction systems are used during market open and close to establish fair opening and closing prices through call auction mechanisms rather than continuous matching.
High-Level Exchange Architecture
Here is how the major components fit together at a high level.
A trader sends an order through their broker’s Order Management System. The broker’s system routes it to the exchange’s API gateway, which handles authentication, rate limiting, and basic validation. The order then flows into the exchange’s own Order Management System, where it gets validated, sequenced, and handed to the pre-trade risk engine for checks like position limits and credit validation. Only then does it reach the matching engine.
The matching engine is the heart of the exchange. It processes orders sequentially, updates the order book, executes trades, and emits trade events. Those events fan out in multiple directions: to the market data publisher, to the post-trade risk system, to clearing and settlement, and to the audit log.
Every step in this chain is designed to be fast, deterministic, and recoverable.
Order Lifecycle Deep Dive
Understanding how a single order travels through the system reveals most of the interesting engineering decisions.
Authentication and rate limiting happen at the gateway layer. The exchange maintains session tokens and connection state per broker or participant. Rate limits are enforced per participant to prevent any single actor from flooding the system. This layer is typically horizontally scalable because it is stateless relative to the order book.
Validation checks that the order is well-formed: valid instrument symbol, valid order type, price within allowed range, quantity within configured bounds. This is cheap and can happen in parallel for different sessions.
Pre-trade risk checks are where things get interesting. The risk engine must verify that the participant has enough buying power or available securities to support this order. In a fast system, this cannot involve a database round trip. Risk state is maintained in memory, often in a purpose-built in-process cache, and updated atomically as trades execute. A failed risk check returns a rejection immediately.
Sequence assignment is critical for correctness. Every order must receive a monotonically increasing sequence number before it enters the matching engine. This sequence number defines the canonical ordering of all market events and is what allows the system to be replayed deterministically after a failure. The sequencer is usually a single bottleneck by design: you want exactly one system assigning sequence numbers.
The matching engine queue is a single-threaded processing loop. The matching engine reads sequenced order events one at a time and processes them in order. This is not a performance limitation by accident; it is a correctness guarantee by design. If two threads could concurrently modify the order book, you would need locking, and locking at that granularity is both complex and a latency risk. Single-threaded processing eliminates a whole class of concurrency bugs.
Execution reports flow back to the submitting participant synchronously (or as fast as possible asynchronously), confirming whether the order was accepted, partially filled, fully filled, or rejected.
The whole path from order submission to execution report, under normal conditions, takes somewhere between 10 and 500 microseconds depending on the exchange and the infrastructure. NASDAQ famously measured their matching engine latency in the range of 70 to 100 microseconds for much of the 2010s. Today’s fastest systems, using FPGA-based matching, operate in the single-digit microsecond or even nanosecond range.
Matching Engine Architecture
The matching engine is where price discovery happens. Let us look at how it actually works.
The Central Limit Order Book (CLOB) is the data structure that holds all outstanding orders. It is organized by instrument, then by side (buy or sell), then by price level, and within each price level by time of arrival.
Price-time priority (also called FIFO matching) is the standard algorithm used by most exchanges. An incoming buy order matches against the lowest-priced sell order first. If there are multiple sell orders at the same price, they are matched in the order they arrived. This is fair in a provable sense: it rewards participants who commit to a price earlier.
Market orders do not specify a price. They match immediately against whatever is available in the book, consuming liquidity at progressively worse prices if needed. If the book does not have enough depth to fill the entire market order, the remainder either gets filled at whatever price remains or is cancelled, depending on exchange rules.
Limit orders specify a maximum buy price or minimum sell price. A buy limit order at $100 will only match against sell orders priced at $100 or below. If no such orders exist, the limit order rests in the book waiting for a matching counterparty.
Iceberg orders are limit orders where only a portion of the total quantity is visible in the order book at any time. When the visible portion is filled, the next tranche becomes visible. This allows large institutions to place large orders without telegraphing their full intention to the market.
Partial fills happen when the available quantity at the best price is less than the incoming order’s quantity. The incoming order gets partially filled, and the residual rests in the book (for limit orders) or continues consuming the next price level (for market orders).
Deterministic execution is non-negotiable. Given the same sequence of input orders, the matching engine must always produce exactly the same output. This is what enables replay-based recovery: if the engine crashes, you can reconstruct its exact state by replaying the input event log from the last checkpoint.
Order Book System
The order book is not just a logical concept; it is a performance-critical data structure that must support fast insertions, deletions, and lookups.
For the price dimension, the book needs to quickly find the best bid and best ask (the top of the book). A sorted structure works, but the choice of data structure matters enormously at the volumes exchanges handle.
Skip lists are popular in matching engines because they offer O(log n) insertion, deletion, and search, but with lower constant factors than balanced trees and simpler lock-free implementations. Some exchanges use them to maintain the sorted list of price levels.
Intrusive linked lists are used within each price level to maintain the time-ordered queue of orders. Because the order objects themselves contain the list pointers, traversal is cache-friendly and avoids heap allocations.
Array-based representations work well when the tick size is small and the price range is bounded. If you know that prices will only ever fall within a specific range and tick size, you can pre-allocate an array indexed by price tick and store the order queue directly at each slot. Lookup becomes O(1). This is common in futures exchanges where price ranges are more predictable.
The bid-ask spread is the difference between the best buy price and the best sell price. A tight spread indicates a liquid market. The depth of market shows how many orders exist at each price level beyond the best bid and ask. Exchanges publish depth information as part of their market data feeds.
For a heavily traded instrument like Apple stock, the in-memory order book might contain thousands of price levels with tens of thousands of orders. The matching engine needs to insert, cancel, and match these orders at rates exceeding a million operations per second per instrument.
Low-Latency Infrastructure
This is where exchange engineering departs most dramatically from conventional software engineering. The pursuit of nanosecond latency drives decisions at every layer of the stack.
Kernel bypass networking is standard practice for latency-sensitive trading infrastructure. The Linux kernel’s network stack introduces hundreds of microseconds of latency due to context switches, interrupt handling, and buffer copying. Libraries like DPDK (Data Plane Development Kit) allow applications to bypass the kernel entirely and interact with the network card directly from user space. The application polls the NIC rather than waiting for interrupts, eliminating interrupt latency entirely.
RDMA (Remote Direct Memory Access) takes this further. With RDMA, one machine can read from or write to another machine’s memory without involving the remote CPU at all. The data transfer happens entirely in hardware, with latencies measured in microseconds. Exchanges use RDMA for replication between primary and standby matching engines.
FPGA acceleration is used for the most latency-sensitive paths. Field-Programmable Gate Arrays are hardware chips that can be programmed to implement custom logic directly in silicon, executing with fixed, deterministic latency measured in nanoseconds. Some exchanges implement their entire order matching logic in FPGAs, others use them just for network parsing and risk checks. The tradeoff is that FPGAs are orders of magnitude harder to develop and debug than software.
Colocation refers to placing trading firm servers in the same physical data center as the exchange’s matching engine. By being physically close, firms reduce network round-trip times from milliseconds to microseconds. Exchanges charge significant fees for colocation access and must ensure all colocated firms receive physically equal cable lengths to guarantee fairness, a practice called equal cabling or normalized latency.
CPU pinning assigns specific threads to specific CPU cores and prevents the operating system from migrating them. This eliminates scheduler-induced jitter. NUMA awareness ensures that a thread and the memory it accesses are on the same NUMA node, avoiding costly cross-socket memory transfers. Lock-free data structures using atomic operations eliminate mutex contention between threads. Huge pages reduce TLB pressure for memory-intensive workloads.
In software, garbage collection is anathema to latency-sensitive exchange systems. Languages like Java and Go can introduce GC pauses of dozens of milliseconds, which is catastrophic. Exchange matching engines are almost universally written in C++ precisely because it gives deterministic, manual memory control. Object pooling and slab allocators ensure that the matching engine’s hot path never calls malloc or free.
| Optimization | Latency Reduction | Complexity | Use Case |
|---|---|---|---|
| Kernel bypass (DPDK) | 100-500 microseconds saved | High | All exchange networking |
| RDMA | 1-10 microseconds for replication | Very High | Engine replication, market data |
| FPGA matching | Sub-microsecond matching | Extreme | Ultra-low-latency venues |
| CPU pinning | Eliminates scheduler jitter | Medium | Matching engine threads |
| Lock-free structures | Eliminates mutex contention | High | Shared data structures |
| Object pooling | Eliminates GC and malloc latency | Medium | Order allocation in C++ |
| Huge pages | Reduces TLB misses | Low | Large in-memory order books |
Market Data Distribution System
Once trades happen and the order book changes, that information needs to reach every interested party as quickly as possible. Market data distribution is a different engineering problem from matching: instead of low-latency point-to-point communication, you need low-latency one-to-many broadcasting.
Multicast networking is the primary mechanism for market data distribution. Instead of sending the same data to each subscriber individually (unicast), the exchange sends a single packet that the network infrastructure delivers to all subscribers simultaneously. This makes the publisher’s bandwidth requirement constant regardless of subscriber count.
Feed handlers are software components that receive raw market data from the exchange, parse it, normalize it into a standard format, and distribute it to downstream consumers. Large banks and trading firms run their own feed handlers, often co-located with the exchange, to minimize latency.
Tick data represents the finest granularity of market data: every individual order event, every trade, every quote change. Tick data volumes are enormous. A busy day on a major exchange can produce hundreds of gigabytes of tick data.
Sequencing and gap detection are critical in market data systems. UDP multicast can drop packets. Each market data message includes a sequence number so consumers can detect gaps. When a gap is detected, the consumer requests retransmission through a separate recovery channel. The key design insight is that the primary delivery path is optimized for low latency (unreliable UDP multicast), while a secondary path handles reliability (reliable TCP or multicast retransmission).
Levels of market data vary in granularity. Level 1 data shows only the best bid and ask. Level 2 (depth of market) shows multiple price levels. Level 3 shows the full order book including individual order IDs. Different subscribers pay for different levels, and each level requires significantly more bandwidth and processing.
The fan-out problem, where one exchange event must reach thousands of subscribers, is solved at the network layer through multicast, but the exchange must also maintain multiple feeds for reliability, redundancy, and different data levels. This results in a complex publishing infrastructure that itself must be highly available and low-latency.
Risk Management Systems
Risk management in an exchange context operates at two levels: per-participant risk and market-wide risk.
Pre-trade risk checks are the first line of defense. Before any order reaches the matching engine, it must pass through risk validation. These checks include position limits (a participant cannot exceed their authorized position size), credit checks (is there enough buying power to support this order), order size limits (no individual order can exceed a maximum size, a defense against fat-finger errors), and price collars (orders priced far away from the current market are rejected to prevent erroneous trades).
The challenge is doing these checks fast enough that they do not become a bottleneck. Most exchanges perform pre-trade risk checks in under a microsecond by maintaining all relevant risk state in memory and using optimized, branchless code paths.
Fat finger protection deserves special mention. A fat finger error is when a trader accidentally enters an order with a wrong price or quantity. For example, entering a sell order for 1 million shares instead of 1,000 shares. The exchange protects against this with configurable maximum order size and maximum notional value limits, and by rejecting orders priced more than a configured percentage away from the current market.
Circuit breakers are market-wide protections that halt trading when prices move too dramatically in a short period. The US exchanges use a tiered Market-Wide Circuit Breaker (MWCB) system: if the S&P 500 falls 7%, trading halts for 15 minutes. A 13% decline triggers another halt. A 20% decline halts trading for the rest of the day. Individual stock circuit breakers (called Limit Up-Limit Down in the US) prevent any single stock from moving more than a configured percentage from a recent reference price without triggering a trading pause.
Implementing circuit breakers requires the risk engine to continuously compute rolling statistics (recent price movement, volatility) and inject halt signals into the matching engine’s processing queue. The halt must be processed in strict sequence with normal order events to ensure no trades occur after the halt threshold is crossed.
Clearing and Settlement Systems
Trading does not end when the matching engine fires. Clearing and settlement are what transform a matched trade into an actual transfer of securities and cash.
Clearing is the process of confirming trade details and computing what each party owes. When a buy order matches a sell order, the trade goes to a clearinghouse (often a subsidiary of the exchange or a separate regulated entity, like the DTCC in the US). The clearinghouse confirms the trade with both counterparties and becomes the legal counterparty to both sides, a process called novation. This eliminates bilateral counterparty risk: neither the buyer nor the seller needs to worry about the other defaulting, because the clearinghouse guarantees both sides.
Netting is an important efficiency in clearing. If a participant buys 1000 shares of AAPL in one trade and sells 700 shares in another, the net obligation is to receive 300 shares and pay the net cash difference. Netting drastically reduces the actual volume of securities and cash that need to physically change hands, improving capital efficiency and reducing settlement risk.
Settlement is the final step: actual transfer of cash and securities. In the US equity markets, settlement is currently T+1, meaning it happens one business day after the trade date. Many markets have historically been T+2, and reducing settlement cycles is an active area of regulatory effort because shorter settlement cycles reduce the window during which counterparty risk can materialize.
Margin is the collateral that clearing members must post to the clearinghouse to cover potential losses. The clearinghouse uses sophisticated risk models to compute how much margin each participant must maintain based on their open positions, volatility, and correlation of their portfolio. Margin calls are issued in real time (intraday) when market conditions change rapidly.
Reconciliation happens at every layer: the exchange reconciles its trade records with the clearinghouse, the clearinghouse reconciles with settlement systems, and settlement systems reconcile with custodian banks and depositories. Any discrepancy triggers an investigation. At scale, even tiny error rates produce significant operational overhead, which is why exchanges invest heavily in automated reconciliation and exception management.
Distributed Exchange Infrastructure
A stock exchange cannot afford downtime. Exchanges typically operate under regulatory mandates requiring extremely high availability, and outages like the NASDAQ outage in 2013 or the NYSE outage in 2015 attract intense regulatory scrutiny.
Active-passive replication is the most common approach for the matching engine. A primary matching engine processes all orders and continuously replicates its state to a standby. The standby receives every order event in sequence and maintains a shadow order book in sync with the primary. If the primary fails, the standby can take over in seconds.
The challenge is that the standby must be synchronized closely enough that failover is seamless, but the replication must not add latency to the primary’s processing. RDMA-based replication achieves microsecond-level synchronization without impacting the primary’s hot path.
Deterministic replay is the backup to active-passive replication. Every order event is written to a persistent, sequenced event log (like a Kafka topic, but often a custom low-latency implementation). If both the primary and standby fail, the system can recover by replaying the event log from the last known good checkpoint. This is slower than active-passive failover but provides a last-resort recovery mechanism.
Multiple data centers provide geographic redundancy. Exchanges typically operate across two or three data centers in the same metro area (for low replication latency) with a disaster recovery site in a geographically separate location. Trading sessions are normally served from a primary data center, with the secondary data center running a hot standby ready to take over.
Consistency requirements make distributed matching engines very hard. Unlike many web systems where eventual consistency is acceptable, an exchange cannot have two matching engines operating simultaneously on the same order book. That would violate price-time priority and create duplicate or conflicting trades. The exchange must operate with a single active matching engine per instrument at any point in time. This is a form of the single-writer principle applied at a system architecture level.
Database and Storage Design
The exchange needs to store many different types of data with different access patterns and latency requirements.
The matching engine itself is entirely in-memory. The order book is never persisted to disk during normal operation. Speed requires that all order book data be in RAM, accessed at memory bandwidth speeds. The current market state is reconstructable from the event log if needed.
The event log is append-only and replicated. Every order event, every trade event, and every order book state change is written to a durable, sequenced log. This log is the source of truth. It is written to fast NVMe storage with synchronous replication to at least one replica before the engine acknowledges the write.
Trade records are written to a relational database for post-trade processing, reporting, and regulatory purposes. These records need to be queryable, auditable, and immutable. A typical schema:
CREATE TABLE trades (
trade_id BIGINT PRIMARY KEY,
instrument_id VARCHAR(20) NOT NULL,
buy_order_id BIGINT NOT NULL,
sell_order_id BIGINT NOT NULL,
buyer_id VARCHAR(50) NOT NULL,
seller_id VARCHAR(50) NOT NULL,
quantity BIGINT NOT NULL,
price NUMERIC(18, 6) NOT NULL,
trade_time TIMESTAMP WITH TIME ZONE NOT NULL,
sequence_number BIGINT NOT NULL UNIQUE,
session_date DATE NOT NULL
);
Order records capture the full lifecycle of every order:
CREATE TABLE orders (
order_id BIGINT PRIMARY KEY,
participant_id VARCHAR(50) NOT NULL,
instrument_id VARCHAR(20) NOT NULL,
side CHAR(1) NOT NULL CHECK (side IN ('B', 'S')),
order_type VARCHAR(10) NOT NULL,
quantity BIGINT NOT NULL,
filled_qty BIGINT NOT NULL DEFAULT 0,
price NUMERIC(18, 6),
status VARCHAR(20) NOT NULL,
submit_time TIMESTAMP WITH TIME ZONE NOT NULL,
last_update TIMESTAMP WITH TIME ZONE NOT NULL,
sequence_number BIGINT NOT NULL
);
Market data is stored in time-series databases like kdb+ (widely used in finance), InfluxDB, or custom columnar stores. Time-series data has extremely regular write patterns (append-only, time-ordered) and requires efficient range queries by time window and instrument.
Position and risk data is maintained in an in-memory store (often Redis or a custom solution) for real-time access during pre-trade checks, with periodic snapshots to a persistent store for recovery.
Event Streaming Architecture
Modern exchanges rely on event streaming infrastructure to decouple their internal subsystems and enable reliable, ordered event processing.
Every order event, trade event, and order book update is published to an internal event stream. Downstream systems, including the market data publisher, the post-trade risk system, the clearing interface, and the audit log, subscribe to this stream and process events independently.
Ordering guarantees are critical. The event stream must preserve the sequence number assigned by the sequencer. No downstream consumer can process event N+1 before it has processed event N for the same instrument. This is why exchanges typically partition their event streams by instrument, ensuring all events for a given symbol flow through a single ordered partition.
Exactly-once delivery is harder than it sounds. In the presence of failures, most messaging systems default to at-least-once delivery, which means consumers must be idempotent. In exchange systems, idempotency is enforced through sequence numbers: a consumer that receives a duplicate event can detect and discard it.
Replay capability is a killer feature of event streaming for exchanges. If a downstream system fails and recovers, it can replay the event stream from the point of failure and bring itself back to a consistent state. This is also how risk systems bootstrap at system startup: they replay all events from the current session to reconstruct current position state.
The internal streaming infrastructure at exchanges often does not use Kafka directly because Kafka’s latency characteristics (sub-millisecond in practice) are still too slow for the hot path. Instead, exchanges often use custom implementations or tools like Aeron, which is designed for ultra-low-latency messaging and supports RDMA, shared memory transport, and UDP with reliability semantics.
Monitoring and Observability
Operating exchange infrastructure requires visibility at every layer, from hardware performance counters to trade surveillance.
Latency monitoring must be continuous and granular. Exchanges measure the latency of every order at multiple points in the pipeline: time of receipt at the gateway, time of sequence assignment, time of matching, time of execution report delivery. Any spike in any segment triggers alerts. Latency is tracked as percentile distributions (p50, p99, p999), because tail latency matters as much as median latency.
Trade surveillance is a regulatory requirement. The exchange must monitor for manipulative trading patterns: spoofing (placing orders with no intent to fill to create a false appearance of supply or demand), layering (a variant of spoofing), wash trading (trading with yourself), and front running. These patterns are detected by analyzing trade data in real time or near-real time using pattern recognition and statistical anomaly detection.
Distributed tracing allows engineers to follow a single order through the entire system. Because the matching engine processes events sequentially with sequence numbers, trace reconstruction is actually easier in exchange systems than in many microservice architectures: the sequence number serves as a correlation ID.
Circuit breaker metrics are watched continuously by market operations teams. The real-time volatility calculations that feed circuit breaker logic are themselves monitored for correctness and availability.
Security and Compliance
Exchanges are regulated financial infrastructure and face a higher security bar than most software systems.
Regulatory compliance requires that every trade be recorded with sufficient detail to reconstruct the full order lifecycle, identify the beneficial owner of each order, and detect manipulative behavior. In the US, this is governed by SEC and FINRA rules. In Europe, MiFID II imposes similar requirements. Regulatory reporting happens on a trade-by-trade basis with timestamps precise to the nanosecond.
Tamper-proof audit logging means that the trade log and order log cannot be altered after the fact. This is implemented through cryptographic chaining (each log entry contains a hash of the previous entry), append-only storage, and air-gapped backup copies sent to regulatory authorities.
Authentication for market participants uses certificate-based mutual TLS at the gateway layer. Session-level authentication tokens are issued after the TLS handshake and carried on every order message. DDoS protection is implemented at the network level with rate limiting at the gateway, with the understanding that colocated participants have lower rate limits applied at the logical level.
Scalability Deep Dive
The fundamental scalability challenge for exchanges is that the matching engine is, by design, a sequential single-threaded process. You cannot simply add more CPUs to make it faster. So how do exchanges scale?
Symbol partitioning is the primary horizontal scaling mechanism. Each instrument is assigned to a specific matching engine instance. AAPL runs on engine cluster 1, MSFT on engine cluster 2, and so on. Since orders for different instruments are independent, multiple matching engines can run in parallel, giving the exchange throughput that scales with the number of instruments.
The challenge with symbol partitioning is cross-symbol interactions. Index futures, for example, involve baskets of individual stocks. Arbitrage strategies trade correlated instruments simultaneously. Risk management must consider the aggregate portfolio across all instruments. These cross-symbol concerns must be handled by systems that can aggregate across partitions, which introduces complexity.
Market data scaling is a different problem: one exchange produces market data that must reach thousands of subscribers without the publisher becoming a bottleneck. The answer is multicast for the last mile and tiered distribution: the exchange publishes once, regional distribution systems amplify to local subscribers.
Gateway scaling is straightforward horizontal scaling. More gateway servers can be added to handle more inbound connections, since gateways are stateless with respect to the order book.
| Bottleneck | Root Cause | Primary Solution | Tradeoff |
|---|---|---|---|
| Matching engine throughput | Single-threaded by design | Symbol partitioning | Cross-symbol complexity |
| Market data fan-out | One-to-many at scale | UDP multicast | Packet loss requires recovery channels |
| Risk system latency | Must check all orders | In-memory risk state with NUMA awareness | State synchronization complexity |
| Gateway connections | Thousands of brokers | Horizontal scaling | Session state management |
| Event log write throughput | Must be durable and fast | NVMe with async batching | Batch size vs latency tradeoff |
Reliability and Availability
The goal for exchange systems is typically five-nines availability (99.999%), which amounts to about five minutes of allowable downtime per year. Achieving this requires eliminating single points of failure at every layer.
Hot standby matching engines receive every order event in real time and maintain a shadow order book. Failover from primary to standby is designed to complete within seconds. The challenge is that during failover, in-flight orders may be in an ambiguous state, and the exchange must have clear rules for how to handle these.
Session continuity across failover requires that the standby engine knows the state of every active session: which orders are outstanding, which execution reports have been sent. This state must be replicated along with the order book.
Market halt procedures are tested regularly. Exchanges run periodic drills where they exercise their failover procedures in production (during non-trading hours) to ensure the standby infrastructure actually works. Paper testing is insufficient: many outages are caused by bugs in failover logic that only manifest under real conditions.
Recovery time objectives (RTO) for exchanges are measured in seconds to minutes, not hours. This influences every infrastructure decision. If a component can only recover in 30 minutes, it cannot be on the critical trading path.
Engineering Tradeoffs
The interesting engineering happens at the decision points where different priorities conflict.
Consistency vs latency: Ensuring that all risk systems have a perfectly consistent view of position state before every order is processed would be correct, but it adds latency. Exchanges accept small windows of inconsistency in risk state, protected by conservative limits and real-time monitoring, to keep pre-trade checks fast.
Throughput vs fairness: Batching orders and processing them together (auction-style) would dramatically improve throughput. But continuous matching with price-time priority is considered fairer and is the market convention. Most equity exchanges do not batch except at open and close.
Distributed vs centralized matching: Distributing the matching engine across multiple nodes would improve fault tolerance and scalability but would make deterministic, fair matching extremely difficult. The single-writer primary remains the dominant architecture because correctness and fairness are non-negotiable.
Hardware optimization vs operational complexity: FPGA-based matching delivers extraordinary latency but makes the system much harder to develop, test, and modify. Exchanges that use FPGAs for matching face much longer development cycles for regulatory or functional changes. Most exchanges accept slightly higher latency in exchange for software flexibility.
Replication synchrony vs latency: Synchronous replication to a standby guarantees no data loss on failover but adds the replication round-trip to every order’s latency. Asynchronous replication reduces latency but risks losing the last few events on failover. Most exchanges use synchronous replication for the matching engine and asynchronous replication for non-critical downstream systems.
Real-World Technology Stack
What do real exchanges actually use?
C++ is the dominant language for matching engines, risk engines, and anything on the critical latency path. Modern C++17 and C++20 provide the control over memory layout, alignment, and cache behavior that ultra-low-latency code requires.
Java is used for back-office systems, clearing interfaces, market surveillance, and operational tooling where latency requirements are less extreme and developer productivity matters more. Java’s mature ecosystem for financial messaging (FIX protocol libraries, risk model libraries) makes it practical for many non-critical paths.
FPGA infrastructure (from vendors like Xilinx/AMD and Intel/Altera) is used by the most latency-focused venues for matching, market data parsing, and network handling. The development toolchain is specialized, and engineers with FPGA expertise command significant premiums.
Aeron is a messaging library designed for ultra-low-latency inter-process and inter-machine communication. It supports shared memory transport (for same-machine communication), UDP unicast, UDP multicast, and RDMA. It is widely used in the financial industry for internal event streaming.
kdb+ (from KX Systems) is a time-series database and query language that is ubiquitous in quantitative finance for storing and analyzing tick data and historical market data. Its columnar format and in-memory capabilities make it extremely fast for time-series workloads.
Apache Kafka appears in exchange infrastructure for non-latency-critical pipelines: regulatory reporting feeds, settlement interfaces, market surveillance data, and operational event buses. Kafka’s durability and replay capabilities are valuable even where latency requirements preclude it from the hot path.
Linux kernel tuning is as important as the software. Exchanges run extensively tuned Linux kernels: real-time scheduling, disabled power management (to eliminate frequency scaling), NUMA-aware memory allocation, CPU isolation (reserving entire CPUs for critical threads), and huge pages enabled by default.
| Component | Technology | Why This Choice |
|---|---|---|
| Matching engine | C++ / FPGA | Deterministic low-latency, manual memory control |
| Internal messaging | Aeron, custom RDMA | Microsecond latency, reliable delivery, replay |
| Market data | UDP multicast | Fan-out at scale without publisher bottleneck |
| Tick data storage | kdb+, InfluxDB | Time-series optimized, columnar, fast range queries |
| Event streaming | Aeron (hot path), Kafka (cold path) | Latency where needed, durability and replay everywhere |
| Back office | Java, PostgreSQL | Ecosystem maturity, correctness, auditability |
| Networking | DPDK, RDMA, Solarflare NICs | Kernel bypass, hardware-accelerated delivery |
System Design Interview Perspective
Stock exchange system design questions appear in senior and staff-level interviews at trading firms, fintech companies, and sometimes at large tech companies building financial products. Here is how to approach them well.
What interviewers are actually testing: They want to see that you understand the latency, correctness, and scalability constraints that drive exchange architecture. Candidates who treat this like a generic CRUD system miss the point entirely. The interesting engineering is in the non-negotiable constraints: deterministic matching, ultra-low latency, fault tolerance without correctness compromise.
Start with requirements clarification: Is this a retail brokerage (where latency requirements might be more relaxed) or an institutional exchange (where nanoseconds matter)? Are we designing just the matching engine, or the full trade lifecycle? Clarifying scope shows maturity.
Discuss the matching engine first: This is the architectural heart. Explain why it must be single-threaded for correctness, how you handle the throughput implications through symbol partitioning, and how you achieve fault tolerance through hot standby replication. Most candidates underestimate how much of the interesting design lives here.
Talk about the data structures: Interviewers at financial firms will often probe your knowledge of order book data structures. Know why skip lists or array-indexed price levels are preferable to naively sorted arrays, and why the in-memory constraint drives so much of the design.
Bring up latency at every layer: For each component, discuss the latency target and what engineering decisions achieve it. This shows you understand the real-world constraints rather than just the happy-path flow.
Cover the failure modes: What happens when the matching engine crashes? What happens when replication falls behind? What happens during a circuit breaker halt? Strong candidates proactively discuss failure modes and recovery paths.
Common mistakes: The most common mistake is proposing a horizontally distributed matching engine without acknowledging the correctness and fairness problems this creates. Another common mistake is treating the database as the primary data store for order book state, which would make the system orders of magnitude too slow. Suggesting a relational database for real-time order matching is a red flag in this context.
Strong answers distinguish between the hot path and cold path: The matching engine and its immediate dependencies must be extremely fast. Downstream systems (clearing, surveillance, reporting) can be slower and can use different technology stacks. Demonstrating awareness of this layering shows architectural sophistication.
Discuss the regulatory dimension: Mentioning audit logging, tamper-proof records, regulatory reporting, and surveillance is differentiating. These are real constraints that shape exchange architecture, and most candidates from non-financial backgrounds overlook them entirely.
The best interview answers feel like a conversation with someone who has thought deeply about the problem, understands the constraints, and can reason about tradeoffs in real time. You do not need to know every implementation detail, but you should understand why each major design decision was made.
Putting It All Together
Stock exchanges are, in many ways, the most demanding distributed systems ever built in production. The combination of ultra-low latency, strict correctness guarantees, high availability requirements, regulatory oversight, and enormous financial consequences creates an engineering environment where every decision is highly consequential.
The single-threaded matching engine, the in-memory order book, kernel bypass networking, symbol partitioning, hot standby replication, append-only event logs, multicast market data distribution, pre-trade risk checks in microseconds: each of these is not an arbitrary choice but a carefully considered answer to a specific hard constraint.
What makes exchange engineering intellectually fascinating is that the most important design decisions often run counter to conventional distributed systems wisdom. Horizontal distribution is usually good; for matching engines, it compromises correctness. Caching is usually an optimization; for risk state, it is a correctness requirement. Eventual consistency is often acceptable; in a financial exchange, it is not.
If you are building systems at the intersection of finance and technology, or if you are preparing for system design interviews at firms that care about low-latency infrastructure, understanding exchange architecture at this depth gives you a framework for reasoning about the most demanding classes of production systems that exist. The principles here, determinism, sequencing, in-memory state management, event sourcing, latency-aware design, apply well beyond the world of stock exchanges.
The market opens at 9:30 AM. Somewhere in a data center in New Jersey, a single-threaded loop starts processing events at a rate that no human can perceive. And somewhere in that loop, millions of buy and sell orders find their match, prices are discovered, and capital flows to where it is most needed. It happens in microseconds, it is fair, and it almost never fails. That is the engineering achievement worth understanding.