Over the past few years, financial exchange operators and electronic marketplaces providers have increasingly embraced the cloud. There is definitive momentum towards migration of these systems to the cloud, with several exchanges running in the cloud already and more already planned.
As the exchanges move to the cloud, the interfacing trading platforms face a choice. Active market participants currently enjoy low and predictable latency to exchanges through co-location and DMA solutions. Sophisticated firms deploy a wide range of performant solutions, ranging from custom ASICS to proprietary operating systems on x86 architectures. Is it even possible to achieve the performance profiles of these systems using cloud native technologies and infrastructure?
This is what we set out to benchmark.
The Goal
Taking the use-case of Tick–to-Trade our experiments provide answers to the following questions:
- Using specific instance types, how long does it take to process a market data message, make a trade decision, and produce an order message, using a “simplistic” implementation – latency and throughput?
- Is it possible to improve the “simplistic” implementation using advanced networking and programming techniques – if so, by how much?
Tick-To-Trade
This use case simulates the workflow from receipt of a market signal to initiating a trade based on the signal. This workflow involves the following technical steps:
- Decode an incoming binary market data feed (SBE FIX) into a domain object
- Inspect the domain object for the feasibility of a trade
- Various complex strategies and computations could be used
- To eliminate timing variability of trade decision algorithms, we used a modulus operator on the sequence number of the message
- If the condition to trade rule evaluates positive, construct an outbound order message in SBE FIX format
- Send the outbound order message
- No-op if the trade rule evaluates to negative.
This effectively captures the amount of time taken to receive and process an inbound market signal, apply a simplistic trade decision algorithm, and send and receive an order on an external exchange.
All benchmarking leverage PCAP files represent actual production market data sourced from CME Group Datamine for CBOT Globex Equity Futures for the 08/28/2024 trading date. These files had 6,840,868 packets used in our analysis.
Test Runs
For our tick-to-trade benchmarking, we conducted a series of tests across different dimensions to evaluate performance on Google Cloud Platform's (GCP) C3 machine series. Our test matrix focused on several key factors: the GCP instance type, the packet ingress mechanism (kernel bypass vs. kernel space), the level of parallelism, and the replay speed of the market data. We also distinguished between in-process packet processing time and the total round-trip duration from the network interface (NIC) on one instance to the NIC on another.
We sourced our data from actual CME Group Datamine PCAP files, specifically for CBOT Globex Equity Futures on August 28, 2024, using a total of 6,840,868 packets for analysis. The benchmarking architecture was straightforward, consisting of two instances: a packet replay instance and a test instance. We ran two distinct modes of operation: kernel space packet processing, which uses standard POSIX socket interfaces, and kernel bypass packet processing. For the kernel bypass tests, we used ring buffers and CPU-pinning to enable parallel processing and better understand performance gains with an increasing number of processor cores.
Each test run was repeated three times, with outliers removed, and the results presented as an average of the remaining data. The in-process measurements recorded the time from when a packet was read into user space for decoding until the FIX order message was placed on the transmit queue. The round-trip duration was measured from the moment the market data packet left the replay instance’s NIC until the order message packet completed its round trip back to the same NIC. This dual approach allowed us to separate the fixed cost of packet transmission from the in-process software optimizations.
The Results
We had set out to determine whether standard cloud compute instances could support demanding financial services data workflows with an acceptable performance profile and no specialized compute infrastructure other than what’s provided with the stock instance type. We note and confirm the ability of a GCP-hosted market data processing infrastructure with baseline low latency in the range of 0.3 to 2.5µs in-process baselines with roundtrip latency of 24-26 µs.
These results prove that high performance, low latency data processing is feasible on stock cloud compute instances, lowering the barrier of entry and leveling the playing field for a wide variety of firms and eliminating the need for specialized hardware and proprietary processing toolkits.

