Why Racing Beats Load Balancing for Latency-Sensitive RPC
The default redundancy pattern in distributed systems is the load balancer. You have N backends; the load balancer picks one per request; if that backend is healthy, you get an answer; if not, you fail over. The pattern is forty years old, well-understood, and present in essentially every production stack.
It is also wrong for latency-sensitive RPC. Specifically, it produces a long tail that nothing on the load balancer’s side can fix, because the load balancer made its picking decision before it had any information about how the chosen backend was currently performing.
StreamSync uses racing instead. Each query goes to 3-5 operators simultaneously; whoever returns the verified answer first wins; the rest do work for nothing on that query and price their bids accordingly. This post is about why we made that choice, what the math actually says, and what the trade-offs are.
The tail latency problem
Single-backend RPC has a p99 latency curve that looks like a long thin tail with occasional huge spikes. The spikes come from things you can’t predict: a GC pause on the backend, a network microburst, a noisy-neighbor on the host, an upstream Solana RPC having a slow slot. A well-tuned backend gets the spike rate down but never to zero, because the spike causes are not in the backend’s control.
Load balancing with N backends does not fix this. If you pick one backend per request and that backend hits a spike, your request hits the spike. The load balancer health-check rate is much slower than your request rate; it sees the backend as healthy because the next health-check hasn’t fired yet. You pay the tail.
Some load balancers do “retry on slow” — wait for a timeout, then retry against a different backend. This shifts the problem; instead of paying the spike, you pay the timeout plus the second backend’s normal latency. For latency-sensitive workloads the math gets ugly fast.
Racing makes the tail thinner
Racing fires the request to multiple backends from the start. Each backend’s latency is an independent random variable; the first response is the minimum of those variables; the tail of a minimum-of-N is much shorter than the tail of any individual variable.
For a back-of-the-envelope: if a single operator’s p99 latency is 25ms (well outside the 10ms SLA), racing three operators with independent failure modes brings the p99 of the minimum down toward the p87 of an individual operator. That’s a dramatic reduction in tail risk for a 3× compute cost.
In practice the gains are even better than the back-of-the-envelope suggests, because the spike causes aren’t fully independent — but they’re independent enough. When NY4 is having a microburst, only operators in NY4 are affected; the AMS5 racer keeps moving. When one operator’s DuckDB is doing a vacuum, only that operator is affected; the others race past.
What it costs
The honest cost is compute. Racing 3 operators per query means roughly 3× the total network work per successful query. We pay this cost in the operator pool: losing racers earn nothing for that query and price their bids to account for losing most races. The customer-facing cost ends up close to a single-backend price because the market clears at the marginal cost of the marginal operator.
There is also a coordination cost: the gateway has to dispatch to multiple operators, collect responses, verify the winner, and discard the losers. That coordination is the gateway’s job, and we’ve spent a lot of time making it cheap — primarily by using NNG instead of HTTP for operator communication and by parallelizing the verifier hashing against the winner’s response. The coordination overhead is around 1-2ms in the common case, which fits inside the 10ms budget.
Why not just deploy more single-backend replicas?
A reasonable counter: if my single backend has tail issues, deploy ten of them and have the load balancer pick the fastest based on recent observations. Why is racing better than smart routing?
Three reasons.
First, smart routing is reactive. It picks based on past observations of the backend. Racing is concurrent. It picks based on what is happening right now, which is the only thing that actually matters for tail latency. If a backend was fast a second ago but is slow this microsecond, smart routing sends you to it anyway.
Second, smart routing requires the load balancer to know the latency distribution of every backend in real time. That information is expensive to collect, expensive to keep fresh, and never quite accurate. Racing requires no such information: the operators tell you their latency by responding, not by reporting.
Third, smart routing requires you to trust the latency reports. A backend that lies about being fast wins routing decisions it shouldn’t. Racing is unfakeable because the latency is the response time itself.
Why not just deploy more operators per region?
Another counter: if racing 3 operators is good, racing 10 must be better. Why don’t we?
We do, for a subset of queries. The reserved-lane pricing tier races 5 operators per query for cold paths and the gateway can be configured to dispatch to up to 8 in extreme cases. We don’t make 10 the default because the marginal benefit drops sharply after 4-5: most of the tail reduction is captured by the third racer. Past that, you’re spending compute for diminishing returns.
We also can’t race more operators than exist in the target region. For specialized workloads (ZK reconstruction, archive), the network may have only 2-3 operators in a region; racing all of them is the best we can do.
The honest counter-argument
Racing has a failure mode load balancing doesn’t: a query that produces inconsistent results across operators. If two operators return different correct-looking answers, the verifier has to decide which is right. We handle this with deterministic per-slot reads and verifier consensus, but it’s a real engineering cost that the load-balancer pattern doesn’t pay.
For chains where the canonical state is fuzzy (e.g., probabilistic finality), racing is harder to implement than load balancing. Solana’s relatively fast finality makes this manageable; for chains with longer finality windows, racing would need a different verification model.
The pattern as a primitive
The interesting result of building this is that racing has turned out to be a useful primitive beyond just the customer-facing query path. We use it internally for the IDL synthesis — multiple synthesizers race to produce a candidate IDL, the verifiers consensus on the right one — and for shard rebalancing decisions. Anywhere you have a latency-sensitive task with multiple capable workers, racing produces better tail behavior at predictable cost.
We expect the load-balancer pattern will continue to dominate stateful workloads where the consistency model is complicated. For stateless reads against a chain like Solana, racing is the better default. The hosted RPC industry will get there eventually; we built our protocol around it from day one because the latency property only shows up in the contract when the market is structured for it.