Drop the Right Requests First: Priority-Aware Load Shedding Under Overload
Static RPS caps shed the wrong traffic. Concurrency is what saturates a service, not request rate. From my notes after reading the InfoQ piece on overload protection, Uber's January writeup on Cinnamon, and Netflix's QCon SF talk on service-level prioritized load shedding, here is why latency is the right control signal — and how a small priority taxonomy plus an adaptive concurrency limit keep the cheapest traffic shedding first.
Autoscaling and retries get the headlines. Neither helps in the ten seconds between a traffic spike hitting a service and the next instance coming online. In that window the arrival rate has already passed service capacity, the queue is filling, p99 is climbing, and every request inside is on its way to a timeout. The only knob that does anything useful right then is admission control — the service deciding, in line, which requests to accept and which to reject before they reach the slow path.
I went into my notes on this after reading the InfoQ piece arguing overload protection is "the missing pillar of platform engineering," Uber's January 2026 writeup on Cinnamon, and the QCon SF 2025 talk from Netflix on service-level prioritized load shedding. The three converge on the same architecture: an adaptive concurrency limit driven by latency, plus a priority taxonomy that decides who gets shed first. The static rate limit most services still ship with does neither, and that is why it sheds the wrong traffic.
The takeaway I want to land: give every request a priority and shed from the bottom under overload using an adaptive limit, so saturation degrades the cheapest traffic instead of failing everyone uniformly.
Why static rate limits shed the wrong traffic
The textbook overload defense is a token bucket — say, 5,000 RPS per service, enforced at the edge. It is easy to reason about and easy to dashboard. It is also wrong in two specific ways.
The first is that RPS is not the variable that breaks a service. Concurrency is. Little's Law writes this down precisely: average concurrency equals average RPS times average latency. A service that processes 5,000 RPS at 10ms holds 50 in-flight requests. The same service holding 50 in-flight requests at 100ms is doing only 500 RPS. The RPS limit hasn't changed, but every other resource — connections, threads, sockets, downstream connections — is doing ten times the work. A static RPS limit picks the cap at one operating point and lies about every other one.
The second is that requests of similar shape impose wildly different costs. Uber's writeup on Docstore — their MySQL-backed distributed key-value store serving 170M monthly active users — is blunt about this: requests of similar size impose varying CPU, memory, or I/O costs, so a stateless rate limiter can shed healthy traffic while overloaded partitions remain unprotected. A 1KB write against a hot partition is not the same load as a 1KB write against a cold one. The token bucket cannot see the difference.
When the cap is wrong, the failure mode is uniform. Every caller — the live request that keeps the home screen rendering, the batch reconciliation nobody is watching — gets 429s at the same rate. The user-facing call that mattered just got shed at the same probability as the cron job that could have run tomorrow.
Adaptive concurrency: latency as the control signal
Netflix's concurrency-limits library frames this as the same control problem TCP solves. A receiver does not know in advance how much in-flight data the network can hold. It measures: minimum round-trip time when the path is empty, current round-trip time under load, and the ratio between them. When the current RTT climbs above the floor, a queue is building somewhere, and the sender backs off. When the ratio returns to one, the sender opens back up.
Translate this to a service. Sample p99 latency over a short window. Compare it to a smoothed minimum latency over a longer window. The gap is a proxy for queue depth. If the gap is small, raise the concurrency cap by one. If the gap is large, drop the cap aggressively. Reject requests that arrive while at the cap.
Netflix's Vegas implementation estimates queue size with L * (1 - minRTT / sampleRTT), where L is the current limit. The cap increases by one when the estimated queue is below alpha (typically 2 to 3 requests) and decreases by one when it crosses beta (typically 4 to 6). Gradient2 generalizes this by tracking divergence between short- and long-window exponential averages, which smooths the impact of outliers in bursty traffic. The exact constants matter less than the property: the cap moves on its own when something downstream slows down, with no operator changing a config.
This is the lever the static rate limit cannot be. A token bucket configured at the steady-state RPS holds that number through every regime change — a cache cold-start, a noisy neighbor, a slow downstream — and lets in-flight traffic pile up until threads or sockets run out. The adaptive limit collapses on first sight of latency growth and reopens when it goes away. The important part is the timing: this happens in the seconds before autoscaling has a chance to react.
There is a clarifying way to state the goal. Netflix's QCon talk introduced two buffers I have found useful when arguing about capacity: the Success Buffer is the headroom above baseline a service can absorb without p99 degradation; the Failure Buffer is the capacity reserved specifically to reject excess gracefully. Adaptive concurrency exists to keep the service inside the Success Buffer until it must dip into the Failure Buffer, and load shedding exists to keep the Failure Buffer from collapsing into a death spiral.
The picture I keep in my notebook for this is a latency-versus-load curve with two traces past the knee: one for uniform shedding, where every priority class' p99 climbs together; one for priority shedding, where t0 stays inside its latency budget while t4 and t5 are rejected. Past the knee the two curves diverge dramatically — and that divergence is the whole reason to do this work.
Assigning priority without a committee
Once the cap is honest about how many concurrent requests a service can run, the question becomes which requests live inside it. The least useful answer is "every request equally, FIFO." That is what CoDel does at the network layer, and it is what Uber's first iteration of overload protection did in front of Docstore. The InfoQ writeup on Cinnamon names the failure mode directly: CoDel dropped low-priority and user-facing traffic alike, which raised on-call load and pushed retries from clients straight back into the overloaded queue.
The fix is a priority taxonomy attached to the request itself. Uber settled on six tiers, t0 through t5, with t0 reserved for the most latency-sensitive user-facing operations. Netflix's per-service load shedding uses a similar idea — non-critical shedding starts at 60% CPU utilization, critical shedding only kicks in at 80%. Under both schemes the cap is the same; what changes is the order in which requests are admitted into it. When the system is healthy everything gets in. When pressure rises, t5 and t4 get rejected first, then t3, then t2. t0 keeps its latency budget intact as long as there is any capacity to give it.
Two pieces of this taxonomy matter more than the number of tiers.
The first is that the priority must travel with the request, propagated end-to-end. A user-facing request that fans out into five downstream calls needs every hop to know it is t0; otherwise the slowest backend will make admission decisions blindfolded. RPC frameworks carry the tag in metadata or context. The discipline is to do it on every edge, including the ones I would rather not touch.
The second is that the taxonomy must be small enough to stay correct. Six tiers is the upper bound I have seen anyone defend. The Netflix automation around this is the part most adopters underestimate: "configuration involves aggregating utilization metrics, and a per-cluster load-shedding function is generated automatically." Without that automation, a tier assignment is a comment in a wiki, drifting from reality the moment a new endpoint ships.
The smallest version of this I would start with is two tiers — interactive and background — applied service-by-service. That is enough to recover the property the static rate limit lost: when the cap shrinks, the background work goes first.
Where priority shedding backfires
The pattern is not free, and a few of the failure modes are worth naming before adopting it.
The most common is misclassification. Every endpoint owner believes their endpoint is critical. If left to declare its own tier without review, a service ends up with everything at t0 and the taxonomy is meaningless. The Netflix automated platform exists in part to prevent this drift: priority is assigned centrally against business signals, and validated continuously against utilization patterns. Without that loop, the work to maintain the taxonomy decays to nothing within a quarter.
The second is starvation of low-priority traffic under sustained overload. If the system spends an hour above the knee, t4 and t5 get rejected for an hour. That may be correct — the alternative is failing t0 — but downstream effects matter. Batch jobs build up backlogs. Reconciliation lags. A backlog the system intended to clear over a quiet evening can grow large enough to need its own incident. The mitigation is to set explicit floors: even under heavy load, reserve a small percentage of the cap for low-priority traffic so the backlog drains at some rate. Netflix's library expresses this directly with partition percentages — live traffic guaranteed 90% of capacity, batch guaranteed 10%, neither starved when both are present.
The third is the retry storm. Shedding a request returns control to the caller, and the typical caller policy is to retry. If every shed request is retried immediately, the admission queue refills from clients faster than the server can drain it, and the same calls keep arriving with higher amplification. Netflix's solution is to halt all retries when server-side shedding is active and only allow high-priority retries under heavy load. The retry budget — a count of retries allowed per unit of time, expressed as a fraction of original requests — is the standard tool for this, but it only works when the budget is enforced on the client and the server cooperates with backoff hints (a Retry-After header, a structured error payload that carries the priority decision back).
The fourth is the one Uber spent the most time on: where the control loop lives. Earlier iterations placed control in a stateless routing layer in front of Docstore. The routing layer could not see partition-level load in time. By the time it learned that a partition was hot, the partition was already saturated. Uber's fix was to colocate admission control with the storage node itself: every node makes its own decision based on its own latency, queue depth, goroutine count, memory pressure, and I/O signals. The unified control loop combined these as plug-in inputs under what they call Bring Your Own Signal. The architecture matters: a load shedder that runs five hops away from the bottleneck is mostly a dashboard.
The Uber numbers from that redesign make the case concretely: throughput under overload up 80%, p99 latency on upsert operations down 70%, goroutine counts down 93%, peak heap usage down 60%. None of those gains came from adding capacity. They came from running existing capacity correctly under stress.
How I would actually wire this up
Pulling these pieces into something a single service owner could ship next sprint, the order I would take is small and concrete:
- Replace any per-service RPS cap with an in-process adaptive concurrency limit. Vegas or Gradient2 are reasonable defaults; both ship in Netflix's library and have been ported widely (Envoy's adaptive concurrency filter uses a Gradient-style algorithm, Resilience4j has a port, Go services tend to grab one of the open-source AIMD or Vegas implementations).
- Tag every inbound request with a priority. Two tiers is enough to begin: interactive vs background. The hard work is propagating the tag through RPC metadata so downstream services inherit it.
- Set partition floors. Reserve at least 10% of the limit for low-priority traffic so it is not starved over long incidents.
- Return structured rejections. A bare 429 says nothing. Include the priority at which the decision was made, a
Retry-Afterhint, and enough metadata for clients to back off intelligently. Without that, clients will pretend the rejection did not happen. - Track shed-by-priority as a first-class metric. The thing to alert on is not "shedding active" — that is the system working — but "shedding above tier N," because that is the point at which user-facing traffic is starting to lose.
What I would not do on the first pass: build a complex per-tenant quota service, deploy this in front of services rather than inside them, or roll out six tiers on day one. The cheap version of this — adaptive limit plus two priorities plus a partition floor — captures most of the available win. The expensive version is what Netflix and Uber built after years of running the cheap version in production and learning where it bent.
When to reach for this — and when to skip it
Use priority-aware load shedding when the service has heterogeneous traffic where some calls are visibly more important than others: user-facing reads vs background analytics, sync vs async writes, paying customers vs free tier, control plane vs data plane. Use it when retries from clients are already part of the failure mode under stress. Use it when autoscaling is too slow to cover the bursts that actually hurt, which is most autoscaling, most of the time.
Skip it when traffic is uniform — every request is the same kind of customer-facing read, with the same cost — because there is no taxonomy to draw a line through. In that case adaptive concurrency alone, without priority, gives most of the benefit. Skip it when the bottleneck is downstream and there is no way to push priority through, because then the priority is being decided on bad information. And skip it when the system can absorb 30 seconds of degradation without business impact, because the overhead of running and maintaining the taxonomy is real and not every service is worth it.
To recap:
- Static RPS limits cap the wrong variable; concurrency is what saturates a service, and Little's Law makes the relationship between RPS, latency, and concurrency explicit.
- Adaptive concurrency limits driven by latency (Vegas, Gradient2) move the cap on their own and absorb the seconds autoscaling cannot.
- A small priority taxonomy that travels with the request lets the system shed from the bottom under overload instead of failing everyone uniformly.
- Misclassification, low-priority starvation, retry storms, and control-loop placement are the four ways this design backfires in production; each has a known mitigation.
- Start with adaptive limit plus two tiers plus a partition floor. The expensive version is what large fleets ship, but the cheap version captures most of the benefit.
Still here? You might enjoy this.
Nothing close enough — try a different angle?
Related Posts
Actor-per-Entity vs Postgres Optimistic Locking: A Seat-Reservation Bake-off
I ran the same hot-key seat reservation workload two ways: Postgres with a version column and retries, and a single actor per seat. The actor design did not scale better — it moved the hard problem from concurrency control to routing and rebalance correctness, and that trade was the easier one to reason about under hot keys.
Idempotency Is a Protocol, Not a Key
The first time I shipped idempotency as a UUID header and a Redis lookup, a duplicate charge slipped through a week later. These are my notes on treating idempotency as a four-part protocol — dedup, determinism, concurrent safety, downstream propagation — with a minimal Kotlin plus Postgres implementation that holds up under retry.
Cell-Based Architecture Isn't Free: What Slack, DoorDash, and Roblox Actually Paid For It
Cell-based architecture contains blast radius, but it is not free. A look at what Slack, DoorDash, and Roblox actually paid for cells in production — and a checklist for the cheaper fault-isolation patterns most teams should reach for first.