Actor-per-Entity vs Optimistic Locking for Hot Seats

2/4

When I see "must not double-book", my reflex is the same one most backend engineers have: a transaction, a version column, retries with jitter. It is the cheapest correct thing on a stack I already run. I wanted to feel the alternative end-to-end before defaulting to it again, so I built the same seat-reservation workload twice and pushed contention at it until both designs squealed. The findings did not match my intuition.

The workload I pinned both designs against

200 concurrent buyers chasing 50 seats. Each buyer fires a single reservation request at a randomly chosen seat. The hot tail is brutal: 80% of requests land on 10 of those seats. The invariant is single-line — a seat has at most one holder — but the contention shape makes that line load-bearing.

I ran both designs on a laptop with a local Postgres 18, a fixed connection pool of 32, and Kotlin 1.9 on a single JVM. I am not measuring distributed throughput. I am measuring how each design behaves when 200 callers fight over the same row.

Design A: Postgres with a version column

The version-column pattern is well-trodden. Read the seat row, check holder IS NULL against the current version, write the holder and bump the version, retry on 40001 serialization failures. The PostgreSQL 18 docs are explicit on the contract: applications running at Repeatable Read or Serializable must be prepared to retry on SQLSTATE 40001, and the database does not offer an automatic retry "since it cannot do so with any guarantee of correctness." The same page also warns that under very high contention, completing a single transaction can require many attempts before one wins.

Under low contention, this works fine. Under the hot-tail workload above, two things happen.

First, the retry budget gets eaten. Most reservations need two or three attempts; the hottest seats see double-digit retries before any single transaction wins. Optimistic locking degrades when conflicts are frequent because retries multiply work without making progress — the long-standing anti-pattern documented for Postgres read-modify-write cycles.

Second, the work amplifies upstream. Each retry burns a connection from the pool, holds it through a network round-trip, and competes with other retries for the same row. The pool fills, latency climbs, and a few unlucky buyers exhaust my retry cap and bubble a 409 back to the caller. In my throwaway runs, p99 stretched into the hundreds of milliseconds well before throughput stabilised.

The correctness invariant holds — that is the whole point of the version column. But the cost of holding it scales with conflict rate, not with traffic, and a seat row under a hot tail is pure conflict.

Design B: one actor per seat

The actor-per-entity sketch is what Microsoft Orleans calls a virtual actor and what Akka Cluster Sharding calls a sharded entity. A single in-memory actor owns the seat. Every reservation request lands in its mailbox. The actor processes them one at a time, on a single thread, and the conflict simply does not exist — the second buyer for the same seat reads "already taken" because the first one already mutated state by the time message two is dequeued. Orleans documents this as a single-activation, single-threaded execution guarantee under non-failure conditions.

In my notes, the local version of this idea is just a Kotlin coroutine with a channel. No cluster, no persistence, but the invariant is the same: mailbox order collapses concurrency control into ordinary local mutation.

kotlin

#!/usr/bin/env kotlin
@file:DependsOn("org.jetbrains.kotlinx:kotlinx-coroutines-core-jvm:1.8.0")

import kotlinx.coroutines.*
import kotlinx.coroutines.channels.Channel

sealed interface SeatCmd {
    data class Reserve(
        val buyer: String,
        val reply: CompletableDeferred<Boolean>
    ) : SeatCmd
}

fun CoroutineScope.seatActor(): Channel<SeatCmd> {
    val mailbox = Channel<SeatCmd>(Channel.UNLIMITED)
    launch {
        var holder: String? = null
        for (msg in mailbox) when (msg) {
            is SeatCmd.Reserve -> {
                if (holder == null) {
                    holder = msg.buyer
                    msg.reply.complete(true)
                } else {
                    msg.reply.complete(false)
                }
            }
        }
    }
    return mailbox
}

runBlocking {
    val seat = seatActor()
    val results = (1..200).map { i ->
        async(Dispatchers.Default) {
            val reply = CompletableDeferred<Boolean>()
            seat.send(SeatCmd.Reserve("buyer-$i", reply))
            reply.await()
        }
    }
    val winners = results.awaitAll().count { it }
    check(winners == 1) { "expected one winner, got $winners" }
    println("one buyer won out of 200; mailbox order made the invariant local")
    seat.close()
}

Run it with kotlin seat.main.kts.

The line that carries the whole concurrency model is for (msg in mailbox). There is no SELECT FOR UPDATE, no version check, no retry. The coroutine's mailbox serialises every command for the entity, and the holder check is just a local if. In a real system, this actor would be backed by an Akka or Orleans cluster that pins one activation per seat ID across the cluster. The local version is the pedagogical one.

Under the same hot-tail workload, contention does not produce retries. It produces queueing. The actor processes its mailbox in arrival order, so latency at the hot seats becomes a function of mailbox depth, not conflict probability. Tail latency stays linear in load. Throughput becomes "how fast can one CPU run my reservation function", which for trivial logic is the limit of a single core, not the limit of a contended row.

What changed when a node disappeared

In Design A, the hard problem is concurrency control in the storage layer. The database enforces the invariant; the application copes with conflicts.

In Design B, the hard problem is which node owns this entity right now, and what happens when that answer changes mid-write. Every cluster-sharded actor framework has to solve this. Akka Cluster Sharding's handoff procedure was the most thoroughly documented one I read while digging in: when the coordinator decides to rebalance shard 7 from region A to region B, region A starts buffering inbound messages, sends PoisonPill to all of its entity actors, acks HandoffComplete to the coordinator, and only then does region B activate the entity and drain the buffered messages.

The diagram below traces that timeline. The property worth noticing is that messages buffer through the entire stop-and-restart, but they do not get reordered or duplicated as long as the framework's invariants hold.

Akka also makes it explicit that entity state is not transferred during handoff. If the seat actor cared about who held the seat after a rebalance, it had to persist that state to a journal and replay it on the new node. The state machine moves; the bytes do not.

Orleans makes the same point with a sharper edge: under failure-free conditions, an actor has exactly one activation, but the distributed directory is eventually consistent, and during cluster topology changes "multiple activations of a single activation grain may coexist" until the directory converges. This is not a theoretical concern. Akka's Split Brain Resolver exists precisely because two cluster halves can each conclude they are the surviving majority and start two copies of the same entity — and if both copies write to a shared journal, the journal is now corrupt.

So the actor-per-entity invariant rests on three things, all of which I now own:

the routing layer correctly maps seat-7 to one and only one node
the rebalance protocol drains in-flight messages before reallocating
the cluster-membership decision is consistent enough that two halves do not both decide they own seat-7

Optimistic locking has a smaller surface area. It is also a worse fit for hot keys.

What I would actually reach for

For a workload where the invariant is per-entity and the contention is hot-tailed, I now reach for actor-per-entity first. Mailbox order is a more economical concurrency story than retry budgets, and the routing-and-rebalance work is bounded — it lives in the framework, not in every business handler. The InfoQ piece on Durable Objects framed this pattern as a correctness tool rather than a performance tool, and that framing matched what I felt running the two designs side by side.

For workloads where the invariant spans multiple entities — a transfer that debits one account and credits another — the actor design becomes harder, not easier. The stateless-service-plus-database pattern still has the better story there, because the database transaction is the natural place to compose two writes. Pat Helland's "Life Beyond Distributed Transactions" remains the cleanest articulation of why crossing the single-entity boundary in a stateful actor world deserves a long pause.

If I had to compress this into a checklist for tomorrow:

prefer actor-per-entity when the invariant lives inside one entity and the workload has hot keys
prefer Postgres with a version column and retries when invariants span entities or contention is low
in either case, name the new failure mode you just adopted: retry storms in one, routing and rebalance correctness in the other

The takeaway I want to keep is that "scales better" is the wrong frame. Actor-per-entity does not scale better than optimistic locking. It moves the hard problem to a place where, for hot-key transactional workloads, the failure modes are easier to reason about — provided I treat the routing layer as the new invariant I have to defend.

References

PostgreSQL 18 docs — Serialization Failure Handling — https://www.postgresql.org/docs/current/mvcc-serialization-failure-handling.html
EnterpriseDB — Postgres anti-patterns: read-modify-write cycles — https://www.enterprisedb.com/blog/postgresql-anti-patterns-read-modify-write-cycles
Microsoft Orleans overview — https://learn.microsoft.com/en-us/dotnet/orleans/overview
Akka Cluster Sharding concepts — https://doc.akka.io/libraries/akka-core/current/typed/cluster-sharding-concepts.html
Akka Split Brain Resolver — https://doc.akka.io/libraries/akka-core/current/split-brain-resolver.html
InfoQ — One Cache to Rule Them All: Handling Responses and In-Flight Requests with Durable Objects — https://www.infoq.com/articles/durable-objects-handle-inflight-requests/
Pat Helland — Life Beyond Distributed Transactions — https://www.ics.uci.edu/~cs223/papers/cidr07p15.pdf

Still here? You might enjoy this.

Nothing close enough — try a different angle?

Distributed Systems

Auditing a Scala Service Against Chad Fowler's Four Regenerative Constraints

I walked a Scala order-processing service from my notes through Chad Fowler's four regenerative constraints. Two passed for free, two would force a real redesign. Here is what I learned about where "loosely coupled module" ends and "regenerative component" begins, and which parts of the redesign I would actually pay for.

Engineering

Idempotency Is a Protocol, Not a Key

The first time I shipped idempotency as a UUID header and a Redis lookup, a duplicate charge slipped through a week later. These are my notes on treating idempotency as a four-part protocol — dedup, determinism, concurrent safety, downstream propagation — with a minimal Kotlin plus Postgres implementation that holds up under retry.

Engineering

DBOS vs Temporal: When Postgres Is Enough for Durable Workflow Execution

DBOS reuses Postgres as the durability layer for workflows, while Temporal runs a dedicated cluster. The right choice depends on team size, workload shape, and where you want your operational budget to go. This is a practical rubric for picking between them.

Something didn't load

Actor-per-Entity vs Postgres Optimistic Locking: A Seat-Reservation Bake-off