Connection · Interrupted

Something didn't load

Part of this page failed to reach you. Reload to try again — if it keeps happening, check your connection.

Skip to main content
Distributed Systems8 min read

Two-Phase Commit on the JVM: The Blocking Problem Nobody Puts in the Diagram

I crashed a Two-Phase Commit coordinator on purpose in a small Kotlin simulation to measure how long participants stay locked when the coordinator vanishes between phases. The result is the part of 2PC the diagrams never show — and the reason I would model most cross-service writes as a saga instead.

All Posts
2/4

The diagrams I keep seeing for Two-Phase Commit show two clean phases. Prepare, vote, commit. Three or four boxes, a couple of arrows, problem solved. What the diagrams never show is what happens at the worst possible moment: the coordinator vanishing in the gap between phase one and phase two. In that moment the participants have already locked rows, written prepared records to their logs, and promised to commit if asked. They cannot abort. They cannot commit. They sit and wait.

Unmesh Joshi's Two-Phase Commit entry in the Patterns of Distributed Systems catalog states the rule plainly: each participant uses Write-Ahead Log-style durability so that, after a crash and a restart, it can still complete the protocol. That sentence reads like a footnote. It is the entire game. The cost of 2PC is not the extra round trip — it is the duration of that wait, and the fact that nobody in the diagram is responsible for ending it.

I wanted to put a number on that wait rather than gesture at it, so I built a deliberately small simulation in Kotlin and crashed the coordinator on purpose.

What I actually wanted to measure

Two questions, both about the prepared state:

  • Once a participant has voted yes, how long does it sit on its locks if nobody ever tells it the outcome?
  • When the coordinator does come back, what does its log actually let it recover?

These are not benchmark questions. They are correctness questions dressed as timing questions. The answer to the first is "as long as the coordinator stays down." The answer to the second is "less than I expected." Both answers are easier to feel after running the failure than after reading about it.

The simulation

A single Kotlin file, two participants, one coordinator, all writing to a Write-Ahead Log on disk. The coordinator has a flag that makes it throw between phase one and phase two. Nothing else.

kotlin
import java.io.File
import java.time.Instant

private val walDir = File("./wal").apply { mkdirs() }

class Participant(private val name: String) {
    private val log = File(walDir, "$name.wal")
    private var lockedSince: Instant? = null

    fun prepare(txn: String): Boolean {
        Thread.sleep(20) // pretend to do local work
        lockedSince = Instant.now()
        log.appendText("PREPARED $txn at $lockedSince\n")
        println("[$name] prepared $txn — locks held")
        return true
    }

    fun commit(txn: String) {
        log.appendText("COMMITTED $txn at ${Instant.now()}\n")
        lockedSince = null
        println("[$name] committed $txn")
    }

    fun lockHeldMs(): Long? =
        lockedSince?.let { Instant.now().toEpochMilli() - it.toEpochMilli() }
}

class Coordinator(
    private val participants: List<Participant>,
    var crashAfterPrepare: Boolean = false,
) {
    private val log = File(walDir, "coordinator.wal")

    fun run(txn: String) {
        log.appendText("BEGIN $txn\n")
        val votes = participants.map { it.prepare(txn) }
        log.appendText("PREPARED $txn\n")
        if (crashAfterPrepare) {
            println(">>> coordinator crashed before commit phase <<<")
            throw RuntimeException("simulated crash")
        }
        log.appendText("COMMIT $txn\n")
        if (votes.all { it }) participants.forEach { it.commit(txn) }
    }
}

fun main() {
    val p1 = Participant("user-svc")
    val p2 = Participant("billing-svc")
    val coordinator = Coordinator(listOf(p1, p2), crashAfterPrepare = true)

    runCatching { coordinator.run("txn-42") }
        .onFailure { println("coordinator dead: ${it.message}") }

    repeat(5) {
        Thread.sleep(1000)
        println("after ${it + 1}s — user-svc locked ${p1.lockHeldMs()}ms, billing-svc locked ${p2.lockHeldMs()}ms")
    }
    println("\nrecovery would now read coordinator.wal — PREPARED found, no COMMIT.")
    println("participants must wait until the coordinator restarts or a human resolves.")
}

Run it with:

kotlinc TwoPhaseCommitDemo.kt -include-runtime -d demo.jar && java -jar demo.jar

The interesting parts are not the I/O. They are the order: write PREPARED to disk before declaring yourself prepared, and write each participant's PREPARED line before the coordinator's COMMIT line. That ordering is what makes recovery possible at all. Without it, a power loss between voting and writing would leave a participant in memory-prepared, disk-undecided — exactly the inconsistency the WAL is supposed to prevent.

The deliberate throw lives between the participants' prepared writes and the coordinator's commit write. That is the only window I cared about. Coordinator crashes elsewhere are recoverable: before prepare, the transaction never happened; after commit, recovery reads the commit and finishes the job. Between the two, both sides have written enough to be stuck and not enough to move.

What the timing actually shows

The diagram below is the picture I wish accompanied every 2PC explainer — the lock-hold window opening at vote-yes and never closing on its own.

Run the file and the prepared state stays. The lockHeldMs() counter ticks up forever, because nothing in the simulation will ever set lockedSince back to null. The participants are not broken. They are doing the right thing — refusing to abort a transaction they promised to commit, refusing to commit a transaction nobody confirmed.

In a real system the wait is bounded by whatever runs the coordinator. If the coordinator process restarts in seconds, the lock window is seconds. If it requires a human to spot a stuck transaction in Oracle's DBA_2PC_PENDING view and decide its fate, the window is hours. Oracle's distributed transactions documentation describes this exact case and ships a recovery process called RECO whose entire job is to chase down these orphans.

That is the shape of the cost: not a constant overhead added to a happy-path latency number, but a long tail that depends entirely on operational mean-time-to-recover for the coordinator.

What recovery actually restores

The next part surprised me when I dug into it carefully. A participant's WAL only tells the participant what it promised. It does not tell the participant what was decided. The decision lives in the coordinator's log.

So when the coordinator restarts and reads its WAL, it can do one of two things:

  • If it finds a COMMIT record, it tells participants to commit, and recovery succeeds.
  • If it finds only a PREPARED record and never reached COMMIT, it has to ask the participants to abort — but the participants, on their side, only know they voted yes.

Reconciliation depends on the coordinator surviving with its WAL intact. That is a single point of failure dressed as a protocol step. The literature has known this since the late 1980s, and Pat Helland's ACM Queue piece "Life beyond Distributed Transactions: an Apostate's Opinion" argued bluntly that distributed transactions across services should be replaced with messaging and idempotent application-level work — not because 2PC is wrong, but because the coordinator is too important and too fragile for the way services are actually deployed.

In a single database with a strong transaction manager — Oracle, PostgreSQL prepared transactions, an XA resource manager — the coordinator's durability story is solid. Spring Boot's JTA support — historically backed by embedded managers like Atomikos or Narayana, and in Spring Boot 4 leaning on a JTA transaction manager exposed via JNDI — writes its coordinator log to a known location alongside the application. The contract holds because the coordinator and the resources are tightly co-located and operationally owned by the same group.

In a distributed system spanning two services with separate runtimes, separate deploys, and separate on-call rotations, that contract starts to leak. The coordinator is one process owned by one group; the locked rows live in databases owned by other groups. When the coordinator process is gone, somebody has to walk into the right database and resolve the in-doubt transaction by hand. That is operational coupling the diagram never shows.

When the blocking window is fine, and when sagas win

The simulation is small enough to be honest about. Two participants. Local disk. No network partition. The lock window is unbounded only because I chose not to restart the coordinator. In a co-located transaction across two database instances run by the same operator, with a tight RTO on the coordinator process, the window is short and the protocol earns its keep.

Across services, the calculus flips. A saga — a chain of local transactions, each followed by a compensating action if a later step fails — never holds a global lock. The cost is real: each step commits in isolation, intermediate states are visible to other readers, and compensations have to be designed and tested rather than handed to a transaction manager. But the failure mode is the one most operators already know how to handle. If a compensation fails, alert and retry, and other transactions keep moving in the meantime. There is no in-doubt state, no orphaned lock, no stuck row waiting for a human.

For the scenario that prompted this study — a Spring service writing to a user service and a billing service — I would not reach for 2PC. The blocking window I just measured is the reason. It is bounded only by how fast the coordinator comes back, and "fast" is doing a lot of work in that sentence.

Takeaways

  • 2PC's real cost is the time participants spend locked while the coordinator is missing. Measure that window before adopting the protocol.
  • The Write-Ahead Log on each participant makes recovery possible, but the coordinator's log makes recovery decide. Lose that log and the protocol cannot resolve.
  • For two databases under the same operator, with a strong transaction manager and a short coordinator restart time, 2PC is a reasonable default.
  • For two services with independent runtimes and on-call rotations, model the work as a saga and accept eventual consistency. The blocking window of 2PC is a worse failure mode than the visibility window of a saga.

Use 2PC when

  • All participating resources are co-located and operationally owned together.
  • The transaction is short and contention on the locked rows is low.
  • Coordinator restart time is bounded and well practiced.

Avoid 2PC when

  • Participants live in different services with independent deploy cycles.
  • The coordinator's runtime is not as well protected as the data it locks.
  • Longer compensations and visibility of intermediate states are acceptable in exchange for liveness.

Sources I leaned on

Read next

Still here? You might enjoy this.

Nothing close enough — try a different angle?

Was this helpful?

Leave a rating or a quick note — it helps me improve.