Schema Evolution When the Event Log Is the Source of Truth

I spent the last few weeks moving a toy banking service from "Kafka as a bus" to "Kafka as the authoritative store". The shift looked cosmetic at first. The topics already existed. Consumers already built state from them. What changed was the retention policy: infinite. Every event became a permanent artifact. The InfoQ talk Event-Driven Patterns for Cloud-Native Banking — What Works, What Hurts? kept flagging this cost. I did not appreciate how far-reaching it was until I ran my first "safe" schema change and noticed a replay returning the wrong numbers.

This is a write-up of what I found while digging into Avro resolution rules, the Protobuf wire format, and the Confluent Schema Registry's compatibility modes. The short version: "backward compatible" is not a single property. Once the log is authoritative, I inherit every schema decision I have ever made, not just the last one.

The compatibility matrix I ended up drawing

The problem has four independent axes, not two: producer schema version, consumer schema version, retention window, and archived events older than live retention. The diagram below is the one I keep pinned while I think through a change.

Event-Log-as-Source-of-Truth Turns Schema Evolution Into a Forever Problem

With 7-day retention on a plain Kafka topic and a single producer/consumer pair, the Schema Registry default of BACKWARD feels sufficient. That default only checks that the new schema can read data written by the last schema. BACKWARD_TRANSITIVE checks across every prior version.

When the log is the source of truth, retention is effectively infinite, and archived events are not a separate bucket — they are the system of record. Every change has to be readable by every future consumer, all the way back to v1. The default BACKWARD mode is not enough; the only safe default I found was BACKWARD_TRANSITIVE, or FULL_TRANSITIVE if I also want to rewind producers. Confluent's docs are explicit about the difference: BACKWARD only validates the new schema against the immediately previous one, while BACKWARD_TRANSITIVE validates it against every schema ever registered for the subject (Confluent Schema Evolution).

That single toggle is the most important one in this post.

The rename that passed the check and silently corrupted every old record

Here is a safe-looking Avro evolution I tried. I took a Tx record with id: long and amount: int and renamed amount to amount_cents. I gave the new field a default value of 0, because that is what every tutorial does to satisfy BACKWARD.

The registry said COMPATIBLE. The producer deployed. A consumer running the new schema replayed history, and every pre-rename transaction came back with amount_cents = 0.

Here is a single-file Kotlin reproduction against Avro 1.11:

kotlin

@file:DependsOn("org.apache.avro:avro:1.11.3")

import org.apache.avro.Schema
import org.apache.avro.SchemaCompatibility

fun parse(s: String): Schema = Schema.Parser().parse(s)

val v1 = parse("""
    {"type":"record","name":"Tx","fields":[
      {"name":"id","type":"long"},
      {"name":"amount","type":"int"}
    ]}
""".trimIndent())

val v2NoAlias = parse("""
    {"type":"record","name":"Tx","fields":[
      {"name":"id","type":"long"},
      {"name":"amount_cents","type":"int","default":0}
    ]}
""".trimIndent())

val v2WithAlias = parse("""
    {"type":"record","name":"Tx","fields":[
      {"name":"id","type":"long"},
      {"name":"amount_cents","type":"int","default":0,"aliases":["amount"]}
    ]}
""".trimIndent())

listOf("no alias" to v2NoAlias, "with alias" to v2WithAlias).forEach { (label, reader) ->
    val r = SchemaCompatibility.checkReaderWriterCompatibility(reader, v1)
    println("rename $label -> ${r.type}")
}

Run it: kotlin rename.main.kts

Both cases print COMPATIBLE. The no-alias version passes because Avro's resolver sees two independent things: the writer has amount and the reader does not — drop it; the reader has amount_cents and the writer does not — fill from default. Nothing errors; nothing is preserved. Add aliases: ["amount"] and the resolver maps the old field into the new one and reads the real value.

The lesson: the registry's COMPATIBLE verdict is about the wire, not the semantics. For a rename, the alias is doing the real work. Skip it and the check becomes a rubber stamp over silent data loss. In a bus-style pipeline with a 7-day window this would self-heal once old events aged out. In an authoritative log, it is permanent.

Protobuf's invariants look similar and are not the same

Protobuf encodes fields by tag number, not by name, so proto3's Updating A Message Type rules draw a different danger zone.

Renames are wire-safe for free — the tag is what matters, so renaming amount to amount_cents produces bytes that old consumers decode correctly. No alias machinery needed. Source code for generated classes does break, which is a different problem.

Tag reuse is the trap. If I delete a field and a teammate later adds a different field with the same tag number, old events on disk decode into the new field, with the wrong type or meaning. That is exactly what reserved exists to prevent. Once I delete tag 5, I write reserved 5; reserved "amount"; so no future author can reuse either the number or the name. A retention-forever log means the reservation stays in the schema forever too.

int32 -> int64 is wire-safe in the widening direction, because varint encoding uses the minimum number of bytes for each value. Going the other way truncates silently for values that no longer fit. Widening is a one-way door.

singular -> repeated is defined as compatible in the spec, though a singular reader handed multiple values keeps only the last for primitives and merges for messages. Mixing modes in the same log works for decoding; older events just come back as one-element lists.

Same failure class as Avro, different invariants. Protobuf's safety net is tag numbers plus reserved. Avro's is aliases plus defaults. Neither covers the semantic layer, and only one offers a rename primitive at all.

What I keep pinned above my desk

Turn the registry default to BACKWARD_TRANSITIVE (or FULL_TRANSITIVE) on any topic with retention longer than one deploy cycle. The default BACKWARD is a one-deploy guarantee; an authoritative log needs an all-versions guarantee.
For Avro renames, always add aliases on the new field. If I cannot, it is not a rename — it is a new field plus a migration job.
For Protobuf, every deleted field gets a reserved entry for both the tag number and the name in the same change. No exceptions.
Type widening is one-way. Commit to the wider type the first time, or plan a double-write cutover.
Before any schema change ships, replay a week of real events through a consumer built from the new schema in a throwaway environment and diff the materialized state against the current one. The registry check is necessary, not sufficient.

When to reach for this pattern and when to stay away

An authoritative event log pays off when auditability, replay for debugging, and parallel read models are first-class concerns. Ledgers, payments, and any system whose invariants are naturally phrased as "what happened" fit. Martin Fowler's 2005 Event Sourcing essay was candid about the costs even then — external-system interactions during replay, temporal logic that has to live in the domain model, and the effort required to reverse an event. The schema-evolution burden is a cousin of those.

If the domain is mostly CRUD, a relational table with ordinary migrations stays simpler for years. I only reach for the authoritative log when the properties it uniquely gives are worth the tax, and when I do, I budget for the compatibility matrix up front — not the first time a rename lands in a PR.

Event-Log-as-Source-of-Truth Turns Schema Evolution Into a Forever Problem

The compatibility matrix I ended up drawing

The rename that passed the check and silently corrupted every old record

Protobuf's invariants look similar and are not the same

What I keep pinned above my desk

When to reach for this pattern and when to stay away

Still here? You might enjoy this.

Related Posts

DBOS vs Temporal: When Postgres Is Enough for Durable Workflow Execution

Memory Evaluation: Measuring How AI Memory Decays Over a Project's Lifetime

The Transactional Outbox Is Not a Queue