AckWait Is a Contract: How a 30-Second Default Took Down My JetStream Consumer
I lost an evening to a NATS JetStream pull consumer that doubled its work in production. The cause was three lines of ConsumerConfig I never wrote. These are my notes on what AckWait actually counts, why MaxDeliver = -1 is the silent footgun, and the 70-line Go contract I now ship on every JetStream consumer.
I lost an evening to a NATS JetStream pull consumer that looked correct in every test and silently doubled its work in production. The pattern was the same every time: the consumer would process a message, the work would take 31 seconds instead of 28, and a second copy of the same message would arrive while the first was still in flight. Within a minute, a single payment-processing message had been delivered four times, twice to the same goroutine.
The cause was three lines of ConsumerConfig that I had not written. JetStream's defaults are aggressive in a way that only matters when work is slow, and I had been treating the defaults as harmless absence rather than as policy.
This post is what I worked out about JetStream's pull-consumer ack model in the modern nats.go/jetstream package: which timer counts what, which ack call resets it, and the smallest set of ConsumerConfig fields I now treat as a non-negotiable contract on every consumer I deploy.
What AckWait actually counts
When the JetStream server delivers a message to a pull consumer, it starts a per-message timer called the AckWait. If no acknowledgement — explicit, NAK, or progress signal — arrives before that timer expires, the server treats the message as lost and redelivers it. The unmodified default is 30 seconds. The server-side JsDefaultMaxAckPending for the matching in-flight cap is 1000. Both come from the server defaults documented in the official Consumer Details page and confirmed in nats-server.
The thing the docs do not lead with — but that matters more than the default itself — is what the timer is actually measuring. It is not the time between calls to Fetch(). It is not the time between the message hitting your queue and your handler returning. It is the time between when the server sent the message and when the server saw your ack come back. Anything that happens on your end — TCP buffering, GC pause, a slow downstream HTTP call, a goroutine that got descheduled — all of it counts against the same 30-second budget.
The redelivery is not a retry; it is a duplicate. The original copy is still in your handler. If your handler is not idempotent for that message id, you have already burned the contract.
Three details from the spec shaped how I think about this:
MaxDeliverdefaults to-1, which means infinite redelivery. A message that fails forever stays in rotation forever. The poison-pill insurance is off until you turn it on.BackOffoverridesAckWaitfor redelivery timing, but only on AckWait expiry, not on a plainNak(). A naked NAK is an immediate redelivery, which is the opposite of what you want under load.InProgress()resets the AckWait timer without acknowledging the message. This is the API the docs barely advertise and the one that fixes long-running work.
The redelivery survivor pool has a hook too: when a message exhausts MaxDeliver, the server emits an advisory on $JS.EVENT.ADVISORY.CONSUMER.MAX_DELIVERIES.<STREAM>.<CONSUMER> carrying the original stream_seq. That is the subject a dead-letter queue subscribes to. I will come back to it.
The storm mode
Here is what actually happens when work crosses AckWait, traced from a few hours of nats consumer info output and server logs. The diagram below shows the broken case on top and the fixed case on the bottom — read the time axis as a stopwatch starting from the first delivery.
At t=0 the server delivers message #42 and starts a 30-second AckWait. My handler begins processing. At t=30 the timer expires; the server redelivers #42 to the same durable (it can land on the same goroutine or any other subscriber bound to that consumer). My handler is still on the first copy. At t=31 the original work completes and acks; the second copy is now one second into its own 30-second window. At t=60 that window expires too — the server sees no ack because the second handler is still running. A third copy arrives.
The pile-up is multiplicative because the consumer is now processing duplicates that themselves take longer than AckWait, so each one spawns its own redelivery. Pending acks climb and stay climbed even after the original payload has succeeded. Once num_ack_pending hits MaxAckPending (default 1000), the consumer stops receiving new messages entirely. From the outside it looks like the consumer hung.
The whole sequence runs without a single error log. The server is doing exactly what the spec says.
The contract I now ship
Five fields. Every one of them is set deliberately.
package main
import (
"context"
"errors"
"log"
"time"
"github.com/nats-io/nats.go"
"github.com/nats-io/nats.go/jetstream"
)
type poisonErr struct{}
func (poisonErr) Error() string { return "unprocessable payload" }
// process simulates work that may exceed the AckWait window.
func process(ctx context.Context, msg jetstream.Msg) error {
deadline := time.Now().Add(50 * time.Second)
tick := time.NewTicker(15 * time.Second) // < AckWait / 2
defer tick.Stop()
for time.Now().Before(deadline) {
select {
case <-tick.C:
_ = msg.InProgress() // resets the server's AckWait timer
case <-ctx.Done():
return ctx.Err()
}
}
return nil
}
func main() {
nc, err := nats.Connect(nats.DefaultURL)
if err != nil {
log.Fatal(err)
}
defer nc.Drain()
js, err := jetstream.New(nc)
if err != nil {
log.Fatal(err)
}
ctx := context.Background()
stream, err := js.CreateOrUpdateStream(ctx, jetstream.StreamConfig{
Name: "ORDERS",
Subjects: []string{"orders.>"},
})
if err != nil {
log.Fatal(err)
}
cons, err := stream.CreateOrUpdateConsumer(ctx, jetstream.ConsumerConfig{
Durable: "orders-worker",
AckPolicy: jetstream.AckExplicitPolicy,
AckWait: 45 * time.Second,
MaxAckPending: 64,
MaxDeliver: 5,
BackOff: []time.Duration{
2 * time.Second, 8 * time.Second,
30 * time.Second, 2 * time.Minute,
},
})
if err != nil {
log.Fatal(err)
}
_, err = cons.Consume(func(msg jetstream.Msg) {
switch err := process(ctx, msg); {
case err == nil:
_ = msg.Ack()
case errors.As(err, new(poisonErr)):
_ = msg.Term() // do not redeliver
default:
_ = msg.NakWithDelay(5 * time.Second)
}
})
if err != nil {
log.Fatal(err)
}
select {}
}Run it with go run main.go against a server started with nats-server -js.
What each line is doing:
AckWait: 45 * time.Second. I set this to roughlyp99(work) + 50%. The window has to cover the slowest realistic single-message processing, not the median. Thirty seconds is too small for any consumer that does a database write, an external HTTP call, and a publish — all of which I have measured at p99 over 22 seconds.MaxAckPending: 64. Tight, deliberately. The default of 1000 means a single subscriber can have a thousand messages in flight, which is what makes the storm mode look like a hang. With 64, the failure surfaces fast: the consumer goes quiet within seconds of a slow stage instead of buffering for minutes.MaxDeliver: 5. Bounded retries. The advisory on$JS.EVENT.ADVISORY.CONSUMER.MAX_DELIVERIES.ORDERS.orders-workeris the hook a separate DLQ subscriber listens on; once a message is delivered five times, it stops being redelivered and the server emits an advisory event with the originalstream_seq. That is the point at which a human or a separate worker takes over.BackOff: [...]. Four durations applied in order on each AckWait expiry. The sequence length must be≤ MaxDeliver. This stops the immediate-redelivery whip when AckWait does fire — which it still will, because no heartbeat strategy is perfect.InProgress()insideprocess()is the heartbeat. I tick at less thanAckWait / 2because a single missed tick must not push past the window. At 15-second ticks against a 45-second AckWait, two consecutive misses still leave 15 seconds of slack.
The terminal action distinction matters as much as the configuration. The handler picks one of three:
Ack()for success.Term()for known-bad payloads. This emits theMSG_TERMINATEDadvisory and removes the message from rotation regardless ofMaxDeliver. This is what a deserialisation failure should do, never a NAK.NakWithDelay()for transient failures. A bareNak()is an immediate redelivery;NakWithDelaygives the downstream a real chance to recover before the next attempt.
The default Nak() is the third footgun I keep finding in other people's code. Under a thundering herd it does the opposite of what the author intended.
What this configuration actually costs
The trade-offs are visible. A MaxAckPending of 64 caps single-consumer throughput. With 45-second AckWait and 64 in-flight, the theoretical ceiling is roughly 64 * (1000ms / mean_processing_ms) messages per second on a single subscriber. For 200 ms median work, that is around 320 msg/s per subscriber — fine for the workloads I run, not fine for a fanout that needs to chew through millions per minute. The fix there is more subscribers bound to the same durable, not a higher MaxAckPending.
MaxDeliver: 5 plus an advisory subscriber means I need a second piece of code somewhere — a DLQ worker, even a tiny one. The Synadia anti-patterns post from January 2025 recommends staying below roughly 100,000 consumers per server and below roughly 300 disjoint subject filters per consumer; a per-stream DLQ subscriber is well under both limits.
InProgress() heartbeats are not free. Each one is a small publish on a control subject. At 15-second cadence against 64 in-flight messages, that is about four publishes per second per subscriber — negligible against any real workload, but worth knowing exists.
When the contract is the wrong tool
Two cases where I would not reach for it.
Long-running work past five minutes. The heartbeat pattern starts to look ridiculous when a single message represents hours of computation. At that point the message is a job handle, not a unit of work, and the right shape is to ack the message immediately, store the job in a durable workflow engine, and let the workflow engine own the retry contract. Temporal and DBOS exist exactly for this case.
Strict ordering across a partition. Pull consumers with MaxAckPending > 1 deliver messages in parallel to the subscriber, and any reordering caused by a redelivery breaks per-key ordering. If the workload is a per-account ledger, the right shape is MaxAckPending: 1 plus a partitioned subject (orders.<accountId>) and one consumer per partition. That is a different design, not a tweak to this contract.
Takeaways
- AckWait is a server-side timer that counts your wall-clock processing time, not your queue wait. Set it to
p99(work) + 50%and never rely on the 30-second default. InProgress()is the heartbeat that resets the timer. Tick at less thanAckWait / 2.MaxDeliver: -1is infinite redelivery. Always set a number, and subscribe to theMAX_DELIVERIESadvisory to drain the survivors into a DLQ.Nak()without delay is an immediate redelivery whip. UseNakWithDelay()or rely onBackOff.Term()is for messages that will never succeed. It is not a sharperNak(); it is a different shape of acknowledgement.
Reach for this contract on any pull consumer where a single message can do real work. Skip it for fire-and-forget firehoses, jobs that outlive a single AckWait window cleanly, and strictly ordered ledgers — those want different shapes entirely.
Still here? You might enjoy this.
Nothing close enough — try a different angle?
Related Posts
Idempotency Is a Protocol, Not a Key
The first time I shipped idempotency as a UUID header and a Redis lookup, a duplicate charge slipped through a week later. These are my notes on treating idempotency as a four-part protocol — dedup, determinism, concurrent safety, downstream propagation — with a minimal Kotlin plus Postgres implementation that holds up under retry.
What `dbos ontime` Actually Asks: Building a Distributed Cron on etcd Leases in Go
A 0-click query for `dbos ontime` showed up in my Search Console last week. The reader is not asking about DBOS — they are asking how to run a job every minute, exactly once, across a fleet. From my own notes, an etcd lease, the `concurrency.Election` package, and a fencing token cover that case in under 100 lines of Go, without pulling in a workflow engine.
The Transactional Outbox Is Not a Queue
The transactional outbox is a ledger, not a queue. Treating it like one is what breaks Postgres under load. This post walks through the specific failure modes — autovacuum stalls, xmin horizon drift, replication slot lag, poison pills — and the operational rules that actually keep it working in production.