Skip to main content
All Posts
Distributed Systems9 min read

AckWait Is a Contract: How a 30-Second Default Took Down My JetStream Consumer

I lost an evening to a NATS JetStream pull consumer that doubled its work in production. The cause was three lines of ConsumerConfig I never wrote. These are my notes on what AckWait actually counts, why MaxDeliver = -1 is the silent footgun, and the 70-line Go contract I now ship on every JetStream consumer.

Tiarê Balbi BonaminiSoftware Engineer · Vancouver
2/4

I lost an evening to a NATS JetStream pull consumer that looked correct in every test and silently doubled its work in production. The pattern was the same every time: the consumer would process a message, the work would take 31 seconds instead of 28, and a second copy of the same message would arrive while the first was still in flight. Within a minute, a single payment-processing message had been delivered four times, twice to the same goroutine.

The cause was three lines of ConsumerConfig that I had not written. JetStream's defaults are aggressive in a way that only matters when work is slow, and I had been treating the defaults as harmless absence rather than as policy.

This post is what I worked out about JetStream's pull-consumer ack model in the modern nats.go/jetstream package: which timer counts what, which ack call resets it, and the smallest set of ConsumerConfig fields I now treat as a non-negotiable contract on every consumer I deploy.

What AckWait actually counts

When the JetStream server delivers a message to a pull consumer, it starts a per-message timer called the AckWait. If no acknowledgement — explicit, NAK, or progress signal — arrives before that timer expires, the server treats the message as lost and redelivers it. The unmodified default is 30 seconds. The server-side JsDefaultMaxAckPending for the matching in-flight cap is 1000. Both come from the server defaults documented in the official Consumer Details page and confirmed in nats-server.

The thing the docs do not lead with — but that matters more than the default itself — is what the timer is actually measuring. It is not the time between calls to Fetch(). It is not the time between the message hitting your queue and your handler returning. It is the time between when the server sent the message and when the server saw your ack come back. Anything that happens on your end — TCP buffering, GC pause, a slow downstream HTTP call, a goroutine that got descheduled — all of it counts against the same 30-second budget.

The redelivery is not a retry; it is a duplicate. The original copy is still in your handler. If your handler is not idempotent for that message id, you have already burned the contract.

Three details from the spec shaped how I think about this:

  • MaxDeliver defaults to -1, which means infinite redelivery. A message that fails forever stays in rotation forever. The poison-pill insurance is off until you turn it on.
  • BackOff overrides AckWait for redelivery timing, but only on AckWait expiry, not on a plain Nak(). A naked NAK is an immediate redelivery, which is the opposite of what you want under load.
  • InProgress() resets the AckWait timer without acknowledging the message. This is the API the docs barely advertise and the one that fixes long-running work.

The redelivery survivor pool has a hook too: when a message exhausts MaxDeliver, the server emits an advisory on $JS.EVENT.ADVISORY.CONSUMER.MAX_DELIVERIES.<STREAM>.<CONSUMER> carrying the original stream_seq. That is the subject a dead-letter queue subscribes to. I will come back to it.

The storm mode

Here is what actually happens when work crosses AckWait, traced from a few hours of nats consumer info output and server logs. The diagram below shows the broken case on top and the fixed case on the bottom — read the time axis as a stopwatch starting from the first delivery.

At t=0 the server delivers message #42 and starts a 30-second AckWait. My handler begins processing. At t=30 the timer expires; the server redelivers #42 to the same durable (it can land on the same goroutine or any other subscriber bound to that consumer). My handler is still on the first copy. At t=31 the original work completes and acks; the second copy is now one second into its own 30-second window. At t=60 that window expires too — the server sees no ack because the second handler is still running. A third copy arrives.

The pile-up is multiplicative because the consumer is now processing duplicates that themselves take longer than AckWait, so each one spawns its own redelivery. Pending acks climb and stay climbed even after the original payload has succeeded. Once num_ack_pending hits MaxAckPending (default 1000), the consumer stops receiving new messages entirely. From the outside it looks like the consumer hung.

The whole sequence runs without a single error log. The server is doing exactly what the spec says.

The contract I now ship

Five fields. Every one of them is set deliberately.

go
package main

import (
	"context"
	"errors"
	"log"
	"time"

	"github.com/nats-io/nats.go"
	"github.com/nats-io/nats.go/jetstream"
)

type poisonErr struct{}

func (poisonErr) Error() string { return "unprocessable payload" }

// process simulates work that may exceed the AckWait window.
func process(ctx context.Context, msg jetstream.Msg) error {
	deadline := time.Now().Add(50 * time.Second)
	tick := time.NewTicker(15 * time.Second) // < AckWait / 2
	defer tick.Stop()
	for time.Now().Before(deadline) {
		select {
		case <-tick.C:
			_ = msg.InProgress() // resets the server's AckWait timer
		case <-ctx.Done():
			return ctx.Err()
		}
	}
	return nil
}

func main() {
	nc, err := nats.Connect(nats.DefaultURL)
	if err != nil {
		log.Fatal(err)
	}
	defer nc.Drain()

	js, err := jetstream.New(nc)
	if err != nil {
		log.Fatal(err)
	}
	ctx := context.Background()

	stream, err := js.CreateOrUpdateStream(ctx, jetstream.StreamConfig{
		Name:     "ORDERS",
		Subjects: []string{"orders.>"},
	})
	if err != nil {
		log.Fatal(err)
	}

	cons, err := stream.CreateOrUpdateConsumer(ctx, jetstream.ConsumerConfig{
		Durable:       "orders-worker",
		AckPolicy:     jetstream.AckExplicitPolicy,
		AckWait:       45 * time.Second,
		MaxAckPending: 64,
		MaxDeliver:    5,
		BackOff: []time.Duration{
			2 * time.Second, 8 * time.Second,
			30 * time.Second, 2 * time.Minute,
		},
	})
	if err != nil {
		log.Fatal(err)
	}

	_, err = cons.Consume(func(msg jetstream.Msg) {
		switch err := process(ctx, msg); {
		case err == nil:
			_ = msg.Ack()
		case errors.As(err, new(poisonErr)):
			_ = msg.Term() // do not redeliver
		default:
			_ = msg.NakWithDelay(5 * time.Second)
		}
	})
	if err != nil {
		log.Fatal(err)
	}
	select {}
}

Run it with go run main.go against a server started with nats-server -js.

What each line is doing:

  • AckWait: 45 * time.Second. I set this to roughly p99(work) + 50%. The window has to cover the slowest realistic single-message processing, not the median. Thirty seconds is too small for any consumer that does a database write, an external HTTP call, and a publish — all of which I have measured at p99 over 22 seconds.
  • MaxAckPending: 64. Tight, deliberately. The default of 1000 means a single subscriber can have a thousand messages in flight, which is what makes the storm mode look like a hang. With 64, the failure surfaces fast: the consumer goes quiet within seconds of a slow stage instead of buffering for minutes.
  • MaxDeliver: 5. Bounded retries. The advisory on $JS.EVENT.ADVISORY.CONSUMER.MAX_DELIVERIES.ORDERS.orders-worker is the hook a separate DLQ subscriber listens on; once a message is delivered five times, it stops being redelivered and the server emits an advisory event with the original stream_seq. That is the point at which a human or a separate worker takes over.
  • BackOff: [...]. Four durations applied in order on each AckWait expiry. The sequence length must be ≤ MaxDeliver. This stops the immediate-redelivery whip when AckWait does fire — which it still will, because no heartbeat strategy is perfect.
  • InProgress() inside process() is the heartbeat. I tick at less than AckWait / 2 because a single missed tick must not push past the window. At 15-second ticks against a 45-second AckWait, two consecutive misses still leave 15 seconds of slack.

The terminal action distinction matters as much as the configuration. The handler picks one of three:

  • Ack() for success.
  • Term() for known-bad payloads. This emits the MSG_TERMINATED advisory and removes the message from rotation regardless of MaxDeliver. This is what a deserialisation failure should do, never a NAK.
  • NakWithDelay() for transient failures. A bare Nak() is an immediate redelivery; NakWithDelay gives the downstream a real chance to recover before the next attempt.

The default Nak() is the third footgun I keep finding in other people's code. Under a thundering herd it does the opposite of what the author intended.

What this configuration actually costs

The trade-offs are visible. A MaxAckPending of 64 caps single-consumer throughput. With 45-second AckWait and 64 in-flight, the theoretical ceiling is roughly 64 * (1000ms / mean_processing_ms) messages per second on a single subscriber. For 200 ms median work, that is around 320 msg/s per subscriber — fine for the workloads I run, not fine for a fanout that needs to chew through millions per minute. The fix there is more subscribers bound to the same durable, not a higher MaxAckPending.

MaxDeliver: 5 plus an advisory subscriber means I need a second piece of code somewhere — a DLQ worker, even a tiny one. The Synadia anti-patterns post from January 2025 recommends staying below roughly 100,000 consumers per server and below roughly 300 disjoint subject filters per consumer; a per-stream DLQ subscriber is well under both limits.

InProgress() heartbeats are not free. Each one is a small publish on a control subject. At 15-second cadence against 64 in-flight messages, that is about four publishes per second per subscriber — negligible against any real workload, but worth knowing exists.

When the contract is the wrong tool

Two cases where I would not reach for it.

Long-running work past five minutes. The heartbeat pattern starts to look ridiculous when a single message represents hours of computation. At that point the message is a job handle, not a unit of work, and the right shape is to ack the message immediately, store the job in a durable workflow engine, and let the workflow engine own the retry contract. Temporal and DBOS exist exactly for this case.

Strict ordering across a partition. Pull consumers with MaxAckPending > 1 deliver messages in parallel to the subscriber, and any reordering caused by a redelivery breaks per-key ordering. If the workload is a per-account ledger, the right shape is MaxAckPending: 1 plus a partitioned subject (orders.<accountId>) and one consumer per partition. That is a different design, not a tweak to this contract.

Takeaways

  • AckWait is a server-side timer that counts your wall-clock processing time, not your queue wait. Set it to p99(work) + 50% and never rely on the 30-second default.
  • InProgress() is the heartbeat that resets the timer. Tick at less than AckWait / 2.
  • MaxDeliver: -1 is infinite redelivery. Always set a number, and subscribe to the MAX_DELIVERIES advisory to drain the survivors into a DLQ.
  • Nak() without delay is an immediate redelivery whip. Use NakWithDelay() or rely on BackOff.
  • Term() is for messages that will never succeed. It is not a sharper Nak(); it is a different shape of acknowledgement.

Reach for this contract on any pull consumer where a single message can do real work. Skip it for fire-and-forget firehoses, jobs that outlive a single AckWait window cleanly, and strictly ordered ledgers — those want different shapes entirely.

Read next

Still here? You might enjoy this.

Nothing close enough — try a different angle?

Was this helpful?

Leave a rating or a quick note — it helps me improve.