Preregistering Experiment #1: JSON Extraction Prompt Fragility

Last week I posted AI Prompts: How Good and How Bad They Are — the opening shot of a research line on moving language models from "best effort" to specifiable, measurable precision. That post laid out five research questions. This post starts answering the first one, and it does so in a specific way that I want to defend before I run a single API call:

This is a preregistration. Hypotheses, metrics, sample sizes, stopping rules — all committed before the data is collected. The companion code is already in the open. If the results contradict the hypotheses, that gets published exactly the same way the confirmations would.

I am doing this deliberately. The parent post argued that benchmark culture papers over fragility by reporting one number per model per task. The honest counter-move is to bind myself to the methodology before I know the answer. Preregistration is how the rest of empirical science learned to stop fooling itself, and I don't see a reason prompt engineering should be exempt.

This post does not contain results. The results come second, in a separate field report. If you want to follow along — or argue with the design — now is the time.

What this experiment will and will not measure

Of the five questions in the parent post, this is Q1: how fragile, in practice, are the prompts that actually ship? The smallest honest version of that question is a structured extraction task — single-call, fixed schema, objectively scoreable. I picked it for three reasons.

First, extraction is where production teams are actually losing reliability right now. A 2026 ExtractBench study found that frontier models drop to 0% valid output on a 369-field financial reporting schema — the failure is not exotic, it shows up at modest scale. PromptPort's cross-model work makes the more uncomfortable point: a prompt that returns clean JSON on one model returns fenced, prose-wrapped, or malformed output on another, and strict parsers reject otherwise correct extractions. The economic pain is real.

Second, structured outputs are exactly where the field claims it has already won. The parent post celebrated constrained decoding — XGrammar at <40µs per token, default backend for vLLM/SGLang/TensorRT-LLM since March 2026, XGrammar-2 in May — as a genuine syntactic-level win. If even that surface area is fragile when you stop staring at one variant, the field has a problem deeper than it has admitted.

Third, it is cheap to score honestly. No LLM-as-judge, no rubric drift. Schema validity is binary. Field accuracy is exact match against ground truth. F1 on nested arrays is a textbook calculation. The whole measurement stack runs offline against committed fixtures.

What this experiment will not do:

It will not test multiple="true" models. One frontier model, one set of variants. Multi-model contrast is Experiment #2, where the question becomes "is fragility a model property or a universal property."
It will not measure RAG, agentic flows, or open-ended generation. Those have their own experiments queued.
It will not try to optimize the prompt. The point is to characterize variance under semantics-preserving perturbations, not to find the winner.
It will not measure cost or latency tradeoffs. That is question Q5 from the parent post and gets its own treatment.

The scope is intentionally narrow. A small honest result is more useful than a sweeping ambiguous one.

The task

Single-call extraction of invoice line items from a paragraph of English business text into a fixed JSON schema. ~12 fields, two nested arrays, mid-complexity — closer to a real production schema than a toy one, but small enough to control.

The corpus is procedurally generated, seeded, and committed to the repo. Two hundred input documents, no PII, designed to span the kinds of formatting variation (currency symbols, date formats, line-break conventions, polite filler) that production teams hit in the wild. Ground truth is produced by the generator at corpus-build time, not by the model.

Single model: Claude Sonnet 4.6. Temperature 0. Identical schema. Identical system prompt skeleton across all conditions. The only thing that varies is the perturbation.

The perturbation taxonomy

Five classes of perturbation. The full English instruction varies; the schema and the task definition do not. Each class has three variants, for fifteen total prompts.

P1 — Lexical. Synonym swaps in the instructions. "Extract the line items" becomes "pull out the line items" becomes "identify the line items." Same semantics, different surface form. This is the class most people assume is the dominant source of fragility, and BrittleBench's data suggests they are wrong about that.

P2 — Structural. The order of instruction blocks. Schema before rules vs. rules before schema vs. schema and rules interleaved. The content is identical; the sequence changes.

P3 — Formatting. Markdown headers vs. plain prose labels vs. XML-tagged sections. ### Schema vs. Schema: vs. <schema>...</schema>. The instructions read the same way to a human; the tokenization does not.

P4 — Example count. Zero-shot, one-shot, three-shot — with the same canonical examples, drawn from the same generator. This is a controlled test of in-context learning's marginal contribution, holding example content constant.

P5 — Voice and politeness. Polite vs. terse imperative vs. neutral declarative. "Please carefully extract..." vs. "Extract..." vs. "The task is to extract..." This one is mostly here because every team I have ever seen has accidentally A/B-tested it without measuring it.

The metrics

Four dependent variables, computed per (prompt-variant, input) cell.

Schema validity rate. Does the output parse against the JSON schema? Binary, per call.
Field-level accuracy. Per-field exact match against ground truth, averaged across the 200 inputs.
Value-set F1. For the two nested arrays — where the order of items can legitimately vary — compute precision and recall on the value set.
Run-to-run agreement (Krippendorff's α). Each (variant, input) cell is repeated five times. α is computed across the five repeats, treating outputs as nominal. Median α across cells is the per-variant stability score.

The combination matters. (1) and (2) measure whether the model gets it right. (3) handles the "right answer, different order" case that punishes naive scoring. (4) — the one most production teams skip — measures whether the model agrees with itself, run to run, at temperature zero. The parent post argued that even at T=0, batching and inference-time optimizations produce different outputs across runs. (4) is where I find out how much that matters for this task.

The hypotheses

Four falsifiable claims, each with an effect size I will accept as the threshold for "real."

H1 — Brittleness exists at production scale. Across the fifteen perturbation variants, mean field-level accuracy varies by at least 5 percentage points between the best and worst variant on the same 200 inputs. The effect-size threshold is conservative — BrittleBench reports up to 12 — and is set deliberately low to make the hypothesis hard to confirm by accident.

H2 — Format dominates lexical. Class P3 (formatting) contributes more accuracy variance than class P1 (lexical). If true, the practical implication is unambiguous: changing your markdown headers to XML tags matters more than picking the perfect verb. Most prompt-tuning advice in 2025 had this backwards.

H3 — Run instability is meaningful at T=0. Median Krippendorff's α across cells is below 0.90. If α is high, "non-determinism is a feature, not a bug" is overstated for this task class and the field's hand-wringing is misplaced. If α is low, the implication is sharper: production systems that ship after one good response have been making a category mistake.

H4 — Constrained decoding does not close the semantic gap. Re-run the worst-performing variant from H1 with strict schema-enforced decoding (JSON-mode or equivalent). The gap to the best-performing variant should not close by more than half. This tests the parent post's claim that constrained decoding solves syntax but not semantics — and is the natural bridge to Experiment #2.

The hypotheses are independent. H1 could fail and H3 could still hold. H4 could fail in a way that vindicates constrained decoding — which would be the most interesting outcome of the four, because it would update my prior in the direction of "the field's existing tools are stronger than my parent post implied."

Stopping rules and what would falsify each claim

N is fixed up front. 3,000 calls per perturbation class, 15,000 total, plus a separate ~1,000-call budget for the H4 constrained-decoding sweep. There will be no "the result is close, let me add more samples." If the data is ambiguous at the committed N, the writeup says so.

The falsification thresholds:

H1 falsified if max - min field accuracy across all fifteen variants is ≤ 2pp. In that case the experiment becomes a null result, the research line gets reframed, and the next post explains why — that is a more interesting story than confirming H1 would be.
H2 falsified if P1's contribution to variance is within ±10% of P3's.
H3 falsified if median α ≥ 0.90.
H4 falsified if constrained decoding closes more than 50% of the H1 gap.

If two or more of H1–H3 fail to confirm, the post that reports the data will explicitly walk through what the parent post got wrong and what I am updating on. I would rather publish that than dress up a weak result.

What the runnable harness looks like

The companion repository is the precondition for anyone — including me — taking this preregistration seriously. It contains:

The seeded corpus generator (so the 200 inputs are reproducible).
The fifteen perturbation variants, generated programmatically from a single canonical template (so the variants are not artisanally tuned).
The scoring code (schema validity, field accuracy, F1, Krippendorff's α).
A model-agnostic runner with stub support, so the entire pipeline runs end-to-end against a fake model without spending a cent — useful for review.
A pytest suite that exercises the perturbation generator, the scoring code, the corpus generator, and the runner against the stub. CI-friendly, no API calls.

Repository: github.com/tiarebalbi/prompt-fragility-exp1. The README explains reproduction in under five commands.

The repo will be tagged at the moment the experiment is run, so anyone reading the results post can git checkout the exact code that produced the numbers.

What "publishing the result" will look like

When the experiment runs, the writeup will include:

The raw per-cell numbers (every variant × every metric), not just summary statistics.
The confusion patterns — which fields fail, on which inputs, under which perturbations.
Calls I made that I am not confident about (and what I would do differently in Experiment #2).
The hypotheses table, marked confirmed / falsified / inconclusive, with the actual effect sizes.

If something surprises me, that gets called out. If a hypothesis fails in a way that updates the broader thesis from the parent post, that gets called out too. The point of this whole research line is not to be right; it is to make the field's load-bearing assumption — the model might make a mistake — into something that can be measured, contracted, and either upheld or replaced. A confirmed hypothesis advances that. A surprising falsification advances it more.

The bigger frame

Preregistration is a small move. It does not, by itself, get the field to precision. But it is one of the unglamorous engineering practices the parent post argued is exactly what is missing. Databases got versioned schemas. Networks got formal specifications. Web protocols got compliance test suites. Prompts get a vibe check and a thumbs-up.

The smallest version of "treat prompts like engineering artifacts" is: commit to what you will measure before you measure it. That is what this post is. The companion repo is the second-smallest version: make the measurement reproducible by someone who disagrees with you.

Experiment #2 — multi-model contrast on the same harness — is already scoped. Experiments #3 and #4 — RAG variance and agentic-trajectory stability — are queued. The point is not the individual experiment. The point is that there is now a queue.

More to come.

Something didn't load

Preregistering Experiment #1: How Fragile Are Production JSON Extraction Prompts?

What this experiment will and will not measure

The task

The perturbation taxonomy

The metrics

The hypotheses

Stopping rules and what would falsify each claim

What the runnable harness looks like

What "publishing the result" will look like

The bigger frame

Still here? You might enjoy this.

Related Posts

AI Prompts: How Good and How Bad They Are — Opening a New Line of Research

Turning LLM Context Engineering Into an Evaluation Loop with DSPy

Memory Evaluation: Measuring How AI Memory Decays Over a Project's Lifetime