Turning LLM Context Engineering Into an Evaluation Loop with DSPy
Notes from two weekends of digging into DSPy. I stopped treating prompts as the source of truth and started treating them as compiled output from a typed signature, a metric, and an optimizer. Here is the smallest end-to-end program I kept, how MIPROv2 actually searches, and where the approach breaks down in practice.
Every LLM app I have built in the last year eventually hits the same wall. A prompt that worked on a Tuesday quietly regresses on a Friday when the vendor rolls a new model snapshot. A RAG pipeline that nails the happy path falls apart on multi-hop questions. An agent loses the plot halfway through a tool chain. I used to debug that by reading traces and editing the prompt by hand. It felt like writing CSS with no browser to reload.
DSPy pushes that work into code. I spent two weekends with it — reading the source, building a throwaway ticket classifier, running MIPROv2 against a handful of held-out examples — and the takeaway from my notes is simple. The job is no longer "write a better prompt". The job is to declare what the model should do, define how to measure success, and let an optimizer compile the prompt and demonstrations for me.
Why prompt strings stop scaling
A prompt is three things tangled together: an interface (what the model takes and returns), an instruction (how it should behave), and a set of examples (what good looks like). When I change provider or tweak a downstream schema, all three churn at once, and I have no way to tell which part regressed. There is no version, no dataset, no metric. The prompt is also non-portable: switching from gpt-4o-mini to a local Qwen almost always needs a rewrite, because the old string was overfit to the old model's quirks.
DSPy untangles the three. A signature is the interface. A module is the behavior. Examples are the data. A metric scores the output. An optimizer searches the space of instructions and demonstrations against that metric. That is the whole mental model, and once I wrote it down, most of my past prompt work started to look like a giant ad-hoc optimizer I had been running inside my head.
The minimal program
Here is the smallest end-to-end DSPy program I kept in my notes. It classifies support tickets into three buckets, measures accuracy on a tiny dev set, and runs one pass of MIPROv2 to improve the prompt and pick few-shot demos automatically.
# classify.py — a minimal DSPy program with a metric and an optimizer pass.
import dspy
# 1. Configure the LM. Any LiteLLM-supported provider works.
dspy.configure(lm=dspy.LM("openai/gpt-4o-mini"))
# 2. Declare a signature. This replaces the prompt string.
class ClassifyTicket(dspy.Signature):
"""Classify a support ticket as 'billing', 'technical', or 'other'."""
ticket: str = dspy.InputField()
category: str = dspy.OutputField()
# 3. Wire a module. Predict is simplest; ChainOfThought adds reasoning.
classify = dspy.ChainOfThought(ClassifyTicket)
# 4. A labeled dataset — the backbone for eval and optimization.
examples = [
dspy.Example(ticket="My invoice is wrong.", category="billing").with_inputs("ticket"),
dspy.Example(ticket="Why was I charged twice?", category="billing").with_inputs("ticket"),
dspy.Example(ticket="The app crashes on login.", category="technical").with_inputs("ticket"),
dspy.Example(ticket="504 error when I upload a file.", category="technical").with_inputs("ticket"),
dspy.Example(ticket="Do you have a Slack channel?", category="other").with_inputs("ticket"),
dspy.Example(ticket="Can I change my account timezone?", category="other").with_inputs("ticket"),
]
# 5. A metric: exact match on the category field.
def accuracy(example, prediction, trace=None):
return example.category.strip().lower() == prediction.category.strip().lower()
# 6. Baseline eval, before any optimization.
evaluate = dspy.Evaluate(devset=examples, metric=accuracy, display_progress=True)
print("baseline:", evaluate(classify))
# 7. Optimize: MIPROv2 searches instructions and few-shot demos.
optimizer = dspy.MIPROv2(metric=accuracy, auto="light")
optimized = optimizer.compile(classify, trainset=examples)
print("optimized:", evaluate(optimized))
# Save the compiled program so it can be reloaded without re-compiling.
optimized.save("optimized_classifier.json")Run it with:
pip install dspy
OPENAI_API_KEY=sk-... python classify.pyA few lines deserve attention. dspy.configure(lm=...) sets the global model via LiteLLM, so the same program runs against OpenAI, Anthropic, a local Ollama endpoint, or a SageMaker model by changing one string. The Signature is typed — ticket: str and category: str — and DSPy compiles those types into the prompt it actually sends. The metric is a plain Python function; anything I can score, the optimizer can improve. auto="light" asks MIPROv2 to pick reasonable defaults for trial count and candidate count based on the size of the trainset, so I am not hand-tuning the optimizer on top of hand-tuning the prompt.
How MIPROv2 actually searches
The name is not decorative. MIPROv2 is a Bayesian search over two things at once — instructions and few-shot demonstrations — in three phases. The sketch below is what I want the diagram to make visible, before I walk through it.

Phase one bootstraps candidate demonstrations. DSPy samples from the trainset, runs the current program, and keeps only the traces whose outputs pass the metric. Defaults are four bootstrapped demos plus four labeled ones, from the MIPROv2 API page.
Phase two proposes candidate instructions. A prompt_model (the same LM by default) reads the program's code, a short summary of the dataset, and the bootstrapped demos, and drafts several instructions worth trying. This is where DSPy exploits information that hand-prompting throws away — the actual shape of the data, and the shape of the program around the prompt.
Phase three combines instructions and demos and evaluates them on a minibatch of the trainset (default size 35). Bayesian optimization picks which combinations to try next. At the end, MIPROv2 returns the best-scoring program — a module with a different compiled prompt attached.
The interesting result from the MIPROv2 paper is not that Bayesian search is magic. It is that I stop having to pick which part of the prompt to edit. Change the dataset, re-run compile, get a prompt that matches the new data.
The trade-offs I hit
Compile time is real. My six-example classifier finishes auto="light" in roughly a minute on gpt-4o-mini. Scale that to 200 examples and a heavier program, and the rough bound I keep seeing quoted — five minutes to an hour — matches what I measured locally. Each trial is more LM calls; there is no free lunch.
Cost moves in the same direction. I pay for the bootstraps, the instruction proposals, and the minibatch evaluations. In exchange, I can downshift to a cheaper model at serving time. Dropbox's write-up of their Dash relevance judge is the cleanest public example: they compiled quality into a smaller model instead of paying a larger one forever. I cannot reproduce their numbers, but the shape of the win matches what I saw on my toy classifier.
The abstraction is unfamiliar. The first time I saw Predict("question -> answer: float") I read it as DSL theater. Two days in, I realized the string is a signature, the signature is a type, and the type is what lets the optimizer reason about what changed between two candidates. The friction is upfront; the payoff is that a prompt stops being a debugging dead end.
Evaluation is still the hard part. DSPy will not save me from a bad metric. A metric that rewards the wrong thing will be overfit to with enthusiasm. That is not a DSPy bug — it is the classic ML lesson that context engineering inherits. The discipline the framework forces is: write the metric first, argue about it, then optimize.
When I reach for it, and when I skip it
I reach for DSPy when the program has more than one LM call, or when I need the same behavior across two or more models, or when I have at least a dozen labeled examples and a metric I trust. Multi-hop RAG, classification with brittle edge cases, agentic tool use — those are where the optimizer pays rent.
I skip it for one-shot prompts I touch once a month. The boilerplate is not worth it. I also skip it when I cannot define a metric that matters. No metric, no optimization; at that point I am back to prompt engineering, and that is fine.
- Treat a prompt as the compiled output of a program, not a source of truth.
- Start by declaring the signature and writing the metric. The optimizer comes last.
- Use
auto="light"before hand-tuning optimizer parameters. Defaults are usually fine. - Budget real time and real tokens for compile. Cache the compiled program to disk with
.save()and reload it with.load(). - Portability across models is a property of the discipline, not the tool. DSPy only makes it cheap to exercise.
Use DSPy when the task is measurable, the program is multi-step, or model swaps are on the roadmap. Avoid it for one-off prompts, tasks without a usable metric, or workflows where even five minutes of compile time is a non-starter.
Still here? You might enjoy this.
Nothing close enough — try a different angle?
Related Posts
JetBrains Tracy: Pragmatic AI Observability for Kotlin
JetBrains Tracy is a Kotlin library that wires LLM-aware tracing into your app on top of OpenTelemetry. This post walks through how I integrated it in a Spring Boot service, the design decisions that matter, and the failure modes teams hit once LLM calls become the hottest path in their system.
The Deterministic Backbone: Why Production AI Systems Are Moving Away From Fully Autonomous Agents
Fully autonomous agents are hard to bound, hard to test, and expensive to operate. A deterministic backbone with narrow agent steps gives you the control flow back while keeping the intelligence where it matters. Here is how to design, test, and migrate toward it.
Memory Evaluation: Measuring How AI Memory Decays Over a Project's Lifetime
Most AI memory benchmarks grade on recall and stop there. That hides the real failure mode: stale facts quietly poisoning the context window. Here is a lifecycle-based evaluation framework that tests recall, revision, and controlled forgetting across the change points every long-lived project goes through.