Eval-driven development for LLM applications
If you don't have evals, you don't have a product — you have a vibe. Here's the playbook we use to make LLM behaviour testable, repeatable, and shippable.
Trillion Thoughts Engineering
The phrase that has saved us the most pain in the last twelve months of building LLM-backed products is "if you don't have evals, you don't have a product, you have a vibe."
Evals are the LLM-app equivalent of a test suite, with one important twist: they grade outputs that don't have a single right answer. Done well, they let you iterate on prompts and models with the same confidence you iterate on regular code. Done badly — or not at all — you ship and pray.
The minimum viable eval suite
You need three categories of test, in this order:
- Hard assertions. The output must contain a specific field, must be valid JSON, must not exceed N tokens, must not include a banned phrase. These are unit tests. They run in milliseconds and they fail loudly.
- Reference comparisons. A human (or your best model) wrote a "good" answer. The candidate output is compared via embeddings, BLEU, or — most commonly — an LLM judge with a rubric.
- Production replays. Take 50 real, anonymised user interactions per week. Run them through the new prompt or model. Diff against the previous run. Triage the diffs.
If you only have time for one, do (1). Most "the LLM gave us garbage in production" incidents are caught by hard assertions in the first place.
The judge is also a product
Using an LLM to grade an LLM is normal practice. It is also dangerous. Two principles keep our judges honest:
- Rubrics, not opinions. The judge prompt is a list of numbered criteria with explicit yes/no questions. "Did the answer mention X? Did it cite a source? Was the tone professional?" The judge returns one bit per criterion. We aggregate.
- Human-validated, periodically. Once a quarter, two engineers grade 100 outputs by hand and compare to the judge. If the judge drifts more than 5%, we tune the rubric.
Production replays are the unlock
Every production interaction writes a replayable trace: the input, the full message history, the tool calls, the final output. When we ship a prompt change, we replay the last 1,000 traces and look at the diff. Three outcomes:
- No diff. The change did nothing. Throw it away.
- Targeted diff. The change improved the cases we intended to improve, didn't touch the rest. Ship.
- Sprawling diff. The change moved everything by a little. Almost always means a regression we don't yet understand. Investigate before shipping.
The most expensive bug in an LLM app is the one that improves the metric you're measuring while quietly degrading the metric you forgot to measure.
What to measure
For an agentic system specifically, we track:
- Success rate against a fixed task suite. Did the agent complete the task?
- Tool-call efficiency. Mean and p95 number of tool calls per task. Spikes signal confusion.
- Token spend per task. Mean, p50, p95. New prompts can quietly multiply cost.
- Human-intervention rate. How often does a human have to take over? This is often the truest North Star.
The pipeline
Concretely, our pipeline looks like:
- Engineer changes the prompt or model in a PR.
- CI runs the eval suite (~3 minutes for 200 cases).
- If hard assertions fail → block the merge.
- If judge scores drop > 2% → require manual review.
- On merge, a nightly job replays last week's production traces.
- Drift report posted to Slack on Monday.
The whole thing is unglamorous. That's the point. Glamour in an LLM app usually means surprise, and surprise in production is a bug report.
More notes
Prompt caching at production scale
Prompt caching can drop your inference cost by an order of magnitude — but only if you structure prompts the way the cache wants you to.
Designing durable agent workflows for production
Production agents that survive retries, partial failures, and weeks-long executions don't look like a chat loop. They look like a workflow with an LLM in the middle.
