Trillion Thoughts mark
TRILLION THOUGHTS
Thoughts to Solutions
All notes
Generative AI·22 March 2026·8 min read

Eval-driven development for LLM applications

If you don't have evals, you don't have a product — you have a vibe. Here's the playbook we use to make LLM behaviour testable, repeatable, and shippable.

Trillion Thoughts Engineering

The phrase that has saved us the most pain in the last twelve months of building LLM-backed products is "if you don't have evals, you don't have a product, you have a vibe."

Evals are the LLM-app equivalent of a test suite, with one important twist: they grade outputs that don't have a single right answer. Done well, they let you iterate on prompts and models with the same confidence you iterate on regular code. Done badly — or not at all — you ship and pray.

The minimum viable eval suite

You need three categories of test, in this order:

  1. Hard assertions. The output must contain a specific field, must be valid JSON, must not exceed N tokens, must not include a banned phrase. These are unit tests. They run in milliseconds and they fail loudly.
  2. Reference comparisons. A human (or your best model) wrote a "good" answer. The candidate output is compared via embeddings, BLEU, or — most commonly — an LLM judge with a rubric.
  3. Production replays. Take 50 real, anonymised user interactions per week. Run them through the new prompt or model. Diff against the previous run. Triage the diffs.

If you only have time for one, do (1). Most "the LLM gave us garbage in production" incidents are caught by hard assertions in the first place.

The judge is also a product

Using an LLM to grade an LLM is normal practice. It is also dangerous. Two principles keep our judges honest:

  • Rubrics, not opinions. The judge prompt is a list of numbered criteria with explicit yes/no questions. "Did the answer mention X? Did it cite a source? Was the tone professional?" The judge returns one bit per criterion. We aggregate.
  • Human-validated, periodically. Once a quarter, two engineers grade 100 outputs by hand and compare to the judge. If the judge drifts more than 5%, we tune the rubric.

Production replays are the unlock

Every production interaction writes a replayable trace: the input, the full message history, the tool calls, the final output. When we ship a prompt change, we replay the last 1,000 traces and look at the diff. Three outcomes:

  • No diff. The change did nothing. Throw it away.
  • Targeted diff. The change improved the cases we intended to improve, didn't touch the rest. Ship.
  • Sprawling diff. The change moved everything by a little. Almost always means a regression we don't yet understand. Investigate before shipping.
The most expensive bug in an LLM app is the one that improves the metric you're measuring while quietly degrading the metric you forgot to measure.

What to measure

For an agentic system specifically, we track:

  • Success rate against a fixed task suite. Did the agent complete the task?
  • Tool-call efficiency. Mean and p95 number of tool calls per task. Spikes signal confusion.
  • Token spend per task. Mean, p50, p95. New prompts can quietly multiply cost.
  • Human-intervention rate. How often does a human have to take over? This is often the truest North Star.

The pipeline

Concretely, our pipeline looks like:

  1. Engineer changes the prompt or model in a PR.
  2. CI runs the eval suite (~3 minutes for 200 cases).
  3. If hard assertions fail → block the merge.
  4. If judge scores drop > 2% → require manual review.
  5. On merge, a nightly job replays last week's production traces.
  6. Drift report posted to Slack on Monday.

The whole thing is unglamorous. That's the point. Glamour in an LLM app usually means surprise, and surprise in production is a bug report.