Generative AI·8 March 2026·6 min read

Prompt caching at production scale

Prompt caching can drop your inference cost by an order of magnitude — but only if you structure prompts the way the cache wants you to.

Trillion Thoughts Engineering

The first month we shipped a customer-facing agent, our Anthropic bill looked terrifying. The agent was sending 12,000-token system prompts on every turn, including a tool catalogue, a brand voice guide, and a retrieved context block. The model was wonderful. The accountant was not.

Two changes — both basically free engineering work — dropped that cost by 87%. Both came from taking prompt caching seriously instead of treating it as an afterthought.

How the cache thinks

Anthropic's prompt cache is prefix-keyed. The cache stores the first N tokens of your message; if a new request starts with exactly the same first N tokens, the model reuses the previously computed key/value cache instead of re-encoding from scratch. Cache hits are dramatically cheaper and faster.

The two operative words are prefix and exactly. That has consequences for how you structure prompts.

Move the stable stuff to the front

Most prompts have a structure like:

[ system instructions ]
[ tool catalogue       ]
[ retrieved context    ]
[ conversation history ]
[ user's latest turn   ]

The first two blocks are stable across thousands of requests. The next two change a little. The last one changes every turn. Order matters: keep the stable blocks at the very top so the cache prefix lines up.

Mark the cache breakpoints explicitly

Anthropic's API takes cache_control markers on individual message parts. A practical pattern:

{
  "system": [
    { "type": "text", "text": SYSTEM_INSTRUCTIONS,
      "cache_control": { "type": "ephemeral" } },
    { "type": "text", "text": TOOL_CATALOGUE,
      "cache_control": { "type": "ephemeral" } }
  ],
  "messages": [
    ...history,
    { "role": "user", "content": userTurn }
  ]
}

That tells the cache: "store the prefix up to and including the tool catalogue; everything after is fair game to vary." We've seen 90%+ hit rates with this layout for agent loops where the only thing that changes between calls is the latest message.

Quirks worth knowing

Min length. The cache only kicks in for prefixes above a few thousand tokens. Below that, you pay full price. Worth checking before you start optimising.
TTL. Ephemeral cache entries live for a few minutes. Bursty traffic benefits more than spread-out traffic — but there's a longer-lived tier if you ask for it.
Order is sacred. Reorder your tool catalogue and the cache invalidates. Sort tool definitions alphabetically and never re-sort.
Cache hits show up in the response. Look for cache_read_input_tokens in the usage object. Wire it into your dashboard. If your hit rate is below 70% on agent loops, something is varying that shouldn't be.

What we measure

One metric, plotted weekly:

cache_hit_ratio = cache_read_tokens / total_input_tokens

Sustained above 0.7 for our main agent. Anything below 0.5 triggers a review. The review almost always finds someone has injected a timestamp or a request ID into the prefix.

Prompt caching is not a knob you turn on. It is a discipline about how you build prompts. The reward is steady cost as you scale, instead of a hockey stick.

More notes

Operations6 min read

Workflow automation isn't a no-code problem

The pitch is always 'drag boxes, ship workflows.' The reality is that durable, debuggable, multi-week processes need real engineering — even when the surface is visual.

Read note

Generative AI8 min read

Eval-driven development for LLM applications

If you don't have evals, you don't have a product — you have a vibe. Here's the playbook we use to make LLM behaviour testable, repeatable, and shippable.

Read note