Prompt caching at production scale
Prompt caching can drop your inference cost by an order of magnitude — but only if you structure prompts the way the cache wants you to.
Trillion Thoughts Engineering
The first month we shipped a customer-facing agent, our Anthropic bill looked terrifying. The agent was sending 12,000-token system prompts on every turn, including a tool catalogue, a brand voice guide, and a retrieved context block. The model was wonderful. The accountant was not.
Two changes — both basically free engineering work — dropped that cost by 87%. Both came from taking prompt caching seriously instead of treating it as an afterthought.
How the cache thinks
Anthropic's prompt cache is prefix-keyed. The cache stores the first N tokens of your message; if a new request starts with exactly the same first N tokens, the model reuses the previously computed key/value cache instead of re-encoding from scratch. Cache hits are dramatically cheaper and faster.
The two operative words are prefix and exactly. That has consequences for how you structure prompts.
Move the stable stuff to the front
Most prompts have a structure like:
[ system instructions ]
[ tool catalogue ]
[ retrieved context ]
[ conversation history ]
[ user's latest turn ]The first two blocks are stable across thousands of requests. The next two change a little. The last one changes every turn. Order matters: keep the stable blocks at the very top so the cache prefix lines up.
Mark the cache breakpoints explicitly
Anthropic's API takes cache_control markers on individual message parts. A practical pattern:
{
"system": [
{ "type": "text", "text": SYSTEM_INSTRUCTIONS,
"cache_control": { "type": "ephemeral" } },
{ "type": "text", "text": TOOL_CATALOGUE,
"cache_control": { "type": "ephemeral" } }
],
"messages": [
...history,
{ "role": "user", "content": userTurn }
]
}That tells the cache: "store the prefix up to and including the tool catalogue; everything after is fair game to vary." We've seen 90%+ hit rates with this layout for agent loops where the only thing that changes between calls is the latest message.
Quirks worth knowing
- Min length. The cache only kicks in for prefixes above a few thousand tokens. Below that, you pay full price. Worth checking before you start optimising.
- TTL. Ephemeral cache entries live for a few minutes. Bursty traffic benefits more than spread-out traffic — but there's a longer-lived tier if you ask for it.
- Order is sacred. Reorder your tool catalogue and the cache invalidates. Sort tool definitions alphabetically and never re-sort.
- Cache hits show up in the response. Look for
cache_read_input_tokensin the usage object. Wire it into your dashboard. If your hit rate is below 70% on agent loops, something is varying that shouldn't be.
What we measure
One metric, plotted weekly:
cache_hit_ratio = cache_read_tokens / total_input_tokensSustained above 0.7 for our main agent. Anything below 0.5 triggers a review. The review almost always finds someone has injected a timestamp or a request ID into the prefix.
Prompt caching is not a knob you turn on. It is a discipline about how you build prompts. The reward is steady cost as you scale, instead of a hockey stick.
More notes
Workflow automation isn't a no-code problem
The pitch is always 'drag boxes, ship workflows.' The reality is that durable, debuggable, multi-week processes need real engineering — even when the surface is visual.
Eval-driven development for LLM applications
If you don't have evals, you don't have a product — you have a vibe. Here's the playbook we use to make LLM behaviour testable, repeatable, and shippable.
