Generative AI·2 April 2026·9 min read

Designing durable agent workflows for production

Production agents that survive retries, partial failures, and weeks-long executions don't look like a chat loop. They look like a workflow with an LLM in the middle.

Trillion Thoughts Engineering

It is easy to build a demo agent. while (notDone) { call LLM; run tool; check }. It is hard to build an agent that runs for nine days, talks to seven external systems, and survives the inevitable moment when one of those systems decides today is a bad day.

We've been shipping agents to production for the better part of a year now. The architectural shape that has held up best looks almost nothing like the demos. It looks like a workflow with an LLM in the middle.

The shape

Five components, in roughly this order:

A durable workflow runtime. Temporal, in our case. The workflow is the thing that "remembers". The LLM does not.
The agent loop, expressed as activities. Each LLM call is one activity. Each tool call is another. The workflow simply alternates between them.
A typed tool registry. Tools are functions with a schema. The LLM gets the schema; the workflow runs the function. The model never touches your network directly.
An eval harness wired to production traces. Every run produces a trace. The trace is replayable, attachable to a test case, and runnable through the eval suite.
A human-in-the-loop signal. Long-running agents need a way for a human to step in. We model this as a Temporal signal that the workflow can wait on at any point.

What goes inside the activity

Every LLM call is an activity. That means it gets retries, timeouts, and idempotency for free — but only if you write it to be idempotent. Three rules we follow:

Pass the full context in. No reading from a global conversation cache inside the activity. The workflow owns the conversation; the activity is pure.
Cache aggressively. System prompts, tool schemas, and long context windows go through Anthropic's prompt cache. Done right, this drops cost by an order of magnitude on long-running agents.
Log the full input and output. Anthropic returns a stable response ID. Pair it with your workflow ID and you have a straight line from "this run misbehaved" to "this is the exact prompt and response".

What goes inside the workflow

The workflow is the boring part — and that's the point. A typical loop looks like this:

async function agentWorkflow(input) {
  let history = [];
  while (true) {
    const decision = await callLlm({ history, input });
    history.push(decision);
    if (decision.type === "final") return decision.value;
    if (decision.type === "tool") {
      const result = await runTool(decision.tool, decision.args);
      history.push(result);
    }
    if (decision.type === "human") {
      const reply = await waitForSignal("human_reply");
      history.push(reply);
    }
  }
}

Notice what's not there: no try/catch around the network, no retry loop, no idempotency keys. All of that lives in the activity definitions and the workflow runtime config.

The hard parts nobody mentions

1. Drift

Models change. The prompt that worked beautifully in March will start producing slightly different outputs by June. Without an eval suite running against production traces, you'll only notice when a customer complains. Build the eval harness first. Treat it like a unit test suite for prompts.

2. Bounded loops

Every agent will eventually try to spin forever. We cap iterations (usually 30–50), cap total token spend per workflow, and have a kill-switch signal that any human can fire. None of this should be controversial.

3. Tool design is product design

The tools you give the model define what it can do. Vague tools (do_query) produce vague behaviour. Sharp tools (get_user_by_email, refund_order) produce sharp behaviour. Treat the tool registry as a product surface, not as glue.

What we wouldn't do again

Roll our own retry loop. Use the workflow runtime's.
Skip the eval harness "until we ship". Without it, you ship blind.
Treat the agent as a chat box. Production agents are back-end systems. Build them like back-end systems.

A production agent is 10% prompts, 30% tools, and 60% the workflow runtime that holds it all together.

More notes

Generative AI8 min read

Eval-driven development for LLM applications

If you don't have evals, you don't have a product — you have a vibe. Here's the playbook we use to make LLM behaviour testable, repeatable, and shippable.

Read note

Engineering7 min read

Why Temporal beat the queue

We tried hard to keep our jobs on Redis and SQS. Here's the failure mode that finally tipped us over to Temporal — and the three patterns that made the migration worth it.

Read note