Saga patterns in Temporal
Distributed transactions don't go away when you split the monolith. Sagas — and Temporal's compensation model — are how we keep them sane in production.
Trillion Thoughts Engineering
The first time a payment succeeded, an inventory reservation failed, and we shipped a customer something that was no longer in stock, we rediscovered something every distributed-systems person eventually rediscovers: two-phase commit doesn't exist for the rest of the world. The bank doesn't care about your inventory service. Stripe doesn't care about your CRM. Your CRM doesn't care about your analytics warehouse.
The honest answer to "how do we make these N services agree?" is the saga pattern. The pleasant answer to "how do we implement a saga without losing our minds?" is Temporal.
Sagas, briefly
A saga is a long-running transaction split into N local steps. Each step has a forward action and a compensating action. If step 4 fails, you run the compensations for steps 3, 2, and 1 — in reverse order — to leave the system in a consistent-ish state.
"Consistent-ish" is doing a lot of work in that sentence. Saga compensations are semantic reversals, not literal ones. You don't unsend an email; you send an apology. You don't unauthorise a payment; you issue a refund. The compensation is a real business event, not a magic undo.
What Temporal contributes
Three things that turn the saga pattern from theory into something you can ship:
- The workflow remembers where it is. If your worker dies between step 3 and step 4, Temporal resumes at step 4 — not at step 1. No idempotency keys to design, no resume-from-checkpoint state machine to write.
- Compensations are first-class. You write them as activities, the same way as forward actions. You attach them to the workflow with try/catch in your normal language. No DSL to learn.
- Observability comes for free. Temporal's history shows every step, every retry, every compensation, with their inputs and outputs. The day you have to explain to a customer what happened to their order, you'll be very glad of this.
The shape of a saga
A typical order-fulfilment saga looks like this:
async function fulfilOrder(order) {
const compensations = [];
try {
const payment = await chargeCard(order);
compensations.unshift(() => refundCard(payment));
const reservation = await reserveInventory(order);
compensations.unshift(() => releaseInventory(reservation));
const shipment = await bookShipment(order, reservation);
compensations.unshift(() => cancelShipment(shipment));
await sendConfirmation(order, shipment);
} catch (err) {
for (const undo of compensations) {
// each compensation is itself an activity, with its own retries
await undo();
}
throw err;
}
}Three things to notice:
- Compensations are pushed onto a stack as we go. If we fail at step 3, we only run the compensations for steps 1 and 2.
- Compensations are activities. They get retries and timeouts. A compensation that itself fails doesn't crash the whole workflow — it just retries until it succeeds (or you escalate).
- The forward path reads like ordinary code. No state machine, no nested promise dance. The saga shape disappears into normal try/catch.
The hard parts
1. Idempotency upstream
Temporal will retry your activities. That's a feature, not a bug. But it means every external call has to be idempotent — usually with an idempotency key derived from the workflow ID. Stripe, Shopify, and most SaaS APIs support this directly. For your own services, build it in from day one.
2. Compensations that can fail forever
Sometimes a compensation simply cannot succeed. The shipment has already been picked up by the carrier; "cancel" returns 410 Gone. You need an escalation path: after N retries, page a human and freeze the workflow on a signal. We model this as a Temporal signal compensation_resolved that ops can fire after manual cleanup.
3. Partial visibility to customers
While a saga is mid-compensation, the customer's view of the system is inconsistent. Their order page might show "paid" while inventory shows "released". Decide upfront which view is canonical and pin the UX to that, or you'll spend support hours explaining transient states.
When not to use a saga
- If everything is in one database, just use a transaction. Don't reach for sagas because they sound architectural; reach for them when actual systems disagree.
- If the steps are independent, just retry each one. Sagas are for ordered, dependent steps. Independent fan-out work belongs in a queue.
- If the failure mode is "we live with it", skip the compensation. Best-effort updates with a daily reconciliation job can be cheaper to operate than a real saga, and good enough for analytics-style work.
Sagas don't make distributed transactions easy. They make them explicit — which turns out to be the part that actually matters.
More notes
From monolith to modular: the cuts that actually pay off
Most monolith decompositions fail at boundary selection. Here's the heuristic we use to find the cuts worth making — and the ones that look attractive but cost more than they return.
Designing durable agent workflows for production
Production agents that survive retries, partial failures, and weeks-long executions don't look like a chat loop. They look like a workflow with an LLM in the middle.
