Why Temporal beat the queue
We tried hard to keep our jobs on Redis and SQS. Here's the failure mode that finally tipped us over to Temporal — and the three patterns that made the migration worth it.
Trillion Thoughts Engineering
For most of our first year, our background jobs lived on a healthy stack of Redis Streams, SQS, and a sprinkle of cron. It worked. We had idempotency keys, retries with backoff, dead-letter queues, and a Grafana board that was, frankly, beautiful.
Then a customer's onboarding workflow got stuck halfway. The first six steps had run successfully. Step seven needed a webhook from a third party that took 36 hours to arrive. The webhook came back fine — but our worker had lost its place. Step eight ran, then step five ran again, then step nine, then a notification fired twice. The bug took an afternoon to reproduce and three days to fully understand. The problem wasn't a single piece of broken code; it was a mental model that had quietly stopped matching reality.
The thing queues don't model
Queues are excellent at moving messages between workers. They are not excellent at modelling the state of a long-running business process. Once you start needing things like:
- "Wait up to 30 days for this signal, then continue"
- "If step 4 fails, run a compensating step 3, then retry from step 4"
- "Keep an audit log of what happened, in order, with cause and effect"
…you end up writing a small workflow engine on top of your queue. We wrote ours three times. Each version was correct enough to ship and wrong enough to wake us up at 2 a.m. once a quarter.
What Temporal gives you
Temporal is a workflow engine that treats your code as the source of truth. You write a workflow as a normal function; Temporal records every decision and every side effect; if the worker dies, another worker picks up exactly where the first one left off. The fancy term is durable execution. The boring term is "your code keeps its place."
Three properties moved the needle for us:
- Determinism, by construction. Workflow code runs inside a sandbox that catches non-deterministic operations (random numbers, current time, network calls) and forces you to express them as activities. The result is a runtime where replay is safe by default — and replay is what makes resume-from-crash work.
- Activities as the only failure surface. Activities are normal functions that can fail. Workflows orchestrate them and can't. That separation gave us a single place to think about retries, timeouts, and back-pressure, instead of having that logic spread across every queue consumer.
- Signals and queries as first-class citizens. The third-party webhook that derailed us is now a signal handler. The workflow waits, the signal arrives whenever it arrives, the workflow continues. No timer juggling, no message scheduling, no "did we already process this".
The cost we paid
The migration wasn't free. The biggest costs were the ones we underestimated:
- The mental model is new. Engineers new to Temporal spend two weeks confused about why their
fetchcall throws a "non-deterministic" error. The fix is always the same — wrap it in an activity — but the lesson takes a few rounds to land. - Local dev needs a real cluster. The Temporal CLI gives you a one-command local server, but your team needs to learn to run it. We baked it into our
docker composeand stopped worrying. - Observability changes shape. You stop staring at queue depth and start staring at workflow histories. The Temporal Web UI is excellent for this; pair it with structured logs and you'll rarely miss the old dashboard.
When we'd still reach for a queue
Temporal is not a queue. If your job is a single fire-and-forget function — send this email, resize this image, push this message — a queue is still the right tool. We use SQS for that today, alongside Temporal for everything that has a "wait", a "retry compensation", or a "what step are we on" question attached.
The cleanest signal that you've outgrown a queue is when "what step is this customer on?" becomes a hard question to answer.
If you're considering it
Two pieces of unsolicited advice from a year of running Temporal in production:
- Start with one workflow. Pick a process that's burning you. Migrate it end-to-end. Resist the urge to build a framework before you've shipped one workflow.
- Write integration tests against a real Temporal cluster. The replay-style unit tests are seductive but they let you mock the interesting failures. A real cluster in CI takes thirty seconds to spin up and catches the bugs that matter.
Six months in, our on-call hours are down by a third. The biggest win wasn't latency or throughput — it was sleeping through Saturday.
More notes
Designing durable agent workflows for production
Production agents that survive retries, partial failures, and weeks-long executions don't look like a chat loop. They look like a workflow with an LLM in the middle.
Eval-driven development for LLM applications
If you don't have evals, you don't have a product — you have a vibe. Here's the playbook we use to make LLM behaviour testable, repeatable, and shippable.
