← blogs / engineering

§7.1 · May 20, 2026

Making Background Jobs Idempotent

Most job-queue bugs aren't caused by the job failing — they're caused by it running twice. A few design patterns that make retries safe by default.

backenddistributed-systemsreliability

title: "Making Background Jobs Idempotent" date: "2026-05-20" excerpt: "Most job-queue bugs aren't caused by the job failing — they're caused by it running twice. A few design patterns that make retries safe by default." tags: ["backend", "distributed-systems", "reliability"]

At-least-once delivery is the default for every job queue I've used in production. SQS, Sidekiq, BullMQ, Cloud Tasks — they all promise your job will run, but not that it'll run exactly once. The result: every worker you write should assume it might be called a second time before the first call finishes.

Most of the time it isn't. Then one deployment happens at the wrong moment, a network partition causes a visibility timeout, or an instance is preempted — and suddenly you have duplicate charges, double emails, or two database rows where there should be one.

The core idea

A function is idempotent when calling it multiple times with the same input produces the same result as calling it once. For a job that sends a welcome email, "same result" means the user gets exactly one welcome email regardless of how many times the job runs.

There are three cheap ways to get there.

1. Natural idempotency

Some operations are idempotent by nature. Updating a row to a specific state (SET status = 'verified') is safe to replay. So is an upsert on a unique key. If you can reformulate the work as a set operation instead of an append operation, you get idempotency for free.

The trap: side effects outside the database. Sending an HTTP request, writing to S3, charging a card — these are not naturally idempotent and need explicit handling.

2. Idempotency keys

Before performing an irreversible side effect, record your intent in a table keyed on a stable identifier — the job ID, a transaction ID, whatever uniquely names this unit of work.

CREATE TABLE job_completions (
  idempotency_key TEXT PRIMARY KEY,
  completed_at    TIMESTAMPTZ NOT NULL DEFAULT now()
);

In the worker:

BEGIN;
  INSERT INTO job_completions (idempotency_key) VALUES ($1)
  ON CONFLICT DO NOTHING
  RETURNING idempotency_key;
-- if no row returned, another run already completed this job — exit early
COMMIT;
-- now do the side effect

The ON CONFLICT DO NOTHING / RETURNING pattern lets you claim the work atomically. Only one concurrent worker wins the insert.

3. Conditional side effects

For external APIs that support it, pass the idempotency key through to the downstream call. Stripe's API accepts an Idempotency-Key header; most modern payment processors and messaging APIs have an equivalent. The provider deduplicates on their side, so even if your worker retries after a network timeout, the charge doesn't go through twice.

When the external API doesn't support idempotency keys natively, check for the side effect before taking it:

if not email_already_sent(user_id, template: "welcome"):
  send_email(...)
  record_email_sent(user_id, template: "welcome")

The check-then-act window is a race condition, but for most business operations the probability of two workers racing through the check within milliseconds is low enough to accept, especially combined with a database-level lock on the email record.

What to log

When a job exits early because it detects a duplicate, log it as a distinct event — not a failure, not a success, but a deduplication. This makes it easy to measure how often duplicates actually occur and catch regressions if a deployment causes a spike.


The rule I try to follow: if the job can't be made idempotent cheaply, that's a signal the work should be broken into smaller, naturally idempotent steps. A complex job that does ten things is harder to make safe than ten jobs that each do one thing.