brief.crastinating.pro
DecidedTwo-way doorDecided 21 April 2026

Which idempotency key shape for retried payment refunds?

Refund retries occasionally double-charge merchants. Three key shapes are on the table — pick the one we ship to production this quarter.

Ticket
PAY-812
Decider
Staff eng · payments platform
Team
Series-B fintech, EU + US, ~28 engineers

The blocker

Why this stalled long enough to need a brief.

  • Refund retries triggered by network blips were creating duplicate ledger entries on roughly 0.4% of refunds.
  • Two engineers had been arguing the right key shape for nine days; the ticket was rotting and the partner team was escalating.
  • Server-side dedup alone wasn't enough — we needed the client to assert intent across retries.

Options on the table

Each one was a real proposal, not a strawman.

  • (a) refund_id + attempt_seq — stable across the same logical retry
    Picked

    Encodes intent: the same refund, the same attempt, regardless of which client process retries. Survives client crashes, easy to grep.

  • (b) freshly generated UUID per retry, dedup server-side on (tenant, refund_id)

    Pushes dedup entirely server-side. Simpler client, but obscures intent in logs and makes incident postmortems harder.

  • (c) hash(tenant_id, refund_id, retry_window_minute)

    Time-bucketed. Cute, but couples correctness to clock skew between client and server. Rejected on first read.

The memo

Why we picked refund_id + attempt_seq.

Option (a) wins because the key encodes the retry intent the client is asserting. If the same refund is retried after a crash, the same key is reconstructed deterministically — there is no client-side state to lose.

We will require `refund_id + attempt_seq` on every refund call, validate it at the gateway, and fail closed on missing keys after 2026-05-15. The migration is two-way: if a partner can't produce attempt_seq, we accept a single-attempt fallback for 30 days.

The ledger gains a unique constraint on `(tenant_id, idempotency_key)` and a 24-hour idempotency window. After 24 hours, the same key is treated as a new refund — long enough for any reasonable retry, short enough to avoid stale collisions on tenant-scoped ID reuse.

What actually happened

Followed up roughly 30 days later.

Shipped 2026-04-22. In the first 14 days: zero duplicate ledger entries on refunds. The previous baseline was 0.4%.

Two partners pushed back on the attempt_seq requirement. The fallback bought them three weeks; both shipped a compliant client before the deadline.

Lesson worth keeping: the idempotency key is a contract about intent, not a UUID generator. Pick a shape that fails closed when the client is confused.

The other doors

The arguments we didn't take, preserved.

  • (b) freshly generated UUID per retry, dedup server-side on (tenant, refund_id)
    Pushes dedup entirely server-side. Simpler client, but obscures intent in logs and makes incident postmortems harder.
  • (c) hash(tenant_id, refund_id, retry_window_minute)
    Time-bucketed. Cute, but couples correctness to clock skew between client and server. Rejected on first read.