Which idempotency key shape for retried payment refunds?
Refund retries occasionally double-charge merchants. Three key shapes are on the table — pick the one we ship to production this quarter.
- Ticket
- PAY-812
- Decider
- Staff eng · payments platform
- Team
- Series-B fintech, EU + US, ~28 engineers
The blocker
Why this stalled long enough to need a brief.
- Refund retries triggered by network blips were creating duplicate ledger entries on roughly 0.4% of refunds.
- Two engineers had been arguing the right key shape for nine days; the ticket was rotting and the partner team was escalating.
- Server-side dedup alone wasn't enough — we needed the client to assert intent across retries.
Options on the table
Each one was a real proposal, not a strawman.
- (a) refund_id + attempt_seq — stable across the same logical retryPicked
Encodes intent: the same refund, the same attempt, regardless of which client process retries. Survives client crashes, easy to grep.
- (b) freshly generated UUID per retry, dedup server-side on (tenant, refund_id)
Pushes dedup entirely server-side. Simpler client, but obscures intent in logs and makes incident postmortems harder.
- (c) hash(tenant_id, refund_id, retry_window_minute)
Time-bucketed. Cute, but couples correctness to clock skew between client and server. Rejected on first read.
The memo
Why we picked refund_id + attempt_seq.
Option (a) wins because the key encodes the retry intent the client is asserting. If the same refund is retried after a crash, the same key is reconstructed deterministically — there is no client-side state to lose.
We will require `refund_id + attempt_seq` on every refund call, validate it at the gateway, and fail closed on missing keys after 2026-05-15. The migration is two-way: if a partner can't produce attempt_seq, we accept a single-attempt fallback for 30 days.
The ledger gains a unique constraint on `(tenant_id, idempotency_key)` and a 24-hour idempotency window. After 24 hours, the same key is treated as a new refund — long enough for any reasonable retry, short enough to avoid stale collisions on tenant-scoped ID reuse.
What actually happened
Followed up roughly 30 days later.
Shipped 2026-04-22. In the first 14 days: zero duplicate ledger entries on refunds. The previous baseline was 0.4%.
Two partners pushed back on the attempt_seq requirement. The fallback bought them three weeks; both shipped a compliant client before the deadline.
Lesson worth keeping: the idempotency key is a contract about intent, not a UUID generator. Pick a shape that fails closed when the client is confused.
The other doors
The arguments we didn't take, preserved.
- (b) freshly generated UUID per retry, dedup server-side on (tenant, refund_id)Pushes dedup entirely server-side. Simpler client, but obscures intent in logs and makes incident postmortems harder.
- (c) hash(tenant_id, refund_id, retry_window_minute)Time-bucketed. Cute, but couples correctness to clock skew between client and server. Rejected on first read.