services: vault, payment-gateway, checkout-worker, api-gateway, redis; level values: info, warn, error
why did checkout start failing around 3am? customers are seeing 5xx on payment submit.
▸ README — what am i looking at?
One run of the Sleuth agent on a synthetic 72-row, 5-service incident: a Stripe API key rotation that cascaded into checkout 5xx errors. You asked a question in English. The agent wrote Python over your logs. This page renders the signed JSON it produced.
Every run produces one case file: question, trajectory, root cause, cited evidence. Self-contained. Replayable. Shareable.
- ▸ Root cause — the agent's answer
- ▸ Confidence — how sure, in its own words
- ▸ Ground truth — pass/fail gate vs known answer
- ▸ Evidence — cited log lines with ±3 context
- ▸ Trajectory — every tool call, in order
checkout-worker is still presenting stripe_api_key version 6 to payment-gateway after vault rotated the secret to version 7 at 02:58:04 UTC. payment-gateway subscribed to the rotation webhook and flipped to v7 instantly, with zero grace period for v6. checkout-worker did not subscribe and caches the secret in-process with no SIGHUP handler, so every outbound charge signed with v6 is now rejected 401 invalid_api_key_version. retries re-present the same stale secret, cascading to api-gateway 502 and a redis queue backup that paged oncall as user-facing 5xx.
already applied: rolling restart of checkout-worker forced re-read of stripe_api_key v7 from vault. permanent fix: (1) subscribe checkout-worker to the vault rotation webhook so it refreshes in-process secrets within 30s of rotation, (2) add a SIGHUP handler that reloads secrets without restart, (3) have payment-gateway honor a short grace window (60s) where v(N-1) is still accepted, to absorb clock skew between subscribers.
the causal chain is fully covered by the logs: rotation event at 02:58:04, first 401 at 02:58:41, recovery immediately after pod restart at 03:45. not 1.0 because we do not directly observe the in-memory secret version inside checkout-worker, we infer it from the presented_version field payment-gateway reports.
ground truth
overlap 100%hit (5)
- ✓ vault-rotate-stripe-2026-04-17T02:58:04Z
- ✓ checkout-worker-first-401-2026-04-17T02:58:41Z
- ✓ payment-gateway-reject-v6-2026-04-17T02:58:41Z
- ✓ api-gateway-5xx-spike-2026-04-17T03:02:12Z
- ✓ redis-queue-lag-2026-04-17T03:04:30Z
miss (0)
expected root cause
checkout-worker failed to pick up the rotated stripe_api_key from vault. Secret was rotated at 02:58:04 UTC; payment-gateway reloaded to v7 immediately, but checkout-worker caches secrets in-process for the pod lifetime and only reloads on SIGHUP or restart. It kept signing outbound calls with stripe_api_key v6, which payment-gateway now rejects as 401 invalid_api_key_version. Retries amplified the error rate and backed up the checkout queue in redis, which is what paged oncall as user-facing 5xx.
evidence
4 key / 5 total- KEY 2026-04-17T02:58:04Z vault info logs/vault.log.jsonl:3
secret rotated secret=kv/stripe_api_key old_version=6 new_version=7
the rotation is the precipitating event
- KEY 2026-04-17T02:58:41Z payment-gateway warn logs/payment-gateway.log.jsonl:5
auth rejected reason=invalid_api_key_version presented=6 expected=7 caller=checkout-worker auth=[REDACTED_BEARER]
first rejection tying checkout-worker to the stale key version
- KEY 2026-04-17T02:58:41Z checkout-worker error logs/checkout-worker.log.jsonl:5
payment-gateway returned 401 upstream_reason=invalid_api_key_version
caller side of the 401, same trace_id
- KEY 2026-04-17T03:02:12Z api-gateway error logs/api-gateway.log.jsonl:7
5xx rate exceeds SLO route=POST /v1/checkout/submit error_pct=91.4
user-visible impact, this is what paged oncall
- 2026-04-17T03:04:30Z redis warn logs/redis-queue.log.jsonl:5
queue lag threshold exceeded queue=checkout.submit depth=1104
secondary symptom, confirms dequeue collapsed at the same time 401s started
trajectory
6 steps01 schema 0ms 02 top_errors limit=20 0ms error#1 (312): payment-gateway returned 401 (checkout-worker) error#2 (87+142+168+156): auth rejected surge (payment-gateway)
03 search pattern=invalid_api_key_version limit=5 0ms payment-gateway warn auth rejected presented_version=6 expected_version=7 caller=checkout-worker
04 around service=vault ts=2026-04-17T02:58:04Z window_s=60 0ms vault info secret rotated old_version=6 new_version=7; webhook dispatched to payment-gateway; checkout-worker NOT subscribed
05 trace trace_id=tr_0003 0ms checkout-worker error 401 -> api-gateway 502 -> retry -> 401
06 submit_incident_report 0ms report submitted