$ sleuth ask "why did checkout start failing around 3am? custo…" submitted
CASE #SLEUTH-0001
SIG pending
v0.1 · anthropic/claude-sonnet-4-6
$pip install sleuth-rlm
THE QUESTION

why did checkout start failing around 3am? customers are seeing 5xx on payment submit.

ROWS
72
FILES
5
STEPS
6
LLM CALLS
7
WALL CLOCK
42s
README — what am i looking at?

One run of the Sleuth agent on a synthetic 72-row, 5-service incident: a Stripe API key rotation that cascaded into checkout 5xx errors. You asked a question in English. The agent wrote Python over your logs. This page renders the signed JSON it produced.

Every run produces one case file: question, trajectory, root cause, cited evidence. Self-contained. Replayable. Shareable.

  • Root cause — the agent's answer
  • Confidence — how sure, in its own words
  • Ground truth — pass/fail gate vs known answer
  • Evidence — cited log lines with ±3 context
  • Trajectory — every tool call, in order
THE ANSWER

checkout-worker is still presenting stripe_api_key version 6 to payment-gateway after vault rotated the secret to version 7 at 02:58:04 UTC. payment-gateway subscribed to the rotation webhook and flipped to v7 instantly, with zero grace period for v6. checkout-worker did not subscribe and caches the secret in-process with no SIGHUP handler, so every outbound charge signed with v6 is now rejected 401 invalid_api_key_version. retries re-present the same stale secret, cascading to api-gateway 502 and a redis queue backup that paged oncall as user-facing 5xx.

REMEDIATION

already applied: rolling restart of checkout-worker forced re-read of stripe_api_key v7 from vault. permanent fix: (1) subscribe checkout-worker to the vault rotation webhook so it refreshes in-process secrets within 30s of rotation, (2) add a SIGHUP handler that reloads secrets without restart, (3) have payment-gateway honor a short grace window (60s) where v(N-1) is still accepted, to absorb clock skew between subscribers.

92%
confidence

the causal chain is fully covered by the logs: rotation event at 02:58:04, first 401 at 02:58:41, recovery immediately after pod restart at 03:45. not 1.0 because we do not directly observe the in-memory secret version inside checkout-worker, we infer it from the presented_version field payment-gateway reports.

ground truth

overlap 100%

hit (5)

  • ✓ vault-rotate-stripe-2026-04-17T02:58:04Z
  • ✓ checkout-worker-first-401-2026-04-17T02:58:41Z
  • ✓ payment-gateway-reject-v6-2026-04-17T02:58:41Z
  • ✓ api-gateway-5xx-spike-2026-04-17T03:02:12Z
  • ✓ redis-queue-lag-2026-04-17T03:04:30Z

miss (0)

expected root cause

checkout-worker failed to pick up the rotated stripe_api_key from vault. Secret was rotated at 02:58:04 UTC; payment-gateway reloaded to v7 immediately, but checkout-worker caches secrets in-process for the pod lifetime and only reloads on SIGHUP or restart. It kept signing outbound calls with stripe_api_key v6, which payment-gateway now rejects as 401 invalid_api_key_version. Retries amplified the error rate and backed up the checkout queue in redis, which is what paged oncall as user-facing 5xx.

evidence

4 key / 5 total
  • KEY 2026-04-17T02:58:04Z vault info logs/vault.log.jsonl:3
    secret rotated secret=kv/stripe_api_key old_version=6 new_version=7

    the rotation is the precipitating event

  • KEY 2026-04-17T02:58:41Z payment-gateway warn logs/payment-gateway.log.jsonl:5
    auth rejected reason=invalid_api_key_version presented=6 expected=7 caller=checkout-worker auth=[REDACTED_BEARER]

    first rejection tying checkout-worker to the stale key version

  • KEY 2026-04-17T02:58:41Z checkout-worker error logs/checkout-worker.log.jsonl:5
    payment-gateway returned 401 upstream_reason=invalid_api_key_version

    caller side of the 401, same trace_id

  • KEY 2026-04-17T03:02:12Z api-gateway error logs/api-gateway.log.jsonl:7
    5xx rate exceeds SLO route=POST /v1/checkout/submit error_pct=91.4

    user-visible impact, this is what paged oncall

  • 2026-04-17T03:04:30Z redis warn logs/redis-queue.log.jsonl:5
    queue lag threshold exceeded queue=checkout.submit depth=1104

    secondary symptom, confirms dequeue collapsed at the same time 401s started

trajectory

6 steps
  1. 01 schema 0ms
    services: vault, payment-gateway, checkout-worker, api-gateway, redis; level values: info, warn, error
  2. 02 top_errors limit=20 0ms
    error#1 (312): payment-gateway returned 401 (checkout-worker)
    error#2 (87+142+168+156): auth rejected surge (payment-gateway)
  3. 03 search pattern=invalid_api_key_version limit=5 0ms
    payment-gateway warn auth rejected presented_version=6 expected_version=7 caller=checkout-worker
  4. 04 around service=vault ts=2026-04-17T02:58:04Z window_s=60 0ms
    vault info secret rotated old_version=6 new_version=7; webhook dispatched to payment-gateway; checkout-worker NOT subscribed
  5. 05 trace trace_id=tr_0003 0ms
    checkout-worker error 401 -> api-gateway 502 -> retry -> 401
  6. 06 submit_incident_report 0ms
    report submitted