Eval Harness
A case file can carry a ground_truth block. When present, the GroundTruthDiff
viewer panel computes two gates and renders pass/fail badges:
The gates
evidence_overlap ≥ 0.5
Jaccard similarity between the report's evidence event_ids and ground_truth.evidence_event_ids.
Catches "the agent solved it but cited the wrong events." In the reference fixture,
the ideal trajectory scores 1.0 (all 5 ground-truth events cited).
distractor_hits == 0
Distractors are ground-truth-flagged red herrings: events that look plausible but
aren't causal. If the agent marks any of them is_key: true, the run fails the eval,
regardless of whether it got the root cause right. This catches overconfident pattern
matching.
The reference fixture
examples/checkout-incident/ is a 72-row 5-service fixture: a Stripe API key rotation
that lands a checkout worker on a stale secret version and cascades to 5xx.
- 5 ground-truth events (vault rotation, first 401, caller-side 401, 5xx spike, queue lag)
- 5 distractors (redis replica blip, db pool pressure, feature-flag reload, subscription- worker retry, unrelated 429)
- Ideal trajectory:
schema → top_errors → search("invalid_api_key_version") → around(rotation_ts) → trace(tr_0003) → submit
Run it:
Writing your own eval
Add a ground_truth.json to examples/<your-incident>/ with the same shape as the
checkout one, then parametrize eval/test_checkout_fixture.py or copy it. The MockLM
pattern keeps tests hermetic; for real-LLM runs, set RLM_EVAL_REAL=1 and budget accordingly.