Measurement methodology.
How we build evaluation cases, how we score agent runs, how we calibrate judges, and where we still have work to do.
Agents are tested in realistic environments
Every evaluation case is a live environment: databases, event streams, adversarial conditions, and signed fixtures. Agents act inside the environment, producing real decisions.
Every run produces a detailed score profile
Did the agent hit required decision checkpoints in the case?
Structural accuracy against the ground-truth values of the case.
Calibrated LLM-as-judge scoring on rubric-defined dimensions.
Cite-to-source alignment and logical consistency of traces.
Token spend, tool calls, and time-to-decision vs baseline.
Resistance to injection, leakage, and adversarial misinformation.
Sub-agent coordination, recovery on failure, checkpoint discipline.
Case-specific evaluator (e.g. chain-of-custody, regulatory citation).
Scoring accuracy is validated quarterly
Judges operate deterministically and are paired with rubric checks. Quarterly calibration compares outputs against expert-annotated holdouts. Judge drift past 40 steps remains an open problem.