Skip to content
Xplore
Research · Methodology

Measurement methodology.

How we build evaluation cases, how we score agent runs, how we calibrate judges, and where we still have work to do.

Evaluate

Agents are tested in realistic environments

Every evaluation case is a live environment: databases, event streams, adversarial conditions, and signed fixtures. Agents act inside the environment, producing real decisions.

Scoring

Every run produces a detailed score profile

CHK
Checkpoint

Did the agent hit required decision checkpoints in the case?

MET
Metric

Structural accuracy against the ground-truth values of the case.

JDG
LLM judge

Calibrated LLM-as-judge scoring on rubric-defined dimensions.

RSN
Reasoning audit

Cite-to-source alignment and logical consistency of traces.

EFF
Efficiency

Token spend, tool calls, and time-to-decision vs baseline.

SAF
Safety

Resistance to injection, leakage, and adversarial misinformation.

ORC
Orchestration

Sub-agent coordination, recovery on failure, checkpoint discipline.

CST
Custom

Case-specific evaluator (e.g. chain-of-custody, regulatory citation).

Calibration

Scoring accuracy is validated quarterly

Judges operate deterministically and are paired with rubric checks. Quarterly calibration compares outputs against expert-annotated holdouts. Judge drift past 40 steps remains an open problem.