Research · Methodology

Measurement methodology.

How we build evaluation cases, how we score agent runs, how we calibrate judges, and where we still have work to do.

Evaluate

Agents are tested in realistic environments

Every evaluation case is a live environment: databases, event streams, adversarial conditions, and signed fixtures. Agents act inside the environment, producing real decisions.

Scoring

Every run produces a detailed score profile

CHK

Checkpoint

Did the agent hit required decision checkpoints in the case?

MET

Metric

Structural accuracy against the ground-truth values of the case.

JDG

LLM judge

Calibrated LLM-as-judge scoring on rubric-defined dimensions.

RSN

Reasoning audit

Cite-to-source alignment and logical consistency of traces.

EFF

Efficiency

Token spend, tool calls, and time-to-decision vs baseline.

SAF

Safety

Resistance to injection, leakage, and adversarial misinformation.

ORC

Orchestration

Sub-agent coordination, recovery on failure, checkpoint discipline.

CST

Custom

Case-specific evaluator (e.g. chain-of-custody, regulatory citation).

Calibration

Scoring accuracy is validated quarterly

Judges operate deterministically and are paired with rubric checks. Quarterly calibration compares outputs against expert-annotated holdouts. Judge drift past 40 steps remains an open problem.

How we measure.

How we collaborate.