Measuring agents you cannot see into

Why the eight-evaluator chain is the lower bound of what a production eval should check — not the end state.

#measurement#methodology#evaluators

Static benchmarks ranked models. Agent benchmarks have to rank behaviour. That is a different problem, and most teams are still solving last year’s one.

We score every run on our public benchmark on eight dimensions — not one. Each dimension is independently defensible. Together they form a lower bound of what a production eval should check. The ceiling is higher. We explain how we got here, and where we are honestly not done.

The floor, not the ceiling

A composite score compresses signal. It is useful for a leaderboard. It is dangerous for a decision. Before you trust a number, you should be able to ask: which dimension failed on this run? Which one improved over last week? Which one is judged by an LLM and which by a mechanical check?

That is what the eight-evaluator chain is for. Not to be definitive — to be decomposable.

What is missing

The two open problems that hurt the most are judge drift at long horizons and the absence of a first-party economic evaluator. Both are research, not product. Both are areas we are looking for collaborators on.

— Xplore Lab