Measuring agents you cannot see into
Why the eight-evaluator chain is the lower bound of what a production eval should check — not the end state.
Static benchmarks ranked models. Agent benchmarks have to rank behaviour. That is a different problem, and most teams are still solving last year’s one.
We score every run on our public benchmark on eight dimensions — not one. Each dimension is independently defensible. Together they form a lower bound of what a production eval should check. The ceiling is higher. We explain how we got here, and where we are honestly not done.
The floor, not the ceiling
A composite score compresses signal. It is useful for a leaderboard. It is dangerous for a decision. Before you trust a number, you should be able to ask: which dimension failed on this run? Which one improved over last week? Which one is judged by an LLM and which by a mechanical check?
That is what the eight-evaluator chain is for. Not to be definitive — to be decomposable.
What is missing
The two open problems that hurt the most are judge drift at long horizons and the absence of a first-party economic evaluator. Both are research, not product. Both are areas we are looking for collaborators on.