Forge · Benchmarks

If the world is wrong, the score is noise.

A benchmark environment is a controlled simulation — connectors, tools, clock, and friction your agent would see live, reset every run. Evaluators are separate: they score behaviour inside that world, not whether it memorised a clean fixture.

Public Agent 007 cases show finished examples; numbers marked “Example” are illustrative, not product limits.

Book a demo See evaluation →

100%

Isolated runs with reset state — no cross-run contamination

Reproducibility

50+

connectors

Data and system patterns you can compose into a benchmark

Platform scale

Composable

Evaluator chains with domain-specific weights, not one opaque score

Scoring model

Multi-day

Horizons from batch jobs to multi-day simulations, as the workload needs

Time in the loop

Simulation

The benchmark is the world plus the workload.

You define data planes, tool registrations, policies, and a clock. The agent reads, writes, and calls tools the way it would in production. Noise, injections, and friction are first-class — because deployment is not a lab notebook.

Provisioned state

Databases, sandboxes, and credentials per run — comparable scores and straightforward audits.

Realistic coupling

Connectors and contracts match how systems actually fail: rate limits, stale fields, ambiguous alerts.

Examples you can open

Agent 007 publishes specs and leaderboards for selected industries so teams inspect traces before they commit.

Logistic shocks (supply chain) · All public benchmarks

Instruments

Evaluators score dimensions; they do not replace the simulation.

Each evaluator measures one thing well. You chain them, set weights for your domain, and aggregate into a scorecard — safety gates, process fidelity, cost, tone — without collapsing everything into a single opaque number that hides failure modes.

For the full composable-chain story, see evaluation.

Forge eval chain · logistics-shock-v1

Evidence

0.78

Impact

0.82

Process

0.88

Safety

0.94

Efficiency

0.76

Tool usage

0.81

Weighted 0.83

evidence

w = 0.22

impact

w = 0.22

safety

w = 0.18

Example topology

Why we model graphs, not spreadsheets.

Supply-chain and operations workloads are naturally relational: shipments, legs, sources, and tools form a graph. The counts below illustrate one representative environment design — enough to picture coupling and provenance pressure, not a guarantee every customer topology matches these numbers.

vertices

Single-environment graph (example topology)

Illustrative

relations

Cross-source links in that example

Illustrative

sources

Representative data planes wired in

Illustrative

tools

Tool registrations exposed to the agent

Illustrative

Example node-resolution graph for a supply-chain style benchmark environment

Noise

Contradictions, stale fields, and partial records — by design

Injections

Prompt and document-level adversarial content in the stream

Friction

Latency, rate limits, and tool errors like real integrations

Replay

Deterministic replay for debugging and regression comparisons

Stress

Stress the agent where production will stress it.

Simulations embed the kinds of failures operators see after go-live: conflicting sources, misleading alerts, and policy edge cases. Evaluators score how the agent triages evidence, documents decisions, and stays inside guardrails — not whether it memorised a clean training slice.

Operations

Every run starts clean. Every result is reproducible.

Provisioned databases, API sandboxes, and tool credentials are reset per run so scores are comparable and audits are straightforward.

Isolated

Dedicated state per run — no shared caches or leaked rows between agents.

Reproducible

Replay the same workload to verify fixes and compare versions fairly.

Scalable

Spin up many environments in parallel for training sweeps and CI gates.

Domains

Environment templates by industry shape.

Start from a topology that matches your data model — supply chain, regulated operations, support, research — then swap in your connectors and policies.

Supply chain

Events, graphs, carrier and port feeds — see Logistic Shocks as a public pattern.

Clinical / regulated

Documents, cohorts, lab and regulatory references with citation rules.

Financial compliance

Transactions, sanctions and KYC feeds, policy graphs.

Customer support

Tickets, knowledge bases, product data, escalation paths.

Research & OSINT

Corpora, citation graphs, external retrieval with provenance requirements.

Custom

Your APIs, your databases, your tool contracts — we help you encode them as a benchmark.