Evaluate

You choose what matters. Forge shows where agents fail.

Safety, accuracy, cost, tone — every dimension gets its own score and its own weight. Combine them into a single verdict that reflects your priorities, not ours.

Evaluators score individual dimensions. For full workflow simulations, see benchmark environments.

Book a demo See environments →

types

Evaluator types per chain

Eval framework

0–1

Normalised score range

All evaluators

∞

Custom weight combinations

Composable chains

<2s

Avg evaluator latency

Statistical evaluators

Evaluate

Know exactly which dimensions pass and which don't.

Chain evaluators in any order. Weight safety at 0.30 for clinical workflows, or cost efficiency at 0.25 for operations. Aggregate with weighted mean, min, or Pareto — then set floor thresholds so nothing critical slips through.

Forge eval chain · clinical-trial-v4

Safety gate

0.94

Dose accuracy

0.82

Biomarker

0.75

Process

0.88

Efficiency

0.80

Tool usage

0.90

Weighted 0.85

safety_gate

w = 0.30

dose_rec

w = 0.25

pd_biomarker

w = 0.25

Evaluate

Eight ways to measure what matters.

Statistical checks run first — fast and cheap. LLM judges only when semantic understanding is required. Every score normalised to [0, 1].

Checkpoint

Confirms critical steps happened — tools called, states reached, conditions met. Example: in a supply chain case, did the agent query the sanctions database before approving the shipment? Binary pass/fail per checkpoint.

Metric

Measures precision with F1, exact match, or linear decay. Works on structured outputs where the correct answer is known. Example: agent returns 14 disrupted shipments — ground truth is 12. Partial credit via F1.

LLM Judge

Assesses reasoning quality and report depth against your rubrics. Calibrated against human expert ratings and re-validated quarterly to prevent judge drift. Example: rates whether the agent's risk summary covers all required factors.

Safety

Detects injection attempts, PII leaks, access boundary violations, and hallucinated tool calls. Runs both rule-based pattern matching and adversarial probing. Example: agent receives a prompt injection in user message — does it refuse or comply?

Reasoning Audit

Verifies the chain of reasoning: goal decomposition, evidence grounding, retry discipline. Checks whether conclusions follow from cited evidence. Example: agent claims "shipment delayed due to port congestion" — does the trace show it actually checked port status?

Orchestration

Scores sub-agent delegation: did the orchestrator pick the right specialist? Did it manage scope correctly? Did it recover when a sub-agent failed? Applies to multi-agent systems where coordination quality matters.

Custom

Your logic, your thresholds. Write a Python function that returns a 0–1 score. Forge normalizes it into the eval chain. Example: a compliance team adds a check that verifies every cited regulation is from the current year's revision.

Efficiency

Tracks token usage, tool calls, API cost, and latency against configurable budgets. Example: agent solves the task in 1,800 tokens and 4 tool calls vs. budget of 2,500 tokens and 8 calls — efficiency score 0.88.

Evaluate

One verdict that reflects your priorities.

Weighted mean for balanced trade-offs. Min-of-weighted when no single dimension can fail. Floor thresholds for safety-critical metrics. Per-task overrides merge cleanly with your default chain.

Forge eval chain · trade-screener-v5

Entity resolution

0.91

Sanctions match

0.95

HS classification

0.87

Compliance

0.93

Weighted 0.91

Aggregation

weighted_mean

Floor

compliance ≥ 0.90

Per-task override · pharma-compliance

Injection resistance

0.98

Data exfiltration

1.00

Access boundary

0.97

Hallucinated tools

0.92

Weighted 0.97

Evaluate

Every score is traceable. Every threshold is yours.

No black-box quality labels. Each evaluator produces a score you can inspect, override, and trace back to the exact tool call or reasoning step that earned it. Audit-ready by design.

See how training uses these scores →

Environments →

Auto-Training →

Production Control →

Ready to define your eval chain?

See composable evaluation on your data.

Book a demo

→

See composable eval chains on your data.

Explore training

→

Scores drive training. See how.

Read the docs

→

API reference for evaluation chains.