You choose what matters. Forge shows where agents fail.
Safety, accuracy, cost, tone — every dimension gets its own score and its own weight. Combine them into a single verdict that reflects your priorities, not ours.
Evaluators score individual dimensions. For full workflow simulations, see benchmark environments.
Know exactly which dimensions pass and which don't.
Chain evaluators in any order. Weight safety at 0.30 for clinical workflows, or cost efficiency at 0.25 for operations. Aggregate with weighted mean, min, or Pareto — then set floor thresholds so nothing critical slips through.
Eight ways to measure what matters.
Statistical checks run first — fast and cheap. LLM judges only when semantic understanding is required. Every score normalised to [0, 1].
Confirms critical steps happened — tools called, states reached, conditions met. Example: in a supply chain case, did the agent query the sanctions database before approving the shipment? Binary pass/fail per checkpoint.
Measures precision with F1, exact match, or linear decay. Works on structured outputs where the correct answer is known. Example: agent returns 14 disrupted shipments — ground truth is 12. Partial credit via F1.
Assesses reasoning quality and report depth against your rubrics. Calibrated against human expert ratings and re-validated quarterly to prevent judge drift. Example: rates whether the agent's risk summary covers all required factors.
Detects injection attempts, PII leaks, access boundary violations, and hallucinated tool calls. Runs both rule-based pattern matching and adversarial probing. Example: agent receives a prompt injection in user message — does it refuse or comply?
Verifies the chain of reasoning: goal decomposition, evidence grounding, retry discipline. Checks whether conclusions follow from cited evidence. Example: agent claims "shipment delayed due to port congestion" — does the trace show it actually checked port status?
Scores sub-agent delegation: did the orchestrator pick the right specialist? Did it manage scope correctly? Did it recover when a sub-agent failed? Applies to multi-agent systems where coordination quality matters.
Your logic, your thresholds. Write a Python function that returns a 0–1 score. Forge normalizes it into the eval chain. Example: a compliance team adds a check that verifies every cited regulation is from the current year's revision.
Tracks token usage, tool calls, API cost, and latency against configurable budgets. Example: agent solves the task in 1,800 tokens and 4 tool calls vs. budget of 2,500 tokens and 8 calls — efficiency score 0.88.
One verdict that reflects your priorities.
Weighted mean for balanced trade-offs. Min-of-weighted when no single dimension can fail. Floor thresholds for safety-critical metrics. Per-task overrides merge cleanly with your default chain.
Every score is traceable. Every threshold is yours.
No black-box quality labels. Each evaluator produces a score you can inspect, override, and trace back to the exact tool call or reasoning step that earned it. Audit-ready by design.
Ready to define your eval chain?
See composable evaluation on your data.