Forge — Xplore

Forge

Your agents. evaluated[re]traineddeployedcontrolled evaluated
for your environment.

Load your workflows, connectors, policies, and success metrics. Forge builds the benchmark, optimizes the agent, deploys what passes, and monitors for regression — continuously.

Book a demo See training in action

50+

connectors

Simulate workloads from your systems and data sources

Simulation layer

30+

evals

Evaluate chains with composable dimensions and domain weights

Eval framework

runtimes

[Re]Train with parallel strategies and guardrailed iteration

Training layer

Live

Observe drift, cost, and promotion gates across deployed versions

Operations layer

Evaluate

Know exactly where your agents fail.

Each domain has different priorities. Set weights — safety-critical domains weight safety at 0.30, clinical domains weight dose accuracy at 0.25. Forge scores each dimension separately on your real tasks.

Explore evaluation →

Forge eval chain · support-bot-v3

Tone accuracy

0.89

Knowledge

0.91

Escalation

0.78

Response time

0.94

Safety

0.97

Hallucination

0.03

Weighted 0.85

tone_acc

w = 0.25

knowledge

w = 0.25

escalation

w = 0.20

Node Resolution graph — Supply Chain environment with 33 vertices, 63 relations, 7 data sources

Evaluate

Your agents train on production-grade data.

Neo4j, PostgreSQL, SAP/ERP, OpenSanctions, OSINT — not toy data. 33 vertices, 63 relations across 7 real data sources in a single environment. Your agents train on the same complexity they face in production.

[Re]train

Agents that work beyond their training data.

In-sample optimizes. Out-of-sample verifies. The gap tells you if the agent performs reliably beyond training data. Every iteration measured on tasks from your operational environment.

See training in action →

Forge training — IS/OS/meta fitness over automated iterations with per-category breakdown

Agent structure diff — tools, rules, and instructions changed between iterations

Config diff

Iteration 22 → 23

trainer: step_by_step · accuracy_focus

+ tool: verify_source_citation

"Cross-check facts against original document"

~ rule: escalation_policy

threshold: 0.4 → 0.6

~ instruction: verification section

added: "Always cite page number"

score: 0.68 → 0.73 +0.05 promoted

[Re]train

Understand why performance improved.

Every iteration commits changes like git. Tools added, rules rewritten, score delta tracked. Not just prompts — the whole agent structure: tools, instructions, policies, configurations.

Deploy

Agents only go live when they pass.

Auto-promote the best, set a threshold, or keep it manual. Candidates sit on a branch until they earn promotion. Every version tracked. Roll back in one click.

Training run configuration — promote policy, trainer strategy, mutation knobs

Agent overview — Logistics v7, 148 runs, 6 versions, performance over time

Production controls · live

● Accuracy & Safety enabled

agent: logistics-v7

last 24h: 148 runs · avg: 0.71

⚠ drift: context_adherence 0.78 → 0.61

● RAG quality enabled

agent: chatbot-prod

last 24h: 2,341 runs · avg: 0.84

alerts: 0

● Cost guard enabled

agent: bi-analyst-v1

last 24h: 412 runs · avg: 0.91

alerts: 0

Cost Over Time — stacked area chart (agent, eval, trainer, cert)

Control

Know when agents degrade. Retrain when they do.

Live certification + drift alerts + cost transparency. Every dollar tracked — agent, eval, trainer, cert. When scores drop, retraining triggers automatically.

·Context adherence, completeness, PII leak, latency SLA, token budget
·Drift alerts — score 0.78 → 0.61 flagged instantly
·Per-cycle cost visibility in the dashboard — totals by agent, eval, trainer, and certification

See production controls →

Solutions