[Re]train

Fix what's broken. Automatically.

In-sample trains. Out-of-sample proves. The gap between them tells you if the agent generalizes. Every iteration is measured on tasks from your environment — not synthetic benchmarks.

Book a demo See evaluation →

30+

loops

Each loop scores on training tasks and held-out tasks so you see generalization, not just a lower loss.

In-sample vs out-of-sample

modes

Trainer modes from human-approved steps to fully automated proposals — you control how aggressive the loop is.

How the loop behaves

4-way

Every cycle attributes cost to agent, evaluation, trainer, and certification so finance and engineering see the same numbers.

Cost by surface

L0–L3

Depth levels for how much the system edits the agent autonomously; L4–L5 on the roadmap for deeper arena-style search.

Automation depth

[Re]train

Agents that work beyond their training data.

Three curves show the full picture: in-sample fitness, out-of-sample fitness, and the meta score. When the gap between them shrinks, your agent is learning to handle tasks it was never trained on.

Fitness · IS / OS / meta

IS 0.374 OS 0.326 meta 0.349 gap −0.048

IS/OS/meta fitness over automated iterations with per-category breakdown

Agent structure diff — tools, rules, and instructions changed between iterations

Config diff

Iteration 7 → 8

trainer: evolutionary · arena_selection

+ tool: verify_source_citation

"Cross-check facts against original document"

~ rule: escalation_policy

threshold: 0.4 → 0.6

~ instruction: verification section

added: "Always cite page number"

score: 0.44 → 0.51 +0.07 promoted

[Re]train

Understand why performance improved.

Every iteration commits changes like git. Tools added, rules rewritten, score delta tracked. Not just prompts — the whole agent structure is visible: configurations, policies, and tool definitions.

[Re]train

You choose how much control to keep.

From fully manual review to autonomous evolution — choose how much control you keep at each stage.

Manual

Full control. You review traces, you edit config, you decide what ships.

Self-improvement

The agent analyzes its own mistakes and proposes fixes. You approve or auto-promote.

Evolutionary

Generate config variants. Arena picks winners. Pareto-optimal selection across objectives.

Deep training

Edits the full agent as files — prompts, tools, rules, code. Git-native workspace.

Step-by-step

Focuses on one evaluator at a time. Improves iteratively until plateau, then moves to the next.

[Re]train

Agents fix more on their own as you gain confidence.

Each depth level adds a deeper kind of change the system can make. Start with prompt adjustments, progress to full tool creation. The deeper the level, the more the agent can fix on its own.

Training depth ladder

Prompt tuning

Edits instructions, rules, priorities

● live

Diagnostic training

Reads traces, finds root causes, then edits

● live

Architecture training

Changes agent workflow — adds steps, branches

● live

Tool creation

Writes new tools, extends agent capabilities

● live

Curriculum evolution

Improves the tests themselves

○ roadmap

Team synthesis

Creates specialized sub-agents

○ roadmap

[Re]train

How a training cycle looks in practice.

A logistics agent scores 0.31 on iteration 1. Here's what the trainer changes at each depth level to reach 0.73 by iteration 23.

L0 · Prompt edit

Iteration 2: Trainer rewrites the system prompt to emphasize checking sanctions databases before approving any shipment. Score: 0.31 → 0.38.

L1 · Tool configuration

Iteration 6: Trainer adds "always_query" flag to the sanctions_check tool, ensuring it runs on every entity. Score: 0.38 → 0.49.

L2 · Rule addition

Iteration 11: Trainer adds a rule: "if HS code is dual-use, escalate to compliance review before proceeding." Score: 0.49 → 0.58.

L3 · Tool creation

Iteration 16: Trainer creates a new tool verify_entity_chain that cross-references beneficial owners across three registries. Score: 0.58 → 0.68.

Combined effect

Iteration 23: All improvements compound. The agent now handles sanctions screening, dual-use goods, and entity verification in one pass. Final score: 0.73.

Evaluation →

Environments →

Production Control →

Ready to see training in action?

Iterative training on your tasks and your data.

Book a demo

→

See iterative training on your tasks.

Explore evaluation

→

Scores drive training. Define yours.

Read the docs

→

Training API, strategies, and configs.