Skip to content
Xplore
[Re]train

Fix what's broken. Automatically.

In-sample trains. Out-of-sample proves. The gap between them tells you if the agent generalizes. Every iteration is measured on tasks from your environment — not synthetic benchmarks.

30+
loops
Each loop scores on training tasks and held-out tasks so you see generalization, not just a lower loss.
In-sample vs out-of-sample
5
modes
Trainer modes from human-approved steps to fully automated proposals — you control how aggressive the loop is.
How the loop behaves
4-way
Every cycle attributes cost to agent, evaluation, trainer, and certification so finance and engineering see the same numbers.
Cost by surface
L0–L3
Depth levels for how much the system edits the agent autonomously; L4–L5 on the roadmap for deeper arena-style search.
Automation depth
[Re]train

Agents that work beyond their training data.

Three curves show the full picture: in-sample fitness, out-of-sample fitness, and the meta score. When the gap between them shrinks, your agent is learning to handle tasks it was never trained on.

Fitness · IS / OS / meta
1.0 0.0 iterations
IS 0.374 OS 0.326 meta 0.349 gap −0.048
IS/OS/meta fitness over automated iterations with per-category breakdown
Agent structure diff — tools, rules, and instructions changed between iterations
Config diff
Iteration 7 8
trainer: evolutionary · arena_selection
+ tool: verify_source_citation
"Cross-check facts against original document"
~ rule: escalation_policy
threshold: 0.4 → 0.6
~ instruction: verification section
added: "Always cite page number"
score: 0.44 → 0.51 +0.07 promoted
[Re]train

Understand why performance improved.

Every iteration commits changes like git. Tools added, rules rewritten, score delta tracked. Not just prompts — the whole agent structure is visible: configurations, policies, and tool definitions.

[Re]train

You choose how much control to keep.

From fully manual review to autonomous evolution — choose how much control you keep at each stage.

Manual

Full control. You review traces, you edit config, you decide what ships.

Self-improvement

The agent analyzes its own mistakes and proposes fixes. You approve or auto-promote.

Evolutionary

Generate config variants. Arena picks winners. Pareto-optimal selection across objectives.

Deep training

Edits the full agent as files — prompts, tools, rules, code. Git-native workspace.

Step-by-step

Focuses on one evaluator at a time. Improves iteratively until plateau, then moves to the next.

[Re]train

Agents fix more on their own as you gain confidence.

Each depth level adds a deeper kind of change the system can make. Start with prompt adjustments, progress to full tool creation. The deeper the level, the more the agent can fix on its own.

Training depth ladder
L0
Prompt tuning
Edits instructions, rules, priorities
● live
L1
Diagnostic training
Reads traces, finds root causes, then edits
● live
L2
Architecture training
Changes agent workflow — adds steps, branches
● live
L3
Tool creation
Writes new tools, extends agent capabilities
● live
L4
Curriculum evolution
Improves the tests themselves
○ roadmap
L5
Team synthesis
Creates specialized sub-agents
○ roadmap
[Re]train

How a training cycle looks in practice.

A logistics agent scores 0.31 on iteration 1. Here's what the trainer changes at each depth level to reach 0.73 by iteration 23.

L0 · Prompt edit

Iteration 2: Trainer rewrites the system prompt to emphasize checking sanctions databases before approving any shipment. Score: 0.31 → 0.38.

L1 · Tool configuration

Iteration 6: Trainer adds "always_query" flag to the sanctions_check tool, ensuring it runs on every entity. Score: 0.38 → 0.49.

L2 · Rule addition

Iteration 11: Trainer adds a rule: "if HS code is dual-use, escalate to compliance review before proceeding." Score: 0.49 → 0.58.

L3 · Tool creation

Iteration 16: Trainer creates a new tool verify_entity_chain that cross-references beneficial owners across three registries. Score: 0.58 → 0.68.

Combined effect

Iteration 23: All improvements compound. The agent now handles sanctions screening, dual-use goods, and entity verification in one pass. Final score: 0.73.