SKILL.md file with instructions and metadata, plus optional scripts, reference docs, and assets. Skills follow the open Agent Skills format. One skill can be shared across every agent that needs it.
SKILL.md file or as rich as a full directory with scripts and data. The SKILL.md is the entry point. Scripts extend what the agent can do. References provide context without bloating the main file.
Operator treats agents as programs to be evolved, not prompts to be tweaked by hand. The manager walks you through deploying a first version, testing variants against real data, and running an evolution loop until the best approach emerges.
Define the problem
Tell the manager what you want agents to solve.- AQI prediction
- Fraud detection
- Compound screening
SKILL.md is kept small enough to fit in context alongside test results and evolution instructions (under 500 lines). If an agent needs more, the manager splits supporting logic into scripts/ and detailed context into references/, keeping the SKILL.md as the decision making core that the evolution loop modifies.
Test
The manager spins up multiple instances (default 5, adjustable based on your plan capacity), each running a different agent variant. It explains its approach as it goes: why it chose certain strategies, what data it is using, and what the scores mean. Variants are not parameter tweaks. They use fundamentally different strategies:| Strategy dimension | Example variants |
|---|---|
| Data sources | Weather only vs. multi-source (weather + satellite + activity) |
| Reasoning approach | Chain of thought vs. direct output |
| Algorithm | Heuristic driven vs. ensemble methods |
| Tool usage | Different integrations or API calls |
Where test data comes from
After scoring individuals, the manager may test ensembles of the top performers and report back. Two agents with similar scores but different strategies often produce a stronger combination than either alone.
How scoring works
You decide what good looks like. Tell the manager the metric and, if you have them, test cases. If you are not sure how to evaluate, the manager suggests an approach and checks with you before running it.Common evaluation patterns
Common evaluation patterns
Against historical data. Known outcomes exist. The manager strips answers, runs the agent, and compares to ground truth. Works for: fraud detection, classification, variant scoring, extraction.Against future outcomes. The task is forward facing. The manager records predictions now and checks when outcomes resolve. Works for: AQI prediction, trending detection, churn prediction, any forecast with a deadline.LLM as judge. The manager uses a separate model to evaluate agent output on specific criteria (faithfulness, completeness, correctness). Works for: document summarization, code review, hypothesis generation.These are not rigid modes. The manager adapts to whatever evaluation approach makes sense for your problem. If you have a custom scoring function or a specific dataset, describe it and the manager will use it.
Evolve
After a test round, the manager reports results and proposes what to try next. Winners advance. Losers get deleted.Let agents work on real data
The best way to evaluate an agent is to have it do the actual work, not run in a sandbox. If an agent is detecting fraud, connect it to your real payment stream. Let it flag real transactions. The metrics (precision, recall, false positive rate) flow back naturally because the agent is doing the actual work, not simulating it. This works because each instance is its own computer. If you give a variant instance access to your APIs, your databases, and your data feeds via Environment, it can run the full workflow end to end. The manager can then pull real performance data to score how each variant did.How the manager creates the next generation
How the manager creates the next generation
The next generation varies the top performers. Common patterns:
- Clone the best as a control. If nothing in the next generation beats it, the changes did not help.
- Change one thing per variant. A data source, a heuristic, a threshold, a reasoning strategy.
- Combine ideas from different winners. The data pipeline from one and the scoring logic from another.
- Try something completely different to avoid local optima.
- Remove a technique from the winner to test if it was actually helping.
Example evolution report
Convergence
The manager tells you when agents are plateauing. Typical stopping points: no improvement for several consecutive generations, a target score is reached, or you decide the current version is good enough.At convergence, the manager may test ensemble combinations of the best agents across all generations. The final deployed agent might combine multiple approaches and aggregate outputs.
Running the loop with less supervision
If you want, the manager can set up an automation that triggers the next generation on a schedule. Each automated generation checks its own convergence criteria and reports back or stops itself when agents plateau.Deploy
When you are ready, tell the manager to deploy. It installs the winning agent (or ensemble) on your production instances, cleans up variants, and gives you a summary.SKILL.md, scripts, references), the entire experiment history is inspectable. Every change is a human readable diff. Every score is tied to a concrete test case.
Downloading an agent’s skill
To download a skill from an instance, open the instance page, go to the Server tab, and browse the files. Skills live in.openclaw/skills/{skill-name}/. You can download any file individually: the SKILL.md, scripts, references, or assets. This is useful for backing up a winning agent, porting it to another instance, or inspecting what the evolution loop produced.
Agent design principles
One clear metric
“Predict AQI within 10 points” not “monitor air quality.” Narrow problems converge faster. Broad problems plateau.
Measurable output
Accuracy, precision, recall, F1, MAE, hit rate, convergence speed. If you can put a number on it, the loop can optimize.
External ground truth
Agents scored against real world data (measurements, outcomes, labeled cases) improve faster than agents graded by another LLM.
Short feedback loops
Daily or weekly resolution means more generations per unit of time. The manager can handle longer timelines, but faster is better.
Related docs
Quickstart
Deploy your first agent in one session
Instances
What an instance can do
Automations
Run the evolution loop on a schedule
Integrations
Connect to external data sources