Skip to main content
An agent is an autonomous program that attacks a hard problem. Internally, each agent is powered by a skill: a folder containing a SKILL.md file with instructions and metadata, plus optional scripts, reference docs, and assets. Skills follow the open Agent Skills format. One skill can be shared across every agent that needs it.
aqi-predictor/
├── SKILL.md           # Instructions, metadata, evaluation criteria
├── scripts/           # Executable code (Python, Bash, JS)
│   ├── fetch_weather.py
│   └── score_predictions.py
├── references/        # Additional docs the agent loads on demand
│   └── aqi-methodology.md
└── assets/            # Templates, schemas, lookup tables
    └── city-list.json
A skill can be as simple as a single SKILL.md file or as rich as a full directory with scripts and data. The SKILL.md is the entry point. Scripts extend what the agent can do. References provide context without bloating the main file. Operator treats agents as programs to be evolved, not prompts to be tweaked by hand. The manager walks you through deploying a first version, testing variants against real data, and running an evolution loop until the best approach emerges.

Define the problem

Tell the manager what you want agents to solve.
Build an agent that predicts tomorrow's AQI for 50 cities.
Pull weather data, pollution feeds, satellite imagery, and local activity patterns.
Score daily against actual AQI readings.
The manager asks at most one or two clarifying questions (what is the problem, how will you measure success), then fills in sensible defaults for everything you did not specify. It deploys the agent, picks or creates an instance, writes the skill, and walks you through testing.
If it is not obvious how to measure success, the manager suggests a metric and confirms with you before proceeding.
The SKILL.md is kept small enough to fit in context alongside test results and evolution instructions (under 500 lines). If an agent needs more, the manager splits supporting logic into scripts/ and detailed context into references/, keeping the SKILL.md as the decision making core that the evolution loop modifies.

Test

The manager spins up multiple instances (default 5, adjustable based on your plan capacity), each running a different agent variant. It explains its approach as it goes: why it chose certain strategies, what data it is using, and what the scores mean. Variants are not parameter tweaks. They use fundamentally different strategies:
Strategy dimensionExample variants
Data sourcesWeather only vs. multi-source (weather + satellite + activity)
Reasoning approachChain of thought vs. direct output
AlgorithmHeuristic driven vs. ensemble methods
Tool usageDifferent integrations or API calls
All variants receive the same test inputs and get scored against the same metric.

Where test data comes from

Data you provide

Historical examples, labeled datasets, or specific test cases you give the manager.

Found via web search

Benchmarks, public datasets, and real world data sources the manager finds.

Synthetic generation

Last resort. The manager generates varied cases to cover edge conditions.
After scoring individuals, the manager may test ensembles of the top performers and report back. Two agents with similar scores but different strategies often produce a stronger combination than either alone.

How scoring works

You decide what good looks like. Tell the manager the metric and, if you have them, test cases. If you are not sure how to evaluate, the manager suggests an approach and checks with you before running it.
Against historical data. Known outcomes exist. The manager strips answers, runs the agent, and compares to ground truth. Works for: fraud detection, classification, variant scoring, extraction.Against future outcomes. The task is forward facing. The manager records predictions now and checks when outcomes resolve. Works for: AQI prediction, trending detection, churn prediction, any forecast with a deadline.LLM as judge. The manager uses a separate model to evaluate agent output on specific criteria (faithfulness, completeness, correctness). Works for: document summarization, code review, hypothesis generation.These are not rigid modes. The manager adapts to whatever evaluation approach makes sense for your problem. If you have a custom scoring function or a specific dataset, describe it and the manager will use it.

Evolve

After a test round, the manager reports results and proposes what to try next. Winners advance. Losers get deleted.

Let agents work on real data

The best way to evaluate an agent is to have it do the actual work, not run in a sandbox. If an agent is detecting fraud, connect it to your real payment stream. Let it flag real transactions. The metrics (precision, recall, false positive rate) flow back naturally because the agent is doing the actual work, not simulating it. This works because each instance is its own computer. If you give a variant instance access to your APIs, your databases, and your data feeds via Environment, it can run the full workflow end to end. The manager can then pull real performance data to score how each variant did.
The more real data you expose to the instances, the better the evaluation. An agent scoring synthetic fraud in isolation tells you less than an agent scoring real transactions you can verify against confirmed chargebacks.
The next generation varies the top performers. Common patterns:
  • Clone the best as a control. If nothing in the next generation beats it, the changes did not help.
  • Change one thing per variant. A data source, a heuristic, a threshold, a reasoning strategy.
  • Combine ideas from different winners. The data pipeline from one and the scoring logic from another.
  • Try something completely different to avoid local optima.
  • Remove a technique from the winner to test if it was actually helping.
You can guide this as much or as little as you want. Tell the manager what to change, or ask it to pick. It explains its reasoning either way.
Each generation gets the same test inputs and scoring methodology so results are directly comparable. The manager tracks lineage across generations: which variant descended from which, what changed, and the score delta.

Example evolution report

Generation 3 results:

| Variant       | Parent   | Change                      | Score  | Delta  |
|---------------|----------|-----------------------------|--------|--------|
| gen3-v1       | gen2-v3  | Clone (control)             | 78.2%  | -      |
| gen3-v2       | gen2-v3  | Added satellite data        | 83.1%  | +4.9%  |
| gen3-v3       | gen2-v1  | Swapped to ensemble method  | 76.9%  | -1.3%  |
| gen3-v4       | gen2-v3  | Crossover w/ gen2-v1        | 80.1%  | +1.9%  |
| gen3-v5       | -        | Random (heuristic baseline) | 62.3%  | -15.9% |

Winner: gen3-v2 (83.1%). Satellite imagery added a strong signal
for predicting high-pollution days. Want another generation?

Convergence

The manager tells you when agents are plateauing. Typical stopping points: no improvement for several consecutive generations, a target score is reached, or you decide the current version is good enough.
At convergence, the manager may test ensemble combinations of the best agents across all generations. The final deployed agent might combine multiple approaches and aggregate outputs.

Running the loop with less supervision

If you want, the manager can set up an automation that triggers the next generation on a schedule. Each automated generation checks its own convergence criteria and reports back or stops itself when agents plateau.

Deploy

When you are ready, tell the manager to deploy. It installs the winning agent (or ensemble) on your production instances, cleans up variants, and gives you a summary.
Deployed: aqi-predictor gen3-v2
Starting score: 61.3%  →  Final score: 83.1%
Generations: 3
Strategy: Multi-source with satellite data and burn pattern detection
Key insight: Satellite imagery is 2x more predictive than weather data alone
Because agents are powered by readable skill files (SKILL.md, scripts, references), the entire experiment history is inspectable. Every change is a human readable diff. Every score is tied to a concrete test case.

Downloading an agent’s skill

To download a skill from an instance, open the instance page, go to the Server tab, and browse the files. Skills live in .openclaw/skills/{skill-name}/. You can download any file individually: the SKILL.md, scripts, references, or assets. This is useful for backing up a winning agent, porting it to another instance, or inspecting what the evolution loop produced.

Agent design principles

One clear metric

“Predict AQI within 10 points” not “monitor air quality.” Narrow problems converge faster. Broad problems plateau.

Measurable output

Accuracy, precision, recall, F1, MAE, hit rate, convergence speed. If you can put a number on it, the loop can optimize.

External ground truth

Agents scored against real world data (measurements, outcomes, labeled cases) improve faster than agents graded by another LLM.

Short feedback loops

Daily or weekly resolution means more generations per unit of time. The manager can handle longer timelines, but faster is better.

Quickstart

Deploy your first agent in one session

Instances

What an instance can do

Automations

Run the evolution loop on a schedule

Integrations

Connect to external data sources
Start building →