SKILL.md file with instructions and metadata, plus optional scripts, reference docs, and assets. One skill can be shared across every agent that needs it.
SKILL.md file or as rich as a full directory with scripts and data. The SKILL.md is the entry point. Scripts extend what the agent can do. References provide context without bloating the main file.
Operator treats skills as programs to be evolved, not prompts to be tweaked by hand. The manager walks you through building a first version, testing variants against real work, and running an improvement loop until the skill converges.
Build
Tell the manager what you want the skill to do.- Lead scoring
- Alert triage
- Resume screening
SKILL.md is kept small enough to fit in context alongside test results and improvement instructions (under 500 lines). If a skill needs more, the manager splits supporting logic into scripts/ and detailed context into references/, keeping the SKILL.md as the decision making core that the improvement loop evolves.
Test
The manager spins up multiple instances (default 5, adjustable based on your plan capacity), each running a different variant of the skill. It explains its approach as it goes: why it chose certain strategies, what inputs it is using, and what the scores mean. Variants are not parameter tweaks. They use fundamentally different strategies:| Strategy dimension | Example variants |
|---|---|
| Reasoning approach | Chain of thought vs. direct output |
| Data emphasis | Data heavy vs. heuristic driven |
| Field priority | Different inputs weighted differently |
| Tool usage | Different integrations or API calls |
Where test inputs come from
After scoring individuals, the manager may test ensembles of the top performers and report back. Two variants with similar scores but different reasoning strategies often produce a stronger combination than either alone.
How testing works
You decide what good looks like. Tell the manager the metric and, if you have them, test cases. If you are not sure how to evaluate, the manager suggests an approach and checks with you before running it.Common evaluation patterns
Common evaluation patterns
Against historical data. Known outcomes exist. The manager strips answers, runs the skill, and compares to ground truth. Works for: lead scoring, resume screening, classification, extraction.Against future outcomes. The task is live or forward facing. The manager records predictions now and sets up checks for when results come in. Works for: reply rates, triage accuracy, engagement metrics, any prediction with a deadline.Human ranking. The task is subjective. The manager presents outputs side by side without labels and you rank them. Works for: writing quality, design, tone.These are not rigid modes. The manager adapts to whatever evaluation approach makes sense for your problem. If you have a custom scoring function or a specific dataset, describe it and the manager will use it.
Improve
After a test round, the manager reports results and proposes what to try next. Winners advance. Losers get deleted.Let agents do the real work
The best way to evaluate a skill is to have the agent actually do the job, not run it in a sandbox. If an agent is prospecting, connect it to your real CRM. Let it send real outbound. The metrics (open rate, reply rate, booked meetings) flow back naturally because the agent is doing the actual work, not simulating it. This works because each instance is its own computer. If you give a variant instance access to your CRM, your email platform, and your calendar via Environment, it can run the full workflow end to end. The manager can then pull real performance data from those same systems to score how each variant did.How the manager varies skill variants
How the manager varies skill variants
The next generation varies the top performers. Common patterns:
- Clone the best as a control. If nothing in the next generation beats it, the changes did not help.
- Change one thing per variant. A heuristic, a step, a weight, a prompt structure, a data source.
- Combine ideas from different winners. The reasoning from one and the scoring from another.
- Try something completely different to avoid local optima.
- Remove a technique from the winner to test if it was actually helping.
Example improvement report
Convergence
The manager tells you when the skill is plateauing. Typical stopping points: no improvement for several consecutive generations, a target score is reached, or you decide the current version is good enough.At convergence, the manager may test ensemble combinations of the best variants across all generations. The final deployed skill might be a meta skill that runs multiple approaches and aggregates outputs.
Running the loop with less supervision
If you want, the manager can set up an automation that triggers the next generation on a schedule. Each automated generation checks its own convergence criteria and reports back or stops itself when the skill plateaus.Deploy
When you are ready, tell the manager to deploy. It installs the winning skill (or ensemble) on your production instances, cleans up variant instances, and gives you a summary.SKILL.md, scripts, references), the entire experiment history is inspectable. Every change is a human readable diff. Every score is tied to a concrete test case.
Downloading a skill
To download a skill from an instance, open the instance page, go to the Server tab, and browse the files. Skills live in.openclaw/skills/{skill-name}/. You can download any file individually: the SKILL.md, scripts, references, or assets. This is useful for backing up a winning skill, porting it to another instance, or inspecting what the improvement loop produced.
Skill design principles
One clear question
“Score this lead” not “manage the pipeline.” Narrow skills compose into workflows. Broad skills plateau.
Measurable output
Accuracy, precision, recall, F1, Brier score, hit rate, reply rate, open rate. If you can put a number on it, the loop can optimize.
External ground truth
Skills that validate against data outside the model (filings, source code, public records, outcomes) improve faster than skills graded by another LLM.
Short feedback loops
Daily or weekly resolution means more generations per unit of time. The manager can handle longer timelines, but faster is better.
Related docs
Quickstart
Build your first skill in one session
Instances
What an instance can do
Automations
Run the improvement loop on a schedule
Integrations
Connect to external data sources