Skip to main content
A skill is a self contained capability that an agent can execute. It follows the open Agent Skills format: a folder containing a SKILL.md file with instructions and metadata, plus optional scripts, reference docs, and assets. One skill can be shared across every agent that needs it.
lead-scoring/
├── SKILL.md           # Instructions, metadata, evaluation criteria
├── scripts/           # Executable code (Python, Bash, JS)
│   ├── score.py
│   └── fetch_crm.py
├── references/        # Additional docs the agent loads on demand
│   └── scoring-rules.md
└── assets/            # Templates, schemas, lookup tables
    └── field-weights.json
A skill can be as simple as a single SKILL.md file or as rich as a full directory with scripts and data. The SKILL.md is the entry point. Scripts extend what the agent can do. References provide context without bloating the main file. Operator treats skills as programs to be evolved, not prompts to be tweaked by hand. The manager walks you through building a first version, testing variants against real work, and running an improvement loop until the skill converges.

Build

Tell the manager what you want the skill to do.
Build a skill that scores inbound leads using our Salesforce data and email engagement.
Train on the last 12 months of closed/won and closed/lost deals.
Measure against which leads actually convert.
The manager asks at most one or two clarifying questions (what is the task, how will you know it worked), then fills in sensible defaults for everything you did not specify. It creates the skill, picks or creates an instance, installs it, and walks you through testing.
If it is not obvious how to measure success, the manager suggests a metric and confirms with you before proceeding.
The SKILL.md is kept small enough to fit in context alongside test results and improvement instructions (under 500 lines). If a skill needs more, the manager splits supporting logic into scripts/ and detailed context into references/, keeping the SKILL.md as the decision making core that the improvement loop evolves.

Test

The manager spins up multiple instances (default 5, adjustable based on your plan capacity), each running a different variant of the skill. It explains its approach as it goes: why it chose certain strategies, what inputs it is using, and what the scores mean. Variants are not parameter tweaks. They use fundamentally different strategies:
Strategy dimensionExample variants
Reasoning approachChain of thought vs. direct output
Data emphasisData heavy vs. heuristic driven
Field priorityDifferent inputs weighted differently
Tool usageDifferent integrations or API calls
All variants receive the same test inputs and get scored against the same metric.

Where test inputs come from

Data you provide

Historical examples, labeled datasets, or specific test cases you give the manager.

Found via web search

Benchmarks, public datasets, and real world examples the manager finds.

Synthetic generation

Last resort. The manager generates varied cases to cover edge conditions.
After scoring individuals, the manager may test ensembles of the top performers and report back. Two variants with similar scores but different reasoning strategies often produce a stronger combination than either alone.

How testing works

You decide what good looks like. Tell the manager the metric and, if you have them, test cases. If you are not sure how to evaluate, the manager suggests an approach and checks with you before running it.
Against historical data. Known outcomes exist. The manager strips answers, runs the skill, and compares to ground truth. Works for: lead scoring, resume screening, classification, extraction.Against future outcomes. The task is live or forward facing. The manager records predictions now and sets up checks for when results come in. Works for: reply rates, triage accuracy, engagement metrics, any prediction with a deadline.Human ranking. The task is subjective. The manager presents outputs side by side without labels and you rank them. Works for: writing quality, design, tone.These are not rigid modes. The manager adapts to whatever evaluation approach makes sense for your problem. If you have a custom scoring function or a specific dataset, describe it and the manager will use it.

Improve

After a test round, the manager reports results and proposes what to try next. Winners advance. Losers get deleted.

Let agents do the real work

The best way to evaluate a skill is to have the agent actually do the job, not run it in a sandbox. If an agent is prospecting, connect it to your real CRM. Let it send real outbound. The metrics (open rate, reply rate, booked meetings) flow back naturally because the agent is doing the actual work, not simulating it. This works because each instance is its own computer. If you give a variant instance access to your CRM, your email platform, and your calendar via Environment, it can run the full workflow end to end. The manager can then pull real performance data from those same systems to score how each variant did.
The more of your actual workflow you expose to the instances, the better the evaluation. An agent scoring mock leads in isolation tells you less than an agent scoring real leads that you can check against actual conversions a month later.
The next generation varies the top performers. Common patterns:
  • Clone the best as a control. If nothing in the next generation beats it, the changes did not help.
  • Change one thing per variant. A heuristic, a step, a weight, a prompt structure, a data source.
  • Combine ideas from different winners. The reasoning from one and the scoring from another.
  • Try something completely different to avoid local optima.
  • Remove a technique from the winner to test if it was actually helping.
You can guide this as much or as little as you want. Tell the manager what to change, or ask it to pick. It explains its reasoning either way.
Each generation gets the same test inputs and scoring methodology so results are directly comparable. The manager tracks lineage across generations: which variant descended from which, what changed, and the score delta.

Example improvement report

Generation 3 results:

| Variant       | Parent   | Change              | Score  | Delta  |
|---------------|----------|---------------------|--------|--------|
| gen3-v1       | gen2-v3  | Clone (control)     | 78.2%  | -      |
| gen3-v2       | gen2-v3  | Added recency bias  | 81.4%  | +3.2%  |
| gen3-v3       | gen2-v1  | Swapped to CoT      | 76.9%  | -1.3%  |
| gen3-v4       | gen2-v3  | Crossover w/ gen2-v1| 80.1%  | +1.9%  |
| gen3-v5       | -        | Random (heuristic)  | 72.3%  | -5.9%  |

Winner: gen3-v2 (81.4%). Recency bias helped — leads contacted
in the last 7 days convert at 2x the rate. Want another generation?

Convergence

The manager tells you when the skill is plateauing. Typical stopping points: no improvement for several consecutive generations, a target score is reached, or you decide the current version is good enough.
At convergence, the manager may test ensemble combinations of the best variants across all generations. The final deployed skill might be a meta skill that runs multiple approaches and aggregates outputs.

Running the loop with less supervision

If you want, the manager can set up an automation that triggers the next generation on a schedule. Each automated generation checks its own convergence criteria and reports back or stops itself when the skill plateaus.

Deploy

When you are ready, tell the manager to deploy. It installs the winning skill (or ensemble) on your production instances, cleans up variant instances, and gives you a summary.
Deployed: lead-scoring gen3-v2
Starting score: 61.3%  →  Final score: 81.4%
Generations: 3
Approach: Weighted feature extraction with 7-day recency bias
Key insight: Recent engagement is 2x more predictive than firmographic fit
Because skills are folders of readable files (SKILL.md, scripts, references), the entire experiment history is inspectable. Every change is a human readable diff. Every score is tied to a concrete test case.

Downloading a skill

To download a skill from an instance, open the instance page, go to the Server tab, and browse the files. Skills live in .openclaw/skills/{skill-name}/. You can download any file individually: the SKILL.md, scripts, references, or assets. This is useful for backing up a winning skill, porting it to another instance, or inspecting what the improvement loop produced.

Skill design principles

One clear question

“Score this lead” not “manage the pipeline.” Narrow skills compose into workflows. Broad skills plateau.

Measurable output

Accuracy, precision, recall, F1, Brier score, hit rate, reply rate, open rate. If you can put a number on it, the loop can optimize.

External ground truth

Skills that validate against data outside the model (filings, source code, public records, outcomes) improve faster than skills graded by another LLM.

Short feedback loops

Daily or weekly resolution means more generations per unit of time. The manager can handle longer timelines, but faster is better.

Quickstart

Build your first skill in one session

Instances

What an instance can do

Automations

Run the improvement loop on a schedule

Integrations

Connect to external data sources
Start building →