Skip to main content
When the 3am page fires, you don’t want to paste an alert into ChatGPT and get “have you tried checking the logs?” You want an agent that’s been watching the alerts channel already, knows your architecture from files on disk, remembers the last three times this service went down, and can actually parse the stack trace with code execution. That’s what this is. OpenClaw incident response instances are always-on agents that monitor alert channels autonomously, investigate issues with real tools, and build a persistent knowledge base of past incidents on their filesystem. They get better at their job every time something breaks.

Why this actually works

They monitor on their own. The triage agent watches your alerts channel 24/7. When an alert fires, it classifies severity, identifies the affected service, and surfaces relevant context immediately. Nobody has to DM it first. It’s already paying attention. They remember past incidents. The agent has persistent memory of every incident it’s helped with. When the same service alerts again, it references what happened last time: “payment-api had a similar error pattern on Jan 15 caused by a Stripe webhook handler. Check if the same codepath is involved.” That’s institutional memory that doesn’t leave when engineers change teams. They have a filesystem of runbooks and history. Your service catalog, architecture docs, runbooks, deployment history, past post-mortems. All of it lives on the instance’s volume. The agent reads them during incidents and writes new incident records back to disk. Over time the filesystem becomes a living knowledge base the agent actively uses. They have real tools. Web search to check if an upstream dependency has a status page incident. Code execution to parse log formats or decode error payloads. File handling to read runbooks and write post-mortems. The agent investigates. It doesn’t just comment.

The setup

Create instances for each phase of your incident workflow: Triage instance connects to your alerts Slack channel. Monitors incoming alerts autonomously. When an alert fires, it reads the service catalog from runbooks/service-catalog.md, checks its incident history in incidents/, and responds with severity classification, affected service, owner, relevant runbook, and past incident context. Writes a new incident record to incidents/{date}-{service}.md as soon as it triages. Investigation instance connects to a private Discord server for your on-call team. Has your architecture docs, database schema, and common failure modes on its filesystem. Uses code execution to parse stack traces and log snippets. Reads the incident record the triage agent created. Helps you reason through root causes with full context of your system architecture from disk. Post-mortem instance connects to Telegram for async follow-up. Reads your post-mortem template from templates/post-mortem.md and the incident record from the triage agent. Produces structured post-mortems and writes them to post-mortems/. These become part of the institutional knowledge. Next time a similar incident occurs, the triage agent’s incidents/ directory has the full history.

How this compounds

Incident 1. Alert fires in Slack. The triage agent reads runbooks/service-catalog.md, classifies severity, identifies the payment-api owner, links the runbook. Writes incidents/0115-payment-api.md. On Discord, the investigation agent reads your architecture docs from disk, helps parse the stack trace using code execution, and identifies the root cause. Next day, the post-mortem agent reads the incident file and template from disk, produces a post-mortem, writes it to post-mortems/0115-payment-api.md. Incident 2 (same service). Alert fires again three weeks later. The triage agent reads its incidents/ directory, finds the January 15 incident, and immediately surfaces: “payment-api last alerted on Jan 15. Root cause was a null reference in the Stripe webhook handler after v2.4.1 deploy. Post-mortem recommended adding input validation. Check if this is a regression.” The on-call engineer starts with context instead of starting from zero. That’s the difference between a 20-minute investigation and a 2-hour one. Month 3. The triage agent has records of 15 incidents on disk. It has become institutional memory. New on-call engineers get full context from an agent that remembers every past incident, knows every runbook, and surfaces relevant history automatically. Your meantime-to-resolution drops because no one is starting cold anymore.

What to configure

Filesystem per instance

  • Triage. runbooks/service-catalog.md (service names, owners, dependencies, escalation contacts), runbooks/{service}.md per service, incidents/ directory the agent writes to
  • Investigation. docs/architecture.md, docs/database-schema.md, docs/failure-modes.md, recent deployment notes
  • Post-mortem. templates/post-mortem.md, post-mortems/ directory for output (becomes historical reference)

Skills

  • Web search to check upstream dependency status pages and search for known CVEs or outage reports
  • Code execution to parse log formats, decode error payloads, and analyze stack traces
  • File handling for all instances to read reference docs and write incident records, investigation notes, and post-mortems

Personas

  • Triage. Fast, structured. Reads service catalog and incident history before every response. Writes incident records immediately. Surfaces past context proactively.
  • Investigation. Deep, methodical. Knows your architecture from files on disk. Uses code execution to analyze technical details. Helps reason through root causes step by step.
  • Post-mortem. Blameless, action-oriented. Follows the template. Produces consistent, well-formatted documents. Writes to disk so the record persists.
Three instances fit on the Pro plan. Per-service or per-team investigation bots on the Max plan.