Why this actually works
They monitor on their own. The triage agent watches your alerts channel 24/7. When an alert fires, it classifies severity, identifies the affected service, and surfaces relevant context immediately. Nobody has to DM it first. It’s already paying attention. They remember past incidents. The agent has persistent memory of every incident it’s helped with. When the same service alerts again, it references what happened last time: “payment-api had a similar error pattern on Jan 15 caused by a Stripe webhook handler. Check if the same codepath is involved.” That’s institutional memory that doesn’t leave when engineers change teams. They have a filesystem of runbooks and history. Your service catalog, architecture docs, runbooks, deployment history, past post-mortems. All of it lives on the instance’s volume. The agent reads them during incidents and writes new incident records back to disk. Over time the filesystem becomes a living knowledge base the agent actively uses. They have real tools. Web search to check if an upstream dependency has a status page incident. Code execution to parse log formats or decode error payloads. File handling to read runbooks and write post-mortems. The agent investigates. It doesn’t just comment.The setup
Create instances for each phase of your incident workflow: Triage instance connects to your alerts Slack channel. Monitors incoming alerts autonomously. When an alert fires, it reads the service catalog fromrunbooks/service-catalog.md, checks its incident history in incidents/, and responds with severity classification, affected service, owner, relevant runbook, and past incident context. Writes a new incident record to incidents/{date}-{service}.md as soon as it triages.
Investigation instance connects to a private Discord server for your on-call team. Has your architecture docs, database schema, and common failure modes on its filesystem. Uses code execution to parse stack traces and log snippets. Reads the incident record the triage agent created. Helps you reason through root causes with full context of your system architecture from disk.
Post-mortem instance connects to Telegram for async follow-up. Reads your post-mortem template from templates/post-mortem.md and the incident record from the triage agent. Produces structured post-mortems and writes them to post-mortems/. These become part of the institutional knowledge. Next time a similar incident occurs, the triage agent’s incidents/ directory has the full history.
How this compounds
Incident 1. Alert fires in Slack. The triage agent readsrunbooks/service-catalog.md, classifies severity, identifies the payment-api owner, links the runbook. Writes incidents/0115-payment-api.md. On Discord, the investigation agent reads your architecture docs from disk, helps parse the stack trace using code execution, and identifies the root cause. Next day, the post-mortem agent reads the incident file and template from disk, produces a post-mortem, writes it to post-mortems/0115-payment-api.md.
Incident 2 (same service). Alert fires again three weeks later. The triage agent reads its incidents/ directory, finds the January 15 incident, and immediately surfaces: “payment-api last alerted on Jan 15. Root cause was a null reference in the Stripe webhook handler after v2.4.1 deploy. Post-mortem recommended adding input validation. Check if this is a regression.” The on-call engineer starts with context instead of starting from zero. That’s the difference between a 20-minute investigation and a 2-hour one.
Month 3. The triage agent has records of 15 incidents on disk. It has become institutional memory. New on-call engineers get full context from an agent that remembers every past incident, knows every runbook, and surfaces relevant history automatically. Your meantime-to-resolution drops because no one is starting cold anymore.
What to configure
Filesystem per instance
- Triage.
runbooks/service-catalog.md(service names, owners, dependencies, escalation contacts),runbooks/{service}.mdper service,incidents/directory the agent writes to - Investigation.
docs/architecture.md,docs/database-schema.md,docs/failure-modes.md, recent deployment notes - Post-mortem.
templates/post-mortem.md,post-mortems/directory for output (becomes historical reference)
Skills
- Web search to check upstream dependency status pages and search for known CVEs or outage reports
- Code execution to parse log formats, decode error payloads, and analyze stack traces
- File handling for all instances to read reference docs and write incident records, investigation notes, and post-mortems
Personas
- Triage. Fast, structured. Reads service catalog and incident history before every response. Writes incident records immediately. Surfaces past context proactively.
- Investigation. Deep, methodical. Knows your architecture from files on disk. Uses code execution to analyze technical details. Helps reason through root causes step by step.
- Post-mortem. Blameless, action-oriented. Follows the template. Produces consistent, well-formatted documents. Writes to disk so the record persists.