Agent SwarmAgent Swarm
Playbooks

Code Health & Alert Management

Plug your alerting tools (Datadog, New Relic, Sentry, SigNoz) into the swarm. Real signals kick off fixes or proposals; noise gets filtered. Daily/weekly health audits catch slow rot before it ships.

Plug your alerting (Datadog / New Relic / Sentry / SigNoz) into the swarm. Real signals kick off fixes or proposals; noise gets filtered. Daily/weekly health audits catch slow rot.

What it does

Three independent flows:

  1. Reactive — alert webhooks fire, the lead triages signal vs. noise (some alerts are expected — e.g. customer-test failures shouldn't go to Sentry), and routes real bugs to a code-capable agent.
  2. Scheduled audits — daily infra triage, daily workflow-health audit, weekly code-health scans, weekly dependency-upgrade bundling.
  3. Proposal mode — for slow-burn issues, the agent opens a PR with a proposed fix and a writeup, not a silent merge.

Agents

  • Lead — triages incoming alerts. First question is always "is this signal?" — every alerting tool has expected noise (we route browser/block-runner errors away from Sentry because they're customer-test failures, not bugs).
  • Coder (ours is Picateclas) — implements fixes once triaged.
  • Reviewer — code review on every alert-driven PR (no auto-merge for incident fixes).
  • Researcher — root-cause investigation when triage isn't obvious.

Tools & Skills

Built-in (ships with agent-swarm)

Custom (swarm-managed)

  • signoz-interaction — traces/metrics/logs/alerts from SigNoz.
  • desplega-infra — host-level triage (CPU/mem/disk, container counts, collector uptime).
  • Webhook bridge — small custom integration that turns incoming Datadog / New Relic / Sentry / SigNoz alerts into swarm tasks with the right tags.
  • Alert filter rules — codified in the lead's playbook + per-product config: which alert tags are real vs. expected noise.
  • desloppify — open-source multi-language codebase health scanner (29 languages, tree-sitter AST analysis, gameable-resistant scoring). We wrap it with a small swarm skill that runs it in a sandboxed sprite.
  • knip — open-source dead-code detector for JavaScript/TypeScript projects. We chain it into the weekly code-health workflow.

Workflows / Schedules

  • daily-infra-morning-triage — daily. SigNoz-driven checklist (alert fires + resolutions, host peaks, collector uptime, container counts, metric volume). Posts a tight digest. Observation-only.
  • daily-workflow-health-audit — daily. Surfaces hard failures, halted runs, silent empty-output completions, cron-stuck schedules, consecutive-error schedules. One digest.
  • daily-blocker-digest — daily. Verifies every PR/issue reference in the operational runbook is still open; flags merged-but-still-listed items.
  • weekly-dependabot-triage — weekly. Closes out-of-scope dependabot PRs, bundles in-scope upgrades into one unified PR per path.
  • weekly-harness-upgrade-check — weekly. Checks worker-image harness versions vs. upstream, opens ONE bundled PR.
  • weekly-code-health — weekly per repo. Runs knip + desloppify in a sandboxed sprite, evaluates top-5 concerns, drain-loops one PR per concern with an internal reviewer, hands the stack to humans.
  • monthly-infra-cleanup — monthly. Docker disk-cleanup audit. Confirmation-gated — posts findings + recommendation, waits for human approval before any prune. Never auto-prunes.

Patterns used

Tips for new swarm users

  • Filter alerts before letting agents act. Codify "these tags are not real bugs" in your lead's playbook first. Acting on every alert is how you generate PR spam.
  • Bundle low-significance fixes (dependabot, harness upgrades) into one weekly PR — fewer reviews, less churn, easier rollback.
  • Track a health score over time, not just one-shot scans. The trendline tells you if you're winning the code-rot fight.
  • Confirmation-gate destructive ops (disk prunes, branch deletions, force-pushes) behind a Slack approval.
  • Read failure reasons across runs. A failureReason on every failed task lets you cluster (agent, error) pairs and tell provider-health issues from routing issues from genuine bugs.

On this page