Plug your alerting tools (Datadog, New Relic, Sentry, SigNoz) into the swarm. Real signals kick off fixes or proposals; noise gets filtered. Daily/weekly health audits catch slow rot before it ships.

Plug your alerting (Datadog / New Relic / Sentry / SigNoz) into the swarm. Real signals kick off fixes or proposals; noise gets filtered. Daily/weekly health audits catch slow rot.

What it does

Three independent flows:

Reactive — alert webhooks fire, the lead triages signal vs. noise (some alerts are expected — e.g. customer-test failures shouldn't go to Sentry), and routes real bugs to a code-capable agent.
Scheduled audits — daily infra triage, daily workflow-health audit, weekly code-health scans, weekly dependency-upgrade bundling.
Proposal mode — for slow-burn issues, the agent opens a PR with a proposed fix and a writeup, not a silent merge.

Agents

Lead — triages incoming alerts. First question is always "is this signal?" — every alerting tool has expected noise (we route browser/block-runner errors away from Sentry because they're customer-test failures, not bugs).
Coder — implements fixes once triaged.
Reviewer — code review on every alert-driven PR (no auto-merge for incident fixes).
Researcher — root-cause investigation when triage isn't obvious.

Tools & Skills

Built-in (ships with agent-swarm)

investigate-sentry-issue — Sentry triage skill. Source: plugin/pi-skills/investigate-sentry-issue.
slack-post (incident comms), Linear sync (ticketing). A Linear ticket created from an alert is auto-picked up by the swarm.

Custom (swarm-managed)

signoz-interaction — traces/metrics/logs/alerts from SigNoz.
Host-infrastructure triage skill — host-level triage (CPU/mem/disk, container counts, collector uptime).
Webhook bridge — small custom integration that turns incoming Datadog / New Relic / Sentry / SigNoz alerts into swarm tasks with the right tags.
Alert filter rules — codified in the lead's playbook + per-product config: which alert tags are real vs. expected noise.

Third-party providers (popular tools we use)

desloppify — open-source multi-language codebase health scanner (29 languages, tree-sitter AST analysis, gameable-resistant scoring). We wrap it with a small swarm skill that runs it in a sandboxed sprite.
knip — open-source dead-code detector for JavaScript/TypeScript projects. We chain it into the weekly code-health workflow.

Workflows / Schedules

daily-infra-morning-triage — daily. SigNoz-driven checklist (alert fires + resolutions, host peaks, collector uptime, container counts, metric volume). Posts a tight digest. Observation-only.
daily-workflow-health-audit — daily. Surfaces hard failures, halted runs, silent empty-output completions, cron-stuck schedules, consecutive-error schedules. One digest.
daily-blocker-digest — daily. Verifies every PR/issue reference in the operational runbook is still open; flags merged-but-still-listed items.
weekly-dependabot-triage — weekly. Closes out-of-scope dependabot PRs, bundles in-scope upgrades into one unified PR per path.
weekly-harness-upgrade-check — weekly. Checks worker-image harness versions vs. upstream, opens ONE bundled PR.
weekly-code-health — weekly per repo. Runs knip + desloppify in a sandboxed sprite, evaluates top-5 concerns, drain-loops one PR per concern with an internal reviewer, hands the stack to humans.
monthly-infra-cleanup — monthly. Docker disk-cleanup audit. Confirmation-gated — posts findings + recommendation, waits for human approval before any prune. Never auto-prunes.

Patterns used

Drain Loops — weekly code-health turns top-N concerns into one PR each.
HITL Gates — destructive ops (disk prunes) wait for a human approval.
No-op When Nothing Changed — audits that skip silently on a quiet day.

Tips for new swarm users

Filter alerts before letting agents act. Codify "these tags are not real bugs" in your lead's playbook first. Acting on every alert is how you generate PR spam.
Bundle low-significance fixes (dependabot, harness upgrades) into one weekly PR — fewer reviews, less churn, easier rollback.
Track a health score over time, not just one-shot scans. The trendline tells you if you're winning the code-rot fight.
Confirmation-gate destructive ops (disk prunes, branch deletions, force-pushes) behind a Slack approval.
Read failure reasons across runs. A failureReason on every failed task lets you cluster (agent, error) pairs and tell provider-health issues from routing issues from genuine bugs.

Observability Alert Management