Code Health & Alert Management
Plug your alerting tools (Datadog, New Relic, Sentry, SigNoz) into the swarm. Real signals kick off fixes or proposals; noise gets filtered. Daily/weekly health audits catch slow rot before it ships.
Plug your alerting (Datadog / New Relic / Sentry / SigNoz) into the swarm. Real signals kick off fixes or proposals; noise gets filtered. Daily/weekly health audits catch slow rot.
What it does
Three independent flows:
- Reactive — alert webhooks fire, the lead triages signal vs. noise (some alerts are expected — e.g. customer-test failures shouldn't go to Sentry), and routes real bugs to a code-capable agent.
- Scheduled audits — daily infra triage, daily workflow-health audit, weekly code-health scans, weekly dependency-upgrade bundling.
- Proposal mode — for slow-burn issues, the agent opens a PR with a proposed fix and a writeup, not a silent merge.
Agents
- Lead — triages incoming alerts. First question is always "is this signal?" — every alerting tool has expected noise (we route browser/block-runner errors away from Sentry because they're customer-test failures, not bugs).
- Coder (ours is Picateclas) — implements fixes once triaged.
- Reviewer — code review on every alert-driven PR (no auto-merge for incident fixes).
- Researcher — root-cause investigation when triage isn't obvious.
Tools & Skills
Built-in (ships with agent-swarm)
investigate-sentry-issue— Sentry triage skill. Source:plugin/pi-skills/investigate-sentry-issue.slack-post(incident comms), Linear sync (ticketing). A Linear ticket created from an alert is auto-picked up by the swarm.
Custom (swarm-managed)
signoz-interaction— traces/metrics/logs/alerts from SigNoz.desplega-infra— host-level triage (CPU/mem/disk, container counts, collector uptime).- Webhook bridge — small custom integration that turns incoming Datadog / New Relic / Sentry / SigNoz alerts into swarm tasks with the right tags.
- Alert filter rules — codified in the lead's playbook + per-product config: which alert tags are real vs. expected noise.
Third-party providers (popular tools we use)
- desloppify — open-source multi-language codebase health scanner (29 languages, tree-sitter AST analysis, gameable-resistant scoring). We wrap it with a small swarm skill that runs it in a sandboxed sprite.
- knip — open-source dead-code detector for JavaScript/TypeScript projects. We chain it into the weekly code-health workflow.
Workflows / Schedules
daily-infra-morning-triage— daily. SigNoz-driven checklist (alert fires + resolutions, host peaks, collector uptime, container counts, metric volume). Posts a tight digest. Observation-only.daily-workflow-health-audit— daily. Surfaces hard failures, halted runs, silent empty-output completions, cron-stuck schedules, consecutive-error schedules. One digest.daily-blocker-digest— daily. Verifies every PR/issue reference in the operational runbook is still open; flags merged-but-still-listed items.weekly-dependabot-triage— weekly. Closes out-of-scope dependabot PRs, bundles in-scope upgrades into one unified PR per path.weekly-harness-upgrade-check— weekly. Checks worker-image harness versions vs. upstream, opens ONE bundled PR.weekly-code-health— weekly per repo. Runs knip + desloppify in a sandboxed sprite, evaluates top-5 concerns, drain-loops one PR per concern with an internal reviewer, hands the stack to humans.monthly-infra-cleanup— monthly. Docker disk-cleanup audit. Confirmation-gated — posts findings + recommendation, waits for human approval before any prune. Never auto-prunes.
Patterns used
- Drain Loops — weekly code-health turns top-N concerns into one PR each.
- HITL Gates — destructive ops (disk prunes) wait for a human approval.
- No-op When Nothing Changed — audits that skip silently on a quiet day.
Tips for new swarm users
- Filter alerts before letting agents act. Codify "these tags are not real bugs" in your lead's playbook first. Acting on every alert is how you generate PR spam.
- Bundle low-significance fixes (dependabot, harness upgrades) into one weekly PR — fewer reviews, less churn, easier rollback.
- Track a health score over time, not just one-shot scans. The trendline tells you if you're winning the code-rot fight.
- Confirmation-gate destructive ops (disk prunes, branch deletions, force-pushes) behind a Slack approval.
- Read failure reasons across runs. A
failureReasonon every failed task lets you cluster(agent, error)pairs and tell provider-health issues from routing issues from genuine bugs.
Proactive Customer Support
Standing agents per top account — persistent working directories, scheduled value-showcase reports, automatic post-meeting ingestion, and "what changed since last touchpoint?" briefings on demand.
Reports from Multiple Sources
Integrate your data warehouse, product analytics, billing, search analytics, and observability into one swarm — then ask it the questions your team would have asked a BI tool. Charts render as auto-hosted Pages.