Evals Harness
Run scenario × harness-config evaluation matrices against real Agent Swarm stacks in E2B, with deterministic checks, judge models, and persisted artifacts.
The evals/ sub-project is Agent Swarm's evaluation harness. It runs a scenario × harness-config matrix against fresh swarm stacks in E2B sandboxes, captures the resulting transcripts and artifacts, and grades each attempt with deterministic checks plus optional LLM or agentic judges.
Use it when you need a repeatable answer to questions like:
- Which harness/provider/model combination actually passes this workflow?
- Did a runtime change improve quality, cost, or speed over the last version?
- Can a multi-worker scenario still complete after a change to memory, routing, or provider boot logic?
What It Runs
Each matrix cell is one scenario paired with one harness config:
- Scenario defines the tasks, optional seeding, checks, and judging rubric.
- Harness config defines the worker provider/model setup for that attempt.
- Best@n retries are supported per cell so you can compare pass rate, not just one lucky run.
The harness boots a fresh stack for every attempt:
- Start one API sandbox and one or more worker sandboxes in E2B.
- Optionally seed SQL fixtures, memories, or workspace files.
- Create the scenario's tasks and wait for terminal outcomes.
- Grade the result with deterministic checks and optional judges.
- Persist artifacts, transcripts, task records, costs, and sandbox metadata.
- Tear the sandboxes down, even on failure.
What It Stores
The harness is designed for post-run inspection, not just a pass/fail bit. Each attempt can persist:
- Flattened transcript output
- Raw swarm session-log events
- Harness session files from the worker filesystem
- Task records and dependency outcomes
- Seed command outputs
- Session cost and token snapshots
- Roster snapshots for multi-worker runs
- Worker/API log tails
- Captured API and worker version metadata
Results are stored in a Turso-backed libsql replica, so local runs can be resumed and compared over time.
Quick Start
cd evals
bun install
bun src/cli.ts registry
bun src/cli.ts run --scenarios memory-seeded-recall --configs claude-haiku
bun src/cli.ts serveThe designated smoke scenario is memory-seeded-recall. It is the cheapest meaningful end-to-end check because it proves real server-side memory embedding and retrieval without paying for judge-model work.
memory-seeded-recall requires EMBEDDING_API_KEY in evals/.env. The old OPENAI_API_KEY fallback is intentionally no longer injected for seeded-memory runs.
Core Concepts
Scenarios
Scenarios live under evals/scenarios/. They define:
- The worker roster (
workersorlead+ workers) - Optional seeding (
sqlDump,memories,exec) - One or more tasks, including native
dependsOnrelationships - Outcome checks and pass thresholds
- Optional LLM or agentic-judge rubric
Harness Configs
Harness configs live under evals/configs/. They map a scenario onto a concrete provider/model environment such as Claude, pi, Codex, or opencode. Heterogeneous rosters are supported through per-member overrides.
Judges
Three grading modes can cooperate:
- Deterministic checks for concrete invariants
- LLM judge for rubric-style scoring over transcripts
- Agentic judge for live verification via tools like
run_command,read_file, andapi_get
The agentic judge is the strongest option when transcript-only grading is too trusting.
Common Uses
- Regression testing memory behavior, task routing, or dependency semantics
- Comparing providers or model tiers on the same scenario
- Verifying multi-worker handoffs and lead/worker orchestration
- Measuring cost, duration, and pass-rate trends across runtime versions
Where To Look Next
- Repo source of truth:
evals/README.md - E2B runtime model: E2B Provider Smoke Tests
- Worker/provider setup: Harness Configuration
- Provider implementation details: Adding a Harness Provider
E2B Provider Smoke Tests
Run real provider smoke tests from Dockerless environments by launching ephemeral E2B workers against a live swarm API
Personalization & Status
Brand the swarm with org name + logo + brand color, surface setup readiness via /status, and build cloud-aware UX with the SWARM_* identity envs.