Run scenario × harness-config evaluation matrices against real Agent Swarm stacks in E2B, with deterministic checks, judge models, and persisted artifacts.

The apps/evals/ sub-project is Agent Swarm's evaluation harness. It runs a scenario × harness-config matrix against fresh swarm stacks in E2B sandboxes, captures the resulting transcripts and artifacts, and grades each attempt with deterministic checks plus optional LLM or agentic judges.

Use it when you need a repeatable answer to questions like:

Which harness/provider/model combination actually passes this workflow?
Did a runtime change improve quality, cost, or speed over the last version?
Can a multi-worker scenario still complete after a change to memory, routing, or provider boot logic?

What It Runs

Each matrix cell is one scenario paired with one harness config:

Scenario defines the tasks, optional seeding, checks, and judging rubric.
Harness config defines the worker provider/model setup for that attempt.
Best@n retries are supported per cell so you can compare pass rate, not just one lucky run.

The harness boots a fresh stack for every attempt:

Start one API sandbox and one or more worker sandboxes in E2B.
Optionally seed SQL fixtures, memories, or workspace files.
Create the scenario's tasks and wait for terminal outcomes.
Grade the result with deterministic checks and optional judges.
Persist artifacts, transcripts, task records, costs, and sandbox metadata.
Tear the sandboxes down, even on failure.

Eval-launched API and worker sandboxes pin DESPLEGA_TELEMETRY_ENV=test, so their activity stays out of production telemetry cohorts. The API still runs with NODE_ENV=production to preserve its production runtime and security behavior.

What It Stores

The harness is designed for post-run inspection, not just a pass/fail bit. Each attempt can persist:

Flattened transcript output
Raw swarm session-log events
Harness session files from the worker filesystem
Task records and dependency outcomes
Seed command outputs
Session cost and token snapshots
Roster snapshots for multi-worker runs
Worker/API log tails
Captured API and worker version metadata

Results are stored in a Turso-backed libsql replica, so local runs can be resumed and compared over time.

Quick Start

cd apps/evals
bun install
bun src/cli.ts registry
bun src/cli.ts run --scenarios memory-seeded-recall --configs claude-haiku
bun src/cli.ts serve

The designated smoke scenario is memory-seeded-recall. It is the cheapest meaningful end-to-end check because it proves real server-side memory embedding and retrieval without paying for judge-model work.

memory-seeded-recall requires EMBEDDING_API_KEY in apps/evals/.env. The old OPENAI_API_KEY fallback is intentionally no longer injected for seeded-memory runs.

Core Concepts

Scenarios

Scenarios live under apps/evals/scenarios/. They define:

The worker roster (workers or lead + workers)
Optional seeding (sqlDump, memories, exec)
One or more tasks, including native dependsOn relationships
Outcome checks and pass thresholds
Optional LLM or agentic-judge rubric

Harness Configs

Harness configs live under apps/evals/configs/. They map a scenario onto a concrete provider/model environment such as Claude, pi, Codex, or opencode. Heterogeneous rosters are supported through per-member overrides.

Judges

Three grading modes can cooperate:

Deterministic checks for concrete invariants
LLM judge for rubric-style scoring over transcripts
Agentic judge for live verification via tools like run_command, read_file, and api_get

The agentic judge is the strongest option when transcript-only grading is too trusting.

Common Uses

Regression testing memory behavior, task routing, or dependency semantics
Comparing providers or model tiers on the same scenario
Verifying multi-worker handoffs and lead/worker orchestration
Measuring cost, duration, and pass-rate trends across runtime versions

Where To Look Next

Repo source of truth: apps/evals/README.md
E2B runtime model: E2B Provider Smoke Tests
Worker/provider setup: Harness Configuration
Provider implementation details: Adding a Harness Provider

Evals Harness