Workers wait for harness credentials at runtime instead of crash-looping the container — how the boot loop, dispatcher gating, and dashboard surface fit together

A worker container always boots and registers, even when the harness credentials it needs (e.g. CLAUDE_CODE_OAUTH_TOKEN) are missing. It parks in a waiting_for_credentials state, the dispatcher routes around it, and the dashboard shows what it's blocked on. The worker self-heals as soon as the credential lands in swarm_config — no container restart.

This guide documents that lifecycle, the endpoints that drive it, and the configuration knobs that govern its timing.

This pattern replaces the previous bash-level fail-fast in docker-entrypoint.sh, which exited the container on missing creds and forced operators to restart workers after every credential change.

1. Boot model

The Docker entrypoint is now best-effort. It still does file-prep side effects (codex login, claude-managed config restore from swarm_config, etc.) but does not exit the process on missing harness credentials. The single hard exit it keeps is API_KEY — without that, the worker can't talk to the API at all.

After the entrypoint hands off, the worker process:

Calls join-swarm so the agent row exists in the DB and is visible on the dashboard.
Calls awaitCredentials(...) (src/commands/credential-wait.ts) which loops:
- checkProviderCredentials(provider, env) against the per-adapter predicate.
- If not ready, calls fetchResolvedEnv(...) to merge the latest swarm_config values into process.env.
- Re-checks. If still not ready, sleeps with exponential backoff and reports state via PUT /api/agents/{id}/credential-status.
Once ready, transitions the agent's row to status: idle and starts the task-claim loop.

The agent never crashes. Operators set creds whenever, and the worker picks them up on the next tick.

2. Agent lifecycle

The agents.status enum is idle | busy | offline | waiting_for_credentials. Adding the fourth value keeps the dispatcher predicate trivial — getIdleWorkersWithCapacity filters on status === 'idle', so blocked workers are excluded from routing without any extra condition.

The shape of the agent row while waiting:

{
  "id": "worker-1",
  "status": "waiting_for_credentials",
  "credentialMissing": ["CLAUDE_CODE_OAUTH_TOKEN", "ANTHROPIC_API_KEY"]
}

credentialMissing is null (or absent) in any other state.

From → To	Trigger
`idle` → `waiting_for_credentials`	Worker `awaitCredentials` finds creds gone (rare — usually only happens at boot)
`waiting_for_credentials` → `idle`	Worker's next tick finds creds present in `process.env` after a `fetchResolvedEnv` refresh
any → `offline`	Heartbeat stops landing

3. Per-provider predicates

Each adapter exports a checkCredentials(env, opts?): CredStatus (src/providers/credentials.ts dispatches to the right one). The shape:

interface CredStatus {
  ready: boolean;
  missing: string[];
  satisfiedBy?: 'env' | 'file' | 'side-effect-pending';
  hint?: string;
}

Provider-by-provider:

Provider	Ready when
`claude`	`CLAUDE_CODE_OAUTH_TOKEN` or `ANTHROPIC_API_KEY` is set
`claude-managed`	All of `ANTHROPIC_API_KEY`, `MANAGED_AGENT_ID`, `MANAGED_ENVIRONMENT_ID`, `MCP_BASE_URL` are set
`devin`	Both `DEVIN_API_KEY` and `DEVIN_ORG_ID` are set
`codex`	`~/.codex/auth.json` exists or `OPENAI_API_KEY` is set (login dance still needs to run; reported as `satisfiedBy: 'side-effect-pending'`)
`pi`	`~/.pi/agent/auth.json` exists, otherwise model-conditional: `MODEL_OVERRIDE` resolves to anthropic/openrouter/openai → that provider's key required. Unset → any one of the three is enough.
`opencode`	Same shape as `pi`, file at `~/.local/share/opencode/auth.json`, model-conditional env keys

4. Endpoints

Two read endpoints (for dashboard / orchestrators) and one write endpoint (for the worker itself).

GET /api/agents/{id}/credential-status

Single-agent snapshot. Response:

{
  "agentId": "worker-1",
  "name": "worker-1",
  "status": "waiting_for_credentials",
  "missing": ["CLAUDE_CODE_OAUTH_TOKEN", "ANTHROPIC_API_KEY"],
  "provider": "claude",
  "lastCheckedAt": "2026-05-06T21:35:27.791Z"
}

GET /api/agents/credential-status[?status=waiting_for_credentials]

Bulk snapshot across all agents. The optional ?status= filter narrows to one enum value. Powers the dashboard's at-a-glance fleet view without N round-trips.

PUT /api/agents/{id}/credential-status
{ "ready": false, "missing": ["CLAUDE_CODE_OAUTH_TOKEN"] }

Worker self-report. ready: true flips the agent back to idle and clears credentialMissing. ready: false sets waiting_for_credentials with the listed missing keys.

There is no /ready endpoint on the worker process today. Workers don't expose an HTTP server; orchestrators that want strict gating should poll GET /api/agents/{id}/credential-status against the API. Tracked as a follow-up if a Kubernetes-style readiness split is needed.

5. Recovering a parked worker

Set the missing credential via the config API and the worker picks it up on its next tick (default ≤30s):

curl -X PUT https://api.example.com/api/config \
  -H "Authorization: Bearer $API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "scope": "agent",
    "scopeId": "worker-1",
    "key": "CLAUDE_CODE_OAUTH_TOKEN",
    "value": "sk-ant-oat01-...",
    "isSecret": true
  }'

Scope choices:

scope: "agent" — only this worker sees it.
scope: "global" — all workers inherit it.

The worker's next fetchResolvedEnv call merges the new value into process.env, the predicate flips green, and the agent row transitions to idle.

Caveat: codex side-effect on first credential

codex login --with-api-key runs in the entrypoint as a one-time side effect. If OPENAI_API_KEY arrives after the entrypoint has finished, the worker's predicate goes ready (since the env var is set) but ~/.codex/auth.json may not be written until the next container restart. Plan to restart the codex worker once after first-credential — tracked as a follow-up to move the side effect into TS.

6. Dashboard surface

The agents list (/agents) shows a WAITING FOR CREDS pill in the Status column for blocked workers. The detail view (/agents/{id}) renders a panel listing each missing variable as a chip plus a remediation hint with the exact PUT /api/config form.

The dashboard polls these endpoints; once you set the credential, the badge clears within the polling interval.

7. Configuration knobs

All read from process.env at function entry — overridable per worker.

Variable	Default	Effect
`BOOT_INITIAL_BACKOFF_MS`	`2000`	Initial sleep between credential checks
`BOOT_MAX_BACKOFF_MS`	`30000`	Cap on the exponential backoff
`BOOT_MAX_WAIT_SECONDS`	`0`	If > 0, exit with code `78` (`EX_CONFIG`) once exceeded. `0` = wait forever

78 is distinct from generic failures so monitoring can tell "credentials never arrived" apart from a crash.

8. What this is not

Not a push-based notification system. Workers poll their own state on backoff; there is no SSE / WebSocket subscription to /api/config/reload. Latency: ≤30s by default.
Not a way to bootstrap API_KEY, SECRETS_ENCRYPTION_KEY, or MCP_BASE_URL — those are required for the worker to talk to the API at all and are validated at the bash layer.
Not a replacement for the per-task fetchResolvedEnv refresh — that still runs on every task spawn (src/commands/runner.ts) and is what keeps long-running agents in sync with mid-flight credential rotations.

Adding a Harness Provider — implement a new provider; includes the checkCredentials predicate contract
Harness Configuration — provider-by-provider credential setup
Secrets Encryption — how swarm_config secrets are protected at rest

Worker Credential Recovery