Agent SwarmAgent Swarm
Guides

Worker Credential Recovery

Workers wait for harness credentials at runtime instead of crash-looping the container — how the boot loop, dispatcher gating, and dashboard surface fit together

A worker container always boots and registers, even when the harness credentials it needs (e.g. CLAUDE_CODE_OAUTH_TOKEN) are missing. It parks in a waiting_for_credentials state, the dispatcher routes around it, and the dashboard shows what it's blocked on. The worker self-heals as soon as the credential lands in swarm_config — no container restart.

This guide documents that lifecycle, the endpoints that drive it, and the configuration knobs that govern its timing.

This pattern replaces the previous bash-level fail-fast in docker-entrypoint.sh, which exited the container on missing creds and forced operators to restart workers after every credential change.


1. Boot model

The Docker entrypoint is now best-effort. It still does file-prep side effects (codex login, claude-managed config restore from swarm_config, etc.) but does not exit the process on missing harness credentials. The single hard exit it keeps is API_KEY — without that, the worker can't talk to the API at all.

After the entrypoint hands off, the worker process:

  1. Calls join-swarm so the agent row exists in the DB and is visible on the dashboard.
  2. Calls awaitCredentials(...) (src/commands/credential-wait.ts) which loops:
    • checkProviderCredentials(provider, env) against the per-adapter predicate.
    • If not ready, calls fetchResolvedEnv(...) to merge the latest swarm_config values into process.env.
    • Re-checks. If still not ready, sleeps with exponential backoff and reports state via PUT /api/agents/{id}/credential-status.
  3. Once ready, transitions the agent's row to status: idle and starts the task-claim loop.

The agent never crashes. Operators set creds whenever, and the worker picks them up on the next tick.


2. Agent lifecycle

The agents.status enum is idle | busy | offline | waiting_for_credentials. Adding the fourth value keeps the dispatcher predicate trivial — getIdleWorkersWithCapacity filters on status === 'idle', so blocked workers are excluded from routing without any extra condition.

The shape of the agent row while waiting:

{
  "id": "worker-1",
  "status": "waiting_for_credentials",
  "credentialMissing": ["CLAUDE_CODE_OAUTH_TOKEN", "ANTHROPIC_API_KEY"]
}

credentialMissing is null (or absent) in any other state.

From → ToTrigger
idlewaiting_for_credentialsWorker awaitCredentials finds creds gone (rare — usually only happens at boot)
waiting_for_credentialsidleWorker's next tick finds creds present in process.env after a fetchResolvedEnv refresh
any → offlineHeartbeat stops landing

3. Per-provider predicates

Each adapter exports a checkCredentials(env, opts?): CredStatus (src/providers/credentials.ts dispatches to the right one). The shape:

interface CredStatus {
  ready: boolean;
  missing: string[];
  satisfiedBy?: 'env' | 'file' | 'side-effect-pending';
  hint?: string;
}

Provider-by-provider:

ProviderReady when
claudeCLAUDE_CODE_OAUTH_TOKEN or ANTHROPIC_API_KEY is set
claude-managedAll of ANTHROPIC_API_KEY, MANAGED_AGENT_ID, MANAGED_ENVIRONMENT_ID, MCP_BASE_URL are set
devinBoth DEVIN_API_KEY and DEVIN_ORG_ID are set
codex~/.codex/auth.json exists or OPENAI_API_KEY is set (login dance still needs to run; reported as satisfiedBy: 'side-effect-pending')
pi~/.pi/agent/auth.json exists, otherwise model-conditional: MODEL_OVERRIDE resolves to anthropic/openrouter/openai → that provider's key required. Unset → any one of the three is enough.
opencodeSame shape as pi, file at ~/.local/share/opencode/auth.json, model-conditional env keys

4. Endpoints

Two read endpoints (for dashboard / orchestrators) and one write endpoint (for the worker itself).

GET /api/agents/{id}/credential-status

Single-agent snapshot. Response:

{
  "agentId": "worker-1",
  "name": "worker-1",
  "status": "waiting_for_credentials",
  "missing": ["CLAUDE_CODE_OAUTH_TOKEN", "ANTHROPIC_API_KEY"],
  "provider": "claude",
  "lastCheckedAt": "2026-05-06T21:35:27.791Z"
}
GET /api/agents/credential-status[?status=waiting_for_credentials]

Bulk snapshot across all agents. The optional ?status= filter narrows to one enum value. Powers the dashboard's at-a-glance fleet view without N round-trips.

PUT /api/agents/{id}/credential-status
{ "ready": false, "missing": ["CLAUDE_CODE_OAUTH_TOKEN"] }

Worker self-report. ready: true flips the agent back to idle and clears credentialMissing. ready: false sets waiting_for_credentials with the listed missing keys.

There is no /ready endpoint on the worker process today. Workers don't expose an HTTP server; orchestrators that want strict gating should poll GET /api/agents/{id}/credential-status against the API. Tracked as a follow-up if a Kubernetes-style readiness split is needed.


5. Recovering a parked worker

Set the missing credential via the config API and the worker picks it up on its next tick (default ≤30s):

curl -X PUT https://api.example.com/api/config \
  -H "Authorization: Bearer $API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "scope": "agent",
    "scopeId": "worker-1",
    "key": "CLAUDE_CODE_OAUTH_TOKEN",
    "value": "sk-ant-oat01-...",
    "isSecret": true
  }'

Scope choices:

  • scope: "agent" — only this worker sees it.
  • scope: "global" — all workers inherit it.

The worker's next fetchResolvedEnv call merges the new value into process.env, the predicate flips green, and the agent row transitions to idle.

Caveat: codex side-effect on first credential

codex login --with-api-key runs in the entrypoint as a one-time side effect. If OPENAI_API_KEY arrives after the entrypoint has finished, the worker's predicate goes ready (since the env var is set) but ~/.codex/auth.json may not be written until the next container restart. Plan to restart the codex worker once after first-credential — tracked as a follow-up to move the side effect into TS.


6. Dashboard surface

The agents list (/agents) shows a WAITING FOR CREDS pill in the Status column for blocked workers. The detail view (/agents/{id}) renders a panel listing each missing variable as a chip plus a remediation hint with the exact PUT /api/config form.

The dashboard polls these endpoints; once you set the credential, the badge clears within the polling interval.


7. Configuration knobs

All read from process.env at function entry — overridable per worker.

VariableDefaultEffect
BOOT_INITIAL_BACKOFF_MS2000Initial sleep between credential checks
BOOT_MAX_BACKOFF_MS30000Cap on the exponential backoff
BOOT_MAX_WAIT_SECONDS0If > 0, exit with code 78 (EX_CONFIG) once exceeded. 0 = wait forever

78 is distinct from generic failures so monitoring can tell "credentials never arrived" apart from a crash.


8. What this is not

  • Not a push-based notification system. Workers poll their own state on backoff; there is no SSE / WebSocket subscription to /api/config/reload. Latency: ≤30s by default.
  • Not a way to bootstrap API_KEY, SECRETS_ENCRYPTION_KEY, or MCP_BASE_URL — those are required for the worker to talk to the API at all and are validated at the bash layer.
  • Not a replacement for the per-task fetchResolvedEnv refresh — that still runs on every task spawn (src/commands/runner.ts) and is what keeps long-running agents in sync with mid-flight credential rotations.

On this page