Worker Credential Recovery
Workers wait for harness credentials at runtime instead of crash-looping the container — how the boot loop, dispatcher gating, and dashboard surface fit together
A worker container always boots and registers, even when the harness credentials it needs (e.g. CLAUDE_CODE_OAUTH_TOKEN) are missing. It parks in a waiting_for_credentials state, the dispatcher routes around it, and the dashboard shows what it's blocked on. The worker self-heals as soon as the credential lands in swarm_config — no container restart.
This guide documents that lifecycle, the endpoints that drive it, and the configuration knobs that govern its timing.
This pattern replaces the previous bash-level fail-fast in docker-entrypoint.sh, which exited the container on missing creds and forced operators to restart workers after every credential change.
1. Boot model
The Docker entrypoint is now best-effort. It still does file-prep side effects (codex login, claude-managed config restore from swarm_config, etc.) but does not exit the process on missing harness credentials. The single hard exit it keeps is API_KEY — without that, the worker can't talk to the API at all.
After the entrypoint hands off, the worker process:
- Calls
join-swarmso the agent row exists in the DB and is visible on the dashboard. - Calls
awaitCredentials(...)(src/commands/credential-wait.ts) which loops:checkProviderCredentials(provider, env)against the per-adapter predicate.- If not ready, calls
fetchResolvedEnv(...)to merge the latestswarm_configvalues intoprocess.env. - Re-checks. If still not ready, sleeps with exponential backoff and reports state via
PUT /api/agents/{id}/credential-status.
- Once ready, transitions the agent's row to
status: idleand starts the task-claim loop.
The agent never crashes. Operators set creds whenever, and the worker picks them up on the next tick.
2. Agent lifecycle
The agents.status enum is idle | busy | offline | waiting_for_credentials. Adding the fourth value keeps the dispatcher predicate trivial — getIdleWorkersWithCapacity filters on status === 'idle', so blocked workers are excluded from routing without any extra condition.
The shape of the agent row while waiting:
{
"id": "worker-1",
"status": "waiting_for_credentials",
"credentialMissing": ["CLAUDE_CODE_OAUTH_TOKEN", "ANTHROPIC_API_KEY"]
}credentialMissing is null (or absent) in any other state.
| From → To | Trigger |
|---|---|
idle → waiting_for_credentials | Worker awaitCredentials finds creds gone (rare — usually only happens at boot) |
waiting_for_credentials → idle | Worker's next tick finds creds present in process.env after a fetchResolvedEnv refresh |
any → offline | Heartbeat stops landing |
3. Per-provider predicates
Each adapter exports a checkCredentials(env, opts?): CredStatus (src/providers/credentials.ts dispatches to the right one). The shape:
interface CredStatus {
ready: boolean;
missing: string[];
satisfiedBy?: 'env' | 'file' | 'side-effect-pending';
hint?: string;
}Provider-by-provider:
| Provider | Ready when |
|---|---|
claude | CLAUDE_CODE_OAUTH_TOKEN or ANTHROPIC_API_KEY is set |
claude-managed | All of ANTHROPIC_API_KEY, MANAGED_AGENT_ID, MANAGED_ENVIRONMENT_ID, MCP_BASE_URL are set |
devin | Both DEVIN_API_KEY and DEVIN_ORG_ID are set |
codex | ~/.codex/auth.json exists or OPENAI_API_KEY is set (login dance still needs to run; reported as satisfiedBy: 'side-effect-pending') |
pi | ~/.pi/agent/auth.json exists, otherwise model-conditional: MODEL_OVERRIDE resolves to anthropic/openrouter/openai → that provider's key required. Unset → any one of the three is enough. |
opencode | Same shape as pi, file at ~/.local/share/opencode/auth.json, model-conditional env keys |
4. Endpoints
Two read endpoints (for dashboard / orchestrators) and one write endpoint (for the worker itself).
GET /api/agents/{id}/credential-statusSingle-agent snapshot. Response:
{
"agentId": "worker-1",
"name": "worker-1",
"status": "waiting_for_credentials",
"missing": ["CLAUDE_CODE_OAUTH_TOKEN", "ANTHROPIC_API_KEY"],
"provider": "claude",
"lastCheckedAt": "2026-05-06T21:35:27.791Z"
}GET /api/agents/credential-status[?status=waiting_for_credentials]Bulk snapshot across all agents. The optional ?status= filter narrows to one enum value. Powers the dashboard's at-a-glance fleet view without N round-trips.
PUT /api/agents/{id}/credential-status
{ "ready": false, "missing": ["CLAUDE_CODE_OAUTH_TOKEN"] }Worker self-report. ready: true flips the agent back to idle and clears credentialMissing. ready: false sets waiting_for_credentials with the listed missing keys.
There is no /ready endpoint on the worker process today. Workers don't expose an HTTP server; orchestrators that want strict gating should poll GET /api/agents/{id}/credential-status against the API. Tracked as a follow-up if a Kubernetes-style readiness split is needed.
5. Recovering a parked worker
Set the missing credential via the config API and the worker picks it up on its next tick (default ≤30s):
curl -X PUT https://api.example.com/api/config \
-H "Authorization: Bearer $API_KEY" \
-H "Content-Type: application/json" \
-d '{
"scope": "agent",
"scopeId": "worker-1",
"key": "CLAUDE_CODE_OAUTH_TOKEN",
"value": "sk-ant-oat01-...",
"isSecret": true
}'Scope choices:
scope: "agent"— only this worker sees it.scope: "global"— all workers inherit it.
The worker's next fetchResolvedEnv call merges the new value into process.env, the predicate flips green, and the agent row transitions to idle.
Caveat: codex side-effect on first credential
codex login --with-api-key runs in the entrypoint as a one-time side effect. If OPENAI_API_KEY arrives after the entrypoint has finished, the worker's predicate goes ready (since the env var is set) but ~/.codex/auth.json may not be written until the next container restart. Plan to restart the codex worker once after first-credential — tracked as a follow-up to move the side effect into TS.
6. Dashboard surface
The agents list (/agents) shows a WAITING FOR CREDS pill in the Status column for blocked workers. The detail view (/agents/{id}) renders a panel listing each missing variable as a chip plus a remediation hint with the exact PUT /api/config form.
The dashboard polls these endpoints; once you set the credential, the badge clears within the polling interval.
7. Configuration knobs
All read from process.env at function entry — overridable per worker.
| Variable | Default | Effect |
|---|---|---|
BOOT_INITIAL_BACKOFF_MS | 2000 | Initial sleep between credential checks |
BOOT_MAX_BACKOFF_MS | 30000 | Cap on the exponential backoff |
BOOT_MAX_WAIT_SECONDS | 0 | If > 0, exit with code 78 (EX_CONFIG) once exceeded. 0 = wait forever |
78 is distinct from generic failures so monitoring can tell "credentials never arrived" apart from a crash.
8. What this is not
- Not a push-based notification system. Workers poll their own state on backoff; there is no SSE / WebSocket subscription to
/api/config/reload. Latency: ≤30s by default. - Not a way to bootstrap
API_KEY,SECRETS_ENCRYPTION_KEY, orMCP_BASE_URL— those are required for the worker to talk to the API at all and are validated at the bash layer. - Not a replacement for the per-task
fetchResolvedEnvrefresh — that still runs on every task spawn (src/commands/runner.ts) and is what keeps long-running agents in sync with mid-flight credential rotations.
Related
- Adding a Harness Provider — implement a new provider; includes the
checkCredentialspredicate contract - Harness Configuration — provider-by-provider credential setup
- Secrets Encryption — how
swarm_configsecrets are protected at rest