Observability with OpenTelemetry
Send Agent Swarm traces to SigNoz or any OTLP-compatible backend, filter local runs, and inspect API, worker, MCP, and tool execution spans.
Agent Swarm can emit OpenTelemetry traces from the API server and worker runners. Tracing is disabled by default and turns on when OTEL_EXPORTER_OTLP_ENDPOINT is set.
The same wiring works with SigNoz Cloud, self-hosted SigNoz, Jaeger through an OTLP collector, Honeycomb, Grafana Tempo, and other OTLP-compatible backends.
Setup
SigNoz Cloud
Create or copy an ingestion key from SigNoz Cloud, then set these variables for the API and every worker:
OTEL_EXPORTER_OTLP_ENDPOINT=https://ingest.eu2.signoz.cloud
OTEL_EXPORTER_OTLP_HEADERS=signoz-ingestion-key=your-ingestion-key
OTEL_EXPORTER_OTLP_PROTOCOL=http/protobuf
OTEL_SERVICE_NAME=agent-swarm
OTEL_RESOURCE_ATTRIBUTES=deployment.environment=production,env=production,service.namespace=agent-swarmUse the region-specific endpoint from your SigNoz Cloud account. For the EU2 region, use https://ingest.eu2.signoz.cloud as the base endpoint — the SDK appends /v1/traces automatically per the OTLP HTTP spec.
Do not commit ingestion keys. OTEL_EXPORTER_OTLP_HEADERS is treated as a secret by Agent Swarm's scrubber, but it is still an active credential and should live in your secret manager or deployment environment.
Local Docker Compose
docker-compose.local.yml passes the OpenTelemetry variables through to the API, lead, Pi worker, and Codex worker services.
export OTEL_EXPORTER_OTLP_ENDPOINT="https://ingest.eu2.signoz.cloud"
export OTEL_EXPORTER_OTLP_HEADERS="signoz-ingestion-key=your-ingestion-key"
export OTEL_EXPORTER_OTLP_PROTOCOL="http/protobuf"
export OTEL_SERVICE_NAME="agent-swarm"
export OTEL_RESOURCE_ATTRIBUTES="deployment.environment=local,env=local,service.namespace=agent-swarm"
docker compose -f docker-compose.local.yml up --buildThe important local filter tags are:
| Attribute | Value | Purpose |
|---|---|---|
service.name | agent-swarm-api (API) / agent-swarm (workers) | The API process and worker processes report distinct service names — see Service names. |
deployment.environment | local | Standard OpenTelemetry environment label. |
env | local | Short convenience label for filtering local/dev traffic. |
service.namespace | agent-swarm | Groups the API and worker services together — use this to query across both. |
agentswarm.service.role | api, lead, or worker | Distinguishes API, lead runner, and worker runner spans. |
Production Docker Compose
For production, add the same variables to the API service and every worker service in your compose file:
environment:
- OTEL_EXPORTER_OTLP_ENDPOINT=${OTEL_EXPORTER_OTLP_ENDPOINT}
- OTEL_EXPORTER_OTLP_HEADERS=${OTEL_EXPORTER_OTLP_HEADERS}
- OTEL_EXPORTER_OTLP_PROTOCOL=${OTEL_EXPORTER_OTLP_PROTOCOL:-http/protobuf}
- OTEL_SERVICE_NAME=${OTEL_SERVICE_NAME:-agent-swarm}
- OTEL_RESOURCE_ATTRIBUTES=${OTEL_RESOURCE_ATTRIBUTES:-deployment.environment=production,env=production,service.namespace=agent-swarm}OTEL_SERVICE_NAME sets the base service name (default agent-swarm). Agent Swarm derives the per-process service.name from it — see Service names — so a single shared OTEL_SERVICE_NAME is all you need; no per-service wiring.
Service names: API vs worker
The API process and the worker (and lead) processes report distinct service.name values so they show up as separate service cards in SigNoz:
| Process | service.name |
|---|---|
| API server | agent-swarm-api |
| Worker / lead runner | agent-swarm |
Both are derived per process from OTEL_SERVICE_NAME (the base name, default agent-swarm): the API appends an -api suffix, workers use the base name unchanged. The suffix is applied even when OTEL_SERVICE_NAME is set identically across every process — a shared env var can't collapse the API and workers onto one name.
Both processes still set service.namespace=agent-swarm, so service.namespace = 'agent-swarm' is the filter to use when you want spans from both services in one query.
Filtering poll noise
Worker long-polls run continuously and account for the majority of span volume in a steady-state deployment. To keep your observability backend focused on real work, Agent Swarm skips the worker.poll (worker side) and the /api/poll request span (API side) by default.
To opt in — useful when debugging queue or claim behavior — set on every API and worker process:
OTEL_TRACE_POLL=1Truthy values: 1, true, yes, on (case-insensitive). Anything else (including unset/empty) keeps poll spans off. Expect span volume to roughly double when this is enabled.
How It Works
At startup, the API server and workers call initOtel(). If OTEL_EXPORTER_OTLP_ENDPOINT is absent, all tracing functions are no-ops. If it is present, Agent Swarm starts the OpenTelemetry Node SDK with an OTLP HTTP trace exporter.
The API and workers share trace context over HTTP headers. When a worker calls the API or an MCP tool triggers server-side work, Agent Swarm injects and extracts OpenTelemetry propagation headers so related spans can appear in the same trace where the execution path supports it.
Agent Swarm sets resource attributes once per process:
| Attribute | Source |
|---|---|
service.name | Per process: API → <base>-api, workers → <base>; base from OTEL_SERVICE_NAME (default agent-swarm) |
service.version | package.json version |
service.namespace | OTEL_RESOURCE_ATTRIBUTES or agent-swarm |
service.instance.id | AGENT_ID or a generated UUID |
deployment.environment | OTEL_RESOURCE_ATTRIBUTES, NODE_ENV, or development |
env | OTEL_RESOURCE_ATTRIBUTES.env or deployment.environment |
agentswarm.service.role | Runtime role: api, lead, or worker |
Sensitive exception messages, status messages, tool previews, and OTLP auth headers are scrubbed before they leave the process.
Traces Emitted
API HTTP Requests
Every API request is wrapped in a span named after its route, following the OpenTelemetry HTTP server semantic conventions: {METHOD} {route-template} — for example GET /api/tasks/{id} or POST /api/tasks.
The route template is low-cardinality: a request to /api/tasks/abc-123 and one to /api/tasks/def-456 share the span name GET /api/tasks/{id}, so SigNoz can group and aggregate by endpoint without raw IDs fragmenting the data. Requests that don't match a registered route — core routes (/health, /ping, /me), the MCP transport, and 404s — fall back to {METHOD} /{first-segment} (e.g. GET /health, POST /mcp), or a bare {METHOD} for the root path. The full raw request path is always preserved on the url.path attribute.
The same route template is also published on the http.route span attribute, so SigNoz can group, filter, and aggregate by endpoint as a first-class field instead of parsing it out of the span name. http.route is omitted (not fabricated) for requests that don't match a registered route.
Request handling runs inside the HTTP server span's active context, so server-side spans created while serving the request — notably the mcp.tool spans from MCP tool calls — nest underneath it as children rather than appearing as disconnected root spans.
Earlier releases named every API request span http.server. If you have saved SigNoz queries or dashboards that filter on name = 'http.server', switch them to filter on the agentswarm.component = 'api' attribute (set on every API span) and group by span name or the http.route attribute to get a per-endpoint breakdown.
Common attributes:
| Attribute | Description |
|---|---|
http.request.method | HTTP method |
http.route | Low-cardinality route template, e.g. /api/tasks/{id}. Omitted for unmatched core/MCP/404 paths. |
url.path | Raw request path (full path, including IDs) |
url.scheme | Request scheme — https or http (honors X-Forwarded-Proto) |
server.address | Request host with the port stripped (honors X-Forwarded-Host) |
network.protocol.version | HTTP protocol version, e.g. 1.1 or 2 |
user_agent.original | Raw User-Agent request header |
http.response.status_code | HTTP response status code (SigNoz surfaces this as the responseStatusCode column) |
agentswarm.component | api |
agentswarm.http.duration_ms | Server-side request duration |
Useful for:
- API latency and error-rate dashboards
- task creation, polling, and completion request inspection
- checking whether workers are reaching the API
MCP Tool Calls
Server-side MCP tool handlers emit one span per call, named mcp.tool <tool-name> — for example mcp.tool store-progress — so the executed tool is readable straight from the trace tree. Tool names are a fixed enum, so the span name stays low-cardinality. These spans nest under the API HTTP request span that triggered them.
Common attributes:
| Attribute | Description |
|---|---|
mcp.tool.name | Registered MCP tool name |
mcp.tool.result_content_count | Number of returned content items |
mcp.tool.is_error | Whether the MCP result is an error |
agentswarm.task.id | Source task ID when available |
agentswarm.tool.args_preview | Scrubbed, truncated argument preview |
agentswarm.tool.result_preview | Scrubbed, truncated result preview |
Useful for:
- finding slow or failing MCP tools
- confirming tool calls are attached to a task
- comparing tool usage across agents and harnesses
Worker Polling
Worker poll loops emit worker.poll spans.
Common attributes:
| Attribute | Description |
|---|---|
agentswarm.poll.result | empty, task_assigned, task_offered, pool_tasks_available, or another trigger type |
agentswarm.worker.poll_timeout_ms | Long-poll timeout |
agentswarm.worker.poll_interval_ms | Worker poll interval |
Useful for:
- seeing whether workers are idle or receiving tasks
- debugging queue and claim behavior
- spotting excessive polling or API contention
Worker Sessions
Worker task execution emits worker.session.create and worker.session spans.
Common attributes:
| Attribute | Description |
|---|---|
agentswarm.task.id | Logical task ID |
agentswarm.task.real_id | Real task ID after pool-task claim resolution |
agentswarm.agent.role | Agent role from the registered agent |
agentswarm.harness_provider | claude, pi, codex, opencode, or another provider |
agentswarm.provider.session_id | Provider session ID when available |
agentswarm.session.cwd | Session working directory |
agentswarm.session.vcs_repo | VCS repo attached to the session when set |
agentswarm.session.duration_ms | Session duration |
agentswarm.session.exit_code | Runner exit code |
agentswarm.session.outcome | ok or error |
gen_ai.request.model | Requested model |
gen_ai.response.model | Model reported by the provider, when cost data exists |
gen_ai.usage.input_tokens | Input token count, when available |
gen_ai.usage.output_tokens | Output token count, when available |
agentswarm.cost.total_usd | Provider-reported or computed session cost |
Useful for:
- full task execution timelines
- model and provider comparisons
- task cost and token usage inspection
- finding sessions that failed before calling
store-progress
Worker Tool Executions
Worker-side provider events emit worker.tool or worker.mcp.tool spans.
Common attributes:
| Attribute | Description |
|---|---|
agentswarm.tool.name | Raw tool name reported by the harness |
agentswarm.tool.normalized_name | Normalized tool name |
agentswarm.tool.kind | mcp, shell/tool, or provider-specific kind |
agentswarm.tool.call_id | Provider tool-call ID when available |
mcp.tool.name | MCP tool name when the worker calls an MCP tool |
agentswarm.tool.duration_ms | Tool duration |
agentswarm.tool.args_preview | Scrubbed, truncated args |
agentswarm.tool.result_preview | Scrubbed, truncated result |
agentswarm.tool.missing_start | Result arrived without a matching start event |
agentswarm.tool.implicit_close | Span closed at the assistant-message boundary because the adapter didn't emit a per-tool completion event. Applies to BOTH worker.tool and worker.mcp.tool under the Claude harness. |
agentswarm.tool.unclosed | Session ended before any tool_end or assistant-message boundary fired — should be very rare |
How tool spans close
Under the Claude SDK adapter, neither harness-side tools (Bash/Read/Edit/etc.) nor MCP tools receive per-tool completion events in the JSONL stream. Both kinds therefore close on the same path:
worker.toolandworker.mcp.tool(Claude harness) — close at the next assistant-message boundary, taggedagentswarm.tool.implicit_close=true.duration_msis wall-clock fromtool_startuntil the next assistant turn, which includes tool execution plus the model round-trip after the tool result returned. Slight overcount, but real-ish; covers the typical case.- Other adapters that DO emit explicit
tool_end(e.g. Claude Managed Agents, Codex, opencode) — close on thetool_endevent withduration_msset to true execution time. Noimplicit_closeattribute. agentswarm.tool.unclosed=true— safety net for spans where the session ended before any assistant-message boundary arrived (e.g. the session crashed mid-tool). Should be very rare; treat its presence as a signal worth investigating.
Useful for:
- shell command latency
- MCP usage inside full worker traces
- detecting tools that never completed
- finding repeated tool loops or unusually expensive tool phases
Provider, Progress, Context, and Compaction Events
Provider stream events are attached to active spans as attributes or events where possible.
Common attributes:
| Attribute | Description |
|---|---|
agentswarm.provider.name | Provider name |
agentswarm.provider.event_name | Provider custom event name |
agentswarm.provider.event_data_preview | Scrubbed, truncated provider data |
gen_ai.message.role | Message role |
gen_ai.message.content_preview | Scrubbed, truncated message content |
agentswarm.progress.message | Scrubbed, truncated progress text |
agentswarm.context.used_tokens | Context tokens used |
agentswarm.context.total_tokens | Total context window |
agentswarm.context.percent | Context usage percent |
agentswarm.compaction.trigger | Compaction trigger |
agentswarm.compaction.pre_tokens | Tokens before compaction |
Claude Code Telemetry
The Claude Code CLI emits its own OpenTelemetry signal — metrics, log events, and (in beta) traces — from inside the worker subprocess. This is separate from the Agent Swarm spans described above: it is the model's own view of each interaction (claude_code.interaction), every LLM request, and every tool call.
Agent Swarm leaves Claude Code's telemetry off by default and never force-enables it. Two independent controls govern it:
- The operator enables Claude Code's exporters through swarm config — the env vars below. This is what makes Claude Code emit anything at all.
- The
SWARM_ENABLE_HARNESS_OTELgate makes the adapter inject aTRACEPARENTat spawn time so the harness's spans nest inside the worker's trace (and, for Claude Code, pins privacy-safe logging defaults). This is what makes the two services show up as one end-to-end trace. The same gate also covers Codex — see Codex Telemetry.
Enabling Claude Code's exporters
Claude Code reads the standard OTEL_* exporter variables — endpoint, headers, protocol — which Agent Swarm already forwards to the subprocess. To turn its telemetry on, set these as swarm config (global, or agent-scoped to roll out gradually):
CLAUDE_CODE_ENABLE_TELEMETRY=1 # master switch — required
CLAUDE_CODE_ENHANCED_TELEMETRY_BETA=1 # beta traces (claude_code.interaction et al.)
OTEL_TRACES_EXPORTER=otlp # traces
OTEL_METRICS_EXPORTER=otlp # token / cost / lines-of-code metrics
OTEL_LOGS_EXPORTER=otlp # user_prompt / tool_use log eventsThese are independent of the SWARM_ENABLE_HARNESS_OTEL gate by design: an operator can run Claude Code telemetry on a single agent without any code change, and separately flip the gate when cross-service trace linking should turn on.
The SWARM_ENABLE_HARNESS_OTEL gate
SWARM_ENABLE_HARNESS_OTEL is the single gate for harness-subprocess trace linking — it covers both Claude Code and Codex. When it is truthy (1, true, yes, on — case-insensitive), the adapter, on every session spawn:
- Injects
TRACEPARENT(andTRACESTATEwhen present) derived from the activeworker.sessionspan. A harness launched with this env var parents its own root span to the worker's trace instead of starting a disconnected root. (Claude Code readsTRACEPARENTin non-interactive-pmode; Codex's Rust OpenTelemetry SDK reads it via the standardtracecontextpropagator.) - Pins Claude-Code-specific privacy defaults (claude only) —
OTEL_LOG_USER_PROMPTS=0,OTEL_LOG_TOOL_DETAILS=0,OTEL_LOG_TOOL_CONTENT=0,OTEL_METRICS_INCLUDE_ACCOUNT_UUID=false. These are only set when the operator has not already set them explicitly. Codex does not read these env vars — it has no equivalent.
The gate is read per-spawn from the resolved swarm config, so flipping it takes effect on the next session — no container restart required. When the gate is off, spawn behavior is unchanged: a harness with its own exporters enabled still emits, but its root span is disconnected from the worker trace.
Migration note. The gate was originally introduced as SWARM_ENABLE_CLAUDE_CODE_OTEL. That name is kept as a deprecated alias — a truthy value of either SWARM_ENABLE_HARNESS_OTEL or SWARM_ENABLE_CLAUDE_CODE_OTEL turns injection on for every harness. Prefer SWARM_ENABLE_HARNESS_OTEL in new config; the alias may be removed in a future release.
Expected SigNoz behavior
-
A new
claude-codeservice card appears. Claude Code overridesOTEL_SERVICE_NAMEwith its ownservice.namefor its spans — this is by design and expected. Agent Swarm does not strip or re-set it. -
Span hierarchy. With the gate on, Claude Code's spans nest inside the worker trace:
worker.session └── worker.session.create └── claude_code.interaction ├── claude_code.llm_request └── claude_code.toolA complete-task query (
agentswarm.task.id = '<task-id>') then returns the worker'sworker.toolspans and Claude Code'sclaude_code.*spans in a single trace. -
Log events.
user_promptandtool_useevents are emitted with content redacted by default (see Privacy below).
Privacy posture. Agent Swarm's scrubSecrets does not run on Claude Code's exported payloads — they travel straight from the Claude Code process to your OTLP backend. The gate keeps OTEL_LOG_USER_PROMPTS, OTEL_LOG_TOOL_DETAILS, and OTEL_LOG_TOOL_CONTENT all at 0 so prompt and tool content never leave the process. Do not flip these to 1 without a scrubbing story for Claude Code's payloads.
Rollout
SWARM_ENABLE_HARNESS_OTEL is off by default. Recommended sequence: enable Claude Code's exporters on one agent, flip the gate agent-scoped on that same agent to validate the nested trace in SigNoz, watch span volume, then widen to global config.
Codex Telemetry
The Codex CLI also emits its own OpenTelemetry traces. Like Claude Code, it starts a fresh root span unless it is handed a W3C trace context at spawn — so the same SWARM_ENABLE_HARNESS_OTEL gate injects TRACEPARENT into the Codex subprocess env, and Codex's Rust OpenTelemetry SDK parents its spans to the worker trace via the standard tracecontext propagator.
Enabling Codex's exporters
Codex configures its OTLP exporter through TOML, not env vars — an [otel.exporter] block in ~/.codex/config.toml (endpoint, headers, protocol, plus otel.trace_exporter / otel.metrics_exporter). Setting that up is an operator config step and is out of scope for the trace-linking gate: SWARM_ENABLE_HARNESS_OTEL only injects TRACEPARENT; it does not enable Codex telemetry itself.
What the gate does for Codex
When SWARM_ENABLE_HARNESS_OTEL (or the deprecated SWARM_ENABLE_CLAUDE_CODE_OTEL alias) is on and a sampled worker.session span is active, the codex-adapter injects TRACEPARENT (and TRACESTATE when present) into the minimal env it builds for the Codex subprocess.
No privacy-default env vars are set for Codex — the OTEL_LOG_* switches are Claude-Code-specific and Codex does not read them. Codex's own redaction is governed by its TOML otel.* settings, which the operator controls separately.
With Codex's exporters enabled and the gate on, Codex's spans appear in the same end-to-end trace as worker.session, exactly as Claude Code's do.
Metrics
Agent Swarm currently emits traces, not standalone OTLP metrics. In SigNoz, create useful operational metrics from trace aggregations:
| Metric | Query shape |
|---|---|
| API request rate | rate(count()) over agentswarm.component = 'api', grouped by span name |
| API p95 latency | p95(durationNano) over agentswarm.component = 'api', grouped by span name |
| API errors | count() where hasError = true or responseStatusCode >= 500 |
| Worker task throughput | count() over name = 'worker.session', grouped by agentswarm.service.role and agentswarm.harness_provider |
| Worker session p95 duration | p95(durationNano) over name = 'worker.session' |
| Tool call volume | count() over name IN ('worker.tool', 'worker.mcp.tool') plus name LIKE 'mcp.tool %' for server-side spans, grouped by agentswarm.tool.name or mcp.tool.name |
| Slow tools | p95(durationNano) over tool spans, grouped by tool name |
| Cost by model | sum(agentswarm.cost.total_usd) over session spans, grouped by gen_ai.response.model |
| Token usage by model | sum(gen_ai.usage.input_tokens) and sum(gen_ai.usage.output_tokens), grouped by model |
| Poll behavior | count() over worker.poll, grouped by agentswarm.poll.result |
Useful SigNoz Queries
Use these in SigNoz Traces Explorer or as dashboard widget filters.
The API process reports service.name = 'agent-swarm-api' and workers report service.name = 'agent-swarm' (see Service names). Queries that should span both filter on service.namespace = 'agent-swarm' instead; worker-only queries keep service.name = 'agent-swarm'.
All Local Agent Swarm Traffic
service.namespace = 'agent-swarm' AND env = 'local'Production Traffic Only
service.namespace = 'agent-swarm' AND deployment.environment = 'production'A Complete Task Execution
service.namespace = 'agent-swarm' AND agentswarm.task.id = '<task-id>'Start here when debugging a concrete task. You should see session spans, tool spans, and any server-side MCP spans that carried the task ID.
Worker Sessions for One Harness
service.name = 'agent-swarm'
AND name = 'worker.session'
AND agentswarm.harness_provider = 'pi'Replace pi with claude, codex, opencode, or another provider.
Slow Worker Tools
service.name = 'agent-swarm'
AND name IN ('worker.tool', 'worker.mcp.tool')
AND durationNano > 5000000000durationNano > 5000000000 means slower than five seconds.
MCP Tool Calls for a Task
service.namespace = 'agent-swarm'
AND agentswarm.task.id = '<task-id>'
AND (name LIKE 'mcp.tool %' OR name = 'worker.mcp.tool')Server-side mcp.tool <tool-name> spans live under agent-swarm-api and worker.mcp.tool under agent-swarm, so this query filters on service.namespace to catch both. The LIKE 'mcp.tool %' pattern matches every per-tool span name (mcp.tool store-progress, mcp.tool poll-task, …).
Failed or Error-Spans
service.namespace = 'agent-swarm' AND hasError = trueFor HTTP 5xx responses:
service.name = 'agent-swarm-api'
AND agentswarm.component = 'api'
AND responseStatusCode >= 500Unclosed Tool Calls
service.name = 'agent-swarm'
AND agentswarm.tool.unclosed = trueThis should be a rare safety-net signal: the session ended (or crashed)
before either an explicit tool_end or the assistant-message boundary fired.
For the typical Claude-harness flow, both worker.tool and worker.mcp.tool
spans close on the boundary with agentswarm.tool.implicit_close=true,
not unclosed=true.
Implicit-Closed Tool Calls
service.name = 'agent-swarm'
AND agentswarm.tool.implicit_close = trueThis is the expected closure path under the Claude adapter for BOTH
worker.tool spans (Bash/Read/Edit/etc.) AND worker.mcp.tool spans —
the adapter doesn't emit per-tool completion events for either kind, so
the runner closes them at the next assistant-message boundary. duration_ms
is wall-clock from tool_start to the next assistant message, which slightly
overcounts the actual tool execution time (it includes the model round-trip
after the tool result returned). Adapter-emitted explicit tool_end spans
(Codex, opencode, Claude Managed Agents) won't have this tag.
Sessions with Cost Data
service.name = 'agent-swarm'
AND name = 'worker.session'
AND agentswarm.cost.total_usd > 0Context Pressure
service.name = 'agent-swarm'
AND agentswarm.context.percent >= 80Dashboard Ideas
Start with these panels:
| Panel | Signal |
|---|---|
| API request rate by route | Trace count over agentswarm.component = 'api', grouped by span name |
| API p95 latency by route | p95(durationNano) over agentswarm.component = 'api', grouped by span name |
| Worker sessions by harness | Trace count over worker.session, grouped by agentswarm.harness_provider |
| Worker session duration | p95(durationNano) over worker.session, grouped by harness |
| Tool calls by name | Trace count over tool spans, grouped by agentswarm.tool.name |
| Slowest tools | p95(durationNano) over tool spans |
| Cost by model | sum(agentswarm.cost.total_usd), grouped by gen_ai.response.model |
| Poll outcomes | Trace count over worker.poll, grouped by agentswarm.poll.result |
| Errors by span name | Trace count where hasError = true, grouped by name |
Troubleshooting
No traces appear
Check that every process has the exporter endpoint:
docker compose exec api env | grep OTEL
docker compose exec lead env | grep OTEL
docker compose exec worker-1 env | grep OTELThen confirm the API is logging OTel startup:
docker compose logs api | grep OTelIf OTEL_EXPORTER_OTLP_ENDPOINT is empty, tracing is intentionally disabled.
SigNoz returns authentication errors
For SigNoz Cloud, make sure:
OTEL_EXPORTER_OTLP_ENDPOINTis the base URL (no/v1/tracessuffix — the SDK appends it)OTEL_EXPORTER_OTLP_HEADERSis exactlysignoz-ingestion-key=<key>OTEL_EXPORTER_OTLP_PROTOCOLishttp/protobuf- the key belongs to the same SigNoz region as the ingest endpoint
Local and production traces are mixed
Set explicit resource attributes in every environment:
OTEL_RESOURCE_ATTRIBUTES=deployment.environment=local,env=local,service.namespace=agent-swarmFor production:
OTEL_RESOURCE_ATTRIBUTES=deployment.environment=production,env=production,service.namespace=agent-swarmThen filter with env or deployment.environment.
API and workers show as separate services
This is intentional. The API process reports service.name = agent-swarm-api and worker/lead processes report service.name = agent-swarm so they appear as separate service cards in SigNoz — see Service names.
To query across both at once, filter on service.namespace = 'agent-swarm' (set by every process) instead of service.name. Within a single service, agentswarm.service.role further splits api, lead, and worker spans.
Related
- Deployment Guide - production Docker Compose setup
- Environment Variables - OpenTelemetry environment variable reference
- Telemetry - anonymized product telemetry, separate from your OpenTelemetry traces
Adding a Harness Provider
Implement a new harness provider (like claude, pi, or codex) — the full adapter contract, reference implementations, and wiring checklist
Cost & context computation
How cost and context-window numbers are computed across harness providers, and how to read the costSource / contextFormula badges in the UI.