Send Agent Swarm traces and OTLP metrics to SigNoz or any compatible backend, filter local runs, and inspect API, worker, MCP, and tool execution telemetry.

Agent Swarm can emit OpenTelemetry traces and OTLP metrics from the API server and worker runners. Telemetry is disabled by default and turns on when OTEL_EXPORTER_OTLP_ENDPOINT is set.

The same wiring works with SigNoz Cloud, self-hosted SigNoz, Jaeger through an OTLP collector, Honeycomb, Grafana Tempo, Datadog, and other OTLP-compatible backends.

Setup

SigNoz Cloud

Create or copy an ingestion key from SigNoz Cloud, then set these variables for the API and every worker:

.env

OTEL_EXPORTER_OTLP_ENDPOINT=https://ingest.eu2.signoz.cloud
OTEL_EXPORTER_OTLP_HEADERS=signoz-ingestion-key=your-ingestion-key
OTEL_EXPORTER_OTLP_PROTOCOL=http/protobuf
OTEL_SERVICE_NAME=agent-swarm
OTEL_RESOURCE_ATTRIBUTES=deployment.environment=production,env=production,service.namespace=agent-swarm

Use the region-specific endpoint from your SigNoz Cloud account. For the EU2 region, use https://ingest.eu2.signoz.cloud as the base endpoint — the SDK appends /v1/traces automatically per the OTLP HTTP spec.

Do not commit ingestion keys. OTEL_EXPORTER_OTLP_HEADERS is treated as a secret by Agent Swarm's scrubber, but it is still an active credential and should live in your secret manager or deployment environment.

Local Docker Compose

docker-compose.local.yml passes the OpenTelemetry variables through to the API, lead, Pi worker, and Codex worker services.

export OTEL_EXPORTER_OTLP_ENDPOINT="https://ingest.eu2.signoz.cloud"
export OTEL_EXPORTER_OTLP_HEADERS="signoz-ingestion-key=your-ingestion-key"
export OTEL_EXPORTER_OTLP_PROTOCOL="http/protobuf"
export OTEL_SERVICE_NAME="agent-swarm"
export OTEL_RESOURCE_ATTRIBUTES="deployment.environment=local,env=local,service.namespace=agent-swarm"

docker compose -f docker-compose.local.yml up --build

The important local filter tags are:

Attribute	Value	Purpose
`service.name`	`agent-swarm-api` (API) / `agent-swarm` (workers)	The API process and worker processes report distinct service names — see Service names.
`deployment.environment`	`local`	Standard OpenTelemetry environment label.
`env`	`local`	Short convenience label for filtering local/dev traffic.
`service.namespace`	`agent-swarm`	Groups the API and worker services together — use this to query across both.
`agentswarm.service.role`	`api`, `lead`, or `worker`	Distinguishes API, lead runner, and worker runner spans.

Production Docker Compose

For production, add the same variables to the API service and every worker service in your compose file:

docker-compose.yml

environment:
  - OTEL_EXPORTER_OTLP_ENDPOINT=${OTEL_EXPORTER_OTLP_ENDPOINT}
  - OTEL_EXPORTER_OTLP_HEADERS=${OTEL_EXPORTER_OTLP_HEADERS}
  - OTEL_EXPORTER_OTLP_PROTOCOL=${OTEL_EXPORTER_OTLP_PROTOCOL:-http/protobuf}
  - OTEL_SERVICE_NAME=${OTEL_SERVICE_NAME:-agent-swarm}
  - OTEL_RESOURCE_ATTRIBUTES=${OTEL_RESOURCE_ATTRIBUTES:-deployment.environment=production,env=production,service.namespace=agent-swarm}

OTEL_SERVICE_NAME sets the base service name (default agent-swarm). Agent Swarm derives the per-process service.name from it — see Service names — so a single shared OTEL_SERVICE_NAME is all you need; no per-service wiring.

Service names: API vs worker

The API process and the worker (and lead) processes report distinct service.name values so they show up as separate service cards in SigNoz:

Process	`service.name`
API server	`agent-swarm-api`
Worker / lead runner	`agent-swarm`

Both are derived per process from OTEL_SERVICE_NAME (the base name, default agent-swarm): the API appends an -api suffix, workers use the base name unchanged. The suffix is applied even when OTEL_SERVICE_NAME is set identically across every process — a shared env var can't collapse the API and workers onto one name.

Both processes still set service.namespace=agent-swarm, so service.namespace = 'agent-swarm' is the filter to use when you want spans from both services in one query.

Filtering poll noise

Worker long-polls run continuously and account for the majority of span volume in a steady-state deployment. To keep your observability backend focused on real work, Agent Swarm skips the worker.poll (worker side) and the /api/poll request span (API side) by default.

To opt in — useful when debugging queue or claim behavior — set on every API and worker process:

OTEL_TRACE_POLL=1

Truthy values: 1, true, yes, on (case-insensitive). Anything else (including unset/empty) keeps poll spans off. Expect span volume to roughly double when this is enabled.

How It Works

At startup, the API server and workers call initOtel(). If OTEL_EXPORTER_OTLP_ENDPOINT is absent, all tracing functions are no-ops. If it is present, Agent Swarm starts the OpenTelemetry Node SDK with an OTLP HTTP trace exporter.

The API and workers share trace context over HTTP headers. When a worker calls the API or an MCP tool triggers server-side work, Agent Swarm injects and extracts OpenTelemetry propagation headers so related spans can appear in the same trace where the execution path supports it.

Agent Swarm sets resource attributes once per process:

Attribute	Source
`service.name`	Per process: API → `<base>-api`, workers → `<base>`; base from `OTEL_SERVICE_NAME` (default `agent-swarm`)
`service.version`	`package.json` version
`service.namespace`	`OTEL_RESOURCE_ATTRIBUTES` or `agent-swarm`
`service.instance.id`	`AGENT_ID` or a generated UUID
`deployment.environment`	`OTEL_RESOURCE_ATTRIBUTES`, `NODE_ENV`, or `development`
`env`	`OTEL_RESOURCE_ATTRIBUTES.env` or `deployment.environment`
`agentswarm.service.role`	Runtime role: `api`, `lead`, or `worker`

Sensitive exception messages, status messages, tool previews, and OTLP auth headers are scrubbed before they leave the process.

Traces Emitted

API HTTP Requests

Every API request is wrapped in a span named after its route, following the OpenTelemetry HTTP server semantic conventions: {METHOD} {route-template} — for example GET /api/tasks/{id} or POST /api/tasks.

The route template is low-cardinality: a request to /api/tasks/abc-123 and one to /api/tasks/def-456 share the span name GET /api/tasks/{id}, so SigNoz can group and aggregate by endpoint without raw IDs fragmenting the data. Requests that don't match a registered route — core routes (/health, /ping, /me), the MCP transport, and 404s — fall back to {METHOD} /{first-segment} (e.g. GET /health, POST /mcp), or a bare {METHOD} for the root path. The full raw request path is always preserved on the url.path attribute.

The same route template is also published on the http.route span attribute, so SigNoz can group, filter, and aggregate by endpoint as a first-class field instead of parsing it out of the span name. http.route is omitted (not fabricated) for requests that don't match a registered route.

Request handling runs inside the HTTP server span's active context, so server-side spans created while serving the request — notably the mcp.tool spans from MCP tool calls — nest underneath it as children rather than appearing as disconnected root spans.

Earlier releases named every API request span http.server. If you have saved SigNoz queries or dashboards that filter on name = 'http.server', switch them to filter on the agentswarm.component = 'api' attribute (set on every API span) and group by span name or the http.route attribute to get a per-endpoint breakdown.

Common attributes:

Attribute	Description
`http.request.method`	HTTP method
`http.route`	Low-cardinality route template, e.g. `/api/tasks/{id}`. Omitted for unmatched core/MCP/404 paths.
`url.path`	Raw request path (full path, including IDs)
`url.scheme`	Request scheme — `https` or `http` (honors `X-Forwarded-Proto`)
`server.address`	Request host with the port stripped (honors `X-Forwarded-Host`)
`network.protocol.version`	HTTP protocol version, e.g. `1.1` or `2`
`user_agent.original`	Raw `User-Agent` request header
`http.response.status_code`	HTTP response status code (SigNoz surfaces this as the `responseStatusCode` column)
`agentswarm.component`	`api`
`agentswarm.http.duration_ms`	Server-side request duration

Useful for:

API latency and error-rate dashboards
task creation, polling, and completion request inspection
checking whether workers are reaching the API

MCP Tool Calls

Server-side MCP tool handlers emit one span per call, named mcp.tool <tool-name> — for example mcp.tool store-progress — so the executed tool is readable straight from the trace tree. Tool names are a fixed enum, so the span name stays low-cardinality. These spans nest under the API HTTP request span that triggered them.

Common attributes:

Attribute	Description
`mcp.tool.name`	Registered MCP tool name
`mcp.tool.result_content_count`	Number of returned content items
`mcp.tool.is_error`	Whether the MCP result is an error
`agentswarm.task.id`	Source task ID when available
`agentswarm.tool.args_preview`	Scrubbed, truncated argument preview
`agentswarm.tool.result_preview`	Scrubbed, truncated result preview

Useful for:

finding slow or failing MCP tools
confirming tool calls are attached to a task
comparing tool usage across agents and harnesses

Worker Polling

Worker poll loops emit worker.poll spans.

Common attributes:

Attribute	Description
`agentswarm.poll.result`	`empty`, `task_assigned`, `task_offered`, `pool_tasks_available`, or another trigger type
`agentswarm.worker.poll_timeout_ms`	Long-poll timeout
`agentswarm.worker.poll_interval_ms`	Worker poll interval

Useful for:

seeing whether workers are idle or receiving tasks
debugging queue and claim behavior
spotting excessive polling or API contention

Worker Sessions

Worker task execution emits worker.session.create and worker.session spans.

Common attributes:

Attribute	Description
`agentswarm.task.id`	Logical task ID
`agentswarm.task.real_id`	Real task ID after pool-task claim resolution
`agentswarm.agent.role`	Agent role from the registered agent
`agentswarm.harness_provider`	`claude`, `pi`, `codex`, `opencode`, or another provider
`agentswarm.provider.session_id`	Provider session ID when available
`agentswarm.session.cwd`	Session working directory
`agentswarm.session.vcs_repo`	VCS repo attached to the session when set
`agentswarm.session.duration_ms`	Session duration
`agentswarm.session.exit_code`	Runner exit code
`agentswarm.session.outcome`	`ok` or `error`
`gen_ai.request.model`	Requested model
`gen_ai.response.model`	Model reported by the provider, when cost data exists
`gen_ai.usage.input_tokens`	Input token count, when available
`gen_ai.usage.output_tokens`	Output token count, when available
`agentswarm.cost.total_usd`	Provider-reported or computed session cost

Useful for:

full task execution timelines
model and provider comparisons
task cost and token usage inspection
finding sessions that failed before calling store-progress

Worker Tool Executions

Worker-side provider events emit worker.tool or worker.mcp.tool spans.

Common attributes:

Attribute	Description
`agentswarm.tool.name`	Raw tool name reported by the harness
`agentswarm.tool.normalized_name`	Normalized tool name
`agentswarm.tool.kind`	`mcp`, shell/tool, or provider-specific kind
`agentswarm.tool.call_id`	Provider tool-call ID when available
`mcp.tool.name`	MCP tool name when the worker calls an MCP tool
`agentswarm.tool.duration_ms`	Tool duration
`agentswarm.tool.args_preview`	Scrubbed, truncated args
`agentswarm.tool.result_preview`	Scrubbed, truncated result
`agentswarm.tool.missing_start`	Result arrived without a matching start event
`agentswarm.tool.implicit_close`	Span closed at the assistant-message boundary because the adapter didn't emit a per-tool completion event. Applies to BOTH `worker.tool` and `worker.mcp.tool` under the Claude harness.
`agentswarm.tool.unclosed`	Session ended before any `tool_end` or assistant-message boundary fired — should be very rare

How tool spans close

Under the Claude SDK adapter, neither harness-side tools (Bash/Read/Edit/etc.) nor MCP tools receive per-tool completion events in the JSONL stream. Both kinds therefore close on the same path:

worker.tool and worker.mcp.tool (Claude harness) — close at the next assistant-message boundary, tagged agentswarm.tool.implicit_close=true. duration_ms is wall-clock from tool_start until the next assistant turn, which includes tool execution plus the model round-trip after the tool result returned. Slight overcount, but real-ish; covers the typical case.
Other adapters that DO emit explicit tool_end (e.g. Claude Managed Agents, Codex, opencode) — close on the tool_end event with duration_ms set to true execution time. No implicit_close attribute.
agentswarm.tool.unclosed=true — safety net for spans where the session ended before any assistant-message boundary arrived (e.g. the session crashed mid-tool). Should be very rare; treat its presence as a signal worth investigating.

Useful for:

shell command latency
MCP usage inside full worker traces
detecting tools that never completed
finding repeated tool loops or unusually expensive tool phases

Provider, Progress, Context, and Compaction Events

Provider stream events are attached to active spans as attributes or events where possible.

Common attributes:

Attribute	Description
`agentswarm.provider.name`	Provider name
`agentswarm.provider.event_name`	Provider custom event name
`agentswarm.provider.event_data_preview`	Scrubbed, truncated provider data
`gen_ai.message.role`	Message role
`gen_ai.message.content_preview`	Scrubbed, truncated message content
`agentswarm.progress.message`	Scrubbed, truncated progress text
`agentswarm.context.used_tokens`	Context tokens used
`agentswarm.context.total_tokens`	Total context window
`agentswarm.context.percent`	Context usage percent
`agentswarm.compaction.trigger`	Compaction trigger
`agentswarm.compaction.pre_tokens`	Tokens before compaction

Claude Code Telemetry

The Claude Code CLI emits its own OpenTelemetry signal — metrics, log events, and (in beta) traces — from inside the worker subprocess. This is separate from the Agent Swarm spans described above: it is the model's own view of each interaction (claude_code.interaction), every LLM request, and every tool call.

Agent Swarm leaves Claude Code's telemetry off by default and never force-enables it. Two independent controls govern it:

The operator enables Claude Code's exporters through swarm config — the env vars below. This is what makes Claude Code emit anything at all.
The SWARM_ENABLE_HARNESS_OTEL gate makes the adapter inject a TRACEPARENT at spawn time so the harness's spans nest inside the worker's trace (and, for Claude Code, pins privacy-safe logging defaults). This is what makes the two services show up as one end-to-end trace. The same gate also covers Codex — see Codex Telemetry.

Enabling Claude Code's exporters

Claude Code reads the standard OTEL_* exporter variables — endpoint, headers, protocol — which Agent Swarm already forwards to the subprocess. To turn its telemetry on, set these as swarm config (global, or agent-scoped to roll out gradually):

CLAUDE_CODE_ENABLE_TELEMETRY=1        # master switch — required
CLAUDE_CODE_ENHANCED_TELEMETRY_BETA=1 # beta traces (claude_code.interaction et al.)
OTEL_TRACES_EXPORTER=otlp             # traces
OTEL_METRICS_EXPORTER=otlp            # token / cost / lines-of-code metrics
OTEL_LOGS_EXPORTER=otlp               # user_prompt / tool_use log events

These are independent of the SWARM_ENABLE_HARNESS_OTEL gate by design: an operator can run Claude Code telemetry on a single agent without any code change, and separately flip the gate when cross-service trace linking should turn on.

The `SWARM_ENABLE_HARNESS_OTEL` gate

SWARM_ENABLE_HARNESS_OTEL is the single gate for harness-subprocess trace linking — it covers both Claude Code and Codex. When it is truthy (1, true, yes, on — case-insensitive), the adapter, on every session spawn:

Injects TRACEPARENT (and TRACESTATE when present) derived from the active worker.session span. A harness launched with this env var parents its own root span to the worker's trace instead of starting a disconnected root. (Claude Code reads TRACEPARENT in non-interactive -p mode; Codex's Rust OpenTelemetry SDK reads it via the standard tracecontext propagator.)
Pins Claude-Code-specific privacy defaults (claude only) — OTEL_LOG_USER_PROMPTS=0, OTEL_LOG_TOOL_DETAILS=0, OTEL_LOG_TOOL_CONTENT=0, OTEL_METRICS_INCLUDE_ACCOUNT_UUID=false. These are only set when the operator has not already set them explicitly. Codex does not read these env vars — it has no equivalent.

The gate is read per-spawn from the resolved swarm config, so flipping it takes effect on the next session — no container restart required. When the gate is off, spawn behavior is unchanged: a harness with its own exporters enabled still emits, but its root span is disconnected from the worker trace.

Migration note. The gate was originally introduced as SWARM_ENABLE_CLAUDE_CODE_OTEL. That name is kept as a deprecated alias — a truthy value of either SWARM_ENABLE_HARNESS_OTEL or SWARM_ENABLE_CLAUDE_CODE_OTEL turns injection on for every harness. Prefer SWARM_ENABLE_HARNESS_OTEL in new config; the alias may be removed in a future release.

Expected SigNoz behavior

A new claude-code service card appears. Claude Code overrides OTEL_SERVICE_NAME with its own service.name for its spans — this is by design and expected. Agent Swarm does not strip or re-set it.
Span hierarchy. With the gate on, Claude Code's spans nest inside the worker trace:
```
worker.session
└── worker.session.create
    └── claude_code.interaction
        ├── claude_code.llm_request
        └── claude_code.tool
```
A complete-task query (agentswarm.task.id = '<task-id>') then returns the worker's worker.tool spans and Claude Code's claude_code.* spans in a single trace.
Log events. user_prompt and tool_use events are emitted with content redacted by default (see Privacy below).

Privacy posture. Agent Swarm's scrubSecrets does not run on Claude Code's exported payloads — they travel straight from the Claude Code process to your OTLP backend. The gate keeps OTEL_LOG_USER_PROMPTS, OTEL_LOG_TOOL_DETAILS, and OTEL_LOG_TOOL_CONTENT all at 0 so prompt and tool content never leave the process. Do not flip these to 1 without a scrubbing story for Claude Code's payloads.

Rollout

SWARM_ENABLE_HARNESS_OTEL is off by default. Recommended sequence: enable Claude Code's exporters on one agent, flip the gate agent-scoped on that same agent to validate the nested trace in SigNoz, watch span volume, then widen to global config.

Codex Telemetry

The Codex CLI also emits its own OpenTelemetry traces. Like Claude Code, it starts a fresh root span unless it is handed a W3C trace context at spawn — so the same SWARM_ENABLE_HARNESS_OTEL gate injects TRACEPARENT into the Codex subprocess env, and Codex's Rust OpenTelemetry SDK parents its spans to the worker trace via the standard tracecontext propagator.

Enabling Codex's exporters

Codex configures its OTLP exporter through TOML, not env vars — an [otel.exporter] block in ~/.codex/config.toml (endpoint, headers, protocol, plus otel.trace_exporter / otel.metrics_exporter). Setting that up is an operator config step and is out of scope for the trace-linking gate: SWARM_ENABLE_HARNESS_OTEL only injects TRACEPARENT; it does not enable Codex telemetry itself.

What the gate does for Codex

When SWARM_ENABLE_HARNESS_OTEL (or the deprecated SWARM_ENABLE_CLAUDE_CODE_OTEL alias) is on and a sampled worker.session span is active, the codex-adapter injects TRACEPARENT (and TRACESTATE when present) into the minimal env it builds for the Codex subprocess.

No privacy-default env vars are set for Codex — the OTEL_LOG_* switches are Claude-Code-specific and Codex does not read them. Codex's own redaction is governed by its TOML otel.* settings, which the operator controls separately.

With Codex's exporters enabled and the gate on, Codex's spans appear in the same end-to-end trace as worker.session, exactly as Claude Code's do.

Metrics

Agent Swarm emits both traces and OTLP metrics. The metric exporter runs on the same OTLP pipeline as traces (same OTEL_EXPORTER_OTLP_ENDPOINT and OTEL_EXPORTER_OTLP_HEADERS), so no additional endpoint configuration is needed.

Session Cost and Token Counters

Two cumulative counters are emitted by the API server each time a session-cost record is finalized (POST /api/session-costs). They cover every harness — claude, claude-managed, codex, pi, opencode, devin, gemini — with a single chokepoint.

Metric	Unit	Description
`agentswarm.cost.usd`	`{usd}`	USD cost per finalized cost record. Not emitted for zero-cost sessions.
`agentswarm.tokens`	`{token}`	Token count per finalized cost record, split by `token_type`. Not emitted for zero-count classes.

Both counters share the same set of low-cardinality attributes:

Attribute	Values	Description
`harness`	`claude`, `codex`, `pi`, `opencode`, `devin`, `gemini`, `unknown`	Harness provider that produced the session. `unknown` when the request omits the `provider` field.
`model`	model identifier	The model key sent by the adapter (e.g. `claude-sonnet-4-6`, `gpt-4o`). Stripped of routing prefixes by the pricing-normalize path. Scrubbed through the secret scrubber before export.
`cost_source`	`harness`, `pricing-table`, `unpriced`	How the cost figure was derived. `harness` = adapter-reported; `pricing-table` = recomputed from seeded pricing rows; `unpriced` = provider/model pair has no pricing data.
`is_error`	`true`, `false`	Whether the session ended with an error.

agentswarm.tokens additionally carries:

Attribute	Values	Description
`token_type`	`input`, `output`, `cacheRead`, `cacheWrite`, `reasoning`, `thinking`	Token class. Only classes with a non-zero count are emitted.

Metric Temporality

Temporality is NOT hardcoded. Set the following for Datadog or any delta-preferred backend:

OTEL_EXPORTER_OTLP_METRICS_TEMPORALITY_PREFERENCE=delta

Omit it for backends (SigNoz, Prometheus) that prefer cumulative temporality (the SDK default).

Example Dashboard Panels

Panel	Metric query
Total cost by harness	`sum(agentswarm.cost.usd)`, group by `harness`
Cost by model	`sum(agentswarm.cost.usd)`, group by `model`
Cost by pricing source	`sum(agentswarm.cost.usd)`, group by `cost_source`
Input tokens by harness	`sum(agentswarm.tokens)` where `token_type = 'input'`, group by `harness`
Output tokens by model	`sum(agentswarm.tokens)` where `token_type = 'output'`, group by `model`
Cache efficiency	`sum(agentswarm.tokens)` where `token_type = 'cacheRead'` ÷ `token_type = 'input'`
Error session cost	`sum(agentswarm.cost.usd)` where `is_error = true`

Trace-derived Metrics

In addition to the standalone counters above, SigNoz supports creating operational metrics from trace aggregations:

Metric	Query shape
API request rate	`rate(count())` over `agentswarm.component = 'api'`, grouped by span `name`
API p95 latency	`p95(durationNano)` over `agentswarm.component = 'api'`, grouped by span `name`
API errors	`count()` where `hasError = true` or `responseStatusCode >= 500`
Worker task throughput	`count()` over `name = 'worker.session'`, grouped by `agentswarm.service.role` and `agentswarm.harness_provider`
Worker session p95 duration	`p95(durationNano)` over `name = 'worker.session'`
Tool call volume	`count()` over `name IN ('worker.tool', 'worker.mcp.tool')` plus `name LIKE 'mcp.tool %'` for server-side spans, grouped by `agentswarm.tool.name` or `mcp.tool.name`
Slow tools	`p95(durationNano)` over tool spans, grouped by tool name
Cost by model (trace)	`sum(agentswarm.cost.total_usd)` over session spans, grouped by `gen_ai.response.model`
Token usage by model (trace)	`sum(gen_ai.usage.input_tokens)` and `sum(gen_ai.usage.output_tokens)`, grouped by model
Poll behavior	`count()` over `worker.poll`, grouped by `agentswarm.poll.result`

Useful SigNoz Queries

Use these in SigNoz Traces Explorer or as dashboard widget filters.

The API process reports service.name = 'agent-swarm-api' and workers report service.name = 'agent-swarm' (see Service names). Queries that should span both filter on service.namespace = 'agent-swarm' instead; worker-only queries keep service.name = 'agent-swarm'.

All Local Agent Swarm Traffic

service.namespace = 'agent-swarm' AND env = 'local'

Production Traffic Only

service.namespace = 'agent-swarm' AND deployment.environment = 'production'

A Complete Task Execution

service.namespace = 'agent-swarm' AND agentswarm.task.id = '<task-id>'

Start here when debugging a concrete task. You should see session spans, tool spans, and any server-side MCP spans that carried the task ID.

Worker Sessions for One Harness

service.name = 'agent-swarm'
AND name = 'worker.session'
AND agentswarm.harness_provider = 'pi'

Replace pi with claude, codex, opencode, or another provider.

Slow Worker Tools

service.name = 'agent-swarm'
AND name IN ('worker.tool', 'worker.mcp.tool')
AND durationNano > 5000000000

durationNano > 5000000000 means slower than five seconds.

MCP Tool Calls for a Task

service.namespace = 'agent-swarm'
AND agentswarm.task.id = '<task-id>'
AND (name LIKE 'mcp.tool %' OR name = 'worker.mcp.tool')

Server-side mcp.tool <tool-name> spans live under agent-swarm-api and worker.mcp.tool under agent-swarm, so this query filters on service.namespace to catch both. The LIKE 'mcp.tool %' pattern matches every per-tool span name (mcp.tool store-progress, mcp.tool poll-task, …).

Failed or Error-Spans

service.namespace = 'agent-swarm' AND hasError = true

For HTTP 5xx responses:

service.name = 'agent-swarm-api'
AND agentswarm.component = 'api'
AND responseStatusCode >= 500

Unclosed Tool Calls

service.name = 'agent-swarm'
AND agentswarm.tool.unclosed = true

This should be a rare safety-net signal: the session ended (or crashed) before either an explicit tool_end or the assistant-message boundary fired. For the typical Claude-harness flow, both worker.tool and worker.mcp.tool spans close on the boundary with agentswarm.tool.implicit_close=true, not unclosed=true.

Implicit-Closed Tool Calls

service.name = 'agent-swarm'
AND agentswarm.tool.implicit_close = true

This is the expected closure path under the Claude adapter for BOTH worker.tool spans (Bash/Read/Edit/etc.) AND worker.mcp.tool spans — the adapter doesn't emit per-tool completion events for either kind, so the runner closes them at the next assistant-message boundary. duration_ms is wall-clock from tool_start to the next assistant message, which slightly overcounts the actual tool execution time (it includes the model round-trip after the tool result returned). Adapter-emitted explicit tool_end spans (Codex, opencode, Claude Managed Agents) won't have this tag.

Sessions with Cost Data

Use the agentswarm.cost.usd OTLP metric counter for cost aggregations and dashboards. For per-span cost attribution in trace queries, filter on the agentswarm.cost.total_usd span attribute on worker.session spans:

service.name = 'agent-swarm'
AND name = 'worker.session'
AND agentswarm.cost.total_usd > 0

Context Pressure

service.name = 'agent-swarm'
AND agentswarm.context.percent >= 80

Dashboard Ideas

Start with these panels:

Panel	Signal
Cost by harness	`sum(agentswarm.cost.usd)` metric, grouped by `harness`
Cost by model	`sum(agentswarm.cost.usd)` metric, grouped by `model`
Token usage by type	`sum(agentswarm.tokens)` metric, grouped by `token_type`
Cache efficiency	`sum(agentswarm.tokens)` where `token_type=cacheRead` ÷ `input`
API request rate by route	Trace count over `agentswarm.component = 'api'`, grouped by span `name`
API p95 latency by route	`p95(durationNano)` over `agentswarm.component = 'api'`, grouped by span `name`
Worker sessions by harness	Trace count over `worker.session`, grouped by `agentswarm.harness_provider`
Worker session duration	`p95(durationNano)` over `worker.session`, grouped by harness
Tool calls by name	Trace count over tool spans, grouped by `agentswarm.tool.name`
Slowest tools	`p95(durationNano)` over tool spans
Poll outcomes	Trace count over `worker.poll`, grouped by `agentswarm.poll.result`
Errors by span name	Trace count where `hasError = true`, grouped by `name`

Troubleshooting

No traces appear

Check that every process has the exporter endpoint:

docker compose exec api env | grep OTEL
docker compose exec lead env | grep OTEL
docker compose exec worker-1 env | grep OTEL

Then confirm the API is logging OTel startup:

docker compose logs api | grep OTel

If OTEL_EXPORTER_OTLP_ENDPOINT is empty, tracing is intentionally disabled.

SigNoz returns authentication errors

For SigNoz Cloud, make sure:

OTEL_EXPORTER_OTLP_ENDPOINT is the base URL (no /v1/traces suffix — the SDK appends it)
OTEL_EXPORTER_OTLP_HEADERS is exactly signoz-ingestion-key=<key>
OTEL_EXPORTER_OTLP_PROTOCOL is http/protobuf
the key belongs to the same SigNoz region as the ingest endpoint

Local and production traces are mixed

Set explicit resource attributes in every environment:

OTEL_RESOURCE_ATTRIBUTES=deployment.environment=local,env=local,service.namespace=agent-swarm

For production:

OTEL_RESOURCE_ATTRIBUTES=deployment.environment=production,env=production,service.namespace=agent-swarm

Then filter with env or deployment.environment.

API and workers show as separate services

This is intentional. The API process reports service.name = agent-swarm-api and worker/lead processes report service.name = agent-swarm so they appear as separate service cards in SigNoz — see Service names.

To query across both at once, filter on service.namespace = 'agent-swarm' (set by every process) instead of service.name. Within a single service, agentswarm.service.role further splits api, lead, and worker spans.

Deployment Guide - production Docker Compose setup
Environment Variables - OpenTelemetry environment variable reference
Telemetry - anonymized product telemetry, separate from your OpenTelemetry traces

Observability with OpenTelemetry

On this page