Skip to main content

Observability

The Digitorn daemon exposes health probes, JSON metrics, and (when prometheus_client is installed) Prometheus-formatted metrics over HTTP. Per-session metrics also live in-process and are queryable via the API.

Every endpoint and metric on this page maps to real code; entries are cited with file + line.

Health endpoints

The daemon registers three health surfaces. Two of them are designed to be hit anonymously by an orchestrator and are part of the public contract; one is admin-side.

GET /healthz is the liveness probe. It is exempt from auth middleware and returns {"status": "alive"}. Use it as your Kubernetes liveness check.

GET /readyz is the readiness probe. It returns HTTP 503 with {"status": "draining"} while the daemon is shutting down, and OK otherwise. When server.auth_enabled=true, this endpoint requires JWT auth, so for K8s readiness either disable auth on the probe path, run a sidecar that mints a token, or rely on /healthz instead.

The third surface is a richer health view that returns version, status, system metrics, event-loop lag, watchdog, and worker-pool stats. Its status field flips to "degraded" automatically when event-loop lag exceeds 500 ms or the turn pool saturates (active turns ≥ max workers). It is admin-only; use it in front-of-daemon load balancers to stop sending new traffic while the daemon is overloaded.

Kubernetes example

livenessProbe:
httpGet:
path: /healthz
port: 8000
initialDelaySeconds: 5
periodSeconds: 10

readinessProbe:
httpGet:
path: /readyz
port: 8000
initialDelaySeconds: 2
periodSeconds: 5

Metrics

The daemon exposes both JSON and Prometheus-formatted metrics through admin-only endpoints (operational reference held by the daemon administrator).

Prometheus support is opt-in via dependency - install prometheus_client to enable the Prometheus exposition.

pip install prometheus_client

Per-session metrics

Every active session has a SessionMetrics instance tracking real-time numbers.

Fields

FieldDescription
app_id, session_id, agent_id, user_id, channel, model, providerIdentity.
statusactive / idle / closed.
created_at, last_active_atUnix timestamps.
turn, max_turnsCurrent turn count and the configured cap.
prompt_tokens, completion_tokens, total_tokensCumulative token usage reported by the LLM provider.
llm_calls, llm_total_ms, llm_last_msLLM latency stats.
tool_calls_total, tool_calls_success, tool_calls_failedTool-call counters.
tool_metricsPer-tool breakdown - dict[str, ToolMetrics]. Each ToolMetrics tracks calls, successes, failures, avg_duration_ms, last_duration_ms, last_error.
contextContextBreakdown - system / tools / messages token split for the current turn.
memory_goal, memory_facts_count, memory_todos_countMemory snapshot.

Programmatic access

from digitorn.core.runtime.session_metrics import (
get_session_metrics,
list_active_metrics,
app_summary,
global_summary,
)

# Get a single session's metrics object
m = get_session_metrics(app_id="my-app", session_id="abc-123")
print(m.snapshot())

# All active sessions across all apps
for m in list_active_metrics():
print(m["app_id"], m["session_id"], m["total_tokens"])

# Per-app rollup
print(app_summary("my-app"))

# Daemon-wide
print(global_summary())

The same data is reachable through the admin metrics surface (see your daemon administrator).

Per-module health

Modules can expose their own health probe through the daemon's modules API. Each module decides what "healthy" means (DB connection alive, MCP server reachable, HTTP backend responding, ...). The CLI front-end is:

digitorn modules health
digitorn mcp health # per-MCP-server health

The mcp.health_check action is the LLM-callable equivalent for MCP servers.

Channel health

Each declared channel exposes a per-channel ChannelHealth snapshot through the daemon's channels API. Useful when triggers depend on inbound channels (a webhook listener that's lost its connection should flag degraded).

Credentials health

The daemon exposes a credentials-vault health endpoint that returns the state of the master-key provider, cipher, audit-log integrity, OAuth registry, and refresh loop. The exact route is admin-only; consult your daemon administrator. Documented behavior is in Credentials.

Audit log

Every gate decision in the security layer fires an audit event - see Security → Audit log. The append-only trail is queryable through admin endpoints with filters on event_type, actor_user_id, target_user_id, target_app_id, time range, success-only, limit+offset. Credential-specific audit data is stored in a hash-chained table; integrity verification is also admin-only.

Logging

The daemon uses Python's stdlib logging configured by the runtime - no third-party log framework is mandatory. Log level is controlled by the DIGITORN_LOG_LEVEL env var (or the logging section of the daemon config). For structured JSON output, set the appropriate handler in your deployment - Digitorn doesn't force structlog on you.

For configuration of log handlers, formats, and per-module verbosity, see Daemon Configuration.

Frontend integration

The web client consumes metrics over Socket.IO event streams, not the HTTP metrics endpoints - Socket.IO is push-based (events fire as turns complete), the metrics surface is admin-only and pull-based.

Use caseSurface
Real-time dashboard inside the appSocket.IO metrics:* events
Prometheus scrape, Grafana dashboardadmin metrics surface
One-off ops queryadmin metrics surface
Kubernetes / load-balancer probeliveness / readiness probes

Cross-references