Observability

SIP.IO is built to observe millions of calls a month without drowning the control plane. It separates data by class and stores each where it’s cheapest to keep and fastest to query.

Three classes of data, three stores

Class	What it is	Store	Volume (at ~3M calls/mo)
CDR	One summary row per call/leg (who, whom, duration, disposition, cost).	the Apache Iceberg data lake `cdr` table	~3–6M rows/mo
Call-event trace	Every lifecycle and flow event for a call.	the Apache Iceberg data lake `call_events` table	~60M events/mo
SIP packet capture (optional)	Raw SIP messages for deep debugging.	HEP capture → a columnar store (separate)	100M+ msgs/mo

Keeping CDRs and traces in the Apache Iceberg data lake (columnar Parquet) rather than the edge SQL database is deliberate: 60M events/month is far past what a control-plane database should hold, and Iceberg makes it cheap to retain and quick to scan. Single-call lookups are partition-pruned (events are partitioned by date and account_id), so reconstructing a call reads only the relevant slice.

The correlation key

Everything ties together on one key: call_id. To reconstruct a call end-to-end:

SELECT * FROM call_events WHERE call_id = ? ORDER BY ts;

Who emits what

The trace is assembled from every layer, all stamped with the same call_id:

the SIP signaling layer: INVITE / 180 / 200 / BYE and the routing decision.
edge runtime: /route, /auth, /flow, and admission (/admit) decisions.
CallSessionDO: every flow node entered, its outcome, timing, variables, and DTMF. (This replaces ad-hoc flow logging, since the interpreter is the tracer.)
PresenceDO: ACD reserve / bridge / no-answer events and CAC decisions.
the media engine: play / record / conference / bridge.

To avoid a write storm, the CallSessionDO buffers a call’s trace in-object and flushes it in one batch at a terminal or bridge step, not on every poll. (A queue hold polls frequently; you don’t want a database row per heartbeat.)

CDR vs. trace

The CDR is the billing-grade summary: account, direction, timestamps, billable seconds, caller/destination, disposition, hangup cause, SIP status, cost/price, recording key, and which route/flow ran. It’s derived from the event stream (or emitted at hangup) and retained long-term.

The trace is the diagnostic detail: the full ordered story of one call, retained hot for weeks. Use the CDR for reporting and billing; use the trace for “why did this specific call do that?”

Live vs. durable

The same event stream feeds two sinks:

Durable: the streaming pipeline → object storage Iceberg (the CDR and the trace).
Live: a decoupled outbox → WebSocket fan-out that powers real-time wallboards, the live flow visualizer, and agent dashboards.

The live path is intentionally decoupled from the source-of-truth write: a stateful edge object writes the delta in the same output batch, and a separate drain loop pushes it to subscribers, so broadcasting to dashboards never blocks an admit() or a reserve().

Support tooling

GET /calls/<callId>/trace returns the full execution trace for one call: dialplan steps, commands, outcomes, errors, in order. This is the first stop when debugging a specific call.
/debug is a live dev monitor (WebSocket, session object-terminated) that drives the real PresenceDO, useful for testing flows and ACD behavior without a physical phone.

Metrics

Aggregate operational metrics (calls/sec, node-execution counts, error rates, p95 setup latency) are tracked via the edge analytics engine, sampled and aggregated, separate from the per-call trace. Net storage cost for the whole observability stack lands around pennies per month at the target scale, because object storage has no egress fees and queries prune to the partitions they touch.