Skip to content

Observability

Zion's observability stack covers four concerns:

  1. Distributed tracingtracing everywhere, optional OTLP gRPC export.
  2. Metrics with exemplars — Prometheus text format upgraded to OpenMetrics so each histogram bucket can carry the trace ID of the latest observation that fell into it.
  3. Audit log — HMAC-SHA256-chained JSON-Lines, opt-in.
  4. Panic hook — every panic emits one structured JSON record to stderr and to a "last-gasp" file before the process aborts.

All four are always linked into the binary; they're cheap when idle. OTLP export is the only feature gated behind a build flag (--features otel) because it pulls in tonic + prost.

Distributed tracing

The tracing crate is initialized at boot. Filtering follows RUST_LOG (full tracing-subscriber syntax); the default is zion=info,warn. Output format mirrors [server.log_format]:

log_formatOutput
text (default)pretty multi-line, ANSI-colored on a TTY
jsonone JSON object per line — wire-compatible with Loki / ELK / Datadog

W3C Trace Context propagation

Every request carries a traceparent header. The dispatcher:

  1. Parses the inbound header per W3C Trace Context v0. All-zero IDs and malformed values are rejected (zion_traces_invalid_total counter ticks).
  2. If the inbound header was valid, it is forwarded unchanged.
  3. Otherwise, Zion generates one and forwards it.

The parsed 16-byte trace ID is attached to the latency histogram as an OpenMetrics exemplar (see below) and to every audit event for the request.

Optional OTLP export

bash
cargo build --release --features otel
OTEL_EXPORTER_OTLP_ENDPOINT=http://tempo.observability.svc:4317 \
    ZION_CONFIG=zion.toml ./target/release/zion

The exporter ships every span emitted by tracing::info_span! / #[instrument] to the configured collector. Resource attributes are populated from service.name=zion and service.version (compile-time crate version). No batching parameters are exposed yet; the SDK default (5-second batch, 512-span queue) is in effect.

To verify export end-to-end without a collector, point OTEL_EXPORTER_OTLP_ENDPOINT at http://127.0.0.1:4317 and run otel-cli or the collector contrib distribution locally.

Metrics with exemplars

/metrics is now OpenMetrics-compatible. Histogram buckets gain a per-bucket exemplar suffix that links to the latest slow request:

zion_request_duration_seconds_bucket{le="0.512"} 17 # {trace_id="0af7651916cd43dd8448eb211c80319c"} 0.481234 1714896000.123

The exemplar update cost is 4 relaxed atomic stores on a cache line we just touched — measurable in benchmarks at sub-percent overhead, hidden by the existing histogram observation cost.

Five new counters are exposed alongside the existing ones:

CounterMeaning
zion_panics_totalWorker panics caught by the panic hook.
zion_audit_events_totalAudit-log events emitted (signed + chained).
zion_audit_events_dropped_totalAudit events dropped because the writer queue was full. Non-zero values mean either the disk is slow or audit.queue_depth is too small.
zion_traces_emitted_totalRequest spans observed (one per request).
zion_traces_invalid_totalInbound traceparent headers rejected as malformed.

Sovereign IP classification (--features geo-ita / geo-eu)

When the [sovereign] block is enabled, every request's client IP is classified against the baked-in CIDR dataset (an O(log N) binary search

  • one relaxed fetch_add — no allocation, no syscall, no external GeoIP DB) and tallied:
zion_sovereign_classifications_total{class="eu"}              42891
zion_sovereign_classifications_total{class="gov_eu"}            317
zion_sovereign_classifications_total{class="residential_eu"}  18044
zion_sovereign_classifications_total{class="datacenter_eu"}    9210
zion_sovereign_classifications_total{class="unknown"}         15538
...

The dataset is generated by scripts/generate_sovereign_data.py and refreshed weekly via the sovereign-data workflow (RIPE NCC delegated stats + Team Cymru IPtoASN). The eu class is the EU-27 country-level baseline; gov_eu / residential_eu / datacenter_eu are the more specific curated-ASN roles that override it.

Reading "% EU vs non-EU traffic" — sum the EU-family classes over the grand total. In PromQL:

promql
sum(zion_sovereign_classifications_total{class=~"eu|gov_eu|residential_eu|datacenter_eu"})
  / ignoring(class) sum(zion_sovereign_classifications_total)

Both IPv4 and IPv6 clients are classified (the dataset bakes a u32 table and a parallel u128 table; IPv4-mapped IPv6 folds onto the v4 path). unknown therefore means an IP in no dataset — genuinely non-EU on the geo-eu build, or unclassified-by-ASN on geo-ita.

Tag-driven enforcement ([sovereign.enforce]). By default the class is a pure signal. The operator can opt a class (or an AIMP mesh-reputation threshold) into a hard 403 deny — e.g. deny = ["unknown"] on a geo-eu build blocks every non-EU source while the EU classes pass (the sovereign allowlist by complement). Denials are counted, split by reason:

zion_enforcement_denied_total{reason="class"}       1043
zion_enforcement_denied_total{reason="mesh_score"}    77

The local WAF / rate-limiter / auth gates stay authoritative — enforcement only adds a deny on top, and is off until configured.

L7 tarpit ([sovereign.enforce.tarpit], #151). When enforcement is on, the operator can escalate a deny from a cheap 403 to a held connection: a flagged source is parked hold_secs before the refusal so a backed flood pays wall-clock and socket budget. A hard global ceiling (max_concurrent) sheds back to an immediate 403 at capacity, and is clamped at config-load to ¼ of the global connection pool so held connections can't pin admission.

zion_tarpit_active        12     # gauge: connections currently held
zion_tarpit_total       4310     # counter: total ever held
zion_tarpit_shed_total   118     # counter: shed to immediate 403 at the ceiling
zion_tarpit_held_ms_total 43100  # counter: cumulative held wall-clock (ms)

Mean hold ≈ rate(zion_tarpit_held_ms_total[5m]) / rate(zion_tarpit_total[5m]). A rising zion_tarpit_shed_total means the ceiling is saturated — raise max_concurrent (bounded by the ¼-pool clamp) or lower hold_secs. The ceiling counts in-flight held requests (HTTP/2 streams), so size it with stream fan-out in mind.

Access log

Every successful request emits one structured tracing::info! event under the access target with these fields:

FieldTypeNotes
statusu16HTTP response status.
latency_usu64Total request duration (client → response sent), in µs.
methodstrHTTP method.
pathstrURI path with query-string redacted via [redact.query_params].
remote_ipstrClient IP after XFF resolution.
headersjsonConfigured headers, redacted per [redact.headers]. Empty when [access_log] include_headers is empty (default).
mtls_fpstrX-Client-Cert-Fingerprint value (SHA-256 hex), when present and [access_log] mtls_fingerprint = true. Never redacted — the value is already a hash.

Configuration (issue #60)

toml
[access_log]
# Headers to emit on every access-log line. Lowercased on parse;
# values pass through `[redact.headers]` before serialisation.
include_headers   = ["user-agent", "authorization", "host", "x-forwarded-for"]
# Surface the mTLS leaf-cert SHA-256 fingerprint as a dedicated
# `mtls_fp` field. Default true — set false to omit even when mTLS
# is configured.
mtls_fingerprint  = true

[redact]
# headers in this list are replaced by `<redacted:N>` (N = byte
# length of the original value). Same policy already protects the
# audit log's request_blocked / auth_failure events.
headers       = ["authorization", "cookie", "x-api-key"]

Storage budget

Each line is a JSON object. With the default fields plus 5 headers emitted, expect ~500 B per request. At 100 k rps that's ~50 MB/s of access-log volume — operators sizing log shippers (Vector, Fluent Bit, Loki agent) should account for this when opting into include_headers. When the list is empty (default), the line stays at ~120 B.

Audit-log mirror (request_completed)

When [access_log] opts in (any header configured OR mtls_fingerprint = true AND [audit].enabled = true), every access-log line is mirrored as a signed audit event with kind = "request_completed". The detail field carries status=N latency_us=N headers={…} mtls_fp=… so a compliance reviewer querying the audit log sees the same shape they'd see in the access log, with the HMAC chain attached. The audit kind is defined as the canonical constant audit::kind::REQUEST_COMPLETED.

Audit log

The audit log is a tamper-evident, HMAC-SHA256-chained JSON-Lines file. It is disabled by default.

Configuration

toml
[audit]
enabled = true
path = "/var/log/zion/audit.jsonl"
key_env = "ZION_AUDIT_HMAC_KEY"   # default; the secret never lives in zion.toml
queue_depth = 4096                # bounded mpsc — events overflow ⇒ dropped + counted

[redact]
headers      = ["authorization", "cookie", "x-api-key"]
query_params = ["token", "api_key", "session"]

The HMAC key is taken from the named environment variable. RFC 2104 recommends ≥ 32 bytes for HMAC-SHA256; shorter keys are accepted but Zion logs a warning at boot.

Wire format

One JSON object per line. Fields:

FieldTypeNotes
sequ64Monotonic within a process. Resets on restart.
tsstringRFC 3339 / ISO 8601 with microsecond precision.
kindstringchain_init, auth_success, auth_failure, request_blocked, config_reload, admin_access, panic.
trace_idstringOptional. 32-char hex.
remote_ip, method, path, detailstringOptional. path's query string is redacted per [redact.query_params].
prev_hashstring64-char hex. The HMAC of the previous record (or the genesis tag for seq=0).
hmacstring64-char hex. `HMAC-SHA256(key, canonical_event_json + "

Verification

A simple shell pipeline verifies the chain:

bash
KEY="$(cat /etc/zion/audit.key)"   # the HMAC key, kept off-config
python3 - <<'PY'
import hmac, hashlib, json, sys, os

key = os.environ["KEY"].encode()
prev = hmac.new(key, b"ZION-AUDIT-GENESIS-V1", hashlib.sha256).hexdigest()
ok = 0
for i, line in enumerate(open("/var/log/zion/audit.jsonl")):
    rec = json.loads(line)
    body = rec.copy()
    expected_prev = body.pop("prev_hash")
    expected_hmac = body.pop("hmac")
    if expected_prev != prev:
        sys.exit(f"chain break at line {i}: prev mismatch")
    canon = json.dumps(body, separators=(",", ":"))  # match serde compact
    sig = hmac.new(key, canon.encode() + b"|" + prev.encode(), hashlib.sha256).hexdigest()
    if sig != expected_hmac:
        sys.exit(f"signature mismatch at line {i}")
    prev = expected_hmac
    ok += 1
print(f"verified {ok} records")
PY

The verifier walks top-down and stops at the first inconsistency. Tamper, deletion, or reordering of any record is detected.

Failure semantics

The writer task runs in tokio::spawn. If:

  • the queue is full, events are dropped and zion_audit_events_dropped_total ticks. The hot path never blocks.
  • the file cannot be opened at startup, audit is silently disabled (with an error log) and the rest of the daemon continues.
  • a write fails mid-run (disk full, fd revoked), the writer task exits and subsequent events are dropped. A monitor on zion_audit_events_dropped_total > 0 is the recommended alert.

Each restart begins a fresh chain anchored at the genesis tag. A chain_init record is emitted as seq=0 so a verifier can spot the boundary. Continuing a chain across restarts would require trusting the on-disk tail value, which defeats tamper-evidence.

Panic hook

Installed before any worker thread is spawned. On a panic anywhere in the process:

  1. zion_panics_total is incremented.
  2. A single-line JSON record is written to stderr — including thread name, source location, and the panic payload, with all control bytes JSON-escaped.
  3. The same record is appended to a "last-gasp" file. Default: /var/lib/zion/last_panic.jsonl. Override with ZION_LAST_GASP_PATH.
  4. The previous panic hook (Rust default, or whatever the test harness installed) is chained — no loss of dev-mode backtrace UX.

Because the release profile ships panic = "abort", the hook runs once and the process exits. A sidecar / next-boot probe surfaces the persisted record. Liveness probes detect the corresponding restart through the orchestrator (Helm probes on /healthz flap; readiness goes red until a fresh process is up).

Mesh (--features sovereign-aimp)

The mesh layer surfaces its observability through the same triad (audit log + counters + structured boot log). All mesh counters are always rendered on /metrics, zero on builds without --features sovereign-aimp — operators can grep for the same metric name regardless of which build their distro produced.

Counters wired today (issue #69):

MetricTypeWhat it counts
zion_mesh_claims_emitted_totalcounterSuccessful local emits (aimp_cp::publish_block).
zion_mesh_claims_received_totalcounterInbound envelopes that passed all policy gates and merged into local state.
zion_mesh_claims_dropped_total{reason="signature"}counterInbound envelopes rejected on Ed25519 signature verification.
zion_mesh_claims_dropped_total{reason="replay"}counterInbound envelopes rejected as duplicates (seen-signature filter).
zion_mesh_claims_dropped_total{reason="other"}counterOther rejections — timestamp skew (past/future), magic-prefix mismatch, payload decode error, revocation by non-original source.
zion_mesh_score_lookups_totalcounterDispatcher hits that found a mesh score for the client IP — the X-Zion-Mesh-Score header rate.
zion_mesh_gossip_bytes_in_totalcounterTotal bytes received on the gossip socket (covers malformed packets too).
zion_mesh_gossip_bytes_out_totalcounterTotal bytes sent on the gossip socket.

Audit kinds (see src/audit.rsmod kind for the canonical reference list):

  • mesh_publish / mesh_receive — every publish + receive can be recorded as a signed audit event carrying the envelope's signature, the resolved node_id, and the local HMAC chain prev_hash.
  • mesh_peer_joined / mesh_peer_dropped — reserved for the mesh peer-state tracker (issue #68).
  • mesh_quorum_decision — reserved for the quorum aggregator (#66 / #67).

Cost: see docs/perf/mesh-overhead.md for the budget the observability surface is allowed to spend.

The full operator-facing guide (topology, identity rotation, debugging) lives at docs/mesh/integration.md. Threat-model addendum specific to the mesh surface: docs/security/threat-model.md §10.

What's next

  • Span instrumentation — automatic span creation around process_request is wired through the W3C parser; richer per-stage spans (WAF, cache, upstream) will follow in a small follow-up.
  • PII redaction in access logs — landed via [access_log] (issue #60). Configured headers pass through the same [redact.headers] policy that protects audit events.
  • OTLP metrics — the SDK supports it; we have not enabled the export path yet because the lock-free metrics module already covers the use cases. We may add it for parity if a downstream consumer needs an OTLP-only ingest.

Released under the MIT License.