Mesh observability — performance budget
Tracking issue: #69. Source: src/aimp_cp.rs, src/metrics.rs. Operator guide: docs/mesh/integration.md.
Target
The mesh observability surface — eight always-on counters, three new audit kinds, plus future tracing spans — must cost < 0.5 % of host throughput on a 100 k rps workload that doesn't otherwise exercise the mesh. Builds compiled without --features sovereign-aimp see zero mesh-related work; this budget governs the path on builds that opted in.
Where the cost is
The instrumentation is scattered across the mesh hot path. Each counter increment is a single relaxed fetch_add on a global AtomicU64 — same shape as the WAF / cache counters that have been in production since v0.1.x. There are exactly three categories of spend:
- Per-envelope receive cost — every UDP packet that hits the gossip socket triggers one
fetch_addforgossip_bytes_in, plus onefetch_addfor the disposition (receivedordropped_*). That's two atomic ops per envelope, served from the receiver task on its own runtime worker. Even at 10 k envelopes/sec — far above any plausible production rate — the atomic store cost is below microseconds-per-second. - Per-emit cost —
publish_blockbumpsmesh_claims_emittedonce and then enqueues the delta into the publish channel. The publisher task bumpsgossip_bytes_outonce persend_to. These run off the request hot path (publish_block enqueues + the publisher loop drains async); the dispatcher only pays the singlefetch_addinpublish_block. - Per-request cost — the dispatcher
cp.lookup(client_ip)was already on the path before this PR. Issue #69 adds onefetch_addon the positive lookup path (mesh_score_lookups). Negative lookups pay nothing extra. So the per-request overhead is bounded by the hit rate of the mesh-score lookup table.
Measured
Numbers below are taken on a 10-core M-series Mac with cargo bench --bench (microbench harness, issue #54). The methodology is: a fresh AimpEnvelope is constructed and fed through try_merge 1 M times in a tight loop; then the same loop without the try_merge body, used as the floor.
| Path | Median time | Counter ops |
|---|---|---|
try_merge accept (clean envelope) | (bench TODO) | 2 fetch_add |
try_merge reject (replay) | (bench TODO) | 2 fetch_add |
try_merge reject (signature) | (bench TODO) | 2 fetch_add |
publish_block enqueue | (bench TODO) | 1 fetch_add + 1 channel send |
cp.lookup hit + counter | (bench TODO) | 1 hashmap get + 1 fetch_add |
The (bench TODO) rows will land alongside #72 (Bench: --features mesh cost at idle and at saturation) once the mesh has a saturation workload to measure against.
What's NOT measured here
- Tracing span overhead —
tracing::info_span!is gated by the active subscriber. UnderRUST_LOG=warn(production default) the mesh-receive event is never recorded; the cost is one branch prediction. UnderRUST_LOG=debug(dev) it's a few hundred nanoseconds per event. Operators that flip mesh-tracing on in production should expect a few percent of the mesh receiver task's budget to go to span recording. - Audit-log mesh kinds —
mesh_publish/mesh_receiveevents are written to the audit log only when[audit].enabled = true. The audit writer is async (bounded mpsc, dedicated task) so the emit is non-blocking; the cost is onetry_sendon a channel. Sustained audit cost is dominated by the disk-flush rate, not by zion's emit-side.
How to validate the budget
# 1. Build the binary with --features sovereign-aimp.
cargo build --release --features sovereign-aimp
# 2. Run zion against a backend (benchmarks/zion-bench-tls.toml)
# with the mesh enabled but no peers. publish_block is never
# called; mesh_score_lookups is exercised on every request.
ZION_AIMP_LISTEN=127.0.0.1:7777 ZION_AIMP_PEERS= \
./target/release/zion --config benchmarks/zion-bench-tls.toml &
# 3. wrk a 100k rps burst against /api/v1/data, scrape /metrics, and
# confirm `zion_mesh_score_lookups_total` ticks at the expected
# rate (1 per request that has a cp.lookup hit).
wrk -t8 -c128 -d30s -H 'Host: bench.local' \
https://127.0.0.1:4430/api/v1/data
# 4. The throughput delta from the same workload without
# --features sovereign-aimp is the mesh observability cost.
# Target: < 0.5 %.Issue #72 tracks landing this measurement as a CI bench so the budget is enforced automatically.