Skip to content

Optimization Log

Changes made to improve throughput and latency, with rationale. Throughput claims reference wrk benchmark results from benchmarks/bench-native.sh.

Architectural

ChangeRationale
HTTP/2 upstream multiplexinghyper-rustls HttpsConnector with ALPN H2 negotiation for HTTPS upstreams
TLS connection pre-warmingHealth checks reuse shared HttpClient; startup pre-warm task fires GET to all upstreams
TCP_CORK on listener (Linux)Batches TLS record + HTTP headers into full MSS segments
Connection pool pre-warmingFires GET to all upstreams before accept loop starts
Thread-local route lookup cacheFNV hash of path maps to cached Arc<ResolvedRoute> (~5ns vs ~30ns radix tree)

Compiler / Build

ChangeRationale
target-cpu=native (.cargo/config.toml)Unlocks NEON/AES-CE on Apple Silicon, AVX2/AES-NI on x86_64
PGO build script (bench-pgo.sh)Two-phase profile-guided optimization for 10-20% additional throughput

Hot Path Allocation Elimination

ChangeRationale
Traceparent: [u8;55] stack bufferReplaces 3x format! heap allocations (-500ns/req)
CORS origin: HeaderValue cloneAvoids String allocation per CORS request
WAF content-type: borrow from parts.headersEliminates to_owned() clone on POST/PUT/PATCH
Cache key: Arc::from() directSkips intermediate String allocation on cache miss

Lock / Contention Reduction

ChangeRationale
WebSocket TLS: OnceLock<Arc<ClientConfig>>Builds root cert store once, not per WS upgrade
Metrics render: ArcSwap<(u64, Bytes)>Lock-free /metrics endpoint, atomic timestamp+buffer pair
Histogram: non-cumulative differential buckets3 atomic ops per observation instead of 17; prefix sums at render time
HTTP builder: Arc<AutoBuilder>Per-connection clone is ref-count bump, not deep copy

Data Structures

ChangeRationale
L1 cache: O(1) LRU (index-based doubly-linked list)Replaces O(N) VecDeque linear scan on every cache hit
L1/L2 generation-based coherenceAtomic generation counter invalidates stale L1 entries after L2 update
Host validation: single-pass byte scanReplaces 8 separate contains() calls
CORS origin: FNV hash setO(1) lookup replaces Vec linear scan; case-insensitive via pre-lowercased storage

WAF Pipeline

ChangeRationale
Two pattern sets selected per profile via modebalanced (default, ~120 high-precision patterns) or aggressive (~190, broader recall). Categories: SQLi, XSS, CMDi, path traversal, SSRF/cloud-metadata, LDAP, XXE, SSTI, CRLF, Log4Shell, prototype pollution; NoSQL, deserialization, generic XSS handlers, JS API sinks live in aggressive
Aho-Corasick (no regex)O(N) single-pass, no backtracking, case-insensitive, ReDoS-immune by construction
Normalisation: iterative re-scanURL-decode (%XX, +), SQL comment strip (/* … */), JSON unicode (\uXXXX); decode loop runs up to 3 passes (catches single, double, triple encoding) and re-scans after each pass
Buffer shrink-to-fit (>64KB)Prevents permanent memory inflation from adversarial large bodies
Entropy gate: JSON-string-aware (default 6.5 bits/byte)For application/json, computed only on bytes inside string literals so structural punctuation doesn't dilute the signal. Threshold leaves base64/JWT through; per-profile threshold + kill-switch
DELETE body inspectionRFC 9110 allows bodies on DELETE
Content-Type delimiter enforcementRequires ; or delimiter after type match

Innovative

ChangeRationale
Request coalescing (singleflight)N concurrent cache misses = 1 upstream fetch (thundering herd protection)
Health probe inline fast-path/healthz responds in ~1us, bypasses full process_request pipeline
SO_BUSY_POLL (Linux)Spin-poll NIC queue 50us before sleeping; -5-15us p99 latency

Allocator

ChangeRationale
mimalloc global allocatorReduces allocation contention under concurrent load compared to system malloc

TLS

ChangeRationale
TLS 1.3 default1-RTT handshake instead of 2-RTT (TLS 1.2)
Session cache 16,384 entriesMore resumed sessions avoid full ECDHE key exchange
Session tickets (Ticketer)Stateless resumption, works across process restarts
0-RTT early data (16 KB max)Clients can send data before handshake completes (idempotent methods only)
Server cipher order enforcedignore_client_order = true
send_half_rtt_data = trueServer sends data before client Finished on resumed connections
FnvHashMap for SNI map~2x faster than SipHash for short hostname keys
Thread-local SNI cacheInvalidated via dual-generation counter (instance + global)
Acquire/Release orderingPrevents stale cert serving on ARM (Graviton) after hot-reload
sys_membarrier (Linux)Ensures all threads observe new cert config after reload
Cert pre-warming (120s)Pre-builds ServerConfig before expiry; race-protected via generation check
TLS handshake timeout (10s)Drops connections that stall during handshake

Network (Linux)

ChangeRationale
TCP_NODELAYDisables Nagle's algorithm on all connections
SO_REUSEPORTKernel-level connection distribution across listeners
TCP_DEFER_ACCEPTKernel holds connection until client sends data
TCP_FASTOPENData in SYN packet for returning clients (256 pending queue)
TCP_QUICKACKImmediate ACK instead of delayed ACK timer
TCP_CORKBatches writes on listener; combined with NODELAY on accept
SO_BUSY_POLL (50us)Spin-poll NIC queue before sleeping; trades CPU for latency
Listen backlog 1024Prevents SYN drops under burst load
io_uring multishot acceptFeature-gated: one syscall for N connections

Proxy

ChangeRationale
HTTP/2 upstream via hyper-rustlsALPN H2 for HTTPS upstreams; eliminates head-of-line blocking
Connection pooling (128 idle per host)Reuse upstream TCP+TLS connections; 30s idle timeout
Hop-by-hop header stripping (RFC 7230)Transfer-Encoding, TE, Trailer, Proxy-Authorization, Keep-Alive
SSE stream: no-buffer headersCache-Control: no-cache, X-Accel-Buffering: no
WebSocket: OnceLock TLS configRoot cert store built once, not per WS TLS upgrade
WebSocket: forward handshake headersSec-WebSocket-Accept, Protocol, Extensions from upstream 101

WAF

ChangeRationale
Aho-Corasick (no regex)O(N) single-pass, no backtracking; all patterns of the active mode scanned simultaneously
Skip body inspection for GET/HEAD/OPTIONSPOST/PUT/PATCH/DELETE bodies are inspected
Entropy check only for bodies >= 256 bytesShort payloads lack sufficient data for meaningful entropy analysis
simd-json for JSON validationSIMD-accelerated JSON parsing where hardware supports it
Byte-level content-type matchingNo string allocation; case-insensitive byte prefix comparison

Cache

ChangeRationale
Two-level: L1 thread-local + L2 DashMapL1 zero contention (~5ns), L2 sharded locks (~30ns)
L1 O(1) LRU via doubly-linked listIndex-based nodes in Vec with free-list recycling
L1 sized from detected L1d cache50% of L1d for hot entries
L1/L2 generation coherenceAtomic counter bumped on L2 insert; stale L1 entries detected on get
Cache key includes query stringPrevents cache poisoning (/api?a=1 vs /api?a=2)
Content-Encoding preservedGzip-compressed responses served with correct header
Singleflight coalescingDashMap + tokio::sync::watch; wait_for inspects current value at first poll, so a waiter that subscribes after the fetcher has already published true still observes it (race-free). Inflight entry is cleaned up on all exit paths.
L2 eviction: expired-firstTTL-expired before live entries; oldest-TTL fallback
Bytes (reference-counted)Cloning is Arc increment, not memcpy
Thread-local route LRUFNV hash of path; O(1) get/insert/evict via intrusive doubly-linked list (capacity 256 per worker). Replaces a previous "first 256 then no more inserts" map that could be locked out by a flood of distinct paths.
Connection pool pre-warmingFires GET to all upstreams at startup

Hyper Tuning

ChangeRationale
Max headers: 64 (default: 100)Reduces memory per connection from malformed requests
Max header buffer: 16 KBLimits memory consumption from oversized headers
Connection timeout: 1 hourSupports HTTP/2 mux, WebSocket, SSE long-lived connections

Released under the MIT License.