perf: optimize DNS query hot path (#15)

* perf: optimize hot path — RwLock, inline filtering, pre-allocated strings - Mutex → RwLock for cache, blocklist, and overrides (concurrent read access) - Make cache.lookup() and overrides.lookup() take &self (read-only) - Eliminate 3 Vec allocations per DnsPacket::write() via inline filtering - Pre-allocate domain strings with capacity 64 in parse path - Add criterion micro-benchmarks (hot_path + throughput) - Add bench README documenting both benchmark suites Measured improvement: ~14% faster parsing, ~9% pipeline throughput, round-trip cached 733ns → 698ns (~2.3M queries/sec). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * chore: simplify benchmark code after review - Remove redundant DnsHeader::new() (already set by DnsPacket::new()) - Remove unused DnsHeader import - Change simulate_cached_pipeline to take &DnsCache (lookup is &self now) - Remove unnecessary mut on cache in cache_lookup_miss bench Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-27 02:01:08 +02:00
parent 5d454cbed5
commit 236ef7b4f5
13 changed files with 728 additions and 77 deletions
--- a/bench/README.md
+++ b/bench/README.md
@@ -0,0 +1,87 @@
+# Benchmarks
+
+Numa has two benchmark suites measuring different layers of performance.
+
+## Micro-benchmarks (`benches/`, criterion)
+
+Nanosecond-precision measurement of individual operations on the hot path.
+No running server required — these are pure Rust unit-level benchmarks.
+
+```sh
+cargo bench            # run all
+cargo bench --bench hot_path      # parse, serialize, cache, clone
+cargo bench --bench throughput    # pipeline QPS, buffer alloc
+```
+
+### What's measured
+
+**hot_path** — individual operations:
+
+| Benchmark | What it measures |
+|-----------|-----------------|
+| `buffer_parse` | Wire bytes → DnsPacket (typical response with 4 records) |
+| `buffer_serialize` | DnsPacket → wire bytes |
+| `packet_clone` | Full DnsPacket clone (what cache hit costs) |
+| `cache_lookup_hit` | Cache lookup on a single-entry cache |
+| `cache_lookup_hit_populated` | Cache lookup with 1000 entries |
+| `cache_lookup_miss` | HashMap miss (baseline) |
+| `cache_insert` | Insert into cache with packet clone |
+| `round_trip_cached` | Full cached path: parse query → cache hit → serialize response |
+
+**throughput** — pipeline capacity:
+
+| Benchmark | What it measures |
+|-----------|-----------------|
+| `pipeline_throughput/N` | N cached queries end-to-end (parse → lookup → serialize) |
+| `buffer_alloc` | BytePacketBuffer 4KB zero-init cost |
+
+### Reading results
+
+Criterion auto-compares against the previous run:
+
+```
+round_trip_cached  time: [710.5 ns 715.2 ns 720.1 ns]
+                   change: [-2.48% -1.85% -1.21%] (p = 0.00 < 0.05)
+                   Performance has improved.
+```
+
+- The three values are [lower bound, estimate, upper bound] of the mean
+- `change` shows the delta vs the last saved baseline
+- HTML reports with charts: `target/criterion/report/index.html`
+
+To save a named baseline for comparison:
+
+```sh
+cargo bench -- --save-baseline before
+# ... make changes ...
+cargo bench -- --baseline before
+```
+
+## End-to-end benchmark (`bench/dns-bench.sh`)
+
+Real-world latency comparison using `dig` against a running Numa instance
+and public resolvers. Measures millisecond-level latency including network I/O.
+
+```sh
+# Start Numa first (default port 15353 for testing)
+python3 bench/dns-bench.sh [port] [rounds]
+python3 bench/dns-bench.sh 15353 20    # default
+```
+
+### What's measured
+
+- **Numa (cold)**: cache flushed before each query — measures upstream forwarding
+- **Numa (cached)**: queries hit cache — measures local processing
+- **System / Google / Cloudflare / Quad9**: public resolver comparison
+
+Results saved to `bench/results.json`.
+
+### When to use which
+
+| Question | Use |
+|----------|-----|
+| Did my code change make parsing faster? | `cargo bench --bench hot_path` |
+| Is the cached path still sub-microsecond? | `cargo bench --bench hot_path` (round_trip_cached) |
+| How many queries/sec can we handle? | `cargo bench --bench throughput` |
+| Is Numa still competitive with system resolver? | `bench/dns-bench.sh` |
+| Did upstream forwarding regress? | `bench/dns-bench.sh` |