perf: optimize DNS query hot path (#15)
* perf: optimize hot path — RwLock, inline filtering, pre-allocated strings - Mutex → RwLock for cache, blocklist, and overrides (concurrent read access) - Make cache.lookup() and overrides.lookup() take &self (read-only) - Eliminate 3 Vec allocations per DnsPacket::write() via inline filtering - Pre-allocate domain strings with capacity 64 in parse path - Add criterion micro-benchmarks (hot_path + throughput) - Add bench README documenting both benchmark suites Measured improvement: ~14% faster parsing, ~9% pipeline throughput, round-trip cached 733ns → 698ns (~2.3M queries/sec). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * chore: simplify benchmark code after review - Remove redundant DnsHeader::new() (already set by DnsPacket::new()) - Remove unused DnsHeader import - Change simulate_cached_pipeline to take &DnsCache (lookup is &self now) - Remove unnecessary mut on cache in cache_lookup_miss bench Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit was merged in pull request #15.
This commit is contained in:
87
bench/README.md
Normal file
87
bench/README.md
Normal file
@@ -0,0 +1,87 @@
|
||||
# Benchmarks
|
||||
|
||||
Numa has two benchmark suites measuring different layers of performance.
|
||||
|
||||
## Micro-benchmarks (`benches/`, criterion)
|
||||
|
||||
Nanosecond-precision measurement of individual operations on the hot path.
|
||||
No running server required — these are pure Rust unit-level benchmarks.
|
||||
|
||||
```sh
|
||||
cargo bench # run all
|
||||
cargo bench --bench hot_path # parse, serialize, cache, clone
|
||||
cargo bench --bench throughput # pipeline QPS, buffer alloc
|
||||
```
|
||||
|
||||
### What's measured
|
||||
|
||||
**hot_path** — individual operations:
|
||||
|
||||
| Benchmark | What it measures |
|
||||
|-----------|-----------------|
|
||||
| `buffer_parse` | Wire bytes → DnsPacket (typical response with 4 records) |
|
||||
| `buffer_serialize` | DnsPacket → wire bytes |
|
||||
| `packet_clone` | Full DnsPacket clone (what cache hit costs) |
|
||||
| `cache_lookup_hit` | Cache lookup on a single-entry cache |
|
||||
| `cache_lookup_hit_populated` | Cache lookup with 1000 entries |
|
||||
| `cache_lookup_miss` | HashMap miss (baseline) |
|
||||
| `cache_insert` | Insert into cache with packet clone |
|
||||
| `round_trip_cached` | Full cached path: parse query → cache hit → serialize response |
|
||||
|
||||
**throughput** — pipeline capacity:
|
||||
|
||||
| Benchmark | What it measures |
|
||||
|-----------|-----------------|
|
||||
| `pipeline_throughput/N` | N cached queries end-to-end (parse → lookup → serialize) |
|
||||
| `buffer_alloc` | BytePacketBuffer 4KB zero-init cost |
|
||||
|
||||
### Reading results
|
||||
|
||||
Criterion auto-compares against the previous run:
|
||||
|
||||
```
|
||||
round_trip_cached time: [710.5 ns 715.2 ns 720.1 ns]
|
||||
change: [-2.48% -1.85% -1.21%] (p = 0.00 < 0.05)
|
||||
Performance has improved.
|
||||
```
|
||||
|
||||
- The three values are [lower bound, estimate, upper bound] of the mean
|
||||
- `change` shows the delta vs the last saved baseline
|
||||
- HTML reports with charts: `target/criterion/report/index.html`
|
||||
|
||||
To save a named baseline for comparison:
|
||||
|
||||
```sh
|
||||
cargo bench -- --save-baseline before
|
||||
# ... make changes ...
|
||||
cargo bench -- --baseline before
|
||||
```
|
||||
|
||||
## End-to-end benchmark (`bench/dns-bench.sh`)
|
||||
|
||||
Real-world latency comparison using `dig` against a running Numa instance
|
||||
and public resolvers. Measures millisecond-level latency including network I/O.
|
||||
|
||||
```sh
|
||||
# Start Numa first (default port 15353 for testing)
|
||||
python3 bench/dns-bench.sh [port] [rounds]
|
||||
python3 bench/dns-bench.sh 15353 20 # default
|
||||
```
|
||||
|
||||
### What's measured
|
||||
|
||||
- **Numa (cold)**: cache flushed before each query — measures upstream forwarding
|
||||
- **Numa (cached)**: queries hit cache — measures local processing
|
||||
- **System / Google / Cloudflare / Quad9**: public resolver comparison
|
||||
|
||||
Results saved to `bench/results.json`.
|
||||
|
||||
### When to use which
|
||||
|
||||
| Question | Use |
|
||||
|----------|-----|
|
||||
| Did my code change make parsing faster? | `cargo bench --bench hot_path` |
|
||||
| Is the cached path still sub-microsecond? | `cargo bench --bench hot_path` (round_trip_cached) |
|
||||
| How many queries/sec can we handle? | `cargo bench --bench throughput` |
|
||||
| Is Numa still competitive with system resolver? | `bench/dns-bench.sh` |
|
||||
| Did upstream forwarding regress? | `bench/dns-bench.sh` |
|
||||
Reference in New Issue
Block a user