diff --git a/.gitignore b/.gitignore index 649d86b..acfc601 100644 --- a/.gitignore +++ b/.gitignore @@ -4,3 +4,5 @@ CLAUDE.md docs/ site/blog/posts/ ios/ +drafts/ +site/blog/index.html diff --git a/Makefile b/Makefile index f84761a..dbff53a 100644 --- a/Makefile +++ b/Makefile @@ -32,6 +32,19 @@ blog: pandoc "$$f" --template=site/blog-template.html -o "site/blog/posts/$$name.html"; \ echo " $$f → site/blog/posts/$$name.html"; \ done + @scripts/generate-blog-index.sh + +blog-drafts: blog + @if [ -d drafts ] && ls drafts/*.md >/dev/null 2>&1; then \ + for f in drafts/*.md; do \ + name=$$(basename "$$f" .md); \ + pandoc "$$f" --template=site/blog-template.html -o "site/blog/posts/$$name.html"; \ + echo " $$f → site/blog/posts/$$name.html (draft)"; \ + done; \ + BLOG_INCLUDE_DRAFTS=1 scripts/generate-blog-index.sh; \ + else \ + echo " No drafts found"; \ + fi release: ifndef VERSION diff --git a/blog/dns-from-scratch.md b/blog/dns-from-scratch.md index 7bf666c..c626f8a 100644 --- a/blog/dns-from-scratch.md +++ b/blog/dns-from-scratch.md @@ -1,7 +1,7 @@ --- title: I Built a DNS Resolver from Scratch in Rust description: How DNS actually works at the wire level — label compression, TTL tricks, DoH, and what surprised me building a resolver with zero DNS libraries. -date: March 2026 +date: 2026-03-20 --- I wanted to understand how DNS actually works. Not the "it translates domain names to IP addresses" explanation — the actual bytes on the wire. What does a DNS packet look like? How does label compression work? Why is everything crammed into 512 bytes? diff --git a/blog/dnssec-from-scratch.md b/blog/dnssec-from-scratch.md index 01bc5c5..804b425 100644 --- a/blog/dnssec-from-scratch.md +++ b/blog/dnssec-from-scratch.md @@ -1,7 +1,7 @@ --- title: Implementing DNSSEC from Scratch in Rust description: Recursive resolution from root hints, chain-of-trust validation, NSEC/NSEC3 denial proofs, and what I learned implementing DNSSEC with zero DNS libraries. -date: March 2026 +date: 2026-03-28 --- In the [previous post](/blog/posts/dns-from-scratch.html) I covered how DNS works at the wire level — packet format, label compression, TTL caching, DoH. Numa was a forwarding resolver: it parsed packets, did useful things locally, and relayed the rest to Cloudflare or Quad9. diff --git a/blog/dot-from-scratch.md b/blog/dot-from-scratch.md index 448f185..859202d 100644 --- a/blog/dot-from-scratch.md +++ b/blog/dot-from-scratch.md @@ -1,7 +1,7 @@ --- title: DNS-over-TLS from Scratch in Rust description: Building RFC 7858 on top of rustls — length-prefix framing, ALPN cross-protocol defense, and two bugs that only the strict clients caught. -date: April 2026 +date: 2026-04-06 --- The [previous post](/blog/posts/dnssec-from-scratch.html) ended with "DoT — the last encrypted transport we don't support." This post is about building it. diff --git a/blog/fixing-doh-tail-latency.md b/blog/fixing-doh-tail-latency.md new file mode 100644 index 0000000..661c456 --- /dev/null +++ b/blog/fixing-doh-tail-latency.md @@ -0,0 +1,169 @@ +--- +title: Fixing DNS tail latency with a 5-line config and a 50-line function +description: We had periodic 40-140ms DoH spikes from hyper's dispatch channel. The fix was reqwest window tuning and request hedging — Dean & Barroso's "The Tail at Scale," applied to a DNS forwarder. Same ideas took our cold recursive p99 from 2.3 seconds to 538ms. +date: 2026-04-12 +--- + +Numa forwards DNS queries over HTTPS using reqwest. When we benchmarked the DoH path, we found periodic 40-140ms latency spikes every ~100ms of wall clock, in an otherwise ~10ms distribution. The tail was dragging our average — median 10ms, mean 23ms. + +
+
+
DoH forwarding p99
+
113 → 71ms
+
window tuning + request hedging
+
+
+
Cold recursive p99
+
2.3s → 538ms
+
NS caching, serve-stale, parallel queries
+
+
+
Forwarding σ
+
31 → 13ms
+
random spikes become parallel races
+
+
+ +The fix was a 5-line reqwest config and a 50-line hedging function. This post is also an advertisement for Dean & Barroso's 2013 paper ["The Tail at Scale"](https://research.google/pubs/pub40801/) — a decade-old idea that still demolishes dispatch spikes. + +--- + +## The cause: hyper's dispatch channel + +Reqwest sits on top of hyper, which interposes an mpsc dispatch channel and a separate `ClientTask` between `.send()` and the h2 stream. We instrumented the forwarding path and confirmed: 100% of the spike time lives in the `send()` phase, and a parallel heartbeat task showed zero runtime lag during spikes. The tokio runtime was fine — the stall was internal to hyper's request scheduling. + +Hickory-resolver doesn't have this issue. It holds `h2::SendRequest` directly and calls `ready().await; send_request()` in the caller's task — no channel, no scheduling dependency. We used it as a reference point throughout. + +## Fix #1 — HTTP/2 window sizes + +Reqwest inherits hyper's HTTP/2 defaults: 2 MB stream window, 5 MB connection window. For DNS responses (~200 bytes), that's ~10,000× oversized — unnecessary WINDOW_UPDATE frames, bloated bookkeeping on every poll, and different server-side scheduling behavior. + +Setting both windows to the h2 spec default (64 KB) dropped our median from 13.3ms to 10.1ms: + +```rust +reqwest::Client::builder() + .use_rustls_tls() + .http2_initial_stream_window_size(65_535) + .http2_initial_connection_window_size(65_535) + .http2_keep_alive_interval(Duration::from_secs(15)) + .http2_keep_alive_while_idle(true) + .http2_keep_alive_timeout(Duration::from_secs(10)) + .pool_idle_timeout(Duration::from_secs(300)) + .pool_max_idle_per_host(1) + .build() +``` + +**Any Rust code using reqwest for tiny-payload HTTP/2 workloads — DoH, API polling, metric scraping — is probably hitting this.** + +## Fix #2 — Request hedging + +["The Tail at Scale"](https://research.google/pubs/pub40801/) (Dean & Barroso, 2013): fire a request, and if it doesn't return within your P50 latency, fire the same request in parallel. First response wins. + +The intuition: if 5% of requests spike due to independent random events, two parallel requests means only 0.25% of pairs spike on *both*. The tail collapses. + +**The surprise: hedging against the same upstream works.** HTTP/2 multiplexes streams — two `send_request()` calls on one connection become independent h2 streams. If one stalls in the dispatch channel, the other keeps making progress. + +```rust +pub async fn forward_with_hedging_raw( + wire: &[u8], + primary: &Upstream, + secondary: &Upstream, + hedge_delay: Duration, + timeout_duration: Duration, +) -> Result> { + let primary_fut = forward_query_raw(wire, primary, timeout_duration); + tokio::pin!(primary_fut); + let delay = sleep(hedge_delay); + tokio::pin!(delay); + + // Phase 1: wait for primary to return OR the hedge delay. + tokio::select! { + result = &mut primary_fut => return result, + _ = &mut delay => {} + } + + // Phase 2: hedge delay expired — fire secondary, keep primary alive. + let secondary_fut = forward_query_raw(wire, secondary, timeout_duration); + tokio::pin!(secondary_fut); + + // First successful response wins. + tokio::select! { + r = primary_fut => r, + r = secondary_fut => r, + } +} +``` + +The [production version](https://github.com/razvandimescu/numa/blob/main/src/forward.rs#L267) adds error handling — if one leg fails, it waits for the other. In production, Numa passes the same `&Upstream` twice when only one is configured. We extended hedging to all protocols — UDP (rescues packet loss on WiFi), DoT (rescues TLS handshake stalls). Configurable via `hedge_ms`; set to 0 to disable. + +**Caveat: hedging hurts on degraded networks.** When latency is consistently high (no random spikes, just slow), the hedge adds overhead with nothing to rescue. Hedging is a variance reducer, not a latency reducer — it only helps when spikes are *random*. + +--- + +## Forwarding results + +5 iterations × 101 domains × 10 rounds, 5,050 samples per method. Hickory-resolver included as a reference (it uses h2 directly, no dispatch channel): + +| | Single | **Hedged** | Hickory (ref) | +|---|---|---|---| +| mean | 17.4ms | **14.3ms** | 16.8ms | +| median | 10.4ms | **10.2ms** | 13.3ms | +| p95 | 52.5ms | **28.6ms** | 37.7ms | +| p99 | 113.4ms | **71.3ms** | 98.1ms | +| σ | 30.6ms | **13.2ms** | 19.1ms | + +The internal improvement: hedging cut p95 by 45%, p99 by 37%, σ by 57%. The exact margin vs hickory varies with network conditions; the σ reduction is consistent across runs. + +## Recursive resolution: from 2.3 seconds to 538ms + +Forwarding is one job. Recursive resolution — walking from root hints through TLD nameservers to the authoritative server — is a different one. We started 15× behind Unbound on cold recursive p99 and traced it to four root causes. + +**1. Missing NS delegation caching.** We cached glue records (ns1's IP) but not the delegation itself. Every `.com` query walked from root. Fix: cache NS records from referral authority sections. (10 lines) + +**2. Expired cache entries caused full cold resolutions.** Fix: serve-stale ([RFC 8767](https://www.rfc-editor.org/rfc/rfc8767)) — return expired entries with TTL=1 while revalidating in the background. (20 lines) + +**3. 1,900ms wasted per unreachable server.** 800ms UDP timeout + unconditional 1,500ms TCP fallback. Fix: 400ms UDP, TCP only for truncation. (5 lines) + +**4. Sequential NS queries on cold starts.** Fix: fire to the top 2 nameservers simultaneously. First response wins, SRTT recorded for both. Same hedging principle. (50 lines) + +
+
+
p99 before
+
2,367ms
+
+
+
+
p99 after
+
538ms
+
+
+
Unbound (ref)
+
748ms
+
+
+ +Genuine cold benchmarks — unique subdomains, 1 query per domain, 5 iterations, 505 samples per server: + +| | Baseline | Final | Unbound (ref) | +|---|---|---|---| +| p99 | 2,367ms | **538ms** | 748ms | +| σ | 254ms | **114ms** | 457ms | +| median | — | 77.6ms | 74.7ms | + +Unbound wins median by ~4% — its C implementation and 19 years of recursive optimization give it an edge on raw speed. It also has features we don't yet: aggressive NSEC caching ([RFC 8198](https://www.rfc-editor.org/rfc/rfc8198)) and a persistent infra cache. Where hedging shines is the tail — domains with slow or unreachable nameservers, where parallel queries turn worst-case sequential timeouts into races. + +Cache hits are tied across Numa, Unbound, and AdGuard Home — all serve at 0.1ms. + +--- + +## Takeaways + +The real hero of this post is Dean & Barroso. Hedging works because **spikes are random, and two random draws rarely both lose**. It's effective for any HTTP/2 client, any language, any forwarder topology. Nobody we know of ships it by default. + +If you're building a Rust service that makes many small HTTP/2 requests to the same backend: check your flow control window sizes first, then implement hedging. Don't rewrite the client. + +Benchmarks are in [`benches/recursive_compare.rs`](https://github.com/razvandimescu/numa/blob/main/benches/recursive_compare.rs) — run them yourself. If you're using reqwest for tiny-payload workloads and try the window size fix, I'd love to hear if you see the same improvement. + +--- + +Numa is a DNS resolver that runs on your laptop or phone. DoH, DoT, .numa local domains, ad blocking, developer overrides, a REST API, and all the optimization work in this post. [github.com/razvandimescu/numa](https://github.com/razvandimescu/numa). diff --git a/scripts/generate-blog-index.sh b/scripts/generate-blog-index.sh new file mode 100755 index 0000000..cacc033 --- /dev/null +++ b/scripts/generate-blog-index.sh @@ -0,0 +1,239 @@ +#!/usr/bin/env bash +set -euo pipefail + +# Generate site/blog/index.html from blog/*.md frontmatter. +# Reads title, description, date from YAML frontmatter in each post. +# Sorts newest first (by date string — "April 2026" > "March 2026"). + +OUT="site/blog/index.html" + +# Extract frontmatter fields from a markdown file +extract() { + local file="$1" field="$2" + sed -n '/^---$/,/^---$/p' "$file" | grep "^${field}:" | sed "s/^${field}: *//" +} + +# Collect posts: "date|name|title|description" per line +posts="" +sources="blog/*.md" +if [ "${BLOG_INCLUDE_DRAFTS:-}" = "1" ] && ls drafts/*.md >/dev/null 2>&1; then + sources="blog/*.md drafts/*.md" +fi +for f in $sources; do + name=$(basename "$f" .md) + title=$(extract "$f" title) + desc=$(extract "$f" description) + date=$(extract "$f" date) + posts+="${date}|${name}|${title}|${desc}"$'\n' +done + +# Sort by ISO date (YYYY-MM-DD), newest first +posts=$(echo "$posts" | grep -v '^$' | sort -t'|' -k1 -r) + +# Format ISO date (YYYY-MM-DD) to "Month YYYY" +format_date() { + local months=(January February March April May June July August September October November December) + local y="${1%%-*}" + local m="${1#*-}"; m="${m%%-*}"; m=$((10#$m)) + echo "${months[$((m-1))]} $y" +} + +# Generate post list items +items="" +while IFS='|' read -r date name title desc; do + display_date=$(format_date "$date") + items+="
  • + +
    ${title}
    +
    ${desc}
    +
    ${display_date}
    +
    +
  • +" +done <<< "$posts" + +# Write the full index.html — style matches the existing hand-maintained version +cat > "$OUT" << HTMLEOF + + + + + +Blog — Numa + + + + + + + + +
    +

    Blog

    +
      +${items}
    +
    + + + + + +HTMLEOF + +echo " blog/index.html generated ($(echo "$posts" | wc -l | tr -d ' ') posts)" diff --git a/scripts/serve-site.sh b/scripts/serve-site.sh new file mode 100755 index 0000000..23854ff --- /dev/null +++ b/scripts/serve-site.sh @@ -0,0 +1,14 @@ +#!/usr/bin/env bash +set -euo pipefail + +PORT="${1:-9000}" + +if [[ "${1:-}" == "--drafts" ]] || [[ "${2:-}" == "--drafts" ]]; then + PORT="${PORT//--drafts/9000}" # default port if --drafts was first arg + make blog-drafts +else + make blog +fi + +echo "Serving site at http://localhost:$PORT" +cd site && python3 -m http.server "$PORT" diff --git a/site/blog-template.html b/site/blog-template.html index 54f0eae..8f8a825 100644 --- a/site/blog-template.html +++ b/site/blog-template.html @@ -267,9 +267,105 @@ body::before { .blog-footer a:hover { color: var(--amber); } /* --- Responsive --- */ +/* Hero metrics cards */ +.hero-metrics { + display: grid; + grid-template-columns: repeat(3, 1fr); + gap: 1rem; + margin: 2rem 0; +} +.metric-card { + background: var(--bg-card); + border: 1px solid var(--border); + border-radius: 6px; + padding: 1.25rem; + text-align: center; +} +.metric-vs { + font-family: var(--font-mono); + font-size: 0.7rem; + letter-spacing: 0.08em; + text-transform: uppercase; + color: var(--text-dim); + margin-bottom: 0.5rem; +} +.metric-value { + font-family: var(--font-display); + font-size: 2.4rem; + font-weight: 400; + color: var(--amber); + line-height: 1.1; +} +.metric-label { + font-size: 0.82rem; + color: var(--text-secondary); + margin-top: 0.5rem; + line-height: 1.3; +} + +/* Before/after progression */ +.before-after { + display: flex; + align-items: center; + justify-content: center; + gap: 1.5rem; + margin: 2rem 0; + padding: 1.5rem; + background: var(--bg-card); + border: 1px solid var(--border); + border-radius: 6px; +} +.ba-item { text-align: center; } +.ba-label { + font-family: var(--font-mono); + font-size: 0.7rem; + letter-spacing: 0.08em; + text-transform: uppercase; + color: var(--text-dim); + margin-bottom: 0.3rem; +} +.ba-value { + font-family: var(--font-display); + font-size: 1.8rem; + font-weight: 400; + color: var(--text-secondary); +} +.ba-before { + text-decoration: line-through; + text-decoration-color: rgba(192, 98, 58, 0.4); + color: var(--text-dim); +} +.ba-after { color: var(--amber); } +.ba-arrow { font-size: 1.5rem; color: var(--text-dim); } +.ba-ref { + border-left: 1px solid var(--border); + padding-left: 1.5rem; +} + +/* Spike highlight */ +.spike { + background: rgba(192, 98, 58, 0.12); + padding: 0.15em 0.5em; + border-radius: 3px; + font-weight: 600; + color: var(--amber-dim); +} + +/* Section dividers */ +.article hr { + border: none; + height: 1px; + background: var(--border); + margin: 3rem auto; + max-width: 120px; +} + @media (max-width: 640px) { .article { padding: 2rem 1.25rem 4rem; } .article pre { padding: 1rem; margin-left: -0.5rem; margin-right: -0.5rem; border-radius: 0; border-left: none; border-right: none; } + .hero-metrics { grid-template-columns: 1fr; } + .before-after { flex-direction: column; gap: 0.75rem; } + .ba-ref { border-left: none; border-top: 1px solid var(--border); padding-left: 0; padding-top: 0.75rem; } } diff --git a/site/blog/index.html b/site/blog/index.html index 993c166..d4df9e4 100644 --- a/site/blog/index.html +++ b/site/blog/index.html @@ -168,10 +168,17 @@ body::before {

    Blog