fix(bootstrap): route numa HTTPS via IP-literal bootstrap resolver (#122) #126

Merged
razvandimescu merged 4 commits from fix/self-resolver-loop into main 2026-04-21 22:52:51 +08:00
razvandimescu commented 2026-04-21 21:19:47 +08:00 (Migrated from github.com)

Summary

When numa is its own system DNS resolver (HAOS add-on, Pi-hole-style container, /etc/resolv.conf → 127.0.0.1), every outgoing HTTPS connection — DoH upstream, ODoH relay/target, blocklist CDN — routed its hostname through getaddrinfo() back to numa. Cold boot deadlocked; steady state taxed every new TCP connection. 0.14.1's retry-with-backoff (#122) masked the startup race but not the underlying self-loop.

Two commits:

  1. refactor(ctx): coalesce forward-path upstream queriesresolve_coalesced now covers Forward + Forwarded-rule branches, not just Recursive. Fixes thundering-herd at boot when N concurrent HTTPS setups each fire independent forward queries for the same upstream hostname.

  2. fix(bootstrap): route numa HTTPS via IP-literal bootstrap resolver — new NumaResolver impl reqwest::dns::Resolve:

    • Per-host overrides (ODoH relay_ip/target_ip) short-circuit DNS entirely, preserving ODoH's zero-plain-DNS-leak property.
    • Otherwise: A+AAAA in parallel via UDP to IP-literal bootstrap servers, TCP fallback for UDP-hostile networks.
    • Bootstrap IPs come from upstream.fallback (IP-literal filtered, hostnames warned). Empty → [9.9.9.9, 1.1.1.1] default; source logged at startup.
    • doh_keepalive_loop fires immediately at boot + keepalive_doh now logs failures → bootstrap problems surface within ~100ms instead of on first client query.
  3. test(odoh): integration-verify relay_ip/target_ip override wiring — Suite 8 now asserts relay_ip/target_ip land in the bootstrap override map; NumaResolver::new logs the override map at INFO.

Distinct from UpstreamPool.fallback (client-query failover), which stays untouched: client queries with no configured fallback still SERVFAIL on primary failure rather than silently shadow-routing through the bootstrap IPs.

Validation

  • cargo test --lib — 354/354
  • cargo clippy --all-targets — no new warnings
  • tests/integration.sh — 101/101 (Suite 8 gained 2 override-wiring checks)
  • tests/docker/self-resolver-loop.sh (new reproducer) — passes: 397k blocklist domains loaded, first DoH query 118ms. Before fix: 0 domains, 3072ms SERVFAIL.

Test plan

  • Run tests/docker/self-resolver-loop.sh locally to confirm the reproducer passes.
  • Check startup log shows bootstrap resolver: … via … line. (verified via reproducer's numa.log tail and local smoke run.)
  • Deploy on a box where numa is the system resolver (HAOS / Pi-hole-shape) and confirm blocklist downloads on cold boot without the /etc/resolv.conf workaround. (covered by the docker reproducer: /etc/resolv.conf → 127.0.0.1, numa on :53, no workaround — 397k domains loaded cold.)
  • Confirm dig @127.0.0.1 example.com with hostname DoH primary + empty fallback returns in <200ms on cold start. (reproducer config matches exactly; measured 118ms.)
  • Confirm ODoH relay_ip / target_ip still behave as before (no DNS query leaves the box for configured endpoints). (integration Suite 8 now runs numa in ODoH mode with TEST-NET-1 override IPs and asserts both relay_ip/target_ip land in the bootstrap override map; combined with the override_returns_configured_ips_without_dns unit test, this covers the zero-plain-DNS-leak property end-to-end.)

Follow-ups

  • Graduate self-resolver-loop.sh into the standard smoke set alongside smoke-port53.sh.
  • Ping Guara92/numa-haos to drop the /etc/resolv.conf injection once 0.14.2 ships.

Closes #122.

## Summary When numa is its own system DNS resolver (HAOS add-on, Pi-hole-style container, `/etc/resolv.conf → 127.0.0.1`), every outgoing HTTPS connection — DoH upstream, ODoH relay/target, blocklist CDN — routed its hostname through `getaddrinfo()` back to numa. Cold boot deadlocked; steady state taxed every new TCP connection. 0.14.1's retry-with-backoff (#122) masked the startup race but not the underlying self-loop. Two commits: 1. **`refactor(ctx): coalesce forward-path upstream queries`** — `resolve_coalesced` now covers Forward + Forwarded-rule branches, not just Recursive. Fixes thundering-herd at boot when N concurrent HTTPS setups each fire independent forward queries for the same upstream hostname. 2. **`fix(bootstrap): route numa HTTPS via IP-literal bootstrap resolver`** — new `NumaResolver` impl `reqwest::dns::Resolve`: - Per-host overrides (ODoH `relay_ip`/`target_ip`) short-circuit DNS entirely, preserving ODoH's zero-plain-DNS-leak property. - Otherwise: A+AAAA in parallel via UDP to IP-literal bootstrap servers, TCP fallback for UDP-hostile networks. - Bootstrap IPs come from `upstream.fallback` (IP-literal filtered, hostnames warned). Empty → `[9.9.9.9, 1.1.1.1]` default; source logged at startup. - `doh_keepalive_loop` fires immediately at boot + `keepalive_doh` now logs failures → bootstrap problems surface within ~100ms instead of on first client query. 3. **`test(odoh): integration-verify relay_ip/target_ip override wiring`** — Suite 8 now asserts `relay_ip`/`target_ip` land in the bootstrap override map; `NumaResolver::new` logs the override map at INFO. Distinct from `UpstreamPool.fallback` (client-query failover), which stays untouched: client queries with no configured fallback still SERVFAIL on primary failure rather than silently shadow-routing through the bootstrap IPs. ## Validation - `cargo test --lib` — 354/354 - `cargo clippy --all-targets` — no new warnings - `tests/integration.sh` — 101/101 (Suite 8 gained 2 override-wiring checks) - **`tests/docker/self-resolver-loop.sh`** (new reproducer) — **passes**: 397k blocklist domains loaded, first DoH query 118ms. Before fix: 0 domains, 3072ms SERVFAIL. ## Test plan - [x] Run `tests/docker/self-resolver-loop.sh` locally to confirm the reproducer passes. - [x] Check startup log shows `bootstrap resolver: … via …` line. _(verified via reproducer's `numa.log` tail and local smoke run.)_ - [x] Deploy on a box where numa is the system resolver (HAOS / Pi-hole-shape) and confirm blocklist downloads on cold boot without the `/etc/resolv.conf` workaround. _(covered by the docker reproducer: `/etc/resolv.conf → 127.0.0.1`, numa on :53, no workaround — 397k domains loaded cold.)_ - [x] Confirm `dig @127.0.0.1 example.com` with hostname DoH primary + empty fallback returns in <200ms on cold start. _(reproducer config matches exactly; measured 118ms.)_ - [x] Confirm ODoH `relay_ip` / `target_ip` still behave as before (no DNS query leaves the box for configured endpoints). _(integration Suite 8 now runs numa in ODoH mode with TEST-NET-1 override IPs and asserts both `relay_ip`/`target_ip` land in the bootstrap override map; combined with the `override_returns_configured_ips_without_dns` unit test, this covers the zero-plain-DNS-leak property end-to-end.)_ ## Follow-ups - Graduate `self-resolver-loop.sh` into the standard smoke set alongside `smoke-port53.sh`. - Ping `Guara92/numa-haos` to drop the `/etc/resolv.conf` injection once 0.14.2 ships. Closes #122.
Sign in to join this conversation.