fix(bootstrap): route numa HTTPS via IP-literal bootstrap resolver (#122) #126
Reference in New Issue
Block a user
Delete Branch "fix/self-resolver-loop"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Summary
When numa is its own system DNS resolver (HAOS add-on, Pi-hole-style container,
/etc/resolv.conf → 127.0.0.1), every outgoing HTTPS connection — DoH upstream, ODoH relay/target, blocklist CDN — routed its hostname throughgetaddrinfo()back to numa. Cold boot deadlocked; steady state taxed every new TCP connection. 0.14.1's retry-with-backoff (#122) masked the startup race but not the underlying self-loop.Two commits:
refactor(ctx): coalesce forward-path upstream queries—resolve_coalescednow covers Forward + Forwarded-rule branches, not just Recursive. Fixes thundering-herd at boot when N concurrent HTTPS setups each fire independent forward queries for the same upstream hostname.fix(bootstrap): route numa HTTPS via IP-literal bootstrap resolver— newNumaResolverimplreqwest::dns::Resolve:relay_ip/target_ip) short-circuit DNS entirely, preserving ODoH's zero-plain-DNS-leak property.upstream.fallback(IP-literal filtered, hostnames warned). Empty →[9.9.9.9, 1.1.1.1]default; source logged at startup.doh_keepalive_loopfires immediately at boot +keepalive_dohnow logs failures → bootstrap problems surface within ~100ms instead of on first client query.test(odoh): integration-verify relay_ip/target_ip override wiring— Suite 8 now assertsrelay_ip/target_ipland in the bootstrap override map;NumaResolver::newlogs the override map at INFO.Distinct from
UpstreamPool.fallback(client-query failover), which stays untouched: client queries with no configured fallback still SERVFAIL on primary failure rather than silently shadow-routing through the bootstrap IPs.Validation
cargo test --lib— 354/354cargo clippy --all-targets— no new warningstests/integration.sh— 101/101 (Suite 8 gained 2 override-wiring checks)tests/docker/self-resolver-loop.sh(new reproducer) — passes: 397k blocklist domains loaded, first DoH query 118ms. Before fix: 0 domains, 3072ms SERVFAIL.Test plan
tests/docker/self-resolver-loop.shlocally to confirm the reproducer passes.bootstrap resolver: … via …line. (verified via reproducer'snuma.logtail and local smoke run.)/etc/resolv.confworkaround. (covered by the docker reproducer:/etc/resolv.conf → 127.0.0.1, numa on :53, no workaround — 397k domains loaded cold.)dig @127.0.0.1 example.comwith hostname DoH primary + empty fallback returns in <200ms on cold start. (reproducer config matches exactly; measured 118ms.)relay_ip/target_ipstill behave as before (no DNS query leaves the box for configured endpoints). (integration Suite 8 now runs numa in ODoH mode with TEST-NET-1 override IPs and asserts bothrelay_ip/target_ipland in the bootstrap override map; combined with theoverride_returns_configured_ips_without_dnsunit test, this covers the zero-plain-DNS-leak property end-to-end.)Follow-ups
self-resolver-loop.shinto the standard smoke set alongsidesmoke-port53.sh.Guara92/numa-haosto drop the/etc/resolv.confinjection once 0.14.2 ships.Closes #122.