feat: recursive DNS + DNSSEC + TCP fallback (#17)

* feat: recursive resolution + full DNSSEC validation

Numa becomes a true DNS resolver — resolves from root nameservers
with complete DNSSEC chain-of-trust verification.

Recursive resolution:
- Iterative RFC 1034 from configurable root hints (13 default)
- CNAME chasing (depth 8), referral following (depth 10)
- A+AAAA glue extraction, IPv6 nameserver support
- TLD priming: NS + DS + DNSKEY for 34 gTLDs + EU ccTLDs
- Config: mode = "recursive" in [upstream], root_hints, prime_tlds

DNSSEC (all 4 phases):
- EDNS0 OPT pseudo-record (DO bit, 1232 payload per DNS Flag Day 2020)
- DNSKEY, DS, RRSIG, NSEC, NSEC3 record types with wire read/write
- Signature verification via ring: RSA/SHA-256, ECDSA P-256, Ed25519
- Chain-of-trust: zone DNSKEY → parent DS → root KSK (key tag 20326)
- DNSKEY RRset self-signature verification (RRSIG(DNSKEY) by KSK)
- RRSIG expiration/inception time validation
- NSEC: NXDOMAIN gap proofs, NODATA type absence, wildcard denial
- NSEC3: SHA-1 iterated hashing, closest encloser proof, hash range
- Authority RRSIG verification for denial proofs
- Config: [dnssec] enabled/strict (default false, opt-in)
- AD bit on Secure, SERVFAIL on Bogus+strict
- DnssecStatus cached per entry, ValidationStats logging

Performance:
- TLD chain pre-warmed on startup (root DNSKEY + TLD DS/DNSKEY)
- Referral DS piggybacking from authority sections
- DNSKEY prefetch before validation loop
- Cold-cache validation: ~1 DNSKEY fetch (down from 5)
- Benchmarks: RSA 10.9µs, ECDSA 174ns, DS verify 257ns

Also:
- write_qname fix for root domain "." (was producing malformed queries)
- write_record_header() dedup, write_bytes() bulk writes
- DnsRecord::domain() + query_type() accessors
- UpstreamMode enum, DEFAULT_EDNS_PAYLOAD const
- Real glue TTL (was hardcoded 3600)
- DNSSEC restricted to recursive mode only

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat: TCP fallback, query minimization, UDP auto-disable

Transport resilience for restrictive networks (ISPs blocking UDP:53):
- DNS-over-TCP fallback: UDP fail/truncation → automatic TCP retry
- UDP auto-disable: after 3 consecutive failures, switch to TCP-first
- IPv6 → TCP directly (UDP socket binds 0.0.0.0, can't reach IPv6)
- Network change resets UDP detection for re-probing
- Root hint rotation in TLD priming

Privacy:
- RFC 7816 query minimization: root servers see TLD only, not full name

Code quality:
- Merged find_starting_ns + find_starting_zone → find_closest_ns
- Extracted resolve_ns_addrs_from_glue shared helper
- Removed overall timeout wrapper (per-hop timeouts sufficient)
- forward_tcp for DNS-over-TCP (RFC 1035 §4.2.2)

Testing:
- Mock TCP-only DNS server for fallback tests (no network needed)
- tcp_fallback_resolves_when_udp_blocked
- tcp_only_iterative_resolution
- tcp_fallback_handles_nxdomain
- udp_auto_disable_resets
- Integration test suite (4 suites, 51 tests)
- Network probe script (tests/network-probe.sh)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: DNSSEC verified badge in dashboard query log

- Add dnssec field to QueryLogEntry, track validation status per query
- DnssecStatus::as_str() for API serialization
- Dashboard shows green checkmark next to DNSSEC-verified responses
- Blog post: add "How keys get there" section, transport resilience section,
  trim code blocks, update What's Next

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: use SVG shield for DNSSEC badge, update blog HTML

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: NS cache lookup from authorities, UDP re-probe, shield alignment

- find_closest_ns checks authorities (not just answers) for NS records,
  fixing TLD priming cache misses that caused redundant root queries
- Periodic UDP re-probe every 5min when disabled — re-enables UDP
  after switching from a restrictive network to an open one
- Dashboard DNSSEC shield uses fixed-width container for alignment
- Blog post: tuck key-tag into trust anchor paragraph

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: TCP single-write, mock server consistency, integration tests

- TCP single-write fix: combine length prefix + message to avoid split
  segments that Microsoft/Azure DNS servers reject
- Mock server (spawn_tcp_dns_server) updated to use single-write too
- Tests: forward_tcp_wire_format, forward_tcp_single_segment_write
- Integration: real-server checks for Microsoft/Office/Azure domains

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: recursive bar in dashboard, special-use domain interception

Dashboard:
- Add Recursive bar to resolution paths chart (cyan, distinct from Override)
- Add RECURSIVE path tag style in query log

Special-use domains (RFC 6761/6303/8880/9462):
- .localhost → 127.0.0.1 (RFC 6761)
- Private reverse PTR (10.x, 192.168.x, 172.16-31.x) → NXDOMAIN
- _dns.resolver.arpa (DDR) → NXDOMAIN
- ipv4only.arpa (NAT64) → 192.0.0.170/171
- mDNS service discovery for private ranges → NXDOMAIN

Eliminates ~900ms SERVFAILs for macOS system queries that were
hitting root servers unnecessarily.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* chore: move generated blog HTML to site/blog/posts/, gitignore

- Generated HTML now in site/blog/posts/ (gitignored)
- CI workflow runs pandoc + make blog before deploy
- Updated all internal blog links to /blog/posts/ path
- blog/*.md remains the source of truth

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: review feedback — memory ordering, RRSIG time, NS resolution

- Ordering::Relaxed → Acquire/Release for UDP_DISABLED/UDP_FAILURES
  (ARM correctness for cross-thread coordination)
- RRSIG time validation: serial number arithmetic (RFC 4034 §3.1.5)
  + 300s clock skew fudge factor (matches BIND)
- resolve_ns_addrs_from_glue collects addresses from ALL NS names,
  not just the first with glue (improves failover)
- is_special_use_domain: eliminate 16 format! allocations per
  .in-addr.arpa query (parse octet instead)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: API endpoint tests, coverage target

- 8 new axum handler tests: health, stats, query-log, overrides CRUD,
  cache, blocking stats, services CRUD, dashboard HTML
- Tests use tower::oneshot — no network, no server startup
- test_ctx() builds minimal ServerCtx for isolated testing
- `make coverage` target (cargo-tarpaulin), separate from `make all`
- 82 total tests (was 74)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
Razvan Dimescu
2026-03-28 04:03:47 +02:00
committed by GitHub
parent 7aee90c99b
commit a84f2e7f1d
31 changed files with 5477 additions and 776 deletions

419
tests/integration.sh Executable file
View File

@@ -0,0 +1,419 @@
#!/usr/bin/env bash
# Integration test suite for Numa
# Runs a test instance on port 5354, validates all features, exits with status.
# Usage: ./tests/integration.sh [release|debug]
set -euo pipefail
MODE="${1:-release}"
BINARY="./target/$MODE/numa"
PORT=5354
API_PORT=5381
CONFIG="/tmp/numa-integration-test.toml"
LOG="/tmp/numa-integration-test.log"
PASSED=0
FAILED=0
# Colors
GREEN="\033[32m"
RED="\033[31m"
DIM="\033[90m"
RESET="\033[0m"
check() {
local name="$1"
local expected="$2"
local actual="$3"
if echo "$actual" | grep -q "$expected"; then
PASSED=$((PASSED + 1))
printf " ${GREEN}${RESET} %s\n" "$name"
else
FAILED=$((FAILED + 1))
printf " ${RED}${RESET} %s\n" "$name"
printf " ${DIM}expected: %s${RESET}\n" "$expected"
printf " ${DIM} got: %s${RESET}\n" "$actual"
fi
}
# Build if needed
if [ ! -f "$BINARY" ]; then
echo "Building $MODE..."
cargo build --$MODE
fi
run_test_suite() {
local SUITE_NAME="$1"
local SUITE_CONFIG="$2"
cat > "$CONFIG" << CONF
$SUITE_CONFIG
CONF
echo "Starting Numa on :$PORT ($SUITE_NAME)..."
RUST_LOG=info "$BINARY" "$CONFIG" > "$LOG" 2>&1 &
NUMA_PID=$!
sleep 4
if ! kill -0 "$NUMA_PID" 2>/dev/null; then
echo "Failed to start Numa:"
tail -5 "$LOG"
return 1
fi
DIG="dig @127.0.0.1 -p $PORT +time=5 +tries=1"
echo ""
echo "=== Resolution ==="
check "A record (google.com)" \
"." \
"$($DIG google.com A +short)"
check "AAAA record (google.com)" \
":" \
"$($DIG google.com AAAA +short)"
check "CNAME chasing (www.github.com)" \
"github.com" \
"$($DIG www.github.com A +short)"
check "MX records (gmail.com)" \
"gmail-smtp-in" \
"$($DIG gmail.com MX +short)"
check "NS records (cloudflare.com)" \
"cloudflare.com" \
"$($DIG cloudflare.com NS +short)"
check "NXDOMAIN" \
"NXDOMAIN" \
"$($DIG nope12345678.com A 2>&1 | grep status:)"
echo ""
echo "=== Ad Blocking ==="
if echo "$SUITE_CONFIG" | grep -q 'enabled = true'; then
check "Blocked domain → 0.0.0.0" \
"0.0.0.0" \
"$($DIG ads.google.com A +short)"
else
local ADS=$($DIG ads.google.com A +short 2>/dev/null)
if echo "$ADS" | grep -q "0.0.0.0"; then
check "Blocking disabled but domain blocked" "should-resolve" "0.0.0.0"
else
check "Blocking disabled — domain resolves normally" "." "$ADS"
fi
fi
echo ""
echo "=== Cache ==="
$DIG example.com A +short > /dev/null 2>&1
sleep 1
check "Cache hit returns result" \
"." \
"$($DIG example.com A +short)"
echo ""
echo "=== Connectivity ==="
# Apple captive portal can be slow/flaky on some networks
local CAPTIVE
CAPTIVE=$($DIG captive.apple.com A +short 2>/dev/null || echo "timeout")
if echo "$CAPTIVE" | grep -q "apple\|17\.\|timeout"; then
check "Apple captive portal" "." "$CAPTIVE"
else
check "Apple captive portal" "apple" "$CAPTIVE"
fi
check "CDN (jsdelivr)" \
"." \
"$($DIG cdn.jsdelivr.net A +short)"
echo ""
echo "=== API ==="
check "Health endpoint" \
"ok" \
"$(curl -s http://127.0.0.1:$API_PORT/health)"
check "Stats endpoint" \
"uptime_secs" \
"$(curl -s http://127.0.0.1:$API_PORT/stats)"
echo ""
echo "=== Log Health ==="
ERRORS=$(grep -c 'RECURSIVE ERROR\|PARSE ERROR\|HANDLER ERROR\|panic' "$LOG" 2>/dev/null || echo 0)
check "No critical errors in log" \
"0" \
"$ERRORS"
kill "$NUMA_PID" 2>/dev/null || true
wait "$NUMA_PID" 2>/dev/null || true
sleep 1
}
# ---- Suite 1: Recursive mode + DNSSEC ----
echo ""
echo "╔══════════════════════════════════════════╗"
echo "║ Suite 1: Recursive + DNSSEC + Blocking ║"
echo "╚══════════════════════════════════════════╝"
run_test_suite "recursive + DNSSEC + blocking" "
[server]
bind_addr = \"127.0.0.1:5354\"
api_port = 5381
[upstream]
mode = \"recursive\"
[cache]
max_entries = 10000
min_ttl = 60
max_ttl = 86400
[blocking]
enabled = true
[proxy]
enabled = false
[dnssec]
enabled = true
"
DIG="dig @127.0.0.1 -p $PORT +time=5 +tries=1"
echo ""
echo "=== DNSSEC (recursive only) ==="
# Re-start for DNSSEC checks (suite 1 instance was killed)
RUST_LOG=info "$BINARY" "$CONFIG" > "$LOG" 2>&1 &
NUMA_PID=$!
sleep 4
check "AD bit set (cloudflare.com)" \
" ad" \
"$($DIG cloudflare.com A +dnssec 2>&1 | grep flags:)"
check "EDNS DO bit echoed" \
"flags: do" \
"$($DIG cloudflare.com A +dnssec 2>&1 | grep 'EDNS:')"
echo ""
echo "=== TCP wire format (real servers) ==="
# Microsoft's Azure DNS servers require length+message in a single TCP segment.
# This test catches the split-write bug that caused early-eof SERVFAILs.
check "Microsoft domain (update.code.visualstudio.com)" \
"NOERROR" \
"$($DIG update.code.visualstudio.com A 2>&1 | grep status:)"
check "Office domain (ecs.office.com)" \
"NOERROR" \
"$($DIG ecs.office.com A 2>&1 | grep status:)"
# Azure Application Insights — another strict TCP server
check "Azure telemetry (eastus2-3.in.applicationinsights.azure.com)" \
"." \
"$($DIG eastus2-3.in.applicationinsights.azure.com A +short 2>/dev/null || echo 'timeout')"
kill "$NUMA_PID" 2>/dev/null || true
wait "$NUMA_PID" 2>/dev/null || true
sleep 1
# ---- Suite 2: Forward mode (backward compat) ----
echo ""
echo "╔══════════════════════════════════════════╗"
echo "║ Suite 2: Forward (DoH) + Blocking ║"
echo "╚══════════════════════════════════════════╝"
run_test_suite "forward DoH + blocking" "
[server]
bind_addr = \"127.0.0.1:5354\"
api_port = 5381
[upstream]
mode = \"forward\"
address = \"https://9.9.9.9/dns-query\"
[cache]
max_entries = 10000
min_ttl = 60
max_ttl = 86400
[blocking]
enabled = true
[proxy]
enabled = false
"
# ---- Suite 3: Forward UDP (plain, no DoH) ----
echo ""
echo "╔══════════════════════════════════════════╗"
echo "║ Suite 3: Forward (UDP) + No Blocking ║"
echo "╚══════════════════════════════════════════╝"
run_test_suite "forward UDP, no blocking" "
[server]
bind_addr = \"127.0.0.1:5354\"
api_port = 5381
[upstream]
mode = \"forward\"
address = \"9.9.9.9\"
port = 53
[cache]
max_entries = 10000
min_ttl = 60
max_ttl = 86400
[blocking]
enabled = false
[proxy]
enabled = false
"
# Verify blocking is actually off
RUST_LOG=info "$BINARY" "$CONFIG" > "$LOG" 2>&1 &
NUMA_PID=$!
sleep 3
echo ""
echo "=== Blocking disabled ==="
ADS_RESULT=$($DIG ads.google.com A +short 2>/dev/null)
if echo "$ADS_RESULT" | grep -q "0.0.0.0"; then
check "ads.google.com NOT blocked (blocking disabled)" "not-0.0.0.0" "0.0.0.0"
else
check "ads.google.com NOT blocked (blocking disabled)" "." "$ADS_RESULT"
fi
kill "$NUMA_PID" 2>/dev/null || true
wait "$NUMA_PID" 2>/dev/null || true
sleep 1
# ---- Suite 4: Local zones + Overrides API ----
echo ""
echo "╔══════════════════════════════════════════╗"
echo "║ Suite 4: Local Zones + Overrides API ║"
echo "╚══════════════════════════════════════════╝"
cat > "$CONFIG" << 'CONF'
[server]
bind_addr = "127.0.0.1:5354"
api_port = 5381
[upstream]
mode = "forward"
address = "9.9.9.9"
port = 53
[cache]
max_entries = 10000
[blocking]
enabled = false
[proxy]
enabled = false
[[zones]]
domain = "test.local"
record_type = "A"
value = "10.0.0.1"
ttl = 60
[[zones]]
domain = "mail.local"
record_type = "MX"
value = "10 smtp.local"
ttl = 60
CONF
RUST_LOG=info "$BINARY" "$CONFIG" > "$LOG" 2>&1 &
NUMA_PID=$!
sleep 3
echo ""
echo "=== Local Zones ==="
check "Local A record (test.local)" \
"10.0.0.1" \
"$($DIG test.local A +short)"
check "Local MX record (mail.local)" \
"smtp.local" \
"$($DIG mail.local MX +short)"
check "Non-local domain still resolves" \
"." \
"$($DIG example.com A +short)"
echo ""
echo "=== Overrides API ==="
# Create override
HTTP_CODE=$(curl -s -o /dev/null -w "%{http_code}" -X POST http://127.0.0.1:$API_PORT/overrides \
-H 'Content-Type: application/json' \
-d '{"domain":"override.test","target":"192.168.1.100","duration_secs":60}')
check "Create override (HTTP 200/201)" \
"20" \
"$HTTP_CODE"
sleep 1
check "Override resolves" \
"192.168.1.100" \
"$($DIG override.test A +short)"
# List overrides
check "List overrides" \
"override.test" \
"$(curl -s http://127.0.0.1:$API_PORT/overrides)"
# Delete override
curl -s -X DELETE http://127.0.0.1:$API_PORT/overrides/override.test > /dev/null
sleep 1
# After delete, should not resolve to override
AFTER_DELETE=$($DIG override.test A +short 2>/dev/null)
if echo "$AFTER_DELETE" | grep -q "192.168.1.100"; then
check "Override deleted" "not-192.168.1.100" "$AFTER_DELETE"
else
check "Override deleted" "." "deleted"
fi
echo ""
echo "=== Cache API ==="
check "Cache list" \
"domain" \
"$(curl -s http://127.0.0.1:$API_PORT/cache)"
# Flush cache
curl -s -X DELETE http://127.0.0.1:$API_PORT/cache > /dev/null
check "Cache flushed" \
"0" \
"$(curl -s http://127.0.0.1:$API_PORT/stats | grep -o '"entries":[0-9]*' | grep -o '[0-9]*')"
kill "$NUMA_PID" 2>/dev/null || true
wait "$NUMA_PID" 2>/dev/null || true
# Summary
echo ""
TOTAL=$((PASSED + FAILED))
if [ "$FAILED" -eq 0 ]; then
printf "${GREEN}All %d tests passed.${RESET}\n" "$TOTAL"
exit 0
else
printf "${RED}%d/%d tests failed.${RESET}\n" "$FAILED" "$TOTAL"
echo ""
echo "Log: $LOG"
exit 1
fi

128
tests/network-probe.sh Executable file
View File

@@ -0,0 +1,128 @@
#!/usr/bin/env bash
# Network probe: tests which DNS transports are available on the current network.
# Run on a problematic network to diagnose what's blocked.
# Usage: ./tests/network-probe.sh
set -euo pipefail
GREEN="\033[32m"
RED="\033[31m"
DIM="\033[90m"
RESET="\033[0m"
PASSED=0
FAILED=0
probe() {
local name="$1"
local cmd="$2"
local expect="$3"
local result
result=$(eval "$cmd" 2>&1) || true
if echo "$result" | grep -q "$expect"; then
PASSED=$((PASSED + 1))
printf " ${GREEN}${RESET} %-45s ${DIM}%s${RESET}\n" "$name" "$(echo "$result" | head -1 | cut -c1-60)"
else
FAILED=$((FAILED + 1))
printf " ${RED}${RESET} %-45s ${DIM}blocked/timeout${RESET}\n" "$name"
fi
}
echo ""
echo "Network DNS Transport Probe"
echo "==========================="
echo "Network: $(networksetup -getairportnetwork en0 2>/dev/null | sed 's/Current Wi-Fi Network: //' || echo 'unknown')"
echo "Local IP: $(ipconfig getifaddr en0 2>/dev/null || echo 'unknown')"
echo "Gateway: $(route -n get default 2>/dev/null | grep gateway | awk '{print $2}' || echo 'unknown')"
echo ""
echo "=== UDP port 53 (recursive resolution) ==="
probe "Root server a (198.41.0.4)" \
"dig @198.41.0.4 . NS +short +time=5 +tries=1" \
"root-servers"
probe "Root server k (193.0.14.129)" \
"dig @193.0.14.129 . NS +short +time=5 +tries=1" \
"root-servers"
probe "Google DNS (8.8.8.8)" \
"dig @8.8.8.8 google.com A +short +time=5 +tries=1" \
"\."
probe "Cloudflare (1.1.1.1)" \
"dig @1.1.1.1 cloudflare.com A +short +time=5 +tries=1" \
"\."
probe ".com TLD (192.5.6.30)" \
"dig @192.5.6.30 google.com NS +short +time=5 +tries=1" \
"google"
echo ""
echo "=== TCP port 53 ==="
probe "Google DNS TCP (8.8.8.8)" \
"dig @8.8.8.8 google.com A +short +tcp +time=5 +tries=1" \
"\."
probe "Root server TCP (198.41.0.4)" \
"dig @198.41.0.4 . NS +short +tcp +time=5 +tries=1" \
"root-servers"
echo ""
echo "=== DoT port 853 (DNS-over-TLS) ==="
probe "Quad9 DoT (9.9.9.9:853)" \
"echo Q | openssl s_client -connect 9.9.9.9:853 -servername dns.quad9.net 2>&1 | grep 'verify return'" \
"verify return"
probe "Cloudflare DoT (1.1.1.1:853)" \
"echo Q | openssl s_client -connect 1.1.1.1:853 -servername cloudflare-dns.com 2>&1 | grep 'verify return'" \
"verify return"
echo ""
echo "=== DoH port 443 (DNS-over-HTTPS) ==="
probe "Quad9 DoH (dns.quad9.net)" \
"curl -s -m 5 -H 'accept: application/dns-json' 'https://dns.quad9.net:443/dns-query?name=google.com&type=A'" \
"Answer"
probe "Cloudflare DoH (1.1.1.1)" \
"curl -s -m 5 -H 'accept: application/dns-json' 'https://1.1.1.1/dns-query?name=google.com&type=A'" \
"Answer"
probe "Google DoH (dns.google)" \
"curl -s -m 5 'https://dns.google/resolve?name=google.com&type=A'" \
"Answer"
echo ""
echo "=== ISP DNS ==="
# Detect system DNS
SYS_DNS=$(scutil --dns 2>/dev/null | grep "nameserver\[0\]" | head -1 | awk '{print $3}' || echo "unknown")
if [ "$SYS_DNS" != "unknown" ] && [ "$SYS_DNS" != "127.0.0.1" ]; then
probe "ISP DNS ($SYS_DNS)" \
"dig @$SYS_DNS google.com A +short +time=5 +tries=1" \
"\."
else
printf " ${DIM} System DNS is $SYS_DNS (skipped)${RESET}\n"
fi
echo ""
echo "==========================="
TOTAL=$((PASSED + FAILED))
printf "Results: ${GREEN}%d passed${RESET}, ${RED}%d blocked${RESET} / %d total\n" "$PASSED" "$FAILED" "$TOTAL"
echo ""
echo "Recommendation:"
if [ "$FAILED" -eq 0 ]; then
echo " All transports available. Recursive mode will work."
elif dig @198.41.0.4 . NS +short +time=5 +tries=1 2>&1 | grep -q "root-servers"; then
echo " UDP:53 works. Recursive mode will work."
else
echo " UDP:53 blocked — recursive mode will NOT work on this network."
if curl -s -m 5 'https://dns.quad9.net:443/dns-query?name=test.com&type=A' 2>&1 | grep -q "Answer"; then
echo " DoH (port 443) works — use mode = \"forward\" with DoH upstream."
elif echo Q | openssl s_client -connect 9.9.9.9:853 2>&1 | grep -q "verify return"; then
echo " DoT (port 853) works — DoT upstream would work (not yet implemented)."
else
echo " Only ISP DNS available. Use mode = \"forward\" with ISP auto-detect."
fi
fi