feat: recursive DNS + DNSSEC + TCP fallback (#17)

* feat: recursive resolution + full DNSSEC validation

Numa becomes a true DNS resolver — resolves from root nameservers
with complete DNSSEC chain-of-trust verification.

Recursive resolution:
- Iterative RFC 1034 from configurable root hints (13 default)
- CNAME chasing (depth 8), referral following (depth 10)
- A+AAAA glue extraction, IPv6 nameserver support
- TLD priming: NS + DS + DNSKEY for 34 gTLDs + EU ccTLDs
- Config: mode = "recursive" in [upstream], root_hints, prime_tlds

DNSSEC (all 4 phases):
- EDNS0 OPT pseudo-record (DO bit, 1232 payload per DNS Flag Day 2020)
- DNSKEY, DS, RRSIG, NSEC, NSEC3 record types with wire read/write
- Signature verification via ring: RSA/SHA-256, ECDSA P-256, Ed25519
- Chain-of-trust: zone DNSKEY → parent DS → root KSK (key tag 20326)
- DNSKEY RRset self-signature verification (RRSIG(DNSKEY) by KSK)
- RRSIG expiration/inception time validation
- NSEC: NXDOMAIN gap proofs, NODATA type absence, wildcard denial
- NSEC3: SHA-1 iterated hashing, closest encloser proof, hash range
- Authority RRSIG verification for denial proofs
- Config: [dnssec] enabled/strict (default false, opt-in)
- AD bit on Secure, SERVFAIL on Bogus+strict
- DnssecStatus cached per entry, ValidationStats logging

Performance:
- TLD chain pre-warmed on startup (root DNSKEY + TLD DS/DNSKEY)
- Referral DS piggybacking from authority sections
- DNSKEY prefetch before validation loop
- Cold-cache validation: ~1 DNSKEY fetch (down from 5)
- Benchmarks: RSA 10.9µs, ECDSA 174ns, DS verify 257ns

Also:
- write_qname fix for root domain "." (was producing malformed queries)
- write_record_header() dedup, write_bytes() bulk writes
- DnsRecord::domain() + query_type() accessors
- UpstreamMode enum, DEFAULT_EDNS_PAYLOAD const
- Real glue TTL (was hardcoded 3600)
- DNSSEC restricted to recursive mode only

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat: TCP fallback, query minimization, UDP auto-disable

Transport resilience for restrictive networks (ISPs blocking UDP:53):
- DNS-over-TCP fallback: UDP fail/truncation → automatic TCP retry
- UDP auto-disable: after 3 consecutive failures, switch to TCP-first
- IPv6 → TCP directly (UDP socket binds 0.0.0.0, can't reach IPv6)
- Network change resets UDP detection for re-probing
- Root hint rotation in TLD priming

Privacy:
- RFC 7816 query minimization: root servers see TLD only, not full name

Code quality:
- Merged find_starting_ns + find_starting_zone → find_closest_ns
- Extracted resolve_ns_addrs_from_glue shared helper
- Removed overall timeout wrapper (per-hop timeouts sufficient)
- forward_tcp for DNS-over-TCP (RFC 1035 §4.2.2)

Testing:
- Mock TCP-only DNS server for fallback tests (no network needed)
- tcp_fallback_resolves_when_udp_blocked
- tcp_only_iterative_resolution
- tcp_fallback_handles_nxdomain
- udp_auto_disable_resets
- Integration test suite (4 suites, 51 tests)
- Network probe script (tests/network-probe.sh)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: DNSSEC verified badge in dashboard query log

- Add dnssec field to QueryLogEntry, track validation status per query
- DnssecStatus::as_str() for API serialization
- Dashboard shows green checkmark next to DNSSEC-verified responses
- Blog post: add "How keys get there" section, transport resilience section,
  trim code blocks, update What's Next

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: use SVG shield for DNSSEC badge, update blog HTML

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: NS cache lookup from authorities, UDP re-probe, shield alignment

- find_closest_ns checks authorities (not just answers) for NS records,
  fixing TLD priming cache misses that caused redundant root queries
- Periodic UDP re-probe every 5min when disabled — re-enables UDP
  after switching from a restrictive network to an open one
- Dashboard DNSSEC shield uses fixed-width container for alignment
- Blog post: tuck key-tag into trust anchor paragraph

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: TCP single-write, mock server consistency, integration tests

- TCP single-write fix: combine length prefix + message to avoid split
  segments that Microsoft/Azure DNS servers reject
- Mock server (spawn_tcp_dns_server) updated to use single-write too
- Tests: forward_tcp_wire_format, forward_tcp_single_segment_write
- Integration: real-server checks for Microsoft/Office/Azure domains

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: recursive bar in dashboard, special-use domain interception

Dashboard:
- Add Recursive bar to resolution paths chart (cyan, distinct from Override)
- Add RECURSIVE path tag style in query log

Special-use domains (RFC 6761/6303/8880/9462):
- .localhost → 127.0.0.1 (RFC 6761)
- Private reverse PTR (10.x, 192.168.x, 172.16-31.x) → NXDOMAIN
- _dns.resolver.arpa (DDR) → NXDOMAIN
- ipv4only.arpa (NAT64) → 192.0.0.170/171
- mDNS service discovery for private ranges → NXDOMAIN

Eliminates ~900ms SERVFAILs for macOS system queries that were
hitting root servers unnecessarily.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* chore: move generated blog HTML to site/blog/posts/, gitignore

- Generated HTML now in site/blog/posts/ (gitignored)
- CI workflow runs pandoc + make blog before deploy
- Updated all internal blog links to /blog/posts/ path
- blog/*.md remains the source of truth

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: review feedback — memory ordering, RRSIG time, NS resolution

- Ordering::Relaxed → Acquire/Release for UDP_DISABLED/UDP_FAILURES
  (ARM correctness for cross-thread coordination)
- RRSIG time validation: serial number arithmetic (RFC 4034 §3.1.5)
  + 300s clock skew fudge factor (matches BIND)
- resolve_ns_addrs_from_glue collects addresses from ALL NS names,
  not just the first with glue (improves failover)
- is_special_use_domain: eliminate 16 format! allocations per
  .in-addr.arpa query (parse octet instead)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: API endpoint tests, coverage target

- 8 new axum handler tests: health, stats, query-log, overrides CRUD,
  cache, blocking stats, services CRUD, dashboard HTML
- Tests use tower::oneshot — no network, no server startup
- test_ctx() builds minimal ServerCtx for isolated testing
- `make coverage` target (cargo-tarpaulin), separate from `make all`
- 82 total tests (was 74)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
Razvan Dimescu
2026-03-28 04:03:47 +02:00
committed by GitHub
parent 7aee90c99b
commit a84f2e7f1d
31 changed files with 5477 additions and 776 deletions

View File

@@ -153,6 +153,7 @@ struct QueryLogResponse {
path: String,
rescode: String,
latency_ms: f64,
dnssec: String,
}
#[derive(Serialize)]
@@ -178,6 +179,7 @@ struct LanStatsResponse {
struct QueriesStats {
total: u64,
forwarded: u64,
recursive: u64,
cached: u64,
local: u64,
overridden: u64,
@@ -460,6 +462,7 @@ async fn query_log(
path: e.path.as_str().to_string(),
rescode: e.rescode.as_str().to_string(),
latency_ms: e.latency_us as f64 / 1000.0,
dnssec: e.dnssec.as_str().to_string(),
}
})
.collect()
@@ -477,7 +480,11 @@ async fn stats(State(ctx): State<Arc<ServerCtx>>) -> Json<StatsResponse> {
let override_count = ctx.overrides.read().unwrap().active_count();
let bl_stats = ctx.blocklist.read().unwrap().stats();
let upstream = ctx.upstream.lock().unwrap().to_string();
let upstream = if ctx.upstream_mode == crate::config::UpstreamMode::Recursive {
"recursive (root hints)".to_string()
} else {
ctx.upstream.lock().unwrap().to_string()
};
Json(StatsResponse {
uptime_secs: snap.uptime_secs,
@@ -487,6 +494,7 @@ async fn stats(State(ctx): State<Arc<ServerCtx>>) -> Json<StatsResponse> {
queries: QueriesStats {
total: snap.total,
forwarded: snap.forwarded,
recursive: snap.recursive,
cached: snap.cached,
local: snap.local,
overridden: snap.overridden,
@@ -901,3 +909,252 @@ async fn check_tcp(addr: std::net::SocketAddr) -> bool {
.map(|r| r.is_ok())
.unwrap_or(false)
}
#[cfg(test)]
mod tests {
use super::*;
use axum::body::Body;
use http::Request;
use std::sync::{Mutex, RwLock};
use tower::ServiceExt;
async fn test_ctx() -> Arc<ServerCtx> {
let socket = tokio::net::UdpSocket::bind("127.0.0.1:0").await.unwrap();
Arc::new(ServerCtx {
socket,
zone_map: std::collections::HashMap::new(),
cache: RwLock::new(crate::cache::DnsCache::new(100, 60, 86400)),
stats: Mutex::new(crate::stats::ServerStats::new()),
overrides: RwLock::new(crate::override_store::OverrideStore::new()),
blocklist: RwLock::new(crate::blocklist::BlocklistStore::new()),
query_log: Mutex::new(crate::query_log::QueryLog::new(100)),
services: Mutex::new(crate::service_store::ServiceStore::new()),
lan_peers: Mutex::new(crate::lan::PeerStore::new(90)),
forwarding_rules: Vec::new(),
upstream: Mutex::new(crate::forward::Upstream::Udp(
"127.0.0.1:53".parse().unwrap(),
)),
upstream_auto: false,
upstream_port: 53,
lan_ip: Mutex::new(std::net::Ipv4Addr::LOCALHOST),
timeout: std::time::Duration::from_secs(3),
proxy_tld: "numa".to_string(),
proxy_tld_suffix: ".numa".to_string(),
lan_enabled: false,
config_path: "/tmp/test-numa.toml".to_string(),
config_found: false,
config_dir: std::path::PathBuf::from("/tmp"),
data_dir: std::path::PathBuf::from("/tmp"),
tls_config: None,
upstream_mode: crate::config::UpstreamMode::Forward,
root_hints: Vec::new(),
dnssec_enabled: false,
dnssec_strict: false,
})
}
#[tokio::test]
async fn health_returns_ok() {
let ctx = test_ctx().await;
let resp = router(ctx)
.oneshot(Request::get("/health").body(Body::empty()).unwrap())
.await
.unwrap();
assert_eq!(resp.status(), 200);
let body = axum::body::to_bytes(resp.into_body(), 1000).await.unwrap();
let json: serde_json::Value = serde_json::from_slice(&body).unwrap();
assert_eq!(json["status"], "ok");
}
#[tokio::test]
async fn stats_returns_json() {
let ctx = test_ctx().await;
let resp = router(ctx)
.oneshot(Request::get("/stats").body(Body::empty()).unwrap())
.await
.unwrap();
assert_eq!(resp.status(), 200);
let body = axum::body::to_bytes(resp.into_body(), 10000).await.unwrap();
let json: serde_json::Value = serde_json::from_slice(&body).unwrap();
assert!(json["uptime_secs"].is_number());
assert!(json["queries"]["total"].is_number());
}
#[tokio::test]
async fn query_log_empty() {
let ctx = test_ctx().await;
let resp = router(ctx)
.oneshot(
Request::get("/query-log?limit=10")
.body(Body::empty())
.unwrap(),
)
.await
.unwrap();
assert_eq!(resp.status(), 200);
let body = axum::body::to_bytes(resp.into_body(), 10000).await.unwrap();
let json: serde_json::Value = serde_json::from_slice(&body).unwrap();
assert!(json.as_array().unwrap().is_empty());
}
#[tokio::test]
async fn overrides_crud() {
let ctx = test_ctx().await;
let a = router(ctx.clone());
// Create
let resp = a
.clone()
.oneshot(
Request::post("/overrides")
.header("content-type", "application/json")
.body(Body::from(
r#"{"domain":"test.dev","target":"1.2.3.4","duration_secs":60}"#,
))
.unwrap(),
)
.await
.unwrap();
assert!(resp.status().is_success());
// List
let resp = a
.clone()
.oneshot(Request::get("/overrides").body(Body::empty()).unwrap())
.await
.unwrap();
let body = axum::body::to_bytes(resp.into_body(), 10000).await.unwrap();
assert!(String::from_utf8_lossy(&body).contains("test.dev"));
// Get
let resp = a
.clone()
.oneshot(
Request::get("/overrides/test.dev")
.body(Body::empty())
.unwrap(),
)
.await
.unwrap();
assert_eq!(resp.status(), 200);
// Delete
let resp = a
.clone()
.oneshot(
Request::delete("/overrides/test.dev")
.body(Body::empty())
.unwrap(),
)
.await
.unwrap();
assert!(resp.status().is_success());
// Verify deleted
let resp = a
.oneshot(
Request::get("/overrides/test.dev")
.body(Body::empty())
.unwrap(),
)
.await
.unwrap();
assert_eq!(resp.status(), 404);
}
#[tokio::test]
async fn cache_list_and_flush() {
let ctx = test_ctx().await;
let a = router(ctx.clone());
// List (empty)
let resp = a
.clone()
.oneshot(Request::get("/cache").body(Body::empty()).unwrap())
.await
.unwrap();
assert_eq!(resp.status(), 200);
// Flush
let resp = a
.oneshot(Request::delete("/cache").body(Body::empty()).unwrap())
.await
.unwrap();
assert!(resp.status().is_success());
}
#[tokio::test]
async fn blocking_stats_returns_json() {
let ctx = test_ctx().await;
let resp = router(ctx)
.oneshot(Request::get("/blocking/stats").body(Body::empty()).unwrap())
.await
.unwrap();
assert_eq!(resp.status(), 200);
let body = axum::body::to_bytes(resp.into_body(), 10000).await.unwrap();
let json: serde_json::Value = serde_json::from_slice(&body).unwrap();
assert!(json["enabled"].is_boolean());
}
#[tokio::test]
async fn services_crud() {
let ctx = test_ctx().await;
let a = router(ctx);
// Add service
let resp = a
.clone()
.oneshot(
Request::post("/services")
.header("content-type", "application/json")
.body(Body::from(r#"{"name":"testapp","target_port":3000}"#))
.unwrap(),
)
.await
.unwrap();
assert!(resp.status().is_success());
// List
let resp = a
.clone()
.oneshot(Request::get("/services").body(Body::empty()).unwrap())
.await
.unwrap();
let body = axum::body::to_bytes(resp.into_body(), 10000).await.unwrap();
assert!(String::from_utf8_lossy(&body).contains("testapp"));
// Delete
let resp = a
.clone()
.oneshot(
Request::delete("/services/testapp")
.body(Body::empty())
.unwrap(),
)
.await
.unwrap();
assert!(resp.status().is_success());
// Verify deleted
let resp = a
.oneshot(Request::get("/services").body(Body::empty()).unwrap())
.await
.unwrap();
let body = axum::body::to_bytes(resp.into_body(), 10000).await.unwrap();
assert!(!String::from_utf8_lossy(&body).contains("testapp"));
}
#[tokio::test]
async fn dashboard_returns_html() {
let ctx = test_ctx().await;
let resp = router(ctx)
.oneshot(Request::get("/").body(Body::empty()).unwrap())
.await
.unwrap();
assert_eq!(resp.status(), 200);
let body = axum::body::to_bytes(resp.into_body(), 100000)
.await
.unwrap();
assert!(String::from_utf8_lossy(&body).contains("Numa"));
}
}