feat: add DNS-over-TLS (DoT) listener #25

Merged
razvandimescu merged 19 commits from feat/dns-over-tls into main 2026-04-08 07:53:43 +08:00
razvandimescu commented 2026-03-30 05:36:55 +08:00 (Migrated from github.com)

Summary

Adds DNS-over-TLS (RFC 7858) listener to numa, with the protocol-layer hardening and config/test consolidation that shipped alongside it.

Core feature:

  • src/dot.rs — TLS listener on port 853 with persistent connections, coalesced length-prefix + response writes, configurable bind addr/port
  • [dot] config section: enabled, port, bind_addr, optional cert_path/key_path with self-signed CA fallback
  • Refactors handle_query → transport-agnostic resolve_query so UDP and DoT share the same resolver pipeline (zero-alloc on the UDP hot path)

Protocol hardening:

  • ALPN "dot" advertised in TLS ServerHello (RFC 7858bis §3.2) — rustls enforces strictness, rejecting handshakes with mismatched ALPN as a cross-protocol confusion defense (verified by dot_rejects_non_dot_alpn)
  • Explicit {tld}.{tld} SAN on the self-signed DoT cert, since strict TLS clients reject wildcards under single-label TLDs (verified empirically: kdig with a wildcard-only cert fails the handshake; adding the explicit SAN makes it succeed)
  • WRITE_TIMEOUT = 10s on write_framed to prevent slow-reader DoS
  • HANDSHAKE_TIMEOUT = 10s against slowloris on the TLS handshake
  • MAX_CONNECTIONS = 512 semaphore with 100ms backoff on accept errors

Timeouts, limits, error handling:

  • 30s idle timeout on persistent connections
  • SERVFAIL response echoing the question section on resolve errors so DoT clients don't hang
  • FORMERR response on parse errors preserving connection liveness
  • send_response helper unifying error-response serialization

Bug fix caught by Suite 6 integration testing:

  • load_tls_config was missing rustls::crypto::ring::default_provider().install_default(), causing numa to panic on DoT startup when [dot] cert_path/key_path was set AND [proxy] enabled = false. The proxy's build_tls_config normally installs the provider as a side effect, masking the gap. Exactly the deployment shape for "numa as DoT-only server" would have hit this.

Config consolidation:

  • data_dir is now a [server] TOML field instead of a hardcoded path, threaded into build_tls_config via explicit parameter. Tests and containerized deploys override it without env var injection.
  • numa's rule is now explicit: TOML is the single source of truth for app configuration; env vars are only for bootstrap discovery (HOME, SUDO_USER) and standard ecosystem conventions (RUST_LOG).

Infrastructure:

  • Dockerfile now EXPOSE 853/tcp so docker run -p 853:853 works out of the box
  • numa.toml example documents the [dot] section and [server] data_dir override

Testing

Unit tests (cargo test) — 127 passing, +6 DoT-specific:

  • dot_resolves_local_zone — zone lookup over TLS
  • dot_multiple_queries_on_persistent_connection — connection reuse
  • dot_nxdomain_for_unknown — SERVFAIL propagation via blackhole upstream
  • dot_concurrent_connections — semaphore + concurrent handshakes
  • dot_negotiates_alpn — ALPN "dot" advertisement verified via conn.alpn_protocol()
  • dot_rejects_non_dot_alpn — rustls rejects mismatched ALPN (cross-protocol defense)

Integration tests (./tests/integration.sh) — 2 new suites:

  • Suite 5: DNS-over-TLS — kdig + openssl verification. Listener bound, local zone A record resolves, persistent connection (3 queries, 1 handshake via +keepopen), ALPN positive + negative via openssl s_client
  • Suite 6: Proxy + DoT coexistence — both listeners bound, no startup panics, DoT still resolves with proxy enabled, proxy HTTPS handshake still works with DoT enabled

Cross-implementation empirical verification:

  • ✓ iOS Network Extension (via numa setup-phone mobileconfig) — real iPhone resolving real queries over DoT with persistent connections observed in the log
  • ✓ kdig (GnuTLS-based) — strict client, rejected wildcard-only cert in a controlled experiment, confirming the SAN fix is load-bearing for non-iOS clients
  • ✓ openssl s_client — ALPN negotiation positive and negative paths
  • ✓ rustls round-trip (unit tests) — protocol-level correctness

Known gaps (follow-up work)

Intentionally out of scope for this PR:

  • Per-connection query pipelining (RFC 7766 §6.2.1.1) — current implementation is sequential per connection; a slow upstream query blocks subsequent fast queries on the same connection. Matters for mobile DoT clients that pipeline aggressively.
  • EDNS(0) padding (RFC 7830/8467) — not implemented; query sizes leak through TLS length fields. Real privacy gap; the reason to run encrypted DNS in the first place.
  • Oversized response handling — responses >4096 bytes get TC bit set (inherited from UDP); incorrect over TCP/TLS per RFC 7766 §8. Affects DNSSEC-signed responses.
  • Per-source-IP rate limiting — only global MAX_CONNECTIONS cap. A single source can starve the connection pool.
  • Graceful shutdown — numa has no signal handler infrastructure; needs a cross-cutting CancellationToken across all subsystems.

Each has a dedicated implementation sketch from the design discussion; none block correctness for common deployments (home/office DNS resolver with trusted clients).

🤖 Generated with Claude Code

## Summary Adds DNS-over-TLS (RFC 7858) listener to numa, with the protocol-layer hardening and config/test consolidation that shipped alongside it. **Core feature:** - `src/dot.rs` — TLS listener on port 853 with persistent connections, coalesced length-prefix + response writes, configurable bind addr/port - `[dot]` config section: `enabled`, `port`, `bind_addr`, optional `cert_path`/`key_path` with self-signed CA fallback - Refactors `handle_query` → transport-agnostic `resolve_query` so UDP and DoT share the same resolver pipeline (zero-alloc on the UDP hot path) **Protocol hardening:** - ALPN `"dot"` advertised in TLS ServerHello (RFC 7858bis §3.2) — rustls enforces strictness, rejecting handshakes with mismatched ALPN as a cross-protocol confusion defense (verified by `dot_rejects_non_dot_alpn`) - Explicit `{tld}.{tld}` SAN on the self-signed DoT cert, since strict TLS clients reject wildcards under single-label TLDs (verified empirically: kdig with a wildcard-only cert fails the handshake; adding the explicit SAN makes it succeed) - `WRITE_TIMEOUT = 10s` on `write_framed` to prevent slow-reader DoS - `HANDSHAKE_TIMEOUT = 10s` against slowloris on the TLS handshake - `MAX_CONNECTIONS = 512` semaphore with 100ms backoff on accept errors **Timeouts, limits, error handling:** - 30s idle timeout on persistent connections - SERVFAIL response echoing the question section on resolve errors so DoT clients don't hang - FORMERR response on parse errors preserving connection liveness - `send_response` helper unifying error-response serialization **Bug fix caught by Suite 6 integration testing:** - `load_tls_config` was missing `rustls::crypto::ring::default_provider().install_default()`, causing numa to panic on DoT startup when `[dot] cert_path/key_path` was set AND `[proxy] enabled = false`. The proxy's `build_tls_config` normally installs the provider as a side effect, masking the gap. Exactly the deployment shape for "numa as DoT-only server" would have hit this. **Config consolidation:** - `data_dir` is now a `[server]` TOML field instead of a hardcoded path, threaded into `build_tls_config` via explicit parameter. Tests and containerized deploys override it without env var injection. - numa's rule is now explicit: **TOML is the single source of truth for app configuration; env vars are only for bootstrap discovery (`HOME`, `SUDO_USER`) and standard ecosystem conventions (`RUST_LOG`)**. **Infrastructure:** - `Dockerfile` now `EXPOSE 853/tcp` so `docker run -p 853:853` works out of the box - `numa.toml` example documents the `[dot]` section and `[server] data_dir` override ## Testing **Unit tests (`cargo test`) — 127 passing, +6 DoT-specific:** - `dot_resolves_local_zone` — zone lookup over TLS - `dot_multiple_queries_on_persistent_connection` — connection reuse - `dot_nxdomain_for_unknown` — SERVFAIL propagation via blackhole upstream - `dot_concurrent_connections` — semaphore + concurrent handshakes - `dot_negotiates_alpn` — ALPN `"dot"` advertisement verified via `conn.alpn_protocol()` - `dot_rejects_non_dot_alpn` — rustls rejects mismatched ALPN (cross-protocol defense) **Integration tests (`./tests/integration.sh`) — 2 new suites:** - **Suite 5: DNS-over-TLS** — kdig + openssl verification. Listener bound, local zone A record resolves, persistent connection (3 queries, 1 handshake via `+keepopen`), ALPN positive + negative via `openssl s_client` - **Suite 6: Proxy + DoT coexistence** — both listeners bound, no startup panics, DoT still resolves with proxy enabled, proxy HTTPS handshake still works with DoT enabled **Cross-implementation empirical verification:** - ✓ iOS Network Extension (via `numa setup-phone` mobileconfig) — real iPhone resolving real queries over DoT with persistent connections observed in the log - ✓ kdig (GnuTLS-based) — strict client, rejected wildcard-only cert in a controlled experiment, confirming the SAN fix is load-bearing for non-iOS clients - ✓ openssl s_client — ALPN negotiation positive and negative paths - ✓ rustls round-trip (unit tests) — protocol-level correctness ## Known gaps (follow-up work) Intentionally out of scope for this PR: - **Per-connection query pipelining** (RFC 7766 §6.2.1.1) — current implementation is sequential per connection; a slow upstream query blocks subsequent fast queries on the same connection. Matters for mobile DoT clients that pipeline aggressively. - **EDNS(0) padding** (RFC 7830/8467) — not implemented; query sizes leak through TLS length fields. Real privacy gap; the reason to run encrypted DNS in the first place. - **Oversized response handling** — responses >4096 bytes get TC bit set (inherited from UDP); incorrect over TCP/TLS per RFC 7766 §8. Affects DNSSEC-signed responses. - **Per-source-IP rate limiting** — only global `MAX_CONNECTIONS` cap. A single source can starve the connection pool. - **Graceful shutdown** — numa has no signal handler infrastructure; needs a cross-cutting `CancellationToken` across all subsystems. Each has a dedicated implementation sketch from the design discussion; none block correctness for common deployments (home/office DNS resolver with trusted clients). 🤖 Generated with [Claude Code](https://claude.com/claude-code)
Sign in to join this conversation.