Browsers heuristically cached the dashboard page because the response
carried no Cache-Control header, so a numa upgrade on the daemon did
not surface updated PATH_DEFS (e.g. the UPSTREAM row added in v0.14.0)
until the user hard-reloaded. Force revalidation on every load.
Closes#144.
docs/ is gitignored; references to docs/implementation/*.md from public
source, configs, and packaging were dead links outside the maintainer
machine. Adds four recipes (README, dnsdist-front, doh-on-lan,
odoh-upstream) under top-level recipes/ and repoints existing pointers.
- numa.toml, packaging/client/{README.md,numa.toml}: point to
recipes/odoh-upstream.md.
- src/{bootstrap_resolver,forward,serve}.rs: reference issue #122
directly (module scope is broader than the ODoH-specific recipe).
- src/health.rs: drop the §-ref; iOS HealthInfo remains named as the
canonical consumer.
SOA records were stored as opaque bytes (DnsRecord::UNKNOWN), so the
RFC 1035 §3.3.13 MNAME/RNAME name-compression pointers — offsets into
the upstream packet — were re-emitted verbatim. Once Numa applied its
own compression to surrounding names, those pointers landed on garbage
and clients rejected the reply ("malformed reply packet" in kdig).
Parse SOA via read_qname and write via write_qname, matching the
NS/CNAME/MX pattern. Adds the canonical-rdata arm in dnssec.rs for
RRSIG verification. Regression test round-trips a CNAME-chain response
with a compressed SOA in authority through hickory-proto strict parse.
Hedging fires a second upstream query against the same upstream after
the hedge delay. Rescues packet loss and handshake stalls on flaky
links, but every lookup shows up twice at the provider — silently
halves the headroom for anyone on a quota'd upstream (NextDNS free tier,
Control D, paid Quad9).
Surfaced by #134 (bcookatpcsd), who saw every query duplicated on the
NextDNS dashboard with a single-address DoT upstream. Not a bug — the
feature doing what it says on the tin — but a surprising default.
Flipping the default to 0 makes hedging explicitly opt-in. Users who
want tail-latency rescue on flaky nets add `hedge_ms = 10` (or higher).
No config migration needed; no breaking changes to the API surface.
Also tightens the numa.toml comment so the trade-off is visible at
config time, not retroactively on a provider dashboard.
- Switch overrides from HashMap to BTreeMap — deterministic iteration by
type, drops the manual sort when logging.
- Rename the flat_map closure's inner `ips` to `addrs` to stop shadowing
the outer Vec<String>.
- Trim the Suite 8 TEST-NET-1 comment to keep the "why" and drop
mechanism narration.
- Drop a redundant sleep 1 after wait — wait already blocks on exit.
Suite 8 now ends with a config using RFC 5737 TEST-NET-1 IPs as
relay_ip/target_ip, started briefly so the bootstrap resolver logs its
override map. Asserts both host=IP pairs land in that map — closing the
gap flagged on PR #126 (zero-plain-DNS-leak for ODoH endpoints was only
unit-tested).
Also: NumaResolver::new now logs the override map at INFO when non-empty,
so operators can verify their ODoH bootstrap without needing DEBUG level.
When numa is its own system DNS resolver (HAOS add-on, Pi-hole-style
container, /etc/resolv.conf → 127.0.0.1), every numa-originated HTTPS
connection — DoH upstream, ODoH relay/target, blocklist CDN — routed
its hostname through getaddrinfo() back to numa itself. Cold boot
deadlocked; steady state taxed every new TCP connection. 0.14.1's
retry-with-backoff masked the startup race but not the underlying
self-loop.
NumaResolver implements reqwest::dns::Resolve with two lanes:
- Per-host overrides (ODoH relay_ip/target_ip) short-circuit DNS
entirely, preserving ODoH's zero-plain-DNS-leak property.
- Otherwise: A+AAAA in parallel via UDP to IP-literal bootstrap
servers, with TCP fallback for UDP-hostile networks.
Bootstrap IPs come from upstream.fallback (IP-literal filtered,
hostnames skipped with a warning). Empty fallback yields the
hardcoded default [9.9.9.9, 1.1.1.1]; the chosen source is logged
at startup so the silent default is visible.
doh_keepalive_loop now fires its first tick immediately, and
keepalive_doh logs failures at WARN — bootstrap issues surface
within ~100ms of boot instead of on the first client query.
Distinct from UpstreamPool.fallback (client-query failover) which
stays untouched: client queries with no configured fallback still
SERVFAIL on primary failure rather than silently shadow-routing.
Reproducer: tests/docker/self-resolver-loop.sh. Before: 0 blocklist
domains, 3072ms SERVFAIL. After: 397k domains, 118ms NOERROR.
resolve_coalesced now takes leader_path: QueryPath and applies to all
three upstream branches (Forwarded-rule, Recursive, Upstream), not just
Recursive. Fixes thundering-herd at boot when N concurrent HTTPS setups
each trigger independent forward queries for the same upstream hostname.
Derive both the flaky-server drop count and the zero-delay schedule
from RETRY_DELAYS_SECS.len() so the tests keep exercising their
intended invariants — "succeeds on final attempt" and "gives up after
all attempts fail" — if the production retry schedule ever changes.
Also: rename fail_first → drop_first_n to match drop(sock); swap the
giveup test's empty body for an "unreachable" sentinel so a regression
that accidentally served couldn't silently match Some("").
On cold start, reqwest's getaddrinfo can race numa's own first-query
cold-path latency — resolver timeout fires before numa warms its
upstream DoH connection. Wrap each blocklist fetch in 3 retries with
2s/10s/30s backoff; by the second attempt, the upstream is warm and
subsequent getaddrinfos succeed in <100ms.
Also: parallelize fetches across lists via join_all (different hosts,
no warming dependency), walk the full error source chain so reqwest
failures surface the underlying cause, and parameterize retry delays
for unit-test speed.
Plain host-string equality caught the copy-paste-same-URL footgun but
let `r.cloudflare.com` + `odoh.cloudflare.com` through — two subdomains
of the same operator collapse ODoH to ordinary DoH. Add a second layer:
compare registrable domains via the PSL (`psl` crate) after the exact-
host check. Fails open on IP literals and unparseable hosts; the exact-
host check still runs in those cases.
- Hoist ODOH_CONTENT_TYPE to a single pub(crate) constant in odoh.rs;
relay.rs imports it instead of declaring its own.
- Generalize dashboard encryptionPct(data, encryptedKeys, allKeys)
so both Inbound Wire and Outbound Wire panels share the same math
instead of drifting independently.
- Extract RelayState::new() and build_app() helpers in relay.rs so
the test spawn_relay() and production run() wire the same router
+ body-limit layer. Prevents future middleware from landing in one
path but not the other.
All 344 lib tests pass; no behavior change.
- `numa relay [PORT] [BIND]` accepts an optional bind address (defaults
to 127.0.0.1, matching the Caddy reverse-proxy deployment shape).
Required for Docker, where the relay needs 0.0.0.0 inside the
container so Caddy can reach it across the bridge network.
- Dashboard now surfaces the upstream_transport dimension as an
"Outbound Wire" panel alongside the existing "Inbound Wire" (renamed
from "Transport" for directional clarity). Sub-headers — "apps → numa"
/ "numa → internet" — make the threat-model split obvious without
jargon. Bars: UDP/DoH/DoT/ODoH, headline "X% encrypted outbound".
The PR description's promise that "the dashboard answers how much of
my DNS traffic left in cleartext honestly" is now true.
Two issues surfaced from running mode = "odoh" against the live Hetzner
relay as system DNS:
1. **Bootstrap deadlock.** The reqwest HTTPS client resolves the relay
and target hostnames via system DNS. When numa is itself the system
resolver, the ODoH client loops trying to resolve through itself.
Adds optional `relay_ip` and `target_ip` to `[upstream]`, plumbed
into reqwest's `resolve()` so the HTTPS client bypasses system DNS
for those two hostnames. TLS still validates against the URL
hostname, so a stale IP fails loudly rather than silently MITM'ing.
2. **2x relay load.** Default `hedge_ms = 10` triggers a duplicate
in-flight query for every request. Useful for UDP/DoH/DoT (rescues
tail latency cheaply); wasteful for ODoH (doubles HPKE seal/unseal,
doubles sealed-byte footprint a passive observer can correlate, no
latency win — relay hop dominates either way). Force-zero in
oblivious mode regardless of configured hedge_ms.
Validated end-to-end against odoh-relay.numa.rs → Cloudflare:
3 digs produced 3 forwarded_ok on the relay (was 6 before the hedge
fix), upstream_transport.odoh ticks correctly.
Client (mode = "odoh"): URL-query target routing per RFC 9230 §5,
/.well-known/odohconfigs TTL cache with 60s backoff on failure, HPKE
seal/open via odoh-rs, strict-mode default that SERVFAILs on relay
failure instead of silently downgrading. Host-equality config
validation rejects same-operator relay/target pairs.
Relay (`numa relay [PORT]`): axum server with /relay + /health.
SSRF-hardened hostname validator (RFC 1035 ASCII + dot + dash),
4 KiB body cap at the axum layer, 5s full-transaction timeout, and
static 502 on target failure (reqwest internals logged, not leaked).
Aggregate counters only — no per-request logs.
Observability: new `UpstreamTransport { Udp, Doh, Dot, Odoh }`
orthogonal to `QueryPath`, so /stats can tally wire protocols
symmetrically. Recursive mode records `Some(Udp)` for honest
"bytes egressing in cleartext" accounting.
Tests: Suite 8 exercises the client end-to-end via Frank Denis's
public relay + Cloudflare target; Suite 9 exercises `numa relay`
forwarding + guards against Cloudflare as the real far end. Full
probe script at tests/probe-odoh-ecosystem.sh verifies the entire
public ODoH ecosystem (4 targets + 1 relay per DNSCrypt's curated
list — confirms deploying Numa's relay doubles global supply).
Adding a record type used to require 5 edits across the file (enum
variant, to_num, from_num, as_str, parse_str). The macro takes a
single (variant, num, str) tuple per type and generates the enum
plus all four methods.
UNKNOWN(u16) stays hand-coded since it carries data and can't sit
in the table.
src/question.rs: 156 lines -> 92 lines, no behavior change.
Logs were printing UNKNOWN(64), UNKNOWN(29), UNKNOWN(35) for SVCB,
LOC, and NAPTR — three RR types that have been registered for years
and show up in the wild (notably SVCB via RFC 9462 DDR clients
querying _dns.resolver.arpa).
Adds the variants and replaces the SVCB_QTYPE u16 const introduced
in #119 with QueryType::SVCB.to_num(), matching the HTTPS path.
Closes#114.
HTTPS (65) and SVCB (64) share the RDATA wire format, so the existing
parser already handles both — only the call site was HTTPS-only. Widen
the qtype check and extend the existing pipeline test with a second
query for SVCB.
- Drop `const HTTPS_TYPE: u16 = 65;` in favor of `QueryType::HTTPS.to_num()`
at the single call site — avoids a fresh magic number alongside the
existing enum mapping in question.rs.
- Add `DnsPacket::for_each_record_mut` so `strip_https_ipv6_hints` stops
hand-rolling the answers/authorities/resources walk; future section
rewrites go through the same helper.
- Promote the SVCB test-rdata builder from `svcb::tests` to module scope
as `pub(crate) #[cfg(test)] fn build_rdata`, and reuse it in the two
pipeline tests in ctx.rs — kills ~20 lines of byte-fiddling and keeps
one RDATA-construction code path.
Modifying HTTPS rdata invalidates any accompanying RRSIG, so a DNSSEC-
validating downstream would reject the response as Bogus. Gate the
strip on !client_do, matching the existing DNSSEC-records strip.
Adds a regression test that catches the gate being removed: builds a
query with EDNS DO=1, asserts the HTTPS rdata round-trips untouched.
When enabled, AAAA queries short-circuit to NODATA (NOERROR + empty
answer) so Happy Eyeballs clients don't stall waiting on a v6 address
they can't use. Also strips `ipv6hint` SvcParam from HTTPS/SVCB
answers (RFC 9460) so Chrome ≥103, Firefox, and Safari don't bypass
the AAAA filter via the HTTPS record path.
Local data is preserved: overrides, zones, the .numa proxy, and the
blocklist sinkhole keep whatever v6 addresses they configure — the
filter only kicks in on the cache/forward/recursive path. NODATA is
correct per RFC 2308 here; NXDOMAIN would incorrectly imply the name
doesn't exist for A queries either.
Off by default. Opt in via `filter_aaaa = true` under `[server]`.
Re-install failed with ETXTBSY (Text file busy) because std::fs::copy
can't overwrite a binary that's currently being executed by the
running service. Switch to copy-then-rename: write the new binary to
/usr/local/bin/numa.new, then rename over /usr/local/bin/numa. Rename
swaps the path while the running process keeps the old inode alive,
so DNS keeps serving from the previous binary until restart.
Bump systemctl start to restart so the new binary actually loads on
re-install (start is a no-op when the unit is already active, which
would silently leave the old binary running).
Locally verified the full CI sequence: install → curl → reinstall →
curl → uninstall → curl-fails. All three assertions pass.
Linux install_service_linux now does the {{exe_path}} substitution
inline because it uses the (potentially copied) binary path returned
by install_service_binary_linux, not current_exe(). The shared
replace_exe_path helper is dead on Linux — clippy -D warnings caught it.
Narrow the function to macos and split the placeholder test: keep the
"both templates contain {{exe_path}}" assertion as a cross-platform test
(catches placeholder removal on either file), keep the substitution test
gated to macos where the function lives.
DynamicUser=yes' transient account can only traverse world-x directories.
The CI binary at /home/runner/work/numa/numa/target/release/numa fails
exec with EACCES because /home/runner is mode 0700; same applies to a
build under /home/<user>/, ~/.cargo/bin, or any private $HOME tree.
install_service_binary_linux now walks the binary's path. If every
ancestor grants world-execute (Linuxbrew /home/linuxbrew is 0755,
/usr/local/bin is fine, install.sh layout works), keep the source
path so brew/distro upgrades propagate in place. Otherwise copy to
/usr/local/bin/numa and reference that in the unit.
Locally verified both branches in an Ubuntu 24.04 systemd container:
- CI-like /home/runner (0700) → copies + service binds 5380
- Brew-like /home/linuxbrew (0755) → keeps source path + service binds 5380
AUR installs never call `numa install` — PKGBUILD drops the unit straight
into /usr/lib/systemd/system and the user runs `systemctl enable numa`.
With User=numa the Rust installer's useradd code never fires there,
breaking Arch out of the box.
DynamicUser=yes sidesteps packaging entirely — systemd allocates a
transient UID per start and remaps StateDirectory ownership (including
legacy root-owned trees) automatically. Works on any modern systemd.
Drops the ensure_numa_user_linux/chown helpers plus NUMA_USER; the
unit file alone now captures the privilege-drop story.
- numa.service: User=numa + CAP_NET_BIND_SERVICE + sandboxing block
(ProtectSystem=strict, PrivateTmp, seccomp @system-service, etc)
- install_service_linux: create numa system user + chown data_dir
before first start so TLS-cert generation and state writes land
on a numa-owned tree
Runtime verified root-free on Linux — network_watch_loop only reads
/etc/resolv.conf; all system-DNS mutation stays in the installer,
which continues to run as root via sudo.
The SRTT ordering + failure penalty path was UDP-only, so a DoT primary
in a forwarding-rule pool was never deprioritized on failure and all
DoT entries tied at INITIAL_SRTT_MS in the sort key. With [[forwarding]]
now accepting arrays of upstreams, DoT pools are a first-class case and
need the same healthiest-first behavior the default pool gets for UDP.
- Add Upstream::tracked_ip() → Some(ip) for Udp/Dot, None for Doh
(DoH has no stable IP — reqwest pools connections by hostname).
- Rewire the three SRTT call sites in forward_with_failover_raw.
- Hoist srtt.read() out of the candidate-scoring loop — one lock per
query instead of N (matters now that pools commonly have N>1).
- Drop unused #[derive(Debug)] on UpstreamPool and ForwardingRule.
- Regression tests: udp_failure_records_in_srtt + dot_failure_records_in_srtt.
Resolves src/main.rs conflict: serve loop was extracted into src/serve.rs on main (PR #107). Ported the forwarding-rule log change to serve.rs — fwd.upstream is now Vec<String>, logged with join(", ").
Extract parse_sc_registered and parse_sc_state as testable pure
functions. 8 new tests covering: service registration detection,
service state parsing, and Windows config_dir == data_dir invariant.
Stop the running service before disabling Dnscache so the port 53 probe
sees the real state (not Numa's own binding). Wait for SCM STOPPED
state before copying the binary to avoid os error 32 (file in use).
config_dir() on Windows now returns data_dir() (ProgramData) so config,
services.json, and log file are in the same place for both interactive
and service contexts. Service mode writes logs to numa.log via
env_logger pipe. Dashboard shows correct log path per OS.
Probe port 53 after disabling Dnscache instead of assuming reboot is
needed. Skip DNS redirect when port is blocked (service does it on
first boot). Fix readiness probe: TCP connect to API port instead of
broken UDP send_to that always succeeded.
service start/stop/restart/status now map to proper SCM operations
instead of re-running the full install/uninstall flow. On re-install,
stop the running service first so the binary can be overwritten.
Adds a build.rs that runs `git describe --tags --always --dirty` and
sets NUMA_BUILD_VERSION at compile time. A new `numa::version()` helper
returns the build version, falling back to CARGO_PKG_VERSION when git
is unavailable (source tarballs, Docker builds without .git).
Version strings:
tagged release: 0.13.1
commits ahead: 0.13.1+a87f907
uncommitted changes: 0.13.1+a87f907-dirty
no git: 0.13.1
Replaces all 6 inline env!("CARGO_PKG_VERSION") call sites with the
single version() function.
Closes#108.
- Add `version` field to /stats (from CARGO_PKG_VERSION).
- Show `v0.13.1` next to the Numa wordmark in the dashboard header.
- Restructure the footer into two semantic rows:
Row 1 (paths): Config · Data · Logs (platform-detected)
Row 2 (runtime): Upstream · DNSSEC · SRTT · GitHub
- Drop Mode from the footer (redundant with Upstream label).
- Show only the matching-platform log path instead of both
macOS and Linux unconditionally.
- Drop the duplicate WINDOWS_SERVICE_NAME constant; call sites use the
single source of truth at windows_service::SERVICE_NAME.
- windows_service_exe_path and service_config_path now compose from
crate::data_dir() instead of re-parsing %PROGRAMDATA% locally.
- Factor the 6× sc.exe invocation boilerplate into a run_sc helper.
- Replace the 200ms try_recv polling loop in the service dispatcher
with a recv_timeout wait — cuts shutdown latency and idle CPU.
- stop_service_scm/delete_service_scm now log warnings instead of
silently swallowing failures, so unexpected errors are visible.
Hooks the service-dispatcher scaffolding from the previous commit to
actually serve DNS, and replaces the HKLM\…\Run login-time autostart
with a proper Windows service created via sc.exe.
**Refactor**
- Extract main.rs's inline server body (~500 lines) into `numa::serve::run`
so both the interactive CLI entry and the service dispatcher drive the
same startup/serve loop. main.rs is now a thin subcommand router.
- main.rs goes sync (no #[tokio::main]); each branch that needs async
builds its own runtime and block_on's. Required so the --service path
can hand off to SCM without fighting tokio for the entry thread.
**Windows service wrapper**
- `numa::windows_service::run_service` now builds a multi-thread tokio
runtime on a dedicated thread and runs `serve::run` inside it. Stop/
Shutdown from SCM aborts the wait loop and reports SERVICE_STOPPED.
- Config path resolves to `%PROGRAMDATA%\numa\numa.toml` when running
under SCM (SYSTEM's cwd is System32, relative paths don't work).
**Install/uninstall**
- `install_windows` now copies numa.exe to a stable
`%PROGRAMDATA%\numa\bin\numa.exe` and registers it via `sc create`
with start=auto, obj=LocalSystem, and a failure policy of
restart/5000/restart/5000/restart/10000. Starts the service
immediately when no reboot is pending.
- `uninstall_windows` stops + deletes the service and removes the
binary copy before restoring DNS.
- Drops the old `register_autostart` / `remove_autostart` helpers that
wrote to `HKLM\SOFTWARE\Microsoft\Windows\CurrentVersion\Run` — that
path runs at user login in the user's session with no stderr capture
and no crash-restart policy, which is why we've been flying blind in
every Windows debug session.
DNS-set bugs (netsh destructive static, IPv6 not touched, uninstall
secondary-drop) and file logging are orthogonal — tracked for follow-up.
Lets numa.exe act as a real Windows service registered with the SCM,
replacing the HKLM\...\Run login-time autostart that runs in the user
session without stderr capture.
- New `numa::windows_service` module (cfg(windows)) wraps Mullvad's
`windows-service` crate: registers with SCM, reports Running, handles
Stop/Shutdown, reports Stopped.
- `numa.exe --service` is the entry point SCM uses
(`sc create … binPath="numa.exe --service"`); interactive invocations
are unchanged.
- Dep is gated `[target.'cfg(windows)'.dependencies]` — zero impact on
macOS/Linux builds or binary size.
Scaffold only. The service currently blocks on an mpsc channel until
Stop arrives; the actual serve loop will hook in once main.rs's inline
server body is extracted into `numa::serve(config_path)` in a follow-up.
This lets `sc start Numa` / `sc stop Numa` be verified end to end today.
Resolves conflict in src/ctx.rs — both sides added independent tokio
tests (forwarding fail-over on this branch, default-pool upstream path
on main from #103). Keep both.
Mirrors `[upstream] address` — `upstream` accepts string or array
of strings, builds an `UpstreamPool` and routes queries through
`forward_with_failover_raw` so SRTT ordering and failover apply to
matched `[[forwarding]]` rules the same way they do for the default
pool.
Single-string rules keep their current behavior (one-element pool,
equivalent single-upstream path). Empty array errors at config load.
Addresses item 1 of issue #102. Plan: docs/102_item1.md.