Commit Graph

152 Commits

Author SHA1 Message Date
Razvan Dimescu
3970a9f45c fix(linux): copy binary to /usr/local/bin when source path isn't world-traversable
DynamicUser=yes' transient account can only traverse world-x directories.
The CI binary at /home/runner/work/numa/numa/target/release/numa fails
exec with EACCES because /home/runner is mode 0700; same applies to a
build under /home/<user>/, ~/.cargo/bin, or any private $HOME tree.

install_service_binary_linux now walks the binary's path. If every
ancestor grants world-execute (Linuxbrew /home/linuxbrew is 0755,
/usr/local/bin is fine, install.sh layout works), keep the source
path so brew/distro upgrades propagate in place. Otherwise copy to
/usr/local/bin/numa and reference that in the unit.

Locally verified both branches in an Ubuntu 24.04 systemd container:
- CI-like /home/runner (0700) → copies + service binds 5380
- Brew-like /home/linuxbrew (0755) → keeps source path + service binds 5380
2026-04-18 11:51:32 +03:00
Razvan Dimescu
4f6159d961 refactor(linux): switch to DynamicUser=yes, drop install-time user creation
AUR installs never call `numa install` — PKGBUILD drops the unit straight
into /usr/lib/systemd/system and the user runs `systemctl enable numa`.
With User=numa the Rust installer's useradd code never fires there,
breaking Arch out of the box.

DynamicUser=yes sidesteps packaging entirely — systemd allocates a
transient UID per start and remaps StateDirectory ownership (including
legacy root-owned trees) automatically. Works on any modern systemd.

Drops the ensure_numa_user_linux/chown helpers plus NUMA_USER; the
unit file alone now captures the privilege-drop story.
2026-04-18 08:20:07 +03:00
Razvan Dimescu
695a8b963c feat(linux): run systemd service as unprivileged numa user
- numa.service: User=numa + CAP_NET_BIND_SERVICE + sandboxing block
  (ProtectSystem=strict, PrivateTmp, seccomp @system-service, etc)
- install_service_linux: create numa system user + chown data_dir
  before first start so TLS-cert generation and state writes land
  on a numa-owned tree

Runtime verified root-free on Linux — network_watch_loop only reads
/etc/resolv.conf; all system-DNS mutation stays in the installer,
which continues to run as root via sudo.
2026-04-18 07:56:59 +03:00
Razvan Dimescu
5f77af55e9 fix(forward): track SRTT for DoT upstreams, not just UDP
The SRTT ordering + failure penalty path was UDP-only, so a DoT primary
in a forwarding-rule pool was never deprioritized on failure and all
DoT entries tied at INITIAL_SRTT_MS in the sort key. With [[forwarding]]
now accepting arrays of upstreams, DoT pools are a first-class case and
need the same healthiest-first behavior the default pool gets for UDP.

- Add Upstream::tracked_ip() → Some(ip) for Udp/Dot, None for Doh
  (DoH has no stable IP — reqwest pools connections by hostname).
- Rewire the three SRTT call sites in forward_with_failover_raw.
- Hoist srtt.read() out of the candidate-scoring loop — one lock per
  query instead of N (matters now that pools commonly have N>1).
- Drop unused #[derive(Debug)] on UpstreamPool and ForwardingRule.
- Regression tests: udp_failure_records_in_srtt + dot_failure_records_in_srtt.
2026-04-17 03:39:21 +03:00
Razvan Dimescu
ab6cda0c91 Merge branch 'main' into feat/forwarding-array-upstream
Resolves src/main.rs conflict: serve loop was extracted into src/serve.rs on main (PR #107). Ported the forwarding-rule log change to serve.rs — fwd.upstream is now Vec<String>, logged with join(", ").
2026-04-17 03:14:09 +03:00
Razvan Dimescu
fe9f31616e test: add SCM output parsing and config path regression tests
Extract parse_sc_registered and parse_sc_state as testable pure
functions. 8 new tests covering: service registration detection,
service state parsing, and Windows config_dir == data_dir invariant.
2026-04-16 19:31:26 +03:00
Razvan Dimescu
9f08d8b489 fix(windows): stop service before port probe, wait for full exit
Stop the running service before disabling Dnscache so the port 53 probe
sees the real state (not Numa's own binding). Wait for SCM STOPPED
state before copying the binary to avoid os error 32 (file in use).
2026-04-16 19:21:56 +03:00
Razvan Dimescu
9bea038cb6 fix(windows): unify config/data dir and add service log file
config_dir() on Windows now returns data_dir() (ProgramData) so config,
services.json, and log file are in the same place for both interactive
and service contexts. Service mode writes logs to numa.log via
env_logger pipe. Dashboard shows correct log path per OS.
2026-04-16 19:12:42 +03:00
Razvan Dimescu
6789c321bc fix(windows): defer DNS redirect until port 53 is free
Probe port 53 after disabling Dnscache instead of assuming reboot is
needed. Skip DNS redirect when port is blocked (service does it on
first boot). Fix readiness probe: TCP connect to API port instead of
broken UDP send_to that always succeeded.
2026-04-16 18:35:09 +03:00
Razvan Dimescu
65e65028a0 fix(windows): separate service lifecycle from install flow
service start/stop/restart/status now map to proper SCM operations
instead of re-running the full install/uninstall flow. On re-install,
stop the running service first so the binary can be overwritten.
2026-04-16 16:59:54 +03:00
Razvan Dimescu
d3eab73a31 fix: use sort_by_key to satisfy clippy unnecessary_sort_by 2026-04-16 16:13:15 +03:00
Razvan Dimescu
22ec684e48 Merge remote-tracking branch 'origin/main' into feat/windows-service
# Conflicts:
#	src/main.rs
2026-04-16 16:06:49 +03:00
Razvan Dimescu
0118ab0f44 feat: embed git SHA in version string via build.rs
Adds a build.rs that runs `git describe --tags --always --dirty` and
sets NUMA_BUILD_VERSION at compile time. A new `numa::version()` helper
returns the build version, falling back to CARGO_PKG_VERSION when git
is unavailable (source tarballs, Docker builds without .git).

Version strings:
  tagged release:      0.13.1
  commits ahead:       0.13.1+a87f907
  uncommitted changes: 0.13.1+a87f907-dirty
  no git:              0.13.1

Replaces all 6 inline env!("CARGO_PKG_VERSION") call sites with the
single version() function.
2026-04-16 13:02:25 +03:00
Razvan Dimescu
cc635f2f73 feat(dashboard): show version in header, restructure footer
Closes #108.

- Add `version` field to /stats (from CARGO_PKG_VERSION).
- Show `v0.13.1` next to the Numa wordmark in the dashboard header.
- Restructure the footer into two semantic rows:
  Row 1 (paths): Config · Data · Logs (platform-detected)
  Row 2 (runtime): Upstream · DNSSEC · SRTT · GitHub
- Drop Mode from the footer (redundant with Upstream label).
- Show only the matching-platform log path instead of both
  macOS and Linux unconditionally.
2026-04-16 06:15:48 +03:00
Razvan Dimescu
7bb484ada3 refactor(windows): deduplicate after simplify review
- Drop the duplicate WINDOWS_SERVICE_NAME constant; call sites use the
  single source of truth at windows_service::SERVICE_NAME.
- windows_service_exe_path and service_config_path now compose from
  crate::data_dir() instead of re-parsing %PROGRAMDATA% locally.
- Factor the 6× sc.exe invocation boilerplate into a run_sc helper.
- Replace the 200ms try_recv polling loop in the service dispatcher
  with a recv_timeout wait — cuts shutdown latency and idle CPU.
- stop_service_scm/delete_service_scm now log warnings instead of
  silently swallowing failures, so unexpected errors are visible.
2026-04-15 23:48:09 +03:00
Razvan Dimescu
b610160cd1 feat(windows): run numa as a real SCM service, drop Run-key autostart
Hooks the service-dispatcher scaffolding from the previous commit to
actually serve DNS, and replaces the HKLM\…\Run login-time autostart
with a proper Windows service created via sc.exe.

**Refactor**
- Extract main.rs's inline server body (~500 lines) into `numa::serve::run`
  so both the interactive CLI entry and the service dispatcher drive the
  same startup/serve loop. main.rs is now a thin subcommand router.
- main.rs goes sync (no #[tokio::main]); each branch that needs async
  builds its own runtime and block_on's. Required so the --service path
  can hand off to SCM without fighting tokio for the entry thread.

**Windows service wrapper**
- `numa::windows_service::run_service` now builds a multi-thread tokio
  runtime on a dedicated thread and runs `serve::run` inside it. Stop/
  Shutdown from SCM aborts the wait loop and reports SERVICE_STOPPED.
- Config path resolves to `%PROGRAMDATA%\numa\numa.toml` when running
  under SCM (SYSTEM's cwd is System32, relative paths don't work).

**Install/uninstall**
- `install_windows` now copies numa.exe to a stable
  `%PROGRAMDATA%\numa\bin\numa.exe` and registers it via `sc create`
  with start=auto, obj=LocalSystem, and a failure policy of
  restart/5000/restart/5000/restart/10000. Starts the service
  immediately when no reboot is pending.
- `uninstall_windows` stops + deletes the service and removes the
  binary copy before restoring DNS.
- Drops the old `register_autostart` / `remove_autostart` helpers that
  wrote to `HKLM\SOFTWARE\Microsoft\Windows\CurrentVersion\Run` — that
  path runs at user login in the user's session with no stderr capture
  and no crash-restart policy, which is why we've been flying blind in
  every Windows debug session.

DNS-set bugs (netsh destructive static, IPv6 not touched, uninstall
secondary-drop) and file logging are orthogonal — tracked for follow-up.
2026-04-15 22:24:23 +03:00
Razvan Dimescu
cea4b0ef88 feat(windows): add windows-service crate + SCM dispatcher scaffold
Lets numa.exe act as a real Windows service registered with the SCM,
replacing the HKLM\...\Run login-time autostart that runs in the user
session without stderr capture.

- New `numa::windows_service` module (cfg(windows)) wraps Mullvad's
  `windows-service` crate: registers with SCM, reports Running, handles
  Stop/Shutdown, reports Stopped.
- `numa.exe --service` is the entry point SCM uses
  (`sc create … binPath="numa.exe --service"`); interactive invocations
  are unchanged.
- Dep is gated `[target.'cfg(windows)'.dependencies]` — zero impact on
  macOS/Linux builds or binary size.

Scaffold only. The service currently blocks on an mpsc channel until
Stop arrives; the actual serve loop will hook in once main.rs's inline
server body is extracted into `numa::serve(config_path)` in a follow-up.
This lets `sc start Numa` / `sc stop Numa` be verified end to end today.
2026-04-15 22:14:36 +03:00
Razvan Dimescu
4afc56a052 Merge main into feat/forwarding-array-upstream
Resolves conflict in src/ctx.rs — both sides added independent tokio
tests (forwarding fail-over on this branch, default-pool upstream path
on main from #103). Keep both.
2026-04-15 21:28:04 +03:00
Razvan Dimescu
fef43635d6 fix(ci): rustfmt import order and gate Upstream import for Windows 2026-04-15 04:11:27 +03:00
Razvan Dimescu
9a0d586b13 feat: accept array of upstreams in [[forwarding]]
Mirrors `[upstream] address` — `upstream` accepts string or array
of strings, builds an `UpstreamPool` and routes queries through
`forward_with_failover_raw` so SRTT ordering and failover apply to
matched `[[forwarding]]` rules the same way they do for the default
pool.

Single-string rules keep their current behavior (one-element pool,
equivalent single-upstream path). Empty array errors at config load.

Addresses item 1 of issue #102. Plan: docs/102_item1.md.
2026-04-15 04:03:38 +03:00
Razvan Dimescu
ebb2a5db39 refactor: simplify upstream-path test — reuse pool mutex, drop narrating comment 2026-04-14 18:26:45 +03:00
Razvan Dimescu
e0e0f50838 feat: distinguish UPSTREAM vs FORWARD in logs and stats
Queries matching a [[forwarding]] suffix rule now log as FORWARD;
queries resolved via the default [upstream] pool log as UPSTREAM.
Previously both paths shared the FORWARD label, making it impossible
to tell from logs whether a rule matched.

Adds QueryPath::Upstream, a queries.upstream stats counter exposed
via /stats, plus a matching dashboard filter, bar, and path tag.

Closes part of #102.
2026-04-14 18:18:32 +03:00
Razvan Dimescu
b4b939c78b fix: accept tls:// and https:// in [[forwarding]] upstreams
Config-level forwarding rules were parsed with the UDP-only
`parse_upstream_addr` helper, silently rejecting the DoT/DoH schemes
that the rest of the forwarding pipeline already supports.

Widen `ForwardingRule.upstream` from `SocketAddr` to `Upstream` so
config rules reuse the same parser as `[upstream].address` and
`fallback`. Demote `parse_upstream_addr` to `pub(crate)` to prevent
the same mistake recurring.

Closes #100.
2026-04-14 09:22:24 +03:00
Razvan Dimescu
d3f046da4c style: assert loopback addr in subdomain test, trim verbose comment 2026-04-13 08:10:26 +03:00
Razvan Dimescu
0bdde40f40 test: verify forwarded response content from mock upstream 2026-04-13 08:07:58 +03:00
Razvan Dimescu
155c1c4da0 test: full-pipeline coverage for every resolve_query step
Test each pipeline stage in isolation through resolve_query:
- override takes precedence over all other paths
- localhost and *.localhost resolve to loopback
- local zone returns configured records
- .tld proxy resolves registered services to loopback
- blocklist sinkholes to 0.0.0.0
- cache hit returns stored response without upstream
2026-04-13 08:04:59 +03:00
Razvan Dimescu
b40004fe5e refactor: extract shared test infrastructure into testutil module
- test_ctx(): single ServerCtx builder, replaces 3 copies (ctx/api/dot)
- mock_upstream(): canned DNS response server for forwarding tests
- blackhole_upstream(): unresponsive socket for timeout tests
- Removes ~100 lines of duplicated 30-field struct literals
2026-04-13 07:56:47 +03:00
Razvan Dimescu
b8ddc16027 refactor: return QueryPath from resolve_query, add mock upstream to tests
resolve_query now returns (BytePacketBuffer, QueryPath) so callers
and tests can inspect the resolution path without reading the query
log. Production call sites (UDP, DoT, DoH) destructure and ignore it.

The forwarding test now uses a mock UDP upstream that replies with a
canned response, asserting QueryPath::Forwarded instead of != Local.
2026-04-13 07:51:14 +03:00
Razvan Dimescu
48f67be2f1 refactor: deduplicate test_ctx by delegating to test_ctx_with_forwarding 2026-04-13 07:39:55 +03:00
Razvan Dimescu
ca00846393 fix: forwarding rules override special-use NXDOMAIN for private PTR zones
Explicit [[forwarding]] rules now take precedence over the RFC 6303
special-use domain intercept. Previously, PTR queries for private
ranges (e.g. 168.192.in-addr.arpa) always returned local NXDOMAIN
even when a forwarding rule pointed them at a corporate DNS server.

Add full-pipeline resolve_query test harness (test_ctx + resolve_in_test)
and two tests covering both the default behavior and the override.

Closes #94
2026-04-13 07:36:53 +03:00
Razvan Dimescu
305935ed98 style: rustfmt strip_port 2026-04-12 23:59:51 +03:00
Razvan Dimescu
bd505813b6 test: verify TLS cert SANs (wildcard, services, loopback, localhost, bare TLD)
Parse the generated DER cert with x509-parser to assert the exact SAN
set, catching silent try_into() failures that a params-level test
would miss.
2026-04-12 23:54:55 +03:00
Razvan Dimescu
115a55b199 fix: bracketed IPv6, localhost SAN, split host-check helpers
- is_doh_host split into strip_port + is_loopback_host + is_tld_match
- strip_port handles bracketed IPv6 ([::1]:443) and rejects bare IPv6
- Add [::1] to accepted loopback hosts, add localhost DNS SAN to cert
- Remove dead sans.is_empty() guard (loopback IPs always present)
2026-04-12 23:54:26 +03:00
Razvan Dimescu
3665deb56b fix: accept loopback addresses for DoH and add IP SANs to TLS cert
The DoH endpoint rejected requests with Host: 127.0.0.1/::1/localhost,
and the generated TLS cert had no IP SANs — so browsers couldn't use
https://127.0.0.1/dns-query even with the CA trusted.

- is_doh_host now accepts 127.0.0.1, ::1, localhost (with optional port)
- TLS cert includes 127.0.0.1 and ::1 IP SANs, plus bare TLD DNS SAN

Closes #87
2026-04-12 23:54:26 +03:00
Razvan Dimescu
2101dfcf17 feat: transport protocol tracking (UDP/TCP/DoT/DoH) with dashboard visualization
Thread Transport enum through resolve pipeline, record per-query
transport in stats and query log. Dashboard gets bar chart panel
with encryption %, transport column in query log, and filter dropdown.
2026-04-12 22:14:26 +03:00
Razvan Dimescu
02e1449a45 feat: enable request hedging for all upstream protocols
Hedging was DoH-only (hyper dispatch spike mitigation). Now applies to
UDP (rescues packet loss) and DoT (rescues TLS handshake stalls) too.
Same-upstream hedging: fires a second independent request after hedge_ms
delay. First response wins. Disable with hedge_ms = 0.
2026-04-12 21:34:47 +03:00
Razvan Dimescu
6d9ee14ea6 refactor: unify warm_stale/warm_domain, remove raw_wire alloc, add Freshness enum
- Extract refresh_entry in ctx.rs — warm_domain in main.rs now delegates
  to it instead of duplicating the resolve+cache logic (~40 lines removed)
- Eliminate unconditional .to_vec() of raw wire on every UDP/DoT query —
  pass &buffer.buf[..len] directly (zero-cost for cache hits)
- Replace bare bool stale flag with Freshness enum (Fresh/NearExpiry/Stale)
  making the three states self-documenting at every call site
2026-04-12 19:56:42 +03:00
Razvan Dimescu
3c49b0e65d fix: deduplicate background refresh with per-domain guard
Multiple stale queries for the same domain now spawn only one background
refresh. A HashSet<(String, QueryType)> on ServerCtx tracks in-flight
refreshes; subsequent stale hits for the same key skip the spawn.
2026-04-12 19:49:23 +03:00
Razvan Dimescu
8ef95383a2 feat: prefetch at <10% TTL remaining, add stale behavior tests
Entries with <10% TTL remaining are now marked stale on lookup,
triggering a background refresh before they expire. Combined with
the serve-stale + background refresh from the previous commit, this
means entries are proactively refreshed — matching Unbound's prefetch
behavior.
2026-04-12 19:46:14 +03:00
Razvan Dimescu
571ce2f013 feat: background refresh on stale cache hit (RFC 8767 revalidation)
When a cached entry is expired but within the 1-hour stale window,
serve it immediately with TTL=1 AND spawn a background re-resolve.
The next query gets a fresh entry instead of another stale serve.

Without this, stale entries were served repeatedly for up to an hour
with no refresh — effectively ignoring TTL.
2026-04-12 19:42:56 +03:00
Razvan Dimescu
043a7e1ba5 feat: raise cache default to 100K entries, evict stalest instead of dropping
The 10K cap was too conservative — the blocklist alone holds 400K domains.
At ~100 bytes per wire entry, 100K entries is ~10MB.

When the cache is full and evict_expired doesn't free enough slots,
evict_stalest removes the entry with the least remaining TTL instead of
silently discarding the new insert.
2026-04-12 19:23:28 +03:00
Razvan Dimescu
05d5a5145f refactor: remove unused extract_question and read_wire_qname from wire.rs 2026-04-12 18:46:03 +03:00
Razvan Dimescu
628ed00074 refactor: extract cache_and_parse, remove dead truncation log, restore TCP_TIMEOUT to 400ms 2026-04-12 18:40:46 +03:00
Razvan Dimescu
85cff052a4 fix: restore TCP_TIMEOUT to 400ms (test race was the real issue) 2026-04-12 18:40:46 +03:00
Razvan Dimescu
67b472fea7 fix: serialize tests that share global UDP_DISABLED state
The tcp_only_iterative_resolution, tcp_fallback_resolves_when_udp_blocked,
tcp_fallback_handles_nxdomain, and udp_auto_disable_resets tests all mutate
global UDP_DISABLED / UDP_FAILURES atomics. Under cargo test parallelism,
udp_auto_disable_resets would reset the flag mid-flight causing other tests
to attempt UDP against TCP-only mock servers and time out.

Fix: static Mutex serializes tests that depend on global UDP state.
Also: tcp_only_iterative_resolution now calls forward_tcp directly,
removing its dependence on the flag entirely.
2026-04-12 18:40:46 +03:00
Razvan Dimescu
700cca9cb6 style: rustfmt warm_domain 2026-04-12 18:40:46 +03:00
Razvan Dimescu
f705f8c49f fix: bump TCP_TIMEOUT to 800ms to fix flaky CI test 2026-04-12 18:40:46 +03:00
Razvan Dimescu
17a1a6ddba refactor: remove forward_with_failover duplication, fix warm-branch hedge bug
- Remove forward_with_failover (parsed): warm_domain now uses _raw + insert_wire
- forward_udp delegates to forward_udp_raw (single UDP socket implementation)
- forward_query uses unified _raw path for all protocols
- Fix send_query_hedged warm branch: bare select! dropped secondary on primary
  error instead of waiting for it — now drains both futures like the cold branch
- Remove pointless raw_len = len rename
2026-04-12 18:40:46 +03:00
Razvan Dimescu
72b540a44a feat: wire-level cache, serve-stale, raw wire passthrough
- Cache stores raw DNS wire bytes + TTL offsets (2.4x memory reduction)
- Serve-stale (RFC 8767): expired entries returned with TTL=1 for 1hr
- handle_query captures raw_len from recv_from for zero-copy forwarding
- resolve_query accepts raw wire bytes, forwards without re-serializing
- wire.rs: TTL offset scanner, ID/TTL patching, question extraction
- 52 wire tests + 16 cache regression tests
2026-04-12 18:40:46 +03:00
Razvan Dimescu
5d9a3a809b feat: DoT client, recursive optimization, bench refactor
- Add DoT forwarding client (tls://IP#hostname upstream config)
- Recursive: cache NS delegations, serve-stale (RFC 8767), parallel
  NS queries on cold, no TCP fallback on individual UDP timeouts,
  400ms NS/TCP timeout (down from 800/1500ms)
- Reduce recursive p99 from 2367ms to 402ms (vs Unbound's 148ms)
- Refactor benchmark suite: generic compare_two engine, delete
  one-off diagnostics (1969 → 750 lines)
- Code cleanup: forward_query delegates to _raw, Option<String>
  for tls_name, saturating_sub for ns_idx
2026-04-12 18:40:46 +03:00