fix(upstream): default hedge_ms=0 to avoid silent 2x upstream query count #135

Merged
razvandimescu merged 1 commits from fix/hedge-default-off into main 2026-04-23 04:49:16 +08:00
razvandimescu commented 2026-04-23 04:31:15 +08:00 (Migrated from github.com)

Summary

Hedging fires a second upstream query after the hedge delay. It's a genuine tail-latency rescue (packet loss, dispatch spikes, TLS handshake stalls) — but every lookup appears twice at the upstream provider. On quota'd DNS (NextDNS free tier, Control D, paid Quad9), the default hedge_ms = 10 silently halves the user's headroom.

Surfaced by #134 (bcookatpcsd): single-address DoT upstream to NextDNS, dashboard showed every query duplicated. Not a protocol bug — hedging doing what it promises — but a surprising default.

This PR flips default_hedge_ms() from 10 to 0. Opt-in from here; users who want tail-latency rescue add hedge_ms = 10 (or higher) explicitly.

  • src/config.rs:454 — default changed + short docstring explaining the 0 default and when to turn it back on.
  • numa.toml — commented example shows new default, documents the 2× query-count trade-off so it's visible at config time.

No API or config surface changes. No migration. Existing configs that set hedge_ms = X explicitly are unchanged.

Test plan

  • cargo check clean.

  • cargo test --lib config::tests — 51 passed.

  • End-to-end manual test against a counting mock upstream (50ms response delay so hedging has 40ms to fire inside the 10ms window):

    • Default config (no hedge_ms line) → mock received 1 packet, numa answered in 56ms.
    • Explicit hedge_ms = 10 → mock received 2 packets, numa answered in 50ms.

    Confirms the flip suppresses hedging for new users out-of-the-box while explicit opt-in still fires. Mock source + exact steps in the PR thread / linked gist (inline with the review if wanted).

  • CI green (cargo audit, cargo test, integration).

Follow-ups (not in this PR)

  • Release note in v0.14.2: "hedging is now opt-in; set hedge_ms = 10 if you want the pre-0.14.2 tail-latency behavior".
  • Reply to #134 pointing at this change + the hedge_ms = 0 workaround they can apply to 0.14.1 today.
## Summary Hedging fires a second upstream query after the hedge delay. It's a genuine tail-latency rescue (packet loss, dispatch spikes, TLS handshake stalls) — but every lookup appears twice at the upstream provider. On quota'd DNS (NextDNS free tier, Control D, paid Quad9), the default `hedge_ms = 10` silently halves the user's headroom. Surfaced by #134 (bcookatpcsd): single-address DoT upstream to NextDNS, dashboard showed every query duplicated. Not a protocol bug — hedging doing what it promises — but a surprising default. This PR flips `default_hedge_ms()` from `10` to `0`. Opt-in from here; users who want tail-latency rescue add `hedge_ms = 10` (or higher) explicitly. - `src/config.rs:454` — default changed + short docstring explaining the 0 default and when to turn it back on. - `numa.toml` — commented example shows new default, documents the 2× query-count trade-off so it's visible at config time. No API or config surface changes. No migration. Existing configs that set `hedge_ms = X` explicitly are unchanged. ## Test plan - [x] `cargo check` clean. - [x] `cargo test --lib config::tests` — 51 passed. - [x] **End-to-end manual test** against a counting mock upstream (50ms response delay so hedging has 40ms to fire inside the 10ms window): - Default config (no `hedge_ms` line) → mock received **1 packet**, numa answered in 56ms. - Explicit `hedge_ms = 10` → mock received **2 packets**, numa answered in 50ms. Confirms the flip suppresses hedging for new users out-of-the-box while explicit opt-in still fires. Mock source + exact steps in the PR thread / linked gist (inline with the review if wanted). - [ ] CI green (`cargo audit`, `cargo test`, integration). ## Follow-ups (not in this PR) - Release note in v0.14.2: "hedging is now opt-in; set `hedge_ms = 10` if you want the pre-0.14.2 tail-latency behavior". - Reply to #134 pointing at this change + the `hedge_ms = 0` workaround they can apply to 0.14.1 today.
Sign in to join this conversation.