Chicken-Egg problem on blocklist resolution #122

Closed
opened 2026-04-20 22:49:55 +08:00 by Guara92 · 2 comments
Guara92 commented 2026-04-20 22:49:55 +08:00 (Migrated from github.com)

Hi,

I'm creating an homeassistant app to deploy numa on my homeassistant installation (repo here), and I'm facing an issue with the resolution of the blocklist:

[2026-04-20T13:53:37.839Z WARN  numa::blocklist] failed to download blocklist https://cdn.jsdelivr.net/gh/hagezi/dns-blocklists@latest/hosts/pro.txt: error sending request for url (https://cdn.jsdelivr.net/gh/hagezi/dns-blocklists@latest/hosts/pro.txt)

in my particular network setup the router DHCP server set the dns server of the clients to be the ip address of my raspberrypi with homeassistant installed, when numa starts try to download the blocklist and needs to resolve the server name to download it, but I think numa DNS server has not started yet, resulting in failure downloading it, also I think numa is throwing away the underlying reqwest error, making the debug process harder.

At startup, tokio::spawn(load_blocklists) fires on a worker thread before the recv_from loop in run() has a chance to start, so getaddrinfo() inside reqwest sends a DNS query to Numa's own port 53 — which is bound but not yet serving. The query times out and the download fails.

Hi, I'm creating an homeassistant app to deploy numa on my homeassistant installation (repo [here](https://github.com/Guara92/numa-haos)), and I'm facing an issue with the resolution of the blocklist: ``` [2026-04-20T13:53:37.839Z WARN numa::blocklist] failed to download blocklist https://cdn.jsdelivr.net/gh/hagezi/dns-blocklists@latest/hosts/pro.txt: error sending request for url (https://cdn.jsdelivr.net/gh/hagezi/dns-blocklists@latest/hosts/pro.txt) ``` in my particular network setup the router DHCP server set the dns server of the clients to be the ip address of my raspberrypi with homeassistant installed, when numa starts try to download the blocklist and needs to resolve the server name to download it, but I think numa DNS server has not started yet, resulting in failure downloading it, also I think numa is throwing away the underlying reqwest error, making the debug process harder. At startup, `tokio::spawn(load_blocklists)` fires on a worker thread before the `recv_from` loop in `run()` has a chance to start, so `getaddrinfo()` inside reqwest sends a DNS query to Numa's own port 53 — which is bound but not yet serving. The query times out and the download fails.
razvandimescu commented 2026-04-21 00:26:14 +08:00 (Migrated from github.com)

Thank you @Guara92 for reporting this, i've added a paralleled download for multiple blocklists with backoff, will follow up shortly with 0.14.1 so you can bump the add-on

Thank you @Guara92 for reporting this, i've added a paralleled download for multiple blocklists with backoff, will follow up shortly with 0.14.1 so you can bump the add-on
Guara92 commented 2026-04-21 15:08:04 +08:00 (Migrated from github.com)

Thanks for the speedy fix! I'm still facing some stability issue, I need to understand if it's related to the homeassistant addon or not, like I'm seeing upstream resolution hitting timeout for primary DoH and finally being resolved with high latency > 1.6s, during startup seems that the fallback are not used to resolve blocklist dns, in fact during this logs:

[2026-04-21T06:44:08.718Z WARN  numa::blocklist] blocklist https://cdn.jsdelivr.net/gh/hagezi/dns-blocklists@latest/hosts/pro.txt attempt 1/4 failed: error sending request for url (https://cdn.jsdelivr.net/gh/hagezi/dns-blocklists@latest/hosts/pro.txt): client error (Connect): dns error: failed to lookup address information: Try again — retrying in 2s
[2026-04-21T06:44:20.733Z WARN  numa::blocklist] blocklist https://cdn.jsdelivr.net/gh/hagezi/dns-blocklists@latest/hosts/pro.txt attempt 2/4 failed: error sending request for url (https://cdn.jsdelivr.net/gh/hagezi/dns-blocklists@latest/hosts/pro.txt): client error (Connect): dns error: failed to lookup address information: Try again — retrying in 10s
[2026-04-21T06:44:40.745Z WARN  numa::blocklist] blocklist https://cdn.jsdelivr.net/gh/hagezi/dns-blocklists@latest/hosts/pro.txt attempt 3/4 failed: error sending request for url (https://cdn.jsdelivr.net/gh/hagezi/dns-blocklists@latest/hosts/pro.txt): client error (Connect): dns error: failed to lookup address information: Try again — retrying in 30s
[2026-04-21T06:45:20.761Z WARN  numa::blocklist] blocklist https://cdn.jsdelivr.net/gh/hagezi/dns-blocklists@latest/hosts/pro.txt attempt 4/4 failed: error sending request for url (https://cdn.jsdelivr.net/gh/hagezi/dns-blocklists@latest/hosts/pro.txt): client error (Connect): dns error: failed to lookup address information: Try again — giving up

I was able to resolve names with upstream but very slowly, then after a couple of minutes I started to see servfail from numa dashboard and those were the logs:

[2026-04-21T06:50:21.848Z ERROR numa::ctx] 192.168.1.3:33780 | A release-assets.githubusercontent.com | UPSTREAM ERROR | deadline has elapsed
[2026-04-21T06:50:21.849Z ERROR numa::ctx] 192.168.1.3:33617 | AAAA release-assets.githubusercontent.com | UPSTREAM ERROR | deadline has elapsed
[2026-04-21T06:50:30.533Z ERROR numa::ctx] 192.168.1.18:53724 | A gspe79-ssl.ls.apple.com | UPSTREAM ERROR | deadline has elapsed
[2026-04-21T06:50:30.533Z ERROR numa::ctx] 192.168.1.18:58815 | A iphone-ld.apple.com | UPSTREAM ERROR | deadline has elapsed
[2026-04-21T06:50:30.534Z ERROR numa::ctx] 192.168.1.18:61728 | AAAA iphone-ld.apple.com | UPSTREAM ERROR | deadline has elapsed
[2026-04-21T06:50:30.534Z ERROR numa::ctx] 192.168.1.18:64475 | HTTPS gspe79-ssl.ls.apple.com | UPSTREAM ERROR | deadline has elapsed
[2026-04-21T06:50:30.534Z ERROR numa::ctx] 192.168.1.18:51378 | AAAA gspe79-ssl.ls.apple.com | UPSTREAM ERROR | deadline has elapsed
[2026-04-21T06:50:30.534Z ERROR numa::ctx] 192.168.1.18:52983 | HTTPS iphone-ld.apple.com | UPSTREAM ERROR | deadline has elapsed
[2026-04-21T06:50:31.118Z ERROR numa::ctx] 192.168.1.18:64475 | HTTPS gspe79-ssl.ls.apple.com | UPSTREAM ERROR | deadline has elapsed
[2026-04-21T06:50:31.122Z ERROR numa::ctx] 192.168.1.18:51378 | AAAA gspe79-ssl.ls.apple.com | UPSTREAM ERROR | deadline has elapsed
[2026-04-21T06:50:31.122Z ERROR numa::ctx] 192.168.1.18:61728 | AAAA iphone-ld.apple.com | UPSTREAM ERROR | deadline has elapsed
[2026-04-21T06:50:31.124Z ERROR numa::ctx] 192.168.1.18:52983 | HTTPS iphone-ld.apple.com | UPSTREAM ERROR | deadline has elapsed
[2026-04-21T06:50:31.124Z ERROR numa::ctx] 192.168.1.18:58815 | A iphone-ld.apple.com | UPSTREAM ERROR | deadline has elapsed
[2026-04-21T06:50:31.124Z ERROR numa::ctx] 192.168.1.18:53724 | A gspe79-ssl.ls.apple.com | UPSTREAM ERROR | deadline has elapsed

Also, I'm expecting that even with this problem timeout should only occurs until my primary upstream servers are cached, instead they still persist, maybe upstream dns resolution doesn't use the internal cache.
if you want to have a look at the conf used is here

EDIT:
digging more seems that even if your PR fix the startup race, my particular config is triggering another circular problem, the container's resolv.conf (managed by Docker/HA Supervisor) points to an internal DNS proxy at 172.30.32.3, which forwards to the host DNS, which in a typical home setup points back to Numa itself. Since my primary addresses are hostnames, that query that started from Numa returns to Numa, Numa tries the same DoH upstreams to answer it, which call getaddrinfo() again, deadlock. All four attempts from PR #125 fail with EAI_AGAIN because the loop never resolves. TCP/TLS to both upstreams is fine (curl --resolve returns HTTP 200); nslookup dns.quad9.net from inside the container times out completely.

Key Distinction (assisted by claude)

LAN client queries** arrive directly on Numa's UDP socket. Numa controls the entire pipeline, it runs every upstream in sequence and replies when one succeeds. No intermediary can interrupt it.

Internal getaddrinfo() calls instead (reqwest, blocklist) go through the OS resolver → 172.30.32.3 → host DNS → Numa. 172.30.32.3 is an independent forwarder with its own timeout T . It waits at most T for Numa to respond, then returns SERVFAIL and discards any late answer.


Case 1 — LAN client query for example.com (works but slow)

t=0ms     LAN client UDP → Numa:53  [no intermediary]
          forward_with_failover_raw

t=0ms       DoH[0] "https://dns.quad9.net/dns-query"
              reqwest → getaddrinfo("dns.quad9.net") → 172.30.32.3 → Numa
              [inner loop: DoH attempts time out, same pattern]
t=~800ms    getaddrinfo returns EAI_AGAIN — DoH[0] fails

t=800ms     DoH[1] "https://cloudflare-dns.com/dns-query"  [same path]
t=~1600ms   DoH[1] fails

t=1600ms    Upstream::Udp(9.9.9.9:53)  → configured fallback
              direct UdpSocket::send_to — zero getaddrinfo, zero 172.30.32.3
t=~1630ms   9.9.9.9 answers "example.com"

t=1630ms  Numa replies to LAN client ✅  (~1630ms latency)

The fallback fires because Numa owns the pipeline end-to-end. 172.30.32.3 is only involved in the inner DoH hostname lookups, not in delivering the final answer to the LAN client.


Case 2 — internal getaddrinfo("cdn.jsdelivr.net") from blocklist (fails)

t=0ms     reqwest getaddrinfo("cdn.jsdelivr.net")
            → OS → UDP to 172.30.32.3
            172.30.32.3 starts its own timeout T, forwards to Numa

t=0ms     Numa: forward_with_failover_raw
t=0ms       DoH[0] "https://dns.quad9.net/dns-query"
              reqwest → getaddrinfo("dns.quad9.net") → 172.30.32.3 → Numa
t=~800ms    172.30.32.3 inner timeout fires → EAI_AGAIN — DoH[0] fails
t=800ms     DoH[1] "https://cloudflare-dns.com/dns-query"  [same]
t=~1600ms   DoH[1] fails ← at t=T, 172.30.32.3 outer timeout fires for "cdn.jsdelivr.net"
              172.30.32.3 sends SERVFAIL to getaddrinfo, closes the request
t=1600ms    Upstream::Udp(9.9.9.9:53)
              direct UDP → 9.9.9.9 → cdn.jsdelivr.net resolved ✅
t=1630ms    Numa has the answer — sends to 172.30.32.3
              172.30.32.3 has already discarded this request at t=T
t=T         getaddrinfo → EAI_AGAIN ❌

Why the fallback structurally never fires in time

Each DoH attempt fails at ~T (the 172.30.32.3 timeout for the inner hostname lookup). Numa needs two DoH failures before reaching the UDP fallback: 2T total. But 172.30.32.3's outer timer for the original query also expires at T. Since 2T > T always, Numa's UDP fallback answer arrives exactly T milliseconds too late. The gap is structural, not a tuning problem.

another prove is that using this upstream config:

address = [
  "https://dns.quad9.net/dns-query",
  "https://1.1.1.1/dns-query",
]

latency is reduced by half at ~800ms

with this config:

address = [
  "https://1.1.1.1/dns-query",
  "https://dns.quad9.net/dns-query",
]

latency is normal: 10ms>latency<70ms
but blocklist resolution still fails at startup

Thanks for the speedy fix! I'm still facing some stability issue, I need to understand if it's related to the homeassistant addon or not, like I'm seeing upstream resolution hitting timeout for primary DoH and finally being resolved with high latency > 1.6s, during startup seems that the fallback are not used to resolve blocklist dns, in fact during this logs: ``` [2026-04-21T06:44:08.718Z WARN numa::blocklist] blocklist https://cdn.jsdelivr.net/gh/hagezi/dns-blocklists@latest/hosts/pro.txt attempt 1/4 failed: error sending request for url (https://cdn.jsdelivr.net/gh/hagezi/dns-blocklists@latest/hosts/pro.txt): client error (Connect): dns error: failed to lookup address information: Try again — retrying in 2s [2026-04-21T06:44:20.733Z WARN numa::blocklist] blocklist https://cdn.jsdelivr.net/gh/hagezi/dns-blocklists@latest/hosts/pro.txt attempt 2/4 failed: error sending request for url (https://cdn.jsdelivr.net/gh/hagezi/dns-blocklists@latest/hosts/pro.txt): client error (Connect): dns error: failed to lookup address information: Try again — retrying in 10s [2026-04-21T06:44:40.745Z WARN numa::blocklist] blocklist https://cdn.jsdelivr.net/gh/hagezi/dns-blocklists@latest/hosts/pro.txt attempt 3/4 failed: error sending request for url (https://cdn.jsdelivr.net/gh/hagezi/dns-blocklists@latest/hosts/pro.txt): client error (Connect): dns error: failed to lookup address information: Try again — retrying in 30s [2026-04-21T06:45:20.761Z WARN numa::blocklist] blocklist https://cdn.jsdelivr.net/gh/hagezi/dns-blocklists@latest/hosts/pro.txt attempt 4/4 failed: error sending request for url (https://cdn.jsdelivr.net/gh/hagezi/dns-blocklists@latest/hosts/pro.txt): client error (Connect): dns error: failed to lookup address information: Try again — giving up ``` I was able to resolve names with upstream but very slowly, then after a couple of minutes I started to see servfail from numa dashboard and those were the logs: ``` [2026-04-21T06:50:21.848Z ERROR numa::ctx] 192.168.1.3:33780 | A release-assets.githubusercontent.com | UPSTREAM ERROR | deadline has elapsed [2026-04-21T06:50:21.849Z ERROR numa::ctx] 192.168.1.3:33617 | AAAA release-assets.githubusercontent.com | UPSTREAM ERROR | deadline has elapsed [2026-04-21T06:50:30.533Z ERROR numa::ctx] 192.168.1.18:53724 | A gspe79-ssl.ls.apple.com | UPSTREAM ERROR | deadline has elapsed [2026-04-21T06:50:30.533Z ERROR numa::ctx] 192.168.1.18:58815 | A iphone-ld.apple.com | UPSTREAM ERROR | deadline has elapsed [2026-04-21T06:50:30.534Z ERROR numa::ctx] 192.168.1.18:61728 | AAAA iphone-ld.apple.com | UPSTREAM ERROR | deadline has elapsed [2026-04-21T06:50:30.534Z ERROR numa::ctx] 192.168.1.18:64475 | HTTPS gspe79-ssl.ls.apple.com | UPSTREAM ERROR | deadline has elapsed [2026-04-21T06:50:30.534Z ERROR numa::ctx] 192.168.1.18:51378 | AAAA gspe79-ssl.ls.apple.com | UPSTREAM ERROR | deadline has elapsed [2026-04-21T06:50:30.534Z ERROR numa::ctx] 192.168.1.18:52983 | HTTPS iphone-ld.apple.com | UPSTREAM ERROR | deadline has elapsed [2026-04-21T06:50:31.118Z ERROR numa::ctx] 192.168.1.18:64475 | HTTPS gspe79-ssl.ls.apple.com | UPSTREAM ERROR | deadline has elapsed [2026-04-21T06:50:31.122Z ERROR numa::ctx] 192.168.1.18:51378 | AAAA gspe79-ssl.ls.apple.com | UPSTREAM ERROR | deadline has elapsed [2026-04-21T06:50:31.122Z ERROR numa::ctx] 192.168.1.18:61728 | AAAA iphone-ld.apple.com | UPSTREAM ERROR | deadline has elapsed [2026-04-21T06:50:31.124Z ERROR numa::ctx] 192.168.1.18:52983 | HTTPS iphone-ld.apple.com | UPSTREAM ERROR | deadline has elapsed [2026-04-21T06:50:31.124Z ERROR numa::ctx] 192.168.1.18:58815 | A iphone-ld.apple.com | UPSTREAM ERROR | deadline has elapsed [2026-04-21T06:50:31.124Z ERROR numa::ctx] 192.168.1.18:53724 | A gspe79-ssl.ls.apple.com | UPSTREAM ERROR | deadline has elapsed ``` Also, I'm expecting that even with this problem timeout should only occurs until my primary upstream servers are cached, instead they still persist, maybe upstream dns resolution doesn't use the internal cache. if you want to have a look at the conf used is [here](https://github.com/Guara92/numa-haos/blob/main/numa/rootfs/etc/numa/numa.toml.default) EDIT: digging more seems that even if your PR fix the startup race, my particular config is triggering another circular problem, the container's `resolv.conf` (managed by Docker/HA Supervisor) points to an internal DNS proxy at `172.30.32.3`, which forwards to the host DNS, which in a typical home setup points back to Numa itself. Since my primary addresses are hostnames, that query that started from Numa returns to Numa, Numa tries the same DoH upstreams to answer it, which call `getaddrinfo()` again, deadlock. All four attempts from PR #125 fail with `EAI_AGAIN` because the loop never resolves. TCP/TLS to both upstreams is fine (`curl --resolve` returns HTTP 200); `nslookup dns.quad9.net` from inside the container times out completely. ## Key Distinction (assisted by claude) LAN client queries** arrive directly on Numa's UDP socket. Numa controls the entire pipeline, it runs every upstream in sequence and replies when one succeeds. No intermediary can interrupt it. **Internal `getaddrinfo()` calls** instead (reqwest, blocklist) go through the OS resolver → `172.30.32.3` → host DNS → Numa. `172.30.32.3` is an independent forwarder with its own timeout `T` . It waits at most `T` for Numa to respond, then returns SERVFAIL and discards any late answer. --- ### Case 1 — LAN client query for `example.com` (works but slow) ``` t=0ms LAN client UDP → Numa:53 [no intermediary] forward_with_failover_raw t=0ms DoH[0] "https://dns.quad9.net/dns-query" reqwest → getaddrinfo("dns.quad9.net") → 172.30.32.3 → Numa [inner loop: DoH attempts time out, same pattern] t=~800ms getaddrinfo returns EAI_AGAIN — DoH[0] fails t=800ms DoH[1] "https://cloudflare-dns.com/dns-query" [same path] t=~1600ms DoH[1] fails t=1600ms Upstream::Udp(9.9.9.9:53) → configured fallback direct UdpSocket::send_to — zero getaddrinfo, zero 172.30.32.3 t=~1630ms 9.9.9.9 answers "example.com" t=1630ms Numa replies to LAN client ✅ (~1630ms latency) ``` The fallback fires because **Numa owns the pipeline end-to-end**. `172.30.32.3` is only involved in the inner DoH hostname lookups, not in delivering the final answer to the LAN client. --- ### Case 2 — internal `getaddrinfo("cdn.jsdelivr.net")` from blocklist (fails) ``` t=0ms reqwest getaddrinfo("cdn.jsdelivr.net") → OS → UDP to 172.30.32.3 172.30.32.3 starts its own timeout T, forwards to Numa t=0ms Numa: forward_with_failover_raw t=0ms DoH[0] "https://dns.quad9.net/dns-query" reqwest → getaddrinfo("dns.quad9.net") → 172.30.32.3 → Numa t=~800ms 172.30.32.3 inner timeout fires → EAI_AGAIN — DoH[0] fails t=800ms DoH[1] "https://cloudflare-dns.com/dns-query" [same] t=~1600ms DoH[1] fails ← at t=T, 172.30.32.3 outer timeout fires for "cdn.jsdelivr.net" 172.30.32.3 sends SERVFAIL to getaddrinfo, closes the request t=1600ms Upstream::Udp(9.9.9.9:53) direct UDP → 9.9.9.9 → cdn.jsdelivr.net resolved ✅ t=1630ms Numa has the answer — sends to 172.30.32.3 172.30.32.3 has already discarded this request at t=T t=T getaddrinfo → EAI_AGAIN ❌ ``` --- ## Why the fallback structurally never fires in time Each DoH attempt fails at `~T` (the `172.30.32.3` timeout for the inner hostname lookup). Numa needs two DoH failures before reaching the UDP fallback: **`2T` total**. But `172.30.32.3`'s outer timer for the original query also expires at `T`. Since `2T > T` always, Numa's UDP fallback answer arrives exactly `T` milliseconds too late. The gap is structural, not a tuning problem. another prove is that using this upstream config: ``` address = [ "https://dns.quad9.net/dns-query", "https://1.1.1.1/dns-query", ] ``` latency is reduced by half at ~800ms with this config: ``` address = [ "https://1.1.1.1/dns-query", "https://dns.quad9.net/dns-query", ] ``` latency is normal: 10ms>latency<70ms but blocklist resolution still fails at startup
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: dearsky/numa#122