Chicken-Egg problem on blocklist resolution #122
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Hi,
I'm creating an homeassistant app to deploy numa on my homeassistant installation (repo here), and I'm facing an issue with the resolution of the blocklist:
in my particular network setup the router DHCP server set the dns server of the clients to be the ip address of my raspberrypi with homeassistant installed, when numa starts try to download the blocklist and needs to resolve the server name to download it, but I think numa DNS server has not started yet, resulting in failure downloading it, also I think numa is throwing away the underlying reqwest error, making the debug process harder.
At startup,
tokio::spawn(load_blocklists)fires on a worker thread before therecv_fromloop inrun()has a chance to start, sogetaddrinfo()inside reqwest sends a DNS query to Numa's own port 53 — which is bound but not yet serving. The query times out and the download fails.Thank you @Guara92 for reporting this, i've added a paralleled download for multiple blocklists with backoff, will follow up shortly with 0.14.1 so you can bump the add-on
Thanks for the speedy fix! I'm still facing some stability issue, I need to understand if it's related to the homeassistant addon or not, like I'm seeing upstream resolution hitting timeout for primary DoH and finally being resolved with high latency > 1.6s, during startup seems that the fallback are not used to resolve blocklist dns, in fact during this logs:
I was able to resolve names with upstream but very slowly, then after a couple of minutes I started to see servfail from numa dashboard and those were the logs:
Also, I'm expecting that even with this problem timeout should only occurs until my primary upstream servers are cached, instead they still persist, maybe upstream dns resolution doesn't use the internal cache.
if you want to have a look at the conf used is here
EDIT:
digging more seems that even if your PR fix the startup race, my particular config is triggering another circular problem, the container's
resolv.conf(managed by Docker/HA Supervisor) points to an internal DNS proxy at172.30.32.3, which forwards to the host DNS, which in a typical home setup points back to Numa itself. Since my primary addresses are hostnames, that query that started from Numa returns to Numa, Numa tries the same DoH upstreams to answer it, which callgetaddrinfo()again, deadlock. All four attempts from PR #125 fail withEAI_AGAINbecause the loop never resolves. TCP/TLS to both upstreams is fine (curl --resolvereturns HTTP 200);nslookup dns.quad9.netfrom inside the container times out completely.Key Distinction (assisted by claude)
LAN client queries** arrive directly on Numa's UDP socket. Numa controls the entire pipeline, it runs every upstream in sequence and replies when one succeeds. No intermediary can interrupt it.
Internal
getaddrinfo()calls instead (reqwest, blocklist) go through the OS resolver →172.30.32.3→ host DNS → Numa.172.30.32.3is an independent forwarder with its own timeoutT. It waits at mostTfor Numa to respond, then returns SERVFAIL and discards any late answer.Case 1 — LAN client query for
example.com(works but slow)The fallback fires because Numa owns the pipeline end-to-end.
172.30.32.3is only involved in the inner DoH hostname lookups, not in delivering the final answer to the LAN client.Case 2 — internal
getaddrinfo("cdn.jsdelivr.net")from blocklist (fails)Why the fallback structurally never fires in time
Each DoH attempt fails at
~T(the172.30.32.3timeout for the inner hostname lookup). Numa needs two DoH failures before reaching the UDP fallback:2Ttotal. But172.30.32.3's outer timer for the original query also expires atT. Since2T > Talways, Numa's UDP fallback answer arrives exactlyTmilliseconds too late. The gap is structural, not a tuning problem.another prove is that using this upstream config:
latency is reduced by half at ~800ms
with this config:
latency is normal: 10ms>latency<70ms
but blocklist resolution still fails at startup