Dont Forget The Dot
Recently we discovered a surprising bottleneck in our Kubernetes cluster at work. Apparently, our DNS server was overloaded and started to respond slowly. We have applied the quick solution suggested by Wolt’s excellent post but it was only a superficial solution to the bigger problem, why do we have so many DNS queries?
The reason for the DNS load was for two reasons:
- DNS records in Kubernetes have a low TTL by default. While you can change it, it does have it’s own drawbacks that should be considered.
- For each DNS query we actually sent two, one after the other. When trying to
reach
example.com
for example, we will first queryexample.com.svc.cluster.local
and only then will try the actual domain we wanted. The reason is that Kubernetes will first lookup to see if the requested DNS is an internal service before trying to reach out to an external DNS server. That would also apply to local services, before queryinginternal.svc.cluster.local
it will first queryinternal.svc.cluster.local.svc.cluster.local
.
The Solution
The solution was actually one character long! A fqdn
must actually contain a dot at the end of it to represent the root DNS. So
instead of querying example.com
we just changed to example.com.
. The root
DNS suffix mean that it can’t be a local service because you can’t append
anything after the last dot.
Thanks to that simple change we managed to cut our load on the DNS server almost by half. Of course we had to have other mitigations because that wasn’t enough, at least not for long. But it did give us a time to breath.