Dont Forget The Dot

Recently we discovered a surprising bottleneck in our Kubernetes cluster at work. Apparently, our DNS server was overloaded and started to respond slowly. We have applied the quick solution suggested by Wolt’s excellent post but it was only a superficial solution to the bigger problem, why do we have so many DNS queries?

The reason for the DNS load was for two reasons:

  1. DNS records in Kubernetes have a low TTL by default. While you can change it, it does have it’s own drawbacks that should be considered.
  2. For each DNS query we actually sent two, one after the other. When trying to reach for example, we will first query and only then will try the actual domain we wanted. The reason is that Kubernetes will first lookup to see if the requested DNS is an internal service before trying to reach out to an external DNS server. That would also apply to local services, before querying internal.svc.cluster.local it will first query internal.svc.cluster.local.svc.cluster.local.

The Solution

The solution was actually one character long! A fqdn must actually contain a dot at the end of it to represent the root DNS. So instead of querying we just changed to The root DNS suffix mean that it can’t be a local service because you can’t append anything after the last dot.

Thanks to that simple change we managed to cut our load on the DNS server almost by half. Of course we had to have other mitigations because that wasn’t enough, at least not for long. But it did give us a time to breath.