Making Gentlent Faster and More Reliable
A story about how the tech team at Gentlent improved reliability and coincidentally increased core web server performance fourfold.
~ 4 min read
Ah yes, the complex world of server and cloud infrastructure. What a wonderful place to spend hard earned money and time with already far too little left.
G-Core is what we call Gentlent’s monstrous code that contains domain name servers (DNS), SMTP (for emails), tooling to announce our anycast and unicast IP ranges via the border gateway protocol (BGP) to our data centres’ routers, or just the basic Geo IP API that you might use in your next project. There are even more services running on it that are not listed herein and it scales well, like really well. But as all great things do, it has a major flaw. A flaw that we didn’t address just quite yet:
Is it about drives? Is it about power? It’s about restarting the services. To make our infrastructure reliable as a rock - and I hope y’all got the reference - we needed to improve it to support zero-downtime for the majority, if not all, of our future updates and therefore restarts.
Now, Gentlent’s updates to its core happen multiple times a day, whether it’s a small tweak, fix, or a major feature release. Each update starts the same process of slowly closing open connections, stopping IP announcements to the cluster, downloading the new image, and spinning the new container with all its services, listeners, and announcements back up. Updates have already been done as “rolling updates”, meaning that we don’t update all servers at the same time. However, we partly host our own name servers and spinning up these container images takes a bit. At least enough to be detected by our various in-house and third-party monitoring services. We also observed timeouts and connections being refused during updates.
What we've got:
- Our rolling updates require restarts which take down DNS and BGP.
- Even though BGP will automatically re-route users to the next nearest data centre, it will still fail until propagated and prevent cache misses from resolving our public and internal domains correctly.
To tackle these problems, we decided to roll out another service that is intended to stay up 24/7, during updates, and is able to load balance traffic to other servers in case the core services on the machine are updating or restarting. As the software load balancer (LB), which is a reverse proxy in this case, has to keep up with our current level of performance and security, we decided to go with open source software that is optimised for just that.
These reverse proxy type of LBs came with the benefit that they have been heavily optimised and were based on languages like C++ that usually come with a great performance improvement that could compensate the added latency.
Quickly after implementing a first prototype of the new system, we also noticed that we could offload two of the heaviest and most CPU intensive tasks to the LB: SSL/TLS termination and compression. This reduced our overall loading times by more than 50% and increased our infrastructure’s availability and reliability immensely.
Further DNS improvements
As you can imagine it took us many hours of trial and error and as you read above, we finally managed to get this migration done. We seized the opportunity to further improve our in-house DNS by actively loading all zones into local memory therefore speeding up load times and decreasing latency. We used the tools from KeyCDN to measure our DNS performance with quit bad results:
Even though our database servers are pretty fast, the server in Tokyo took almost a third of a second to respond to a single DNS query. It only happened in some circumstances when the request or response was made to any given server for the first time and was therefore served uncached. It took up to a couple hundred milliseconds to get all the data fetched on the fly adding massive spikes of latency to the request.
This was due to the server making a full round trip over the ocean over to Asia, then Europe and into our core database server. By actively loading and caching all zones, the 4 async database queries including their latency are skipped and offloaded in to the abyss of the background services.
We got it down from anywhere above 100 milliseconds to just under 6ms on actively health checked local requests on average.
Yet another massive improvement done, deployed, and completed. But we’re not stopping here.
An official Gentlent website. Official Gentlent websites are always linked from our website gentlent.com , or contain an extended validated certificate.