Invasion of the Killer Retries

Friday, Sep 10 2021 in architecture

How can you bring your web application down? For example, by trying to make it more reliable.

Let services retry failed requests. This sounds sensible, except when the service on the receiving end collapses under the weight of the retries. Circuit breakers and retries with exponential backoff can help prevent your services from taking down each other.

Other times it’s your human clients who generate retry storms. I was on a team developing a web ticket shop for shows and club nights. When the ticket shop loaded slowly, frustrated users hit reload. This sent extra HTTP requests, increased the load on the server and slowed down the app even more.

Then we found out about killer searches. Some search queries made the server run out of memory. The user who submitted the search got back a 502 Bad Gateway error page. The frustrated user resubmitted the same search query. The load balancer routed the request to the next server and boom!, the next server crashed as well. The user kept resending the request until they killed all servers and caused a general outage.

Unfortunately, you cannot install a circuit breaker on your users: solving these issues requires UI changes. A loading indicator eases the anxiety caused by an unresponsive page and deters users from reloading. I imagine the message telling you to wait on some payment sites is also there to reduce the server load.

The server health check configuration worsened the outage: when the application ran out of memory, it failed health checks. This triggered a replacement of the whole virtual machine. Shutting down the virtual machine and starting another one to replace it took a lot more time than killing and restarting the application process. You can also give up on requests that have been stuck for long (load shedding). This AWS Builder’s library article gives a lot of details. On AWS, you can try enabling the least oustanding requests routing algorithm.

Keeping a web application running requires dealing with use cases and behaviour that you’ve never anticipated, and it’s only with indsight that you can ensure your systems keeps becoming more resilient.