The following is a root cause analysis and postmortem for the outage experienced during the week of January 19th, 2015 at Pressable.
Root Cause Analysis. What caused this outage?
Ultimately, the reason for this outage was a well crafted attack on our systems. The attack was a variant of the “slowloris” attack discovered in 2009. The attack went undetected for so long due to the method in which it was executed. After discovering this variant of the attack, we have been working with various security professionals and teams to learn more about it, and ways to mitigate it faster.
What was so special?
Two things contributed to the success of this attack, the first was the manner in which the IP address for the two clusters that were configured on our F5. Due to the nature of our needs, the F5 was configured in what’s commonly known as “Bridge Mode”. This means the F5 was not set to inspect any HTTP or HTTPS traffic.
The second thing was how the requests were made. This attack came from multiple IP addresses, and to multiple sites hosted on our systems. Since the requests were made to active domains and paths on the system it looked like legitimate traffic that we were failing to respond to.
All of our monitoring systems and graphs seemed to indicate that we were underpowered, or had reached capacity, but it wasn’t the case.
We will continue to work on our infrastructure and architecture, we still have some big announcements to make on that front, however, we’ll save those for later.
Who was responsible?
We are still investigating who the perpetrators are, and if there is any legal recourse available to us.
First, we’ve patched all our systems and have made some changes to our F5 so that this attack, or variants of it will be stopped before they do too much damage.
Secondly, it has been made it abundantly clear that our communication channels and processes are not as effective or transparent as they could be. Many of our customers simply wanted to know that we were aware of the situation and were working to fix it. To allay concerns, we opened up a chatroom where we were giving real time updates. We found this room was welcomed by a lot of you, so we’ve decided to keep this room open all the time. For the security and privacy of our customers, we’ll be making some changes, and then we’ll be inviting all of you to join us next week.
Our Sincerest Apologies
We’d like to extend our sincerest apologies to everyone affected by this outage. Our entire team worked tirelessly through the last couple of weeks to handle the situation as best we could given the extreme circumstances.
We would like to thank all of the customers and WordPress community members that reached out to us and showed their support for our team. Your encouragement and kindness was truly appreciated and helped remind us of the kindness in our community.