It’s tragic. It’s unthinkable. It’s Facebook down.
What do we do with Facebook down?
It was a monumental, it took my breath away, it caused me to tweet. At some point around noon it happened. Stop the press, stop whatever you are doing because we have Facebook down! While the outage was relatively quick, it did NOT go unnoticed by me or millions of other people around the globe. According to a statement issued by Facebook, the site was inaccessible for about 2 and a half hours, the worst outage they have had in over 4 years (although, FB has been around much longer than that, but I digress).
Apparently, the cause being an internal technical issue–rather than something sexy like a cyber attack. While we are unimpressed with Mark and company’s lack of a business continuity plan (and would suggest they give us a call). We do applaud them for their speedy and transparent response to the public. Below is an except from their statement or click for the full statement from Facebook:
The key flaw that caused this outage to be so severe was an unfortunate handling of an error condition. An automated system for verifying configuration values ended up causing much more damage than it fixed.
The intent of the automated system is to check for configuration values that are invalid in the cache and replace them with updated values from the persistent store. This works well for a transient problem with the cache, but it doesn’t work when the persistent store is invalid.
Today we made a change to the persistent copy of a configuration value that was interpreted as invalid. This meant that every single client saw the invalid value and attempted to fix it. Because the fix involves making a query to a cluster of databases, that cluster was quickly overwhelmed by hundreds of thousands of queries a second.
To make matters worse, every time a client got an error attempting to query one of the databases it interpreted it as an invalid value, and deleted the corresponding cache key. This meant that even after the original problem had been fixed, the stream of queries continued. As long as the databases failed to service some of the requests, they were causing even more requests to themselves. We had entered a feedback loop that didn’t allow the databases to recover.
Technology has changed the way the world works–and many ways for the better. However, the World Wide Web or social media sites like Facebook itself have been both a blessing and a curse. Gone are the days that a business can go down and people won’t notice–and provide their dissatisfaction. Today, more than ever, businesses need to be prepared with a solid and proven continuity plan, because downtime is no longer acceptable. Don’t be the next Facebook. . .or United, or NYSE. . .and the list goes on. . .contact a business continuity expert for a free consultation.