Wake Up Call
My ringing cell phone woke me a little after 4:30 in the morning on Monday, September 4th.Through the fog of sleep, it seemed at first like a particularly irritating dream noise, but resolved rapidly into a persistent call to consciousness and the unpleasant reality those middle of the night calls always represent. The power had failed, the backup generator didn’t start, and the entire Volusion platform was offline.
Our incident response process was already in motion by the time I was called. (Engineers always get called first, for good reason.) Most of the Infrastructure team was already gathered on the incident conference bridge, and our local network engineer was just pulling into the data center parking lot, where he found the co-location operations team working on the generator. While we waited for them to get the generator running, we continued to telephone on-call resources from every team, so we’d all be ready to begin a cold start of the Volusion platform once electricity was restored. Each time another person connected to the conference bridge, they’d get the brief summary of the situation, and join the rest of us in holding our collective breath while listening to the repair work in progress.
An hour and 40 minutes after the outage began, the generator finally started just as commercial power was restored. Our cabinets were energized one by one. Network devices and storage arrays started automatically, WAN and Internet circuits came up, and SQL servers booted up and began their checkpoint recovery processes. Web servers and all the myriad platform supporting elements like DNS and authentication started. Cold starts are always graceless affairs, but two hours later all database servers, and over 90% of our web servers were online and taking traffic. It was an imperfect but substantial improvement.
With all power events, there’s the risk of subsequent equipment failure. After working for years, whether due to surge or brownout conditions, or just a return to ambient temperature, once stopped some things just don’t behave the same way again, if they start at all. In our case, two chassis of blade servers, totalling 32 blades, powered back up with only 6 operational between them. One entire chassis stopped communicating internally, and ten of sixteen blades in the other chassis had corrupt boot drives.
This created a drastic capacity reduction for one our virtualization clusters, and prompted the long, tedious process of redistributing load to other, fully functional clusters in the environment. Because this now tiny group of hosts was already overwhelmed by shopper traffic, it could devote only the smallest portion of resources toward migrating load away, greatly extending the time required to stabilize remaining customer storefronts.
Migrating load away from the impaired hosts was just one of several parallel efforts to increase resources in that area. Another focus was on restoring function to the dead blade chassis, which after hours of diagnostics supported by the manufacturer, ended with an order of replacement components to be delivered the following day. This left working through the 10 blades with corrupt boot drives one by one, reinstalling their hypervisor images, and adding them back to service.
Plans and Features
While one group of Volusion engineers struggled with dead server blades, another group was working on the plan/feature problem. When some of our web servers started up before the subscription entitlement database was online, the application started with the default feature set, causing many of our merchants to be prompted to upgrade to access search, reporting, and other features. Once we figured out the cause of those issues, Volusion engineers tracked down a bulk live-update tool that hadn’t been needed in years. Soon afterward, all merchant stores were working with the correct features enabled.
Eight and a half hours after my phone first rang, the last few web servers offered up green status indicators. We had several more hours of load redistribution to work through, and although there were some hiccups, the worst was behind us. Our customers were amazingly patient and understanding, and those that helped us identify and debug issues as services recovered made a huge difference. You have our gratitude, and our apologies for the interruption to your businesses.
Justifiably, every one of our valued customers wants to know how we’ll prevent this from happening in the future. We’ve already been talking about it for a while, and you’ve probably read my other blog posts on this topic already. The answer is to redouble efforts in the migration to Google Cloud Platform. There, Volusion services won’t be taken down by a single generator failure - Google deploys them in large redundant fleets. We won’t struggle through distributing load around a couple dozen dead servers - Google deploys them by the tens of thousands. Storage and network elements are built with the same levels of robust redundancy. Cooling faults, UPS malfunctions, and Internet circuit flaps will no longer stop your shoppers from checking out.
As I’ve mentioned before, to do better, we need to leverage the operational scale that Google offers - we simply can’t do it on our own. I’m more convinced than ever this is the right move, and the right time truly is right now.