On Wednesday, we had a power outage in the Maricopa Data Center. Here is what I sent out the next morning:
Dear Maricopa,
In case you have not already been informed through other sources, yesterday the District Office computer room suffered a power outage at approximately 12:30pm due to a malfunction between the Uninteruptible Power Supply and the Power Distribution Unit. Ironically, these devices are used to ensure a proper response to power fluctuations, but in this case they were the source of the problem, not the solution. Due to some complications in resetting the UPS and PDU, the power was out to the entire data center for approximately 3 hours.
Once power was restored, ITS went through a careful process of bringing services back online, and ensuring proper operation of the applications. This process takes about 2 hours of clock time. Almost all services were completely restored by 6pm. We had some lingering issues today caused by the after effects of computers and databases suddenly losing power. To my knowledge, these are now fixed.
As is always the case, we will take away lessons from this experience. We will make changes to the ways the UPS and PDU can be powered on and off. We will sharpen our communication, because we learned yesterday that when both phone and email are completely out, we need to have cell phone numbers, and alternate email accounts on standby. There will be other changes required as well.
We also can use this as a message that Maricopa needs a second data center. Gratefully, we had already identified and acquired a property for this use, and will begin to make plans and allocate budget to make it a reality in the months ahead.
It probably goes without saying, but we regret the outage of yesterday and we in ITS pledge to use it as a learning experience to do a better job in the future.
***********************************************************************
Now it is a couple of days later, and I have these further comments:
1. I am really proud of the core service providers in ITS. The DBA’s, the helpdesk, the server team(s), the network team, email team and the applications folks. Presented with an unexpected wrinkle, they handled it with professionalism. At one level, it was an unscheduled test of “what happens if someone pulls the plug?” We found out some things, and we had people back in business PDQ.
2. We learned that having the phones down is a bad combination with any other outage. I’ve been exploring external services that would send SMS messages to a pre-determined group. So far the one that looks promising for me to use personally is Communicator from Clickatell.com. Stay tuned for a progress report.
3. It confirms a belief I have had for a long time. The problem you prepare for is not the problem that is going to happen. So now we will prepare for having the phones out at the same time e-mail is out. However, the outage gremlins have something else in store for us, I bet.
4. As I said in my note, I am glad we have begun the process to bring up a second data center. One thing that we don’t talk about that much, because it sounds like excuse making, is that it costs money to never fail. In the months ahead we will no doubt be talking more about investing money to reduce our outages. However, I’ll predict that when we look at the total cost of doing all the things that make sure we never have an outage, we will surely decide to live with some risk so that we can keep our investments at a frugal level. Stated simply: Google and Amazon can afford some strategies that we can’t.
5. The biggest lesson learned is that work is more enjoyable with technology tools. Once the data center went down, time dragged a bit for most employees. We had things to do, but the stuff we needed to do the most required technology. In a way, that’s reassuring. We are on the right track.