Saturday, December 3, 2011

Stay Connected When Disaster Strikes

Remote monitoring and control systems enable you to troubleshoot your data center from afar. Here's how to ensure you can connect to them in an emergency.

onlineLast week I talked a bit about how important it is to use and maintain whatever remote monitoring and control systems you have available -- and why it's a good idea to drop a few extra bucks here and there to add control and visibility to your data center.
But monitoring and control systems are useless if they're inaccessible or rendered moot when trouble strikes. You can configure email alerts from your UPS and AC units all you want -- if the mail relay they use is offline, those notifications won't go anywhere. The same is true for data outages. There's nothing worse than a trouble situation that also drops the data connection and leaves you with no idea how a data center is weathering the storm because you can't see anything.
[ For best practices on how to set up remote monitoring and control systems to begin with, see Paul Venezia's, "Troubleshoot your data center from the easy chair." | See if you match this profile: "Nine traits of the veteran Unix admin." ]
There are several key points to inspect when thinking about how to eliminate this problem. The first may be outside your control: how your data circuits are prepared for power outages and circuit cuts. I've seen many situations where a company has full power protection for all the gear in its data center, but down in the basement, a fiber transceiver or carrier termination unit is plugged into a $5 power strip that connects directly into mains power. If the building loses power, the data center may have generator backup, but that doesn't matter because the critical component has no juice, and the bits do not flow. This could easily be prevented with even a small UPS. You might be surprised at how many hours a 500VA UPS can run a fiber transceiver, and in this situation, you'd be very grateful.
The next area of preparation involves multiple data circuits and internal routing. It's a fantastic idea to add a business-class cable or DSL connection to your Internet connectivity. While not as reliable as a fiber drop, it will at least stand a chance of being operational when the main connection evaporates due to carrier problems or Big Joe with the backhoe. I've seen many cases where this supplemental circuit is brought into the data center and source routing at the core is used to push basic Internet browsing across that pipe, leaving the more expensive and reliable circuits to handle VPN and business-critical communications. That's fine, but if the main circuit disappears at 2 a.m., how can you remotely access the data center through the secondary circuit?
Simple: You have to prepare for that eventuality beforehand. In some cases, it makes sense to configure VPN termination on the firewall protecting that link; in other cases, it makes more sense to build a VM that either natively or source-routes through that circuit and pokes a hole in the firewall allowing remote access. I generally do that with a Linux VM and allow ssh so that I can tunnel in quickly from wherever, including a friend's house after quickly downloading putty or using the native ssh in Mac OS X.
In some instances, I've even used GuruPlugs since they sip power and can easily provide a critical pivot without relying on anything but the connected switch. It's the best $100 you can spend in that case.
Having that emergency chute can mean the difference between being able to successfully troubleshoot a data connectivity problem with your carrier's support techs remotely and having to tell them you don't even know if your main firewall has power. In some cases, that lack of information will cause the carrier to require positive acknowledgement of firewall availability before they dispatch technicians to look for problems elsewhere along the circuit path, costing time and money.
Of course, if supplemental circuits aren't available for whatever reason, you should at least be able to get an analog phone line and dust off a few modems. It might be archaic, but even 33.6kbps beats 0kbps when you're trying to see what the heck just happened.
Also, take pains to ensure that your monitoring systems have sufficient safeguards in place to allow them to send email or SMS messages when problems occur. Tacking them all on a single mail relay isn't a great idea -- especially if that relay is external to the site. A better idea is to use local relays (maybe even that Linux VM) for this sole purpose and configure those relays to be able to send mail through several data paths.
Nothing is 100 percent foolproof. But tracing your remote access chain from your gear to your house and making it as stable and resilient as possible is far more Gallant than Goofus.

No comments:

Post a Comment