Amazon's cloud was struck by lightning earlier this week. And that's the truth. On Wednesday evening at about 6:30pm Pacific time, some Amazon cloud sitters saw their floating servers disappear - and yes, the company blamed the temporary outage on a lightning strike. According to a web post from the company, the strike zapped a …
I wondered how they manage to offer virtual hosting cheaper than my company! it's because we've wasted money in UPS, dual redundant feeds etc. Next time we set up a datacenter I'll remember, one of everything, and then pray/appologise as appropriate.
Why can't people just be honest about how much things cost and tell everyone, cheap service == poor service, pricey service = reliable service
So amazon has never consulted with a company to provide it's servers with battery backup units and backup generators?
I did have trouble visiting a few amazon hosted sites during this time period and they were dead but I have no way to confirm/deny that the sites outages were related to this however up until this these sites were never down.
Bravo Amazon, keep up the work.
Re: Dave 10 & James Woods
Do you guys even know what a PDU is and where it falls in the distribution of power?
A PDU is generally a rack-level piece of equipment - a glorified outlet strip - and sits between the generators and UPS and the servers. While it is uncommon for a PDU to fail, it can happen - usually when it is a managed PDU and the management board takes a surge on the Ethernet interface. and thinks that it is supposed to shut down.
The REAL question is what else in their system failed that allowed a surge to get that far into the network...
Re: Re: Dave 10 & James Woods
Care to explain how lighting managed to hit a rack? The type of PDU that got hit is most likely of the very large, outside sort, that takes the power from the utility provider and routes it to various points in the building. The better question is why didn't UPS and generator kick in, unless it was a lower tier of hosting that didn't include that sort of protection.
A good hit...
.... can bring down machines no matter how well they are shielded / protected. Lightning is powerful enough to do that sort of thing if you get a direct hit.
Get real, complainers
"The REAL question is what else in their system failed that allowed a surge to get that far into the network..."
EVERYTHING failed; it was a direct lightning strike. If you think you have a surge protector that works against that, I suggest you get yourself to the patent office straight away.
"The better question is why didn't UPS and generator kick in"
Oh come ON. Lightning doesn't cause a power INTERRUPTION; it eats your infrastructure for breakfast. Nothing is going to protect your data center from a lightning strike. Oh, so you have a UPS system? Well, when the massive current fuses every metallic component in your UPS into a giant conductor, fat lot of good that will do you.
Does lightning knock out split-site resilience
Does lightning knock out split-site resilience where the two sites are miles (or more) apart or do Amazon cloud customers simply not know about or not care about the kind of things that real IT systems have been doing for decades?
Lightning? Cloud? No I can't be bothered either but surely someone could make something of it?
Not an outage - debunking the hype
Amazon EC2 is not a normal ISP. Given proper architectures, clients shouldn't be worried about servers going down on occasion but should already have processes in place to replace failed instances with fresh ones.
I've ranted more on this topic here: http://alestic.com/2009/06/ec2-non-outage
Many years ago
I used to work in Grantham telephone exchange. It had a diesel standby generator, but that didn't mean it worked...
Also many years ago, we had the bearings on a 80kw motor fail so catastrophically that the rotor shorted out the windings. It tripped every overload back to and including the main site incomer (north). The emergency generator started up on auto, ran up to speed and connected into the system. Then cut out on overload. It did this twice more on auto before locking out. Started to check the system before connecting to the alternative incomer (south - what else). Found a dead short caused by the distribution panel in the original fault path having all the terminals (3 phase) fused together as one solidified lump of molten metal.
I can imagine a direct lightning strike being somewhat worse!
It's up to the customer to plan for the worst too
To the uninformed....
I've worked in data centres where we've had N+1 and more resilience on every aspect of the infrastructure. However a well placed lightening strike, chiller leak, roof collapse, car crash or other act of nature/technology can cause the unexpected to happen.
Ultimately it is down to the customer to have business continuity/contingency plans for these events, however unlikely they may be. As Eric Hammond rightly states on his blog:
"A well designed architecture built on top of EC2 keeps important information (databases, log files, etc) in easy to manage persistent and redundant data stores which can be snapshotted, duplicated, detached, and attached to new servers."
If your application is mission critical then you need to spread it around.