Amazon Web Services battened down the hatches and called in extra troops ahead of the vicious thunderstorms that ripped through the Ohio valley and across Virginia and Maryland on Friday night. But despite all the precautions – and there were many – the US East-1 region of Amazon's cloud was brought down after an electrical …
Recovery procedures. Ha!
Almost no company has complete ones, and my guess is, nobody gets to test them.
Last week I just suffered one of these situations.. and guess what.. no reliable backup, no good plans..
"gear is tested weekly"
I remember a telco that used to do that. Every Monday the generators were started to make sure all was well. Then they were stopped. No-one considered that this was just like doing lots of short journeys in a car. When the big power cut arrived, the generators started perfectly. Then, after a few minutes, the hot and thoroughly coked-up cylinders started to misfire, and the engines packed up. It took a full cylinder head-off clean and rebuild to get them running again.
After that they were only tested once a month, and allowed to run for an hour or so to reach full temperature each time.
Well, if nothing else, you've got to give them credit for putting out a detailed report. Bit of a bitch when the failover system itself fails.
Pretty historical stuff
"The real story is not that a generator can bring down a chunk of a cloud, but that the recovery process when there is a failure is still hairy even after some of the best minds on the planet have tried to think of all the angles."
It's also the first time in human history such cloudy juggling at a large scale is being attempted. Just keep on trucking!
As for the 2012-leapsecond-linux-bug, I wonder how many Linux server are still running full blast in the various datacenters right now. I hear our hosting company noticed a fat uptick in Watt consumed when the bug hit.
Precisely 20 minutes or about 20 minutes?
To quote " ... which lasted precisely about 20 minutes ... " or was that how long it takes to drag a sheep from one end of the data centre to the other??
Did you pay extra for the high-availability sheep-dragging plan?
> Did you pay extra for the high-availability sheep-dragging plan?
Baaaargin! Unless they're pulling the wool over your eyes which would be shear lunacy...
Re: Ewe bet.
Well played Sir, Well played.
Too Big to be Economically-and-Realistically-Tested
One of the problems of mega-big systems like this is that it's not economically-feasable to test them at full size and load - you'd have to duplicate the mega-big system to do that.
The reason such full-size and full-load testing is necessary, is to find bugs which only rear their ugly heads under high, complex loads.
I'm not criticizing Amazon (or Google, etc.) for this. It's simply the nature of mega-big systems.
Accordingly, when you're considering moving some-or-all of your company's processing and storage to a public cloud, you'd better figure in the cost of downtime to your company, along with a-higher-probability-of-outage than you'd ordinarily "think" would be likely.
Re: Too Big to be Economically-and-Realistically-Tested
Are there any studies showing that the public cloud goes down more often than private internal ones? Lots of internal networks are much less robust than Amazon's data centres.
When Amazon goes down for an hour, its news, when some 500 person company goes down for 12 hours, new of it never leaves the building.
Re: Too Big to be Economically-and-Realistically-Tested
Totally agreed; good advice.
One addendum: agreed you can't (economically) feasibly test with a production size test system. However, you can (and should) test various scenarios with smaller environments that might at least have brought up the mentioned software problems. (Of course, it might not as these problems may only occur in more massive test environments etc)
They should make their own power
Which is what many large industrial customers do anyway. Then you use the grid for the times your power plant dies. You can make power for about the same cost as grid using 100% nat gas. Then you get 0.001% times 0.001% failure likelihood. Actually not that good, cause real big events might knock everything out, I guess.
The reliability of the North American grid is dropping over the next decade as more de-stabililizing wind comes on line. So its wise to plan for that now.
Re: They should make their own power
Yeah, maybe a personal nuclear power station would be handy.
"even after some of the best minds on the planet have tried to think of all the angles."
"Best minds"? Eh? Assumes facts not in evidence ...
Back when I worked for Bigger Blue (late 1970s), on Fabian Way in Palo Alto, we'd kill the mains power at 3PM on the last Friday of every month to ensure that the battery would carry the load long enough for the genset to warm up enough to take over. It ain't exactly rocket science. (In the event of failure, everyone went home early with two hours pay. It never happened while I worked there).
Keep in mind that this system ran the entire Ford Aerospace campus in Palo Alto, not just the computers. (OK, nearly the entire campus ... The fine folks in the Faraday Cage over on East Meadow Circle and across Adobe Creek on West Bayshore had their own solution ...).
"mains" or "genset" -> battery -> motor-generator -> building/campus power
My personal systems work the same way, and haven't been down since TehIntrawebTube's variation of "Flag Day" (January 1st, 1983).
Re: "even after some of the best minds on the planet have tried to think of all the angles."
+1 for testing DR on live systems... the only way to make sure they're really working.
The cloud depends upon external infrastructure, so do not rely upon it staying available.
I hear that natural gas fuel cells are getting good now, so they may make more sense than diesel or gas turbine generators, ... unless the gas is electrically pumped and you have no on-site gas tanks (preferably underground).
I've learned my lesson.. from last year and the year before that..
.. still with AWS -- nothing compares really with the maturity of the offer and price so far -- but we dont have any illusions.. we've backed up all our critical files in another side of the planet. We'd still be down if the worse happened but at least we have another set of really remote backup. Google Compute will probably take another 2-3 years to be 'good enough' to be even considered.
Super reliable power supply breeds complacency
North America enjoys a pretty reliable power supply but when it goes down, it really makes a job of it.
I was in Toronto when the Northeast US and southern Canada suffered a blackout, a few years ago, that lasted up to 5 days. Generator sales went crazy, even the local butcher had two going to keep his refrigerators working.
Countries that experience failures on a regular basis are usually well prepared - with either portable or standby power units ready for service. High-rise apartment buildings in Ho Chi Minh City list their features such as swimming pools, sun decks and always emergency power is listed as a feature.
Many moons ago I worked a a technician at a Decca Navigator transmitter station and our station manager, the late John Pratt, always surprised us with his sneak power failures not only during the day but also in the middle of the sleeping watch at night. We had banks of batteries that carried the load whilst the generator started up.
When the real thing happened, our transmitter performed flawlessly.
Haven't they heard of N+1 redundancy?
A single generator failure shouldn't have affected them if they had N+1 power redundancy. I have to question what their technicians were thinking when they came up with the power systems if that wasn't factored in...
Re: Haven't they heard of N+1 redundancy?
Probably something along the lines "This isn't a nuclear power plant next to an ocean where tsunamis can happen"...
Don't forget if you test your genrator once a month, keep an eye on the fuel level so that if the power does fail you've not got just 30 minutes worth left...
...like 1 local government organisation I know of.
Re: testing generators
they test it very often (guess some one forgot to Switch the gen from Manuel to Auto, when they are been serviced you do not want one firing up on you as it could kill some one or damage the Generators)
some one had to walk up to it and press the Manuel start :)
Re: testing generators
Yep seen the same thing. After a test someone forgets to switch back to auto. Some time later a digger goes through some power cables and the entire area goes out. This is then very swiftly followed by a lack of power in the office... That's when you get to see how good your DR setup is!
"any volumes that had in-flight writes when power was lost come back in an impair state"
Another individual who seems to be ignorant of the use of the past tense.
For the avoidance of doubt:
Re: "any volumes that had in-flight writes when power was lost come back in an impair state"
"Impair" instead of "impaired" seems to be a typo, but the use of the present simple in a dependent clause to indicate and unchanging action seems correct to me.
In other words, "volumes that had in-flight writes when power was lost came back in an impaired state" indicates that the impairment happened just this one time and wouldn't normally be expected to occur, whereas "volumes that had in-flight writes when power was lost come back in an impaired state" implies that this is what is expected to happen every time.
I would agree, however, that having the whole sentence in present simple would have been equally correct and more aesthetically pleasing. At least to my ears.