Microsoft has been left reeling again after another BPOS crash but at least on this occasion it was not alone, as Amazon's EC2 web services were also downed by the same act of God in Europe. A bolt of lightning struck a transformer at a power utility provider in Dublin, causing an explosion that took down the back-systems last …
Hopefully a wakeup call to all those compsci fanbois who've been saying "put your services in the cloud" for the last couple of years. Newsflash, kiddies: there's some real engineering underneath all this, with real volts and amps and hardware and stuff. It doesn't matter how pretty your little class diagrams are - if it doesn't bloody work then it doesn't bloody work.
You seem to be confusing computer scientists with marketers
Newsflash sunshine, the people who understand the issues involving with running fault tolerant distributed systems are not the ones selling those systems to you.
But the issue is that when people sell 'the Cloud' to the bean counters they fail to price in the need to duplicate the costs so that the data could also reside in a second cloud in a different data center in order to provide for the up time.
The bottom line is that this could have happened to a private 'cloud' as well.
Don't be so sure
This isn't exactly a crisis for cloud computing. A lightning strike could hit a data centre you own yourself, too. In fact, with cloud hosting you could recover very quickly by spinning up a new instance (say, in the US) while the Dublin centre recovers. Can't do that with your own data centre.
Since I first heard of "The Cloud" I've said only fools and geeks would succumb to the marketing hype. Why anyone wants to risk everything on someone else's unknown kit is beyond me.
lovely, some more security theater
... and we couldn't 'float' across to another server farm as you could* on the Cloud
Honestly the cloud availability is far, far better than in-house up time. I can only think that El Reg is haunted by IT bods that feel their cushy number is threatened. Why have one BOFH for 20 servers when you only need one per 2500 servers?
(*data replication permitting)
Volts and amps and stuff? I thought it was all just 1's and 0's.
That's true and that's part of the point.
You control your own data centers and you control your own DR.
You build in the duplication such that you don't have the down time.
But when you go to the 'cloud' like Amazon, you don't necessarily have people doing the due diligence and setting up replication to a different cloud. They don't account for the costs associated with having the second cloud.
"Power sources needed to be "phase synchronised" before being brought online to load, which needed to be done manually"
I remember doing that at Uni. 3 lightbulbs across the phase pairs, wait till they all go dark at the same time & hit the switch. Delay of a few seconds, at most. Maybe this new cloudy stuff is too high tech?
Re: Phase synchronising
You have three phases in a std 415 v supply.
To sync your gensets you need to align the frequency and voltage of each of these so that the voltages match for a short period then you can switch over.
The three bulbs trick is an old one - the bulbs flicker slower and slower as the genset matches the supply and when the match and are in phase the bulbs all go dark.
The problem seems to be that the mains went completely out and wrecked the sync/swtchover gear.
In this case the gensets need only be synced with UPS or if on the feed side of the UPS the syncswitchgear needs to be disabled and start up with the gensets.
It sounds that rather than risk having two outages, they intentionally "stayed down" so they could repair the defective sync/switchgear. My other guess is that the gensets failed as quite often, datacentres do not run thier gensets often enough or they fail to bleed off the water condensation and when the genset is finally ran in anger, the water usually gets into the filters and kills them dead just when they are needed. If this happens the "duff switchgear excuse" gets trotted out to cover incompetent maintenance.
I would estimate that roughly half the gensets in datacentres do not get the regular testing and servicing they require. I know some mechanics involved in the genset servicing game and due to the work going to lowset bidder, money is a joke - so they cut corners, reuse filters etc.
3 phases (red, yellow, blue) from the grid, three phases from your local source. Lightbulbs between grid "red" and local "red", ditto for the other two. When all the bulbs go dark together it means there's no significant voltage difference between grid and local on any phase, i.e. you know that your local source is in sync with the grid, and can be connected.
Synchronisation . . .
. . . is only necessary when going back onto the mains. In the meantime the UPS and generators should have been sufficient running isolated. So this is not the whole story. Lightning protection units? We've heard of them.
Maybe change the name to ...
Slightly biased comments
...we have both BPOS email services and Amazon live EC2 services. Microsoft was back last night whereas EC2 is big time down, cant even get to the images to rebuild in USA which was their advice. Recovering from Crashplan to new US instance).
Huge difference here which seems biased the other way in article !
So, this whole "cloud computing" thing can be felled by a bolt of lightning? Is there no redundancy built into this rubbish at all? Wow.
Let's see an end to this cloud nonsense, ffs.
...there should have been a lightning conductor to protect the equipment that was struck? If not, and it caused such an almighty spike, then the utility provider should be held responsible.
Also, by the same logic, anyone else with a datacentre near Dublin on the same part of the power grid will also be equally screwed (if not moreso)? So cloud hosting doesn't get rid of these kind of risks - doesn't mean that it's inherently worse than self-hosted datacentres.
... because I'm near Dublin, there was one almighty crack of thunder at about 3PM but I didn't hear any more after that.
It works, but it doesnt...
"the incident will rightly lead to questions about the "viability of the cloud as a delivery platform" but added outages were not a sign that the cloud does not work."
"There is also a need to address business contingency on behalf of customers...blah"
In other words, it could have worked, but it doesn't. But it will. Really. Or does it "work" already as long as you keep tweeting that you have a disruption when your disaster recovery doesn't work?
I'm sure a lot of customers really appreciate that there is a need to address business contingency. Problem is, the providers should have done so in advance....
You could mirror your sites in different AWS regions
It's unlikely that both Dublin and Virginia (I think) will both go down at the same time. Then the load balancer stuff should kick in to save your bacon. Just a thought. I only use the free micro tier, but my stuff is all hosted in the USA.
Anyone else notice BPOS could stand for something else not unreleated?
Big Pile Of Sugar? ;)
Although given our recent experiences you should make it BSPOS, add a steaming in there :)
Pass the umbrella..
...rain's falling from the clouds....
Another fine mess, Stanley?
Wow, just love that timeless classic postmodern gobblydegook and unwinese, used as a mass excuse and explanation of failure to supply uninterruptible and incorruptible power services.
Is this ironic?
All these cloud service providers felled by something that came out of a cloud?
Does this count as cloud at all?
How is it different to shared hosting if it's all in one place? I thought cloud was supposed to replicate your stuff a bit better.
there should have been a lightning conductor to protect the equipment that was struck...
Yes they are called Pylons, f**king great metal spikes in the gound, usally covering several hundred miles. However if the lightning decides it's found a better route, it will.
However we look at this a positive as we have a (unrealistic) 1 hour SLA to get all systems back on line. Now with this we can say, "see even MS and Amazon take several hours, so can we make it 4 please?"
I iz confuzzled? Who designed these solutions?
Disaster Resilient Datacenter Designs 101 - if it really matters, get your mains supply from two completely seperate mains supplies. This is very expensive, which is probably why cheapo cloud services wouldn't have it. And you need to have that double sourcing for everything, not just the back-end, as it's pointless having your datacenter humming along nicely if your front-end network routers are all offline.
Disaster Resilient Datacenter Designs 102 - diesel gennies are the belt to your mains braces. Yes, they cost a lot, the diesel needs to be churned and replaced every couple of months, but - when the mains is up the creek - those gennies are completely under your control.
Disaster Resilient Datacenter Designs 103 - if you skip on lessons 101 and 102, then you actually do need metro-level redundancy as well as intercontinental redundancy. That means another datacenter to fail over to in a close location (but not close enough to be taken out by the same disaster, ideally on a completely seperate mains power source on the other side of the city or in another town) before you have to fail over to a whole different continent. Amazon seem to have forgotten that one.
But, seeing as I'm quite happy to see cheapo cloud crash and burn, please don't tell Amazon or M$!
What, no redundancy?
..I smell redundancies :)
But seriously: isn't a lightning strike knocking out the cloud possibly the most ironically wonderful event possible?
Weird weather here.......
We don't normally get storms as bad as this in Dublin- the lightening and torrential rain of the last 2 days has been startling at times........ Entire areas of the city were knocked off the grid some areas several times- which in this day and age is very unusual. I'm not altogether surprised that some high prominence systems fell over- you can build as much redundancy as you like into systems, but you will always encounter scenarios that you just don't plan for.........
… thundering into a datacentre near you.
Me thinks they might want to consider running the servers on DC power instead … DC does not require synchronisation.
amazon still down
I am the customer of a company who use amazon for their hosting, today i got an email telling me "Its not us, its amazon" ... doesn't matter to me who they offloaded their services to, i didn't choose them.
While i am no fan of cloud computing, I am left wondering if there is any benefit atall.. since pre-cloud the services i am using would likely be co-located over just two boxes in different locations, currently i seem to have had all my eggs put in one amazon shaped basket - which is an improvement how?
Re: amazon still down
I would blame your hosting company in this case. Why didn't they host your service on multiple geographically redundant instances, if Amazon allows them to do so?
Don't take me wrong, I'm a cloud skeptic as well, and Amazon is by no means fail-safe, but beware of clueless resellers.
I wonder ...
I wonder if there were some major routers knocked out at the same time? There were some decidedly strange connectivity problems yesterday ... (and no, it wasn't my ISP).
Have you tried...
.switching it off and on again...?
This is why you shouldn't let MBAs design networks ...
I was going to buy some expensive gifts last night from Amazon. The website conked out, so I took it as Divine Intervention. Seems like I was right :)
By George! The competition is getting savage these days. Two clouds knocked out by a third one.
clouds & social media
"using social media where its own service has failed"
But what if Twitter was down at the same time? I guess then you could say it would truly be time to panic? Because as long as you can tweet to your customers "we are aware that your service is currently unavailable" then you can say you've done everything you possibly could have with a straight face.
All eggs in one basket?
So, why are we surprised?
Clearly, the technology deployed so far in "the cloud" isn't resilient enough for catastrophic failure. The likely cause here was the use of transfer switches to change the power sourcing from the failed source to a new mains power feed. These are semiconductor devices, and are damageable by lightning strikes.
The solution for this has been around a few years. --- Remove the transfer switches and have two power sources on each server, each with its own PSU. Lightning might take out a few of the supplies on the A-feed, but the B-feed should keep on trucking.
Of course, this all stems from the concept of megadatacenters. Perhaps that's the crucial fallacy, as seen in the comms failure that down Amazon a few months ago. Maybe smaller regional "bigadatacenters" might be better. The economics are surprisingly close.
Aaah - the Cloud
Best named new technology ever!!! I sometimes wonder if some sarcastic bastard in DAARPA thought that one up? I do struggle with reasonable explanations for this insanity:)
I've said it once, I'll say it again...
Amazon EC2 provides the *infrastructure* on which you can build a redundant service.
They are virtual instances running on physical hardware, not much difference to any other machine running virtual machines.
The difference is that Amazon have availability zones in the same location, and other data centres around the planet that have exactly the same setup, and they provide a single supplier to deal with. So it's much easier to build something that has redundancy built-in. However *you* have to do the work for that.
The services that they offer themselves that do have redundancy (eg. S3) were not affected. To me this is a minor incident as it only affected one availability zone - the other was running fine. So the sites that are well engineered were unaffected. It's the people running just a single EC2 instance against all advice that were affected.
Re: I've said it once...
"It's the people running just a single EC2 instance against all advice that were affected."
That would include Amazon's own shopping site, then.
EC2 is too unreliable for business use
I don't know if my company has just been unlucky, but aside from this major outage (which is still classed by Amazon as a "performance issue!") we often have problems with servers locking up for no reason. When this happens, all you can do is reboot it and hope that fixes it. Disks sometimes just don't attach properly. It simply isn't a reliable technology.
You say it's not much different to running on another virtual machine, but if I had hosting on another company's server and the power went out, I think they would get the power back and restart the server. Amazon on the otherhand still haven't fixed many of the downed servers two days after the event. Their advice for recovering data just isn't working in many cases, as you can see from their own forum.
You say we have to do the work for the redunancy, but if it's just people not setting things up correctly, why did all of Amazon's own sites in Europe go down for about an hour when this all happened?
The truth is, Amazon say mirror stuff in different zones and you will be fine, and use their own DBMS. But although they don't admit it, it wasn't just one zone that was affected and their DBMS had serious problems too.
I can't speak for other cloud systems, but I think EC2 is not suitable for critical business applications.
I just wonder...
... if somene up there is trying to send us a messagenot to try and be so 'clever' at creating 'disaster proof systems'. I wonder if the planet has had enough of being f****d around with. (and NO I'm not an environmentalist).
These cloud systems have fallen down more often than any data center I have been involved with. They must be too complex for the people running them--or have management decisions, influenced by marketing types, led to over-promising things that haven't been designed into the systems?
You wouldn't see these situations if they stuff they promised was actually functional.
"There is also a need to address business contingency on behalf of customers through the use of backup and mirrored facilities. That costs but it is a necessary cost and underlines the need for a web of alliances between application providers and cloud service and infrastructure providers to allow switching in the event of a failure"
So what you mean luv, is that as well as throwing money at you and your marketing cronies for Cloud services we should also build local redundant systems as well? What, you mean like the ones your trying to convince us to move to the Cloud!!?!?!
Said it before, say it again: It's just another fad designed to justify the ever increasing size and scale of modern datacentres, and the claim that is ever going to save a business money is utterly trite (especially if Ms Cloud is now saying you'd best have own backup as we're fskin useless)
amazon goes under blackout
Amazon, a leading cloud provider, was disrupted in its service lately and caused discomfort to many websites!
- Vid Antarctic ice THICKER than first feared – penguin-bot boffins
- Hi-torque tank engines: EXTREME car hacking with The Register
- Review What's MISSING on Amazon Fire Phone... and why it WON'T set the world alight
- Antique Code Show World of Warcraft then and now: From Orcs and Humans to Warlords of Draenor
- Product round-up Trousers down for six of the best affordable Androids