Vulnerabilities in Amazon's web services that were exposed after lightning hit power supplies at the weekend have led to stinging criticism from some customers. The bolt knocked out the utility and back-up generators in Dublin, causing a blackout which took down the Elastic Cloud Compute (EC2) and Relational Database Services ( …
rep - -
sounds like amazon will take quite a reputation hit for this
Silicon Republic are reporting that it wasn't lightning:
Turns out Amazon could be telling porky pies. EBS Electric Ireland, reported an outage for less than a second at a substation and but with NO reports of fire or explosion.
Plus Met Eireann has no reports of lightning at the time of the outage.
More at Silicon Republic http://www.siliconrepublic.com/strategy/item/23084-mystery-surrounds-outage-at
hope the lessons will be learned
This whole thing is a major disappointment, especially for those of us who are big believers in and advocates of the benefits of Cloud software in the SME market. The people I most feel for are the software providers who have built a business model based on AWS and who will have been badly burned - I suspect a lot of their customers will have lost a lot of faith in the Cloud model.
Just hope that Amazon and their competitors learn from this and improve the resilience of their systems. Power outages will happen, hosting companies need to have tested DR solutions in place, as Amazon clearly haven't done here.
I guess those that live in cloud houses shouldn't throw lightning..
The cloud is rubbish
The cloud is for the pen pushers who don't like the though of sharing a compnay with "Geeks" and therefore sharing money with people smarter than them.
Anyone putting a business critcal system into the cloud deserves all they get,, especially if they are stupid enough to put their backup's in the cloud!
EBS ESB BS
EBS - Elastic Block Storage
ESB - Electricity Supply Board
So Amazon are talking BS about their EBS being screwed by the ESB
At least Microsoft recovered.....
A near miss, good rehearsal?
In my case at least, I had everything back up and running (on another Ubuntu instance spun up in one of the other two Availability Zones) pretty quickly; in future, we'll probably keep the new instance running as well and have MySQL replicate between the two for redundancy.
Our backups were in S3 in the us-east region - which incurs higher bandwidth charges compared to using eu-west, but ensures better resilience (anything taking out Dublin and Virginia at the same time has to be pretty big). Since this outage, I've started keeping rsync backups with Strongspace as well: should be faster to pull data back that way if needed, and as far as I can tell it's totally independent of Amazon both technically and administratively, so even a sudden corporate failure or malicious administrator in AWS wouldn't take it all out.
Amazon's handling seems quite dismal in a way: very poor communications, in particular no indication which volumes and instances were affected, and no notification of problems until long after we'd completed DR, as well as no indication of any timescale for restoration. Even now, I'm still waiting for one of the "recovery" snapshots to attach to my recovery instance, in an unaffected zone, for repairs, while another volume waits to be deleted in the affected zone.
In the end, a couple of (non-revenue) websites of ours were down for quite a few hours, and once we fully restore operations we'll be both much more resilient to future problems and much quicker to recover should it be needed ... but we could have been very much worse off, I suspect.
Amazon less than helpful
All the "recovered" snapshots I've seen are completely useless with big chunks of data missing. Luckily we have some offline backups of our data, not so luckily these are a while out of date although our sites don't change that often. Amazon's reponse so far seems to be "we've done our (tiny) bit - you're on your own". Gee, thanks Amazon.
<quote> Amazon said on Monday it would resolve the process in 48 hours but wrote to customers yesterday informing them it had discovered an error in EBS software which "incorrectly deleted" one or more blocks when cleaning snapshots. </quote>
Well we've had bugger all messages from Amazon about any of this. Perhaps you need to be one of their tier-1 customers and not a SME to get information from them.
Anonymous 'cus the boss is kicking off and I don't want to fan the flames...
Got an email - still not helpful
I'm not tier 1 but used a snapshot for backing up my music and documents (2Gb snapshot = £0.25/month)
I received an email with the details in the article and sure enough, I have one "errored" snapshot and a recreated copy which may, or may not, have zeroed blocks in it.
No email; maybe you're not affected?
The dangers of having all your eggs in one basket.
To me, this shows one of the dangers of cloud computing. People are encouraged to move their backups and other systems off site, onto the cloud. Often with the assumption that they can severely reduce their on-site IT facilities and staff (after all, if they don't, the cloud option will be an extra cost, so it won't cost less).
So, they happily off load mission critical systems on to Amazon et al and when their cloud providers systems fail in some way, these people can lose mission critical systems for days. Yet they are happy to rely on Amazon because, hey, Amazon operate one of the largest on-line shops on the web, they know what they are doing, right?
You could argue that they should have backups, and they should. However, bean counters don't think of that, and once they've offloaded enough IT staff to make the move to the cloud profitable, they may not have enough IT staff to have a proper backup strategy.
And I've totally ignored the chaos that can ensue if the companys connection to the Internet falls over.
The funny thing is, I work for a Uni, and we rely on several mission critical systems. If even one falls over for more than 1 hour without a backup kicking in, we can be (and often are) called in to the see the Director of Computing Services to explain ourselves.
Mitigating the risk of power outages
Amazon and Microsoft have once again suffered embarrassing power outages, but whilst high profile cloud-based outages are enough to put any customer off from entrusting their IT systems to a cloud computing supplier, the risk of such disaster can, and should, be lessened.
Regardless of what anyone says, uptime cannot be guaranteed 100% of the time. There are always going to be complications, and the bigger something gets, the more difficult it becomes to understand, manage, document, test and audit. A different approach to take is to build a series of smaller, pocket-sized data centres – 200 to 300 servers is optimum – within one larger data centre. Building on a smaller scale provides the supplier with greater control and visibility of everything that is going on in the estate. Technical teams can easily be queried as to whether they have carried out a disaster recovery (DR) rehearsal recently, auditing procedures can be assessed, and each fibre cable, network line and rack can be tested individually.
Building that to scale with tens or even hundreds of thousands of servers, as with Amazon and Microsoft, and it becomes clear how easy it would be for certain testing procedures to be missed or be impractical and thus for mistakes to be made. This is not to say that the cloud giants are using the incorrect operational model. The simple fact is however, that the bigger you make something, the bigger the opportunity for flaws becomes.
The Cloud Computing Centre
- Analysis Oh no, Joe: WinPhone users already griping over 8.1 mega-update
- Opportunity selfie: Martian winds have given the spunky ol' rover a spring cleaning
- OK, we get the message, Microsoft: Windows Defender splats 1000s of WinXP, Server 2k3 PCs
- Spanish village called 'Kill the Jews' mulls rebranding exercise
- NASA finds first Earth-sized planet in a habitable zone around star