Whole swathes of Web 2.0 disappeared to the dark side of the cloud today, as an outage at one of Amazon's EC2 data centres torpedoed the likes of Quora and Reddit. The service's status site showed that things started going haywire just over five hours ago, when techies began "investigating latency and error rates with EBS …
It's stuff like this..
that should be held up when ever your boss mentions any hint of going to the "cloud"
If this were your business.. you'd be hosed, and you'd have zero control over getting the solution resolved.
Not only that, but because your not a super platinum, high paying customer.. you are condemned to sit in a queue until they get to you.
Fornicate everything about the "cloud" with an iron stick.
If you don't control the iron, you don't control the risk.
Data Center Failures
And you think if you roll your own data center, it will be more reliable and cost effective?
It might be more costly, but once you factor in the total cost of an outage... there is no comparison.
So far... reddit has been down for about 8 hours.
So, lets do some rough maths... based on a $500 man company, who moves their services to the magical cloud.
So... your cloud solution shits itself and dies... so you now have about 500 people sitting around twiddling their thumbs.
Say the average pay across the company is $30/hr
so 500 x 30$/hr x 8 hrs = $120,000
That 8 hr outage has just cost you $120, 000 in productivity, not to mention the loss in reputation to your customers etc.
Now, I will BET that the guys who write the contracts for cloud hosting have various "all care, no responsibly" clauses written into the contract, which means you can't recoup any of that loss.
Say this sort of outage happens, 5 times in a year..
That's about $6 million pissed away because you don't control your servers or devices.
Now add all the risk factors surrounding security of your data.
Not to mention there is no way of independently verifying that your cloud providers are doing what they say they are.. or that their data center is not staffed with inbred, lead poisoned, brain damaged monkeys.
I stand by my earlier statement. Companies who trust the cloud, get everything they deserve.
Reliability and cost-effective are different things
Running your own data centre? You'd be crazy if you didn't have a fallback one. Yes, this is expensive but failure can be even more expensive. If you do own the iron you have plans and insurance for them.
Cost-effective: well, who really gives a shit about reddit or foursquare or quora? It really depends on your definition of effective. Pity the other guys who have real business models built on this stuff.
On the same day that Skynet became self-aware? Coincidence? I think not...
Bloody cloud nonsense
Reddit has been down since I got into the office this morning, I actually got some work done ffs!
Wonder if any of these so-called 'big names' have heard of DR....?
You'll never need DR ever
because it's all in the cloud. Trust us, there's no risk at all.
DR in the cloud?
Errm ok... Its not that simple.
First you're not the owner of the hardware, or data center. So how do you impose DR?
For what you're paying... what DR?
You want DR, build your own data center, staff it and your own hardware configurations.
You can do it, but its not cheap or easy.
Outside of these startups, think about trying to handle DR for things measured in PB.
Trust me, its not easy.
Great idea guys
When it works but centralization of anything on any hardware is guaranteed to fail at some point and for me the Cloud is one big failure waiting to happen.
As I always say keep it simple stupid.
In other news
Office productivity is up 37%
What's curious about this is that it affected all four availability zones in US-EAST-1.
When one of my company's instances starting having problems this morning, I attempted to start a replacement instance in a different availability zone and restore from an EBS backup. This failed.
The whole point of having availability zones is meant to be that if one zone goes down, the others remain unaffected.
The long-standing issue with EC2 is that there's no easy way to copy images, instances and EBS volumes from one *region* to another. Ideally my company would have images and backups available in a number of different regions, so that if Virginia disappears off the map, we can just start a new instance in Dublin as if nothing had happened. As it stands, we have everything in the affected region because having anything anywhere else is impractical.
Fortunately, only one of our instances in US-EAST-1 was affected - most have carried on working fine.
To be fair to Amazon...
...only one of their availability zones had issues, and in much the same way that a private DC can have problems; if the Web 2 outfits had designed for multi-site delivery (using Amazon or another provider as a secondary site), they would be fine.
Whole region is having problems
As Campbeltonian says, a whole Region pretty much went titsup! People's multi-AZ setups are having problems in that Region.
Makes you wonder how the heck this could have happened.... then again on occasions we have whole chunks of the UK dropping off the internet supposedly due to a single router / exchange failure.
Infoworld: "IT's cloud resistance is starting to annoy businesses"
Spooky coincidence or what? Infoworld today has a blog arguing that resistance to cloud computing by IT 'luddites' may be career limiting. The sad thing is, he's probably right. And if you understand why that is, you'll understand a lot about what is wrong with business management in general and IT management in particular.
As one of the comments says: "This Kool-Aid sure tastes funny".
I love how that article moans at those with specialist knowledge in provision of IT services who are (rightly) skeptical of leaping onto the latest rebranding of mainframe computing without fully researching how it would affect the business and taking their time to determine if it's even worth the risk of transitioning the current system over to a new one.
I wonder if they would happily take some experimental medication against the advice of their doctor too?
The Web2 men, say "up today"
But all my data's gone away
And it's raining (dum dum dum-dum-dum)
Raining in my cloud.
There's an old saying:
Don't piss down my back and tell me it's raining.
The Outlaw Josey Wales (1976)
8:54 AM PDT We'd like to provide additional color on what were working on right now (please note that we always know more and understand issues better after we fully recover and dive deep into the post mortem). A networking event early this morning triggered a large amount of re-mirroring of EBS volumes in US-EAST-1. This re-mirroring created a shortage of capacity in one of the US-EAST-1 Availability Zones, which impacted new EBS volume creation as well as the pace with which we could re-mirror and recover affected EBS volumes. Additionally, one of our internal control planes for EBS has become inundated such that it's difficult to create new EBS volumes and EBS backed instances. We are working as quickly as possible to add capacity to that one Availability Zone to speed up the re-mirroring, and working to restore the control plane issue. We're starting to see progress on these efforts, but are not there yet. We will continue to provide updates when we have them.
10:26 AM PDT We have made significant progress in stabilizing the affected EBS control plane service. EC2 API calls that do not involve EBS resources in the affected Availability Zone are now seeing significantly reduced failures and latency and are continuing to recover. We have also brought additional capacity online in the affected Availability Zone and stuck EBS volumes (those that were being remirrored) are beginning to recover. We cannot yet estimate when these volumes will be completely recovered, but we will provide an estimate as soon as we have sufficient data to estimate the recovery. We have all available resources working to restore full service functionality as soon as possible. We will continue to provide updates when we have them.
11:09 AM PDT A number of people have asked us for an ETA on when well be fully recovered. We deeply understand why this is important and promise to share this information as soon as we have an estimate that we believe is close to accurate. Our high-level ballpark right now is that the ETA is a few hours. We can assure you that all-hands are on deck to recover as quickly as possible. Well update the community as we have more information.
live by the cloud, die by the cloud
The cloud is great until it turns to blue sky.
That's why I use the rock (er... is there a better antonym of cloud?)
@AC 21st April 2011 19:17 GMT
You can script AWS (esp. EC2) such that failed instances will trigger new instances being fired up, those new instances can be invoked at other AWS locations. So single point of failure charge made at AWS is not entirely fair.
And, well, you have all your infrastructure in a server room at your head office - that room or it's network connections can fail too. That must count as a single point of failure!
Which is why you have separate, redundant circuits for your network you muppet.
Sounds like they need a little more redundancy. Sounds like they need a little more redundancy.
Fragility of block storage & hypervisors for clouds
Is it just me or does the idea of distributed block level storage sound a rather poor concept? The issue with distributing the blocks is as the lowest common denominator it is also the most sensitive to latency.
Fundamentally the reason for AWS requirement for this approach is the use of hypervisor providing hardware emulation which requires direct block access for the VM images. One has to ask if this is really such a good approach long term for clouds given its considerable performance overhead as well as fragility..... PaaS anyone?
re: ElasticHosts advertising spam
Oh dear.. business must be bad if you need to resort to spamming forums featuring your so called ''direct competitor'' ... more ElasticHosts spam...
Remember What a Cloud Is
A "cloud" is not a physical entity, its a graphical representation on a schematic that essentially means "not our responsibility." Apparently it's not Amazon's responsibility either. If your company trusts their core business to this model they deserve to be offline, permanently.
This is the equivalent of hosting your company web site on Geocities unless your hosting agreement provides guarantees for not just hosting costs but also lost revenue.
AWS in the bunker
- NASA boffin: RIDDLE of odd BULGE FOUND on MOON is SOLVED
- SOULLESS machine-intelligence ROBOT cars to hit Blighty in 2015
- BuzzGasm! Thirteen Astonishing True Facts You Never Knew About SCREWS
- Worstall on Wednesday YES, iPhones ARE getting slower with each new release of iOS
- Tor attack nodes RIPPED MASKS off users for 6 MONTHS