Isn't that supposed to be one of the selling points? That and distributed resilience?
Problems in the Amazon cloud over the weekend crushed apps like Vine, websites like Airbnb, and numerous other services that depend on Bezos & Co's hulking cloud, and the problems were due to a familiar culprit – Elastic Block Store (EBS). EBS is a network-attached block level storage service for Amazon EC2 instances. Amazon …
Isn't that supposed to be one of the selling points? That and distributed resilience?
If it only affected a single region, I'm not sure what the big deal is.
This is no different that a major hosting center or hub having a network issue that affected a bunch of websites, only in this case we get to shout "but it was in the cloud."
Amazon present easy options for sites to present themselves across diverse geographic regions for added resilience. Many choose not to do so, but that isn't a problem with Amazon's offering, it's down to their own cost-benefit priorities.
One of the issues with EBS in certain failures is that the breaking doesn't always work. Sometimes you just get a 100x slowdown one disk, without any 'breaking', until you detect and route around the slowdown. Other times Amazon reports that a disk transaction has been committed, but if you try and read it back you get corrupted data every time you try; indicating a corrupted write in the first place (this is rare, but anyone who has spent much time with any hefty EBS instances will have seen this). So you can end up with corrupted data on both master and slave copies before any of Amazon's stuff actually reports that anything might be wrong, whilst what you have left runs like a drain.
This does not make reliability engineering very easy.
Basically, it's not that EBS fails (you should plan for that as has been mentioned) it's that it sometimes fails in completely unexpected ways that go against the documentation on how it should fail, and you don't necessarily notice that it's gone off on an acid trip until it is too late.
It also doesn't help when multiple availability zones break at the same time, or when the failover takes forever because everyone else is failing over with you.
I should point out that I haven't touched AWS recently, so Amazon may have miraculously fixed all of this, and this latest issue is some unrelated problem.
Any one with more up to date knowledge of AWS care to weigh in?
It is not just single region.
Amazon core services including amazon.co.uk itself (accessed from Virigin in the UK) have been temperamental (to put it mildly) whole weekend long. My initial suspicion was Virgin (as usual).
However, it looks like it may be the other usual suspect.
"Given the outage during the weekend just gone, cloud-first businesses might want to start looking at EBS and working out how to design their systems around potential failures in Amazon's data center hubs."
Given the ***SLA*** they should be doing that. Of course almost nobody does, or will. That's part of the scam.
Having supported apps at two different orgs that were built from the ground up in Amazon over a few years I saw first hand how little work goes into taking into account the failure of amazon infrastructure (i.e. none) in many cases (similar rules apply to in house infrastructure as well btw so it's not unique to cloud - every other company I worked for had developers that did similar things with their own infrastructure).
Not to say that assuming the infrastructure won't fail is not a valid model - in many cases it is (it is often the cheapest way to go for small-mid scale). The problem is of course most people(management) don't realize this critical distinction when using a service like amazon.
They hear "cloud" and equate that to "it just works".
Which equates to never ending nightmare for folks like me (fortunately I don't have to deal with amazon at the moment having moved my company out over a year ago).
I sympathize that you probably got beat up over management's choices. (And beating you up is part of their plan.) But, still, do you really think your former management bozos could cobble together a better plan than "we just go down when AWS hiccups?" Sure, the revenue stream is toast for a few hours a few times a year, losing literally thousands of dollars of revenue. In principle, one can do better; in practice, I doubt they can.
What they hear in "cloud" is "they can't pin it on me."
It's a delicious irony that the ad surrounding this news article at the moment is for Amazon Web Services :-)
It's not ironic! It's AdWordic!
Another example of the stupidity of placing an entities data on someone else's servers umpteen thousand miles away from the entities location with no control by the entity. I wonder how it took for the entities to realize they lost access, let alone how long it took for Amazon to let them know, let alone fix it.
Because you always know in advance when a locally hosted piece of critical hardware or software is about to break, and have the engineering resources to work around it.
No, what this highlights is the failure to plan for failure, and the same can happen with locally hosted resources, particularly when you start looking below the FTSE250 or equivalent where they may not have the ability to pay for 5 9's availability. The cloud at least gives them a chance to get somewhere near this level of service without bankrupting the company. That said, if I were hosting core business apps 'in the cloud' i'd push to host with two providers, Amazon, Rackspace or Azure et al where you're at least spreading your bets that both providers won't break at the same time.
I'm still not convinced that moving non-sensitive data to 'the cloud' is a bad move for a lot of businesses.
Nuclear power plants, particle accelerators and jet engines all experience show stopping problems more often than the Amazon cloud offering and only the fringe complain about those things. Granted those issues are generally handled better, but as ElNumbre states, this issue is a failure to plan for failure; not a failure of the entire concept.
One has to balance the problems against the brilliance of having to invest no capital and not having to figure out how to acquire and keep a local bureau of sysadmins that can do it better.
If I read other comments here correctly, no it's not. The plan for failure is that if part of the cloud service fails, you rollover to a different part of the same cloud service. But what happened was not a clean break, more of a severe brownout. Which means none of the planned triggers were tripped. And the failures needed to be detected at the cloud level, not the level of the people purchasing the service.
This one seems to be more akin to Xerox's 6 and 8 issue with their copiers. There is a subtle failure in the system that doesn't completely break the system. And you have to be looking too damn hard for any reasonable expectation of finding it. Frankly, these are the type of error I find most troublesome. If a failure breaks something cleanly, there are alarms. When it doesn't, not so much.
the what now? Forensic investigation? To understand how it (the network device) failed?
So it's a crime now for a network device to fail? fo . ren . sic: adjective: of, relating to, or denoting the application of scientific methods and techniques to the investigation of crime (forensic evidence / forensic investigation)
OK, I can see (maybe!) the application of scientific methods and techniques (IT is a science, right... well for some it is, for others (including some in the IT field) it is magic / the stuff of gods). I can see that it is a crime in the eyes of Amazon for a network device to fail, but still to us users / observers that seems a bit Over The Top(TM).
Spindoctors / PR people: please use language that actually means something, and please stop using language to make it all sound nice... When you do a root cause analysis, just call it a root cause analysis. What's wrong with that? Forensic Investigation? Just makes me think: You've obviously got no clue as to what you are saying, you're just saying things because, to you, they sound nice... That, I would think, is showing, and I quote, not so much brains as earwax.
I will grant Merriam-Webster doesn't seem to have updated their definition yet, but wiki, which is more attuned to these sorts of subtle changes has. I frequently hear "forensic" used in this new form:
the application of a broad spectrum of sciences and technologies to investigate situations after the fact, and to establish what occurred based on collected evidence.
And frankly given the size of the failure, the number of companies involved, and the sums of money exchanging hands, even your strict definition may be more apt than some (many?) of us would like.
If you go to a fancy school you join the forensics team, not the debate team.
fo . ren . sic: adjective: of, relating to, or denoting the application of scientific methods and techniques to the investigation of crime
Dictionaries do not specify language. And this is only one of the common, well-established meanings of "forensic".
please use language that actually means something
but It doesn't load? is this also a Google failure/downtime or is it ███ pulling the vid due to obvious ██████ reasons?
(oops, it's working now, tinfoil hat off.....)
Other cloud providers are available that will:
Offer you an SLA that sticks
Offer competitive pricing comparable to Amazon (after you've taken into account all the Amazon extras NOT included in the base VM price)
Actually give you a phone number and an account manager to handle your business
Let YOU decide which DC (and country) to host your application in, and not move it at a whim.
We run our own services on top of our cloud offerings, as well as those of our customers, we've NEVER had outages like Amazon or Google.
We run our own services on top of our cloud offerings ... we've NEVER had outages like Amazon or Google.
So, you do your IT in-house, and don't have major outages? There's a surprise.
We've run our web site in US-East AZ C for 5 year and have not had a moment downtime that's not our fault. That's 100% uptime over 5 years. Whether this time or when the last apparent outage occurred we experienced no downtime at all as a result. We have used EBS since they became available. By comparison, our in-house kit fails regularly and our experience on Azure was not positive.
My conclusion is that the AWS infrastructure is resilient. No one can expect that NO problems will affect them, that's ridiculous. In this case a competent application administrator would have mirror sites and services in other zones if not regions.
So I think that some where there's a marketing department try to huff and puff some life into this moribund story.
AC @ 9:36 Be courageous, provide some comparisons (or links to robust comparisons). We've tried (and try) Azure, GoGrid and Rackspace to check prices for our needs. I'm not saying that AWS is a panacea for everyone but we've not been able to find a better price for our needs so far.
Is this the collective noun then? I would have thought a "not-in-yet of sysadmins", or a "sorry-I'm-a-bit-hungover of sysadmins"
We keep our Sysadmin locked in the basement and chained to his chair for precisely this reason.