Problems in the Amazon cloud over the weekend crushed apps like Vine, websites like Airbnb, and numerous other services that depend on Bezos & Co's hulking cloud, and the problems were due to a familiar culprit – Elastic Block Store (EBS). EBS is a network-attached block level storage service for Amazon EC2 instances. Amazon …

COMMENTS

House rules Send corrections

This topic is closed for new posts.

Monday 26th August 2013 18:49 GMT ChrisM

Automated Failover...

Isn't that supposed to be one of the selling points? That and distributed resilience?

0 0
Monday 26th August 2013 18:51 GMT R 11

One region? What's the big deal?

If it only affected a single region, I'm not sure what the big deal is.

This is no different that a major hosting center or hub having a network issue that affected a bunch of websites, only in this case we get to shout "but it was in the cloud."

Amazon present easy options for sites to present themselves across diverse geographic regions for added resilience. Many choose not to do so, but that isn't a problem with Amazon's offering, it's down to their own cost-benefit priorities.

2 2
1. Monday 26th August 2013 19:17 GMT Skoorb
  
  Re: One region? What's the big deal?
  
  One of the issues with EBS in certain failures is that the breaking doesn't always work. Sometimes you just get a 100x slowdown one disk, without any 'breaking', until you detect and route around the slowdown. Other times Amazon reports that a disk transaction has been committed, but if you try and read it back you get corrupted data every time you try; indicating a corrupted write in the first place (this is rare, but anyone who has spent much time with any hefty EBS instances will have seen this). So you can end up with corrupted data on both master and slave copies before any of Amazon's stuff actually reports that anything might be wrong, whilst what you have left runs like a drain.
  
  This does not make reliability engineering very easy.
  
  Basically, it's not that EBS fails (you should plan for that as has been mentioned) it's that it sometimes fails in completely unexpected ways that go against the documentation on how it should fail, and you don't necessarily notice that it's gone off on an acid trip until it is too late.
  
  It also doesn't help when multiple availability zones break at the same time, or when the failover takes forever because everyone else is failing over with you.
  
  I should point out that I haven't touched AWS recently, so Amazon may have miraculously fixed all of this, and this latest issue is some unrelated problem.
  
  Any one with more up to date knowledge of AWS care to weigh in?
  
  3 0
2. Monday 26th August 2013 19:50 GMT Voland's right hand
  
  Re: One region? What's the big deal?
  
  It is not just single region.
  
  Amazon core services including amazon.co.uk itself (accessed from Virigin in the UK) have been temperamental (to put it mildly) whole weekend long. My initial suspicion was Virgin (as usual).
  
  However, it looks like it may be the other usual suspect.
  
  0 1
Monday 26th August 2013 18:52 GMT Nate Amsden

no

"Given the outage during the weekend just gone, cloud-first businesses might want to start looking at EBS and working out how to design their systems around potential failures in Amazon's data center hubs."

Given the ***SLA*** they should be doing that. Of course almost nobody does, or will. That's part of the scam.

Having supported apps at two different orgs that were built from the ground up in Amazon over a few years I saw first hand how little work goes into taking into account the failure of amazon infrastructure (i.e. none) in many cases (similar rules apply to in house infrastructure as well btw so it's not unique to cloud - every other company I worked for had developers that did similar things with their own infrastructure).

Not to say that assuming the infrastructure won't fail is not a valid model - in many cases it is (it is often the cheapest way to go for small-mid scale). The problem is of course most people(management) don't realize this critical distinction when using a service like amazon.

They hear "cloud" and equate that to "it just works".

Which equates to never ending nightmare for folks like me (fortunately I don't have to deal with amazon at the moment having moved my company out over a year ago).

2 0
1. Tuesday 27th August 2013 03:27 GMT Anonymous Coward
  
  Re: no
  
  I sympathize that you probably got beat up over management's choices. (And beating you up is part of their plan.) But, still, do you really think your former management bozos could cobble together a better plan than "we just go down when AWS hiccups?" Sure, the revenue stream is toast for a few hours a few times a year, losing literally thousands of dollars of revenue. In principle, one can do better; in practice, I doubt they can.
  
  What they hear in "cloud" is "they can't pin it on me."
  
  1 1
Monday 26th August 2013 18:59 GMT David Harper 1

How ironic

It's a delicious irony that the ad surrounding this news article at the moment is for Amazon Web Services :-)

2 0
1. Monday 26th August 2013 20:18 GMT Mephistro
  
  Re: How ironic
  
  It's not ironic! It's AdWordic!
  
  2 0
Monday 26th August 2013 20:51 GMT Kev99

"To the Cloud"

Another example of the stupidity of placing an entities data on someone else's servers umpteen thousand miles away from the entities location with no control by the entity. I wonder how it took for the entities to realize they lost access, let alone how long it took for Amazon to let them know, let alone fix it.

2 1
1. Monday 26th August 2013 23:07 GMT ElNumbre
  
  Re: "To the Cloud"
  
  Because you always know in advance when a locally hosted piece of critical hardware or software is about to break, and have the engineering resources to work around it.
  
  No, what this highlights is the failure to plan for failure, and the same can happen with locally hosted resources, particularly when you start looking below the FTSE250 or equivalent where they may not have the ability to pay for 5 9's availability. The cloud at least gives them a chance to get somewhere near this level of service without bankrupting the company. That said, if I were hosting core business apps 'in the cloud' i'd push to host with two providers, Amazon, Rackspace or Azure et al where you're at least spreading your bets that both providers won't break at the same time.
  
  3 1
  1. Tuesday 27th August 2013 00:09 GMT Don Jefe
    
    Re: "To the Cloud"
    
    I'm still not convinced that moving non-sensitive data to 'the cloud' is a bad move for a lot of businesses.
    
    Nuclear power plants, particle accelerators and jet engines all experience show stopping problems more often than the Amazon cloud offering and only the fringe complain about those things. Granted those issues are generally handled better, but as ElNumbre states, this issue is a failure to plan for failure; not a failure of the entire concept.
    
    1 0
    1. Tuesday 27th August 2013 12:38 GMT Tom 13
      
      Re: this issue is a failure to plan for failure
      
      If I read other comments here correctly, no it's not. The plan for failure is that if part of the cloud service fails, you rollover to a different part of the same cloud service. But what happened was not a clean break, more of a severe brownout. Which means none of the planned triggers were tripped. And the failures needed to be detected at the cloud level, not the level of the people purchasing the service.
      
      This one seems to be more akin to Xerox's 6 and 8 issue with their copiers. There is a subtle failure in the system that doesn't completely break the system. And you have to be looking too damn hard for any reasonable expectation of finding it. Frankly, these are the type of error I find most troublesome. If a failure breaks something cleanly, there are alarms. When it doesn't, not so much.
      
      1 0
2. Tuesday 27th August 2013 03:32 GMT Anonymous Coward
  
  Re: "To the Cloud"
  
  One has to balance the problems against the brilliance of having to invest no capital and not having to figure out how to acquire and keep a local bureau of sysadmins that can do it better.
  
  3 0
Tuesday 27th August 2013 08:01 GMT Guus Leeuw

Forensic investigation?

Dear Sir,

the what now? Forensic investigation? To understand how it (the network device) failed?

So it's a crime now for a network device to fail? fo . ren . sic: adjective: of, relating to, or denoting the application of scientific methods and techniques to the investigation of crime (forensic evidence / forensic investigation)

OK, I can see (maybe!) the application of scientific methods and techniques (IT is a science, right... well for some it is, for others (including some in the IT field) it is magic / the stuff of gods). I can see that it is a crime in the eyes of Amazon for a network device to fail, but still to us users / observers that seems a bit Over The Top(TM).

Spindoctors / PR people: please use language that actually means something, and please stop using language to make it all sound nice... When you do a root cause analysis, just call it a root cause analysis. What's wrong with that? Forensic Investigation? Just makes me think: You've obviously got no clue as to what you are saying, you're just saying things because, to you, they sound nice... That, I would think, is showing, and I quote, not so much brains as earwax.

Best regards,

Guus

1 1
1. Tuesday 27th August 2013 12:46 GMT Tom 13
  
  Re: Forensic investigation?
  
  I will grant Merriam-Webster doesn't seem to have updated their definition yet, but wiki, which is more attuned to these sorts of subtle changes has. I frequently hear "forensic" used in this new form:
  
  the application of a broad spectrum of sciences and technologies to investigate situations after the fact, and to establish what occurred based on collected evidence.
  
  http://en.wikipedia.org/wiki/Forensic_science
  
  And frankly given the size of the failure, the number of companies involved, and the sums of money exchanging hands, even your strict definition may be more apt than some (many?) of us would like.
  
  0 0
  1. Tuesday 27th August 2013 14:28 GMT Don Jefe
    
    Re: Forensic investigation?
    
    If you go to a fancy school you join the forensics team, not the debate team.
    
    1 0
2. Friday 30th August 2013 16:42 GMT Michael Wojcik
  
  Re: Forensic investigation?
  
  fo . ren . sic: adjective: of, relating to, or denoting the application of scientific methods and techniques to the investigation of crime
  
  Dictionaries do not specify language. And this is only one of the common, well-established meanings of "forensic".
  
  please use language that actually means something
  
  Prescriptivist rubbish.
  
  0 1
Tuesday 27th August 2013 09:13 GMT Anonymous Coward

I'm trying to watch (laugh at) The Assange©™ video on Google YouTubez

but It doesn't load? is this also a Google failure/downtime or is it ███ pulling the vid due to obvious ██████ reasons?

http://www.youtube.com/user/thejuicemedia?feature=watch

(oops, it's working now, tinfoil hat off.....)

0 2
Tuesday 27th August 2013 09:36 GMT Anonymous Coward

Other cloud providers are available that will:

Offer you an SLA that sticks

Offer competitive pricing comparable to Amazon (after you've taken into account all the Amazon extras NOT included in the base VM price)

Actually give you a phone number and an account manager to handle your business

Let YOU decide which DC (and country) to host your application in, and not move it at a whim.

We run our own services on top of our cloud offerings, as well as those of our customers, we've NEVER had outages like Amazon or Google.

0 1
1. Tuesday 27th August 2013 09:40 GMT Anonymous Coward
  
  We run our own services on top of our cloud offerings ... we've NEVER had outages like Amazon or Google.
  
  So, you do your IT in-house, and don't have major outages? There's a surprise.
  
  1 0
Tuesday 27th August 2013 09:55 GMT Sirius Lee

We've run our web site in US-East AZ C for 5 year and have not had a moment downtime that's not our fault. That's 100% uptime over 5 years. Whether this time or when the last apparent outage occurred we experienced no downtime at all as a result. We have used EBS since they became available. By comparison, our in-house kit fails regularly and our experience on Azure was not positive.

My conclusion is that the AWS infrastructure is resilient. No one can expect that NO problems will affect them, that's ridiculous. In this case a competent application administrator would have mirror sites and services in other zones if not regions.

So I think that some where there's a marketing department try to huff and puff some life into this moribund story.

AC @ 9:36 Be courageous, provide some comparisons (or links to robust comparisons). We've tried (and try) Azure, GoGrid and Rackspace to check prices for our needs. I'm not saying that AWS is a panacea for everyone but we've not been able to find a better price for our needs so far.

1 0
Tuesday 27th August 2013 13:16 GMT CAPS LOCK

a "bureau of sysadmins"

Is this the collective noun then? I would have thought a "not-in-yet of sysadmins", or a "sorry-I'm-a-bit-hungover of sysadmins"

0 0
1. Tuesday 27th August 2013 14:30 GMT Don Jefe
  
  Re: a "bureau of sysadmins"
  
  We keep our Sysadmin locked in the basement and chained to his chair for precisely this reason.
  
  0 0