An Amazon engineer hit the wrong button on Christmas Eve, deleting critical data in its load balancers and ultimately knackering vid streaming biz Netflix for 20 hours. The Netflix outage hit customers in the US, Canada and Latin America on 24 December, particularly those using games consoles and mobiles to watch films, while …

COMMENTS

House rules Send corrections

This topic is closed for new posts.

Page:

Wednesday 2nd January 2013 18:22 GMT Anonymous Coward

Shoulda gone with Akamai

'nuff said.

0 0
1. Wednesday 2nd January 2013 19:34 GMT John 104
  
  Re: Shoulda gone with Akamai
  
  Ya, but Akamai is expensive. :D
  
  We use it here at work and our management keeps coming to me asking if there is a cheaper alternative. I keep saying,yes, but when was the last time we had an outage because of Akamai? The answer is never.
  
  3 1
Wednesday 2nd January 2013 19:41 GMT Anonymous Coward

My DVDs were unaffected.

Actually, the outage even failed to affect my video tapes.

9 0
1. Wednesday 2nd January 2013 20:34 GMT Anonymous Coward
  
  Re: My DVDs were unaffected.
  
  lol
  
  0 0
Wednesday 2nd January 2013 19:51 GMT Anonymous Coward

I know I'm going to be unpopular but what the heck...

I'm getting rather fed up of the argument from some that this is all down to the change processes and that heads should not roll. I know my organisation's efficiency would improve immensely if I were allowed to fire some asses now and then rather than just shuffle them off to the side to some role where (I hope) they cannot do any damage. I often wish that management had not downsized HR quite so much so that there were actually some warm bodies who would help me satisfy all the regs for sacking someone so I could use the money to hire someone decent instead...

3 9
1. Wednesday 2nd January 2013 20:16 GMT Hooksie
  
  Re: I know I'm going to be unpopular but what the heck...
  
  No wonder you were AC on that comment. You think it should be ok to sack people because managers like you continue to ask them to do things that they aren't trained for, don't have the time to finish, isn't their responsibility and that you already outsourced or downsized the team that was SUPPOSED to do that job. Oh, and on top of that you give them a 2.5% pay 'increase' then blame the market conditions.
  
  To err is human, to really fuck things up requires a computer, a tired engineer and piss poor management.
  
  14 3
2. Thursday 3rd January 2013 02:11 GMT Fatman
  
  Re: efficiency would improve immensely if I were allowed to fire some asses now and then
  
  Simples, just hire this guy:
  
  http://disqus.com/JIMTHEBOSS/
  
  Check out some of his CW posts - perfect manglement material.
  
  0 0
3. Thursday 3rd January 2013 19:19 GMT asdf
  
  Re: I know I'm going to be unpopular but what the heck...
  
  > there were actually some warm bodies who would help me satisfy all the regs for sacking someone
  
  Wow definitely not an American in a right to work state then. Right to work someplace else no questions asked is what it should be called. It sounds worse than it is though in that it is generally easier to find a job as their is less risk in hiring someone but you are lucky if you find a place that treats you as anything but an asset though.
  
  0 0
Wednesday 2nd January 2013 20:48 GMT Anonymous Coward

Making bad assumptions

Actually I'm the kind of manager that fights tooth and nail to get my team trained, proper pay rises, promotions and fight against outsourcing and downsizing. I have hated it in the past when I have had to make good people redundant. I don't ask any of my team to do anything that I cannot do myself. All of which is why I'll never rise any further. However the propensity of some people to take the piss does make life worse for everyone else. If you know your UK employment law and your employer stints on HR you can be almost unsackable.

And 2.5%. I'd love to be able to secure that kind of rise for the best people in my team.

4 0
1. Thursday 3rd January 2013 13:42 GMT Anonymous Coward
  
  Re: Making bad assumptions
  
  " I don't ask any of my team to do anything that I cannot do myself"
  
  You are either the most talented person in the world, run the least skilled IT department in the world, or the best bullshitter in the world, or as you don't seem to expect your staff to do stuff you can't do, could explain your own lack of promotion.
  
  0 0
Wednesday 2nd January 2013 21:07 GMT fnusnu

Looks like Amazon's staff are better than Chaos Monkeys...

1 0
Wednesday 2nd January 2013 21:41 GMT DaveNullstein

Shit happens.

Always will.

0 0
1. Thursday 3rd January 2013 03:40 GMT Euripides Pants
  
  Re: Shit happens.
  
  Or, in this case, clouds dissipate...
  
  1 0
2. This post has been deleted by its author
Wednesday 2nd January 2013 22:36 GMT pstones578

Change Control / Change Freeze

Would it not make sense to have some proper change control and then Amazon could have reviewed their change documentation and hey presto notice a change had happened around the time of the problem. Also while they are at it wouldn't it also be a good idea to have a change freeze around such a critical time of year! Unbelievable

0 0
Wednesday 2nd January 2013 23:36 GMT Vince

Blame the engineer, ignore the cause.

So the problem is...

(a) Netflix have poor business continuity planning and rely on a single supplier (AWS) for its systems.

Cause: Poor management decisions/understanding

(b) AWS have poor processes that allow a single point of failure

Cause: Poor management decisions/understanding

(c) Netflix believed the "cloud" of Amazon would be redundant against anything and assumed they had covered the issues in (a)

Cause: Poor management decisions/understanding

The real issue isn't the engineer that "made an error" but the AWS system that can fail despite supposedly being uber-geo-redundant and so on, and the Netflix management who decided to put the eggs in one basket.

As I understand it, Netflix have local content caches with various ISPs so I assume the issue was the database/account side and not the underlying content availability - so it would be a *relatively* less expensive task to put a better system in place (I'm not pretending it is trivial, but it's obviously "less tricky" when you haven't got to replicate what I assume is a huge amount of content which would be costly to store/stream en masse

A better fix would have been to have multiple providers and the ability to have the Netflix client(s)/website(s) detect/choose/forced etc.

Of course this would require more expenditure and at £5.99 (or it seems a penny more if you subscribed more recently) it's unlikely there's enough margin I guess.

2 1
1. Thursday 3rd January 2013 01:06 GMT Don Jefe
  
  Re: Blame the engineer, ignore the cause.
  
  Well, at least we all now know Vince has never been or probably never will be in any sort of management role.
  
  0 0
2. Tuesday 8th January 2013 14:06 GMT Anonymous Coward
  
  Re: Blame the engineer, ignore the cause.
  
  Nice word that 'assume'
  
  0 0
Wednesday 2nd January 2013 23:47 GMT Anonymous Coward

Change control

It really says something about the Amazon change control process. It also says volumes about their support staff; both the person that did the deleting and the subsequent ones that did the troubleshooting. When they encountered missing data, the first thing should have been to look at who made a change, what the change was and what was actually changed. I think Amazon needs to invest in an AAA solution.

0 0
Thursday 3rd January 2013 00:38 GMT Anomalous Cowturd

AAA solution

Anti-Aircraft Artillery?

Make ready my 88mm please, Jeeves.

0 0
Thursday 3rd January 2013 00:40 GMT John H Woods

As with the NatWest disaster ...

... it should not be possible for a single engineer to wreak this kind of havoc: systems like this should be resistant even to deliberate malice. Your engineer could be tired, inexperienced or unwell. But they could also be a saboteur working for a competitor, an employee with a grudge, a criminal who is going to hold your system to ransom or even an out-and-out terrorist.

0 0
Thursday 3rd January 2013 01:08 GMT Don Jefe

Preperation

To prepare for every contingency is usually an excuse to ignore the real world.

- Me 2003

0 0
Thursday 3rd January 2013 01:53 GMT gcarter

/me shakes his head from side to side and ads another handful of movies to his couchpotato / newsgroup queue... can't beat locally stored content :-)

Its an irony how us web pirates have a more robust solution than the poor souls who choose to go legit ;-)

5 0
1. Thursday 3rd January 2013 02:15 GMT Fatman
  
  RE: can't beat locally stored content :-)
  
  Don't 2Tb drives make for some nice amounts of locally stored content!!!!
  
  2 0
Thursday 3rd January 2013 09:37 GMT Sir Codington

Who does maintenance on Christmas eve? There is always a risk, mostly to one's holiday time.

2 0
Thursday 3rd January 2013 12:56 GMT Andrew Jones 2

I find it incredibly annoying that Netflix are still claiming it didn't affect people in the UK -

It bloody well did. But still no explanation why.....

0 0
Thursday 3rd January 2013 15:36 GMT Anonymous Coward

Thats Netflix fixed.....now for MSFT Media Center?

Now all we need if for the engineer, PFY or intern who went on xmas vac forgeting to flick the switch to update the TV Guide data in Media Center to sort out the updates there (we know BDS Ltd have sent data packages to MSFT for upload) then everyone will be happy :-) (For ref UK data ended Jan 1 so having to use dead tree TV guides and ending up with loads of "Manual Recordings" :-( )

0 0
Monday 14th January 2013 14:46 GMT Anonymous Coward

maybe, just maybe

the idiot who didn't check his/her work should shoulder some responsibility for this. perhaps, before initiating a change that could cause a major service outage during a peak usage period, they should take a minute to really look at what they've instructed the system to do before they hit the go button.

i can't see how this is management's fault: it's just poor workmanship.

i'm assuming it's all techies who have blamed the managers. well, i am a techie and this is just someone doing a shit job because it's xmas eve and they're not paying attention.

0 0