back to article How an Amazon engineer's slip-up started a 20-hour Netflix cock-up

An Amazon engineer hit the wrong button on Christmas Eve, deleting critical data in its load balancers and ultimately knackering vid streaming biz Netflix for 20 hours. The Netflix outage hit customers in the US, Canada and Latin America on 24 December, particularly those using games consoles and mobiles to watch films, while …

COMMENTS

This topic is closed for new posts.

Page:

I am glad I am not that man (or woman)!

7
0
Silver badge
Meh

Go on...

Go on, blame it on the little guy!

Now tell us what really went wrong....

6
0
FAIL

Sounds like a RGE - Resume Generating Event...

0
0
Anonymous Coward

Did he accidentally hit the 'DO NOT PRESS THIS BUTTON' button under the 'DO NOT PRESS THIS BUTTON' sign?

I think not.

0
0
FAIL

Alternatively, a CLM - Career Limiting Move.

1
0
Bronze badge
FAIL

RE: Sounds like a RGE - Resume Generating Event...

Preceded by a ETE - Employment Terminating Event.

The only appropriate icon for this screw-up.

0
0
Anonymous Coward

"Did he accidentally hit the 'DO NOT PRESS THIS BUTTON' button under the 'DO NOT PRESS THIS BUTTON' sign?"

It's surprising how many times someone has hit an inviting "emergency" red button - and been unable to explain why its prohibition had exerted such a fatal attraction.

0
0
Anonymous Coward

"Alternatively, a CLM - Career Limiting Move"

According to the Peter Principle it is more likely to generate a promotion.

0
0
Silver badge
Holmes

No, for that you already need upward-balooning momentum, a nice suit and a few files on company dirt that you could "forget in the bus".

0
0
Bronze badge

RE: According to the Peter Principle it is more likely to generate a promotion.

Only if he was in manglement!!

0
0
Angel

No personal experience of such an incident, but..

It was probably the computer operators playing Frisbee with the tape container covers in the computer room and accidentally hitting the tape drive reset button with the cover.

Not that I have any actual experience of such a thing happening <cough> <cough>, but I have heard that it happens sometimes.

I'm innocent, I promise...

1
0
FAIL

In various projects I've worked on that involve critical data, we usually have a ban on major changes on Fridays or just before a public holiday, and a change freeze until Christmas is well out of the way. What went wrong at Amazon/Netflix that allowed this to happen?

20
0
Silver badge
FAIL

easy

>What went wrong at Amazon/Netflix that allowed this to happen?

Poor management which is almost always the case in these incidents. Its much easier to blame some peon but for mission critical infrastructure like this not only should it not have been possible for the peon to accidentally do this but he should not have been able to affect service even if he maliciously tried (yes i know a pipe dream in most reactive only crap corporate environments). If the often times sociopaths in charge are going to take the big salaries then they should occasionally be responsible for something.

12
2
Anonymous Coward

I guess it went like this.

Netflix top honcho moans to Amazon top honcho about wanting everything to be as fast as possible and super dooper for Christmas and everything seems a bit "slow"

Head Amazon honcho moans to AWS top honcho going "Why is netflix complaining everything is slow RAR RAR RAR"

AWS head guy goes to Ops manager "RAR RAR RAR I just had Our Head Honcho moan at me that the systems slow clean it up!"

Ops manager sighs goes to team "I know it's bollocks but Bob you need to run the maintenance on the nodes for Netflix coz everyone is moaning"

Bob wanting to go home and start drinking runs the processes but against the wrong object ID then goes home to the family / to the pub

That is one possible option.Given netflix isn't really a mature outfit and AWS will do what they're told, I can imagine that being the situation.

9
0
Silver badge

Re: easy

Bad managers blame their workers, just like bad workers blame their tools.

And yeah, I never make any change on a Friday that can't be reverted using ConnectBot on my mobile phone, on a crowded no. 38 bus (which route passes through a fairly serious 3G blackspot).

4
0
Anonymous Coward

Re: easy

Sadly I work in a stupid company where all major things happen on a Sunday morning 6am>10am (with a single engineer - who is normally also the only on call engineer.)

0
0
Bronze badge
WTF?

Re: Sadly I work in a stupid company where all major things happen on a Sunday morning

Are you so sure that is all bad?

WROK PALCE had to """fix""" a telco related """fire hazard""" involving a shitload of phone lines that were not "plenum rated cable" (according to """fire marshal"""). On a Sunday morning, WROK PALCE is only manned by security, and no one else. So, I have to ask, do you want phone lines going down during the business day, with employees at WROK, or do you want the phone lines going down when most employees are at church???

Let me see, I will take Sunday morning, any time for this kind of downtime.

2
1
Unhappy

Re: Sadly I work in a stupid company where all major things happen on a Sunday morning

Church?

Seriously?

Blimey.

5
2
Anonymous Coward

Re: Sadly I work in a stupid company where all major things happen on a Sunday morning

'Murrica.

1
0
Silver badge
Stop

"What went wrong at Amazon/Netflix that allowed this to happen?"

They made a techie work on Xmas eve. That was their first mistake. He probably didn't want to be there and wanted to get home.

The second mistake was being too tight to pay the over-time for TWO guys to watch each other's backs and spot mistakes.

0
0
Bronze badge
Pint

Re: Church? Seriously?

At least that is what they say!!

Now, if you think I believe most of them, then, I have this swampland in Florida I could sell you.

For a few, I seriously doubt that they could assume a stand-up position on a Sunday morning.

Icon expresses why!!

0
1
IT Angle

"What went wrong at Amazon/Netflix that allowed this to happen?"

Did Amazon/Netflix out source/off shore to India à la RBS?????

Just asking

0
0
Anonymous Coward

Re: easy

Lucky bugger, try between 1am and 5am Sunday mornings....

0
0
Silver badge

Re: Sunday morning 6am>10am

Granted 6pm < 10 pm on Friday would probably be better depending on the business, that's still better than 4am>8am Monday morning.

Of course the guys I really feel sorry for are the point of sale vendors for fast food joints. My friend's migration schedule is always 3am to done with training before and after.

0
0
Van

24x7 ?

The poster claiming it was operators playing frisbee is a closer guess. I would expect the data center to be manned by a large team of operators 24 x7. And with a 25% shift allowance + extra holidays, they most certainly would want to be there. Eating Pizza, watching TV, in between housekeeping tasks.

0
1
Silver badge

> Ops manager sighs

and says," Its christmas, we're in the middle of a change freeze and we won't be doing fixing anything which isn't already causing an outage, or is likely to cause an outage before the end of the freeze."

Amateurs!

1
0

This post has been deleted by its author

Silver badge
Coat

> we usually have a ban on major changes on Fridays or just before a public holiday,

Yep, changes are Tuesdays (to leave Monday for final planning and cleaning up after the weekend and avoiding "Monday-itus") and Thursdays (because no-one wants to work weekends and its cheaper on overtime payments).

Would it be rude to point out that torrents are naturally fault tolerant and cheaper than F5's?

0
0
Anonymous Coward

Huh ?

I would have thought that by definition, "Elastic Load Balancing" would be an adaptive process. Just deleting the state data would temporarily unbalance things until it "learned" again.

At least that's how *I* would have implemented it. If I was going to call it "Elastic".

15
0
Silver badge

Re: Huh ?

I would have made it so that if you strain it too hard it snaps and the pieces go flying across the room and hurt people. they didnt even invite me to an interview though.

14
0
Bronze badge
WTF?

Re: that if you strain it too hard it snaps and the pieces go flying across the room

WHY, did an image of Steve Ballmer sitting in an executive chair, being ricocheted by a over stretched bungee cord; and being hurled out the window of Microsoft's HQ suddenly pop up in my mind?????

0
0
Happy

Re: WHY, did an image of Steve Ballmer ...

Possibly for the same reason that in my house our two new all-in-ones running 'Microsoft Window' almost ended up with a mug shot of His Steveness mapped to the ClassicShell start button. We decided against, though the thought still tickles.

0
0

Re: WHY, did an image of Steve Ballmer ...

A bungee boss????

0
0
Silver badge

Re: Huh ?

Even AT&T has problems building elasticity that can handle losing a large chunk of their normal bandwidth. The expectation is for random single failures that account for maybe 1% of the load. They get good at dealing with those. But kill 25% instantaneously and the cascade failures start taking down the rest of the system. Sure they stress test it in a VM lab, but for some reason the real world never seems to work that way. And you rarely get real world

No it shouldn't be that way, but all too often it is.

0
0
Silver badge
WTF?

Conspiracy Theory Alert

Amazon own LoveFilm.

Cue much chin-scratching .....

2
0
Silver badge

good. people shouldn't be watching movies on the eve of Jesus's birthday.

Perhaps the amazon engineer was working for the church

5
3
Silver badge
Joke

You forgot your icon (I hope).

3
0
Angel

Thanks -

I probably needed to recalibrate my sarcasm detector anyway.

0
0

"good. people shouldn't be watching movies on the eve of Jesus's birthday."

Right, they should be fighting with loved ones. Or, in the words of Paul Gilmartin, "disfunction rears its yuletide head".

1
0
Silver badge
FAIL

Do you even pagan?

I sure hope you celebrated The Aramaean One's Birthday in front of Stonehenge!

0
0

"good. people shouldn't be watching movies on the eve of Jesus's birthday Dies Natalis Solis Invicti

Fixed it for you.

There's also the god of wine, Dionysus, also called Bacchus, also called Iacchus, Born December 25th to a virgin mother; performing miracles such as changing water into wine; died and was resurrected after three days and ascended into heaven.

If I remember correctly there was also a minor cult in the Roman army who worshipped a dead Roman soldier who was born on December 25th, died/was killed and was resurrected after three days.

0
0
Bronze badge
Facepalm

Change processes themselves have to be tested and controlled. That errant maintenance process should have been run against a test environment prior to being used on production sites.

And then a test suite needs to be included and run to ensure that the test/production sites are still up and running after the change is applied.

0
1
FAIL

This is why you don't let Devs into production environments - EVER.

8
1
FAIL

You've obviously never experienced...

...what production can actually do with your *precious* designs have you? It barely fucking met the spec before you guys got yer hands on it etc

0
0
Anonymous Coward

Wonder if Amazon have been hiring ex RBS employees?

2
0

Least they didn't blame it on outsourcing.

1
0

but isn't using AWS a form of outsourcing by Netflix ?

0
0
Bronze badge

RE: Wonder if Amazon have been hiring ex RBS employees?

Damn you!!!!

Another keyboard fucked up!!!!!!

0
0
Bronze badge
Linux

Now that I know

Netflix is using Amazon's cloud I may just have to drop them. This is the kind of shit that is going to happen more and more as these idiots give up control of their data and infrastructure.

Besides the fact that Netflix doesn't have a Linux app is probably the main reason I want to drop them.

0
3
Trollface

"going to happen more and more"

At current monthly subscription rates that must have been almost $0.25 worth of service we each lost there, no joke if they keep that up the whole economy will soon grind to a screeching halt! One less option for ignoring friends and family, on Christmas of all days when you all know we need it most!! Think of the children!!!

0
1

Page:

This topic is closed for new posts.

Forums