User topics

Article topics

Log in Sign up

Amazon S3-izure cause: Half the web vanished because an AWS bod fat-fingered a command

Amazon has provided the postmortem for Tuesday's AWS S3 meltdown, shedding light on what caused one of its largest cloud facilities to bring a chunk of the web down. In a note today to customers, the tech giant said the storage system was knocked offline by a staffer trying to address a problem with its billing system. …

COMMENTS

Post your comment

House rules Send corrections

Add to 'My topics'

Thursday 2nd March 2017 19:11 GMT Anonymous Coward

So much for fault injection testing !

" Hey Ravi - run this CLI ... that is what fixed it last time ... "

4 0 Reply
1. Friday 3rd March 2017 15:09 GMT Mpeler
  
  Re: So much for fault injection testing !
  
  Here's a song for them then:
  
  I've looked at clouds from both sides now
  
  From up and down, and still somehow
  
  It's cloud illusions I recall
  
  I really don't know clouds at all.....
  
  4 0 Reply
2. Friday 3rd March 2017 19:11 GMT The IT Ghost
  
  Re: So much for fault injection testing !
  
  Plenty of fault was injected, no doubt. Probably 4 or 5 people shown the door, none of them the one who actually flubbed the command.
  
  0 0 Reply
3. Monday 6th March 2017 08:11 GMT TheVogon
  
  Re: So much for fault injection testing !
  
  "an authorized S3 team member using an established playbook executed a command which was intended to remove a small number of servers for one of the S3 subsystems that is used by the S3 billing process," the team wrote in its message.
  
  "Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended."
  
  Wow they make manual command line changes that can impact lots of production systems?! Glad I don't use Amazon then. Such changes should be planned, change controlled, scripted in a file, and 4 eyed before pressing go....
  
  1 0 Reply
Thursday 2nd March 2017 19:14 GMT Herby

To err is human...

...to really foul things up requires a computer.

To guarantee a mess put a human in charge of said computer. Enough said.

Fat fingers win every time as in "I only changed one ~~card~~ line"...

I'm showing my age...

28 0 Reply
1. Thursday 2nd March 2017 19:31 GMT TitterYeNot
  
  Re: To err is human...
  
  "To guarantee a mess put a human in charge of said computer. Enough said."
  
  And to guarantee a shitstorm of Diluvian proportions, put said barely-technical human in front of some automation they don't really understand - but hey it looked great in the management meeting.
  
  It'll save us loads of money, they said.
  
  It'll guarantee five nines availability, they said.
  
  It's foolproof, they said...
  
  23 0 Reply
2. Thursday 2nd March 2017 19:36 GMT Anonymous Coward
  
  Re: To err is human...
  
  ""I only changed one ~~card~~ line"..."
  
  What did you change?
  
  Nothing.
  
  What did you change?
  
  Nothing .....that is relevant.
  
  16 0 Reply
3. Friday 3rd March 2017 13:12 GMT SotarrTheWizard
  
  Re: To err is human...
  
  Funny that you mention punchcards: I recently pulled one of my old boxes of code stacks out of the cellar, to let my grand-daughters make the quintessential early-70s craft project: the Punchcard Christmas Wreath.
  
  I had forgotten the joys of card stacks, and the multiple marker and highlighter lines across the top of the deck to help quickly restore the deck if you dropped it.
  
  Good times, good times. . .
  
  4 0 Reply
  1. Friday 3rd March 2017 14:50 GMT Anonymous Coward
    
    Re: To err is human...Punch cards devine
    
    The 1970's with it's punch cards was good times, a peak in many ways for Canadians, and I'm not talking Fortran WATFOR or WATFIV.
    
    Back then the average family income was about $10,000. That's about $65,000 today, which if you look up family income is still roughly about the middle of family incomes today. No real growth but apparently not much of a set back, until we look at where that income comes from and goes.
    
    In 1970's family income was usually from a single income. Today almost all $65K families are at least dual income and thanks to dramatic changing in Canadian taxes, from who and how much is collected they do not get to keep much of that. Even the US numbers show us what good times the past was when it came to growth and optimism. .
    
    "Expressed in 1950 dollars, U.S. median household income in 1950 was $4,237. Expenditures came to $3,808. Savings came to $429, or 10 per cent of income. The average new-house price was roughly $7,500 – or less than 200 per cent of income. By 1975, however, it took 300 per cent of median household incomes to buy a house; by 2005, 470 per cent."
    
    Many more years in school and training are required to get a job, all adults in a family have to work, most at jobs with much longer hours and often no benefits and today it is almost impossible to get a detached house in a major Canadian city for even 10X the annual income of the average high school graduate.
    
    When I look fondly at punch cards I am reminded that the good times was largely the result of citizens being "allowed" to share in the wealth they were creating.
    
    7 0 Reply
  2. Friday 3rd March 2017 22:10 GMT Anonymous Coward
    
    Re: To err is human...
    
    as late as 2001... we used blank punch cards at IBM as note pads / post it notes .. the file cabinets were stacked with them instead of note pads.
    
    1 0 Reply
4. Friday 3rd March 2017 17:47 GMT fidodogbreath
  
  Re: To err is human...
  
  In the original Reg article about the S-pocalypse, I commented that the last voice command ever was "Alexa, turn off all the servers." Turns out, that's more or less what happened.
  
  Since the outage took down IFTTT, "Alexa, turn all the servers back on" didn't work.
  
  2 0 Reply
5. Friday 3rd March 2017 18:11 GMT ZootCadillac
  
  Re: To err is human...
  
  Herby, don't misplace the punch tape!
  
  0 0 Reply
Thursday 2nd March 2017 19:19 GMT Anonymous Coward

Homo Sapien Ergonomics

I wish my finger tips were smaller than the average keyboard key.

Otherwise, I'm quite proud of my Neanderthal heritage.

15 0 Reply
1. Friday 3rd March 2017 08:25 GMT MyffyW
  
  Re: Homo Sapien Ergonomics
  
  I'm quite proud of my amply covered form, but plump fingers are a bloody nuisance.
  
  0 0 Reply
  1. Saturday 4th March 2017 10:45 GMT Anonymous Coward
    
    Re: Homo Sapien Ergonomics
    
    plump fingers are a bloody nuisance.
    
    Yes, but for most of the bigger boned, that's down to choices they've made (eg, whilst passing Greggs). It is also one that they can unmake, if the downsides of podgy digits get too much?
    
    0 0 Reply
2. Friday 3rd March 2017 09:45 GMT gotes
  
  Re: Homo Sapien Ergonomics
  
  I wish the enter key wasn't so close to the backspace key.
  
  6 0 Reply
Thursday 2nd March 2017 19:25 GMT John Smith 19

Makes me wonder how many others in the "playbook" have this capacity.

Well it should be making Amazon wonder that.

Under what circumstances would you want to be able to (virtually) shut down a whole data centre with one (mis) executed command?

16 1 Reply
1. Thursday 2nd March 2017 23:35 GMT Anonymous Coward
  
  Re: Makes me wonder how many others in the "playbook" have this capacity.
  
  They dig into this to an extent in the full statement. The command alone wasn't enough to do it. It was running a command designed for a much smaller scale of S3 over too many machines causing a bunch of systems subsequently layered over those machines to mutually screw each other up.
  
  Critically it was the requirement to restart that really screwed them. The system hadn't been restarted in so long no one noticed the restart procedure took a really, really long time. Cheeky little humblebrag, methinks.
  
  They also mention a full audit of existing operations to ensure sanity checks are in place. I for one look forward to the outage caused by being unable to affect a change to as many machines as actually needed, because sod's law's just like that.
  
  12 0 Reply
  1. Friday 3rd March 2017 08:10 GMT Bronek Kozicki
    
    Re: Makes me wonder how many others in the "playbook" have this capacity.
    
    I think they need "chaos monkey" to occasionally reset some machine or shutdown some process. At random. That would force them to learn building inherently resilient systems, quickly.
    
    7 1 Reply
  2. Friday 3rd March 2017 09:31 GMT John Smith 19
    
    "They also mention a full audit of existing operations to ensure sanity checks are in place. I"
    
    Oh dear, that sounds like an event.
    
    Not a process.
    
    Which suggests they will find (and hopefully) fix all such issues this time round a whole new bunch will accumulate over time till the next one surfaces and borks them again.
    
    Periodic review following significant (cumulative) changes should be SOP for such a large operation.
    
    3 0 Reply
2. Friday 3rd March 2017 10:15 GMT Anonymous Coward
  
  Re: Makes me wonder how many others in the "playbook" have this capacity.
  
  "Under what circumstances would you want to be able to (virtually) shut down a whole data centre with one (mis) executed command?"
  
  Ultimately somebody has to have the power to do this because shutting down servers is a valid admin activity. However it should be made a multistep process with plenty of Are You Sure? types prompts (or even somehow require 2 people/keys nuclear missile launch style), not something that can be done with a single mistyped command. In the end its a balancing act between treating your admins like responsible professionals and not children who need to be hand-held, but also ensuring one tired person can't make an almighty cock up.
  
  4 1 Reply
  1. Friday 3rd March 2017 13:22 GMT Keith Langmead
    
    Re: Makes me wonder how many others in the "playbook" have this capacity.
    
    "However it should be made a multistep process with plenty of Are You Sure? types prompts"
    
    Not just "are you sure Y/N", but also "Here's exactly what is about to be done... is that correct and what you actually intended? Y/N", otherwise anyone would just assume the command they'd entered would do what THEY intended, not what the command was about to do.
    
    7 0 Reply
    1. Friday 3rd March 2017 18:11 GMT Bronek Kozicki
      
      Re: Makes me wonder how many others in the "playbook" have this capacity.
      
      Not Y/N , but "in the prompt below, enter the missing from the above shell command, to make it work". Force them to read and think, that is.
      
      1 0 Reply
      1. Wednesday 28th February 2018 10:00 GMT donk1
        
        Re: Makes me wonder how many others in the "playbook" have this capacity.
        
        1st prompt
        
        This will shutdown 1040 servers, please type 1040 to continue.
        
        2nd prompt
        
        This will reduce capacity enough to cause a service failure for the following 8 services
        
        A
        
        ...
        
        G
        
        Please type "8 SERVICE FAILURES" to continue.
        
        0 0 Reply
    2. Saturday 4th March 2017 00:19 GMT Allan George Dyer
      
      Re: Makes me wonder how many others in the "playbook" have this capacity.
      
      "However it should be made a multistep process with plenty of Are You Sure? types prompts"
      
      So HAL was just working to design?
      
      "I think you know what the problem is, Dave"
      
      0 0 Reply
  2. Friday 3rd March 2017 23:45 GMT John Smith 19
    
    "not something that can be done with a single mistyped command. "
    
    My point exactly.
    
    Yes servers have to be taken down. Yes sometimes clusters of servers have to be taken down. But it should be very rare that all need to be taken down at the same time.
    
    And it should be impossible to do so without whoever's doing it realizing exactly what is about to happen.
    
    0 0 Reply
  3. Sunday 5th March 2017 06:33 GMT Adam 1
    
    Re: Makes me wonder how many others in the "playbook" have this capacity.
    
    > Ultimately somebody has to have the power to do this because shutting down servers is a valid admin activity. However it should be made a multistep process with plenty of Are You Sure? types prompts
    
    How about "Please enter the shutdown validation GUID. This can be found on display in the bottom of a locked filing cabinet stuck in a disused lavatory with a sign on the door saying ‘Beware of the Leopard."
    
    0 0 Reply
3. Friday 3rd March 2017 10:31 GMT Wayland
  
  Re: Makes me wonder how many others in the "playbook" have this capacity.
  
  One command is better than having to type a 100. 100 commands put into a file, we call that a 'program'.
  
  1 1 Reply
Thursday 2nd March 2017 19:26 GMT gv

PEBKAC

That is all.

17 0 Reply
Thursday 2nd March 2017 19:27 GMT Anonymous Coward

Next problem:

"I'm sorry Dave, I can't let you do that"

23 0 Reply
Thursday 2nd March 2017 19:27 GMT Your alien overlord - fear me

I want to know want was the command they were supposed to enter and what did they actually enter.

5 1 Reply
1. Thursday 2nd March 2017 19:58 GMT Anonymous Coward
  
  It's a super awesome convenience to be able to hit tons of machines in a big data center operation, but as you can see things can go wrong in a big way. It would be interesting to see a pseudo-syntax of what happened, if this was a webgui or a cli, or a script, what have you. I can tell you at the Yahoo! CEO shuffle I attended a few years back we could address wide swaths of machines, but most of the folks knew what not to do, and how to break up big jobs (ha!) into easy to handle tasks. For instance, my first task was to run a script that fixed the storage issue with NetApp "moosehead" disks that would cause it to loose data and the extra cool thing; not be able to recover from their RAID! Good times! This was on over 300 mail "farms" which were middle-tier mail handling clusters that did the sorting of mail vs junk/spam. The spam goes off to cheapo storage, and "good mail" goes to the main stores. Anyway, the IDs needing fixing to point mail user's mail to the new storage by running a script on close to 6000 machines, no VMs, all pizza boxes. No WAY was I to just go nuts and try and run them all at once, even though you could very well do that with Limo, their internal custom multi-host command tool, later replaced by a tool called Pogo. Clusters of machines could also be addressed with aliases, so I could say "all hosts in a group with a simple name"; turn off the flag to show availability to the VIP. For the script work I was clued in via change management meetings, then I ran the script on one farm to make sure it worked and that we did not clobber any users, then we did 10 farms, then 100, and the rest (are here on Gilligan's Island!). No problem. My goal was to not cause any issue that would make it into the news. :P I had nothing to do with the security also, which is a big embarrassment to their new owners, I'm sure.
  
  I was also in Search (AKA the Bing Gateway) and there we typically choose UTC midnight on Wednesdays to perform updates to the front end search servers. In the US there were two big data centers, each with two clusters of 110 hosts to handle the web facing search front end. For maintenance, you just choose a single host, take it out of the global load balancer, then update it, and drop it back in with extra monitoring turned up. If it does not crap itself, we could then take out half of a data center, do the update, put them back in, then repeat the process three more times for the other clusters, and that was that. But, yes, super easy to fuck up and take out every data center if you don't pay attention to your machine lists.
  
  6 1 Reply
  1. Friday 3rd March 2017 11:42 GMT Anonymous Coward
    
    It's a super awesome convenience...
    
    You could take down Bing or Yahoo! any time you like and for as long as you like for "maintenance" and pretty much no-one would ever notice. In fact, why not just leave them down and free up some server space?
    
    2 1 Reply
    1. Saturday 4th March 2017 15:09 GMT fredesmite
      
      Re: It's a super awesome convenience...
      
      Quite honestly - if Bing , FB, google , yahoo , blah blah - disappeared would they really be missed ?
      
      They produce nothing other than hordes of advertising spam . Remember the days before that crap existed .. young adults could actually have a face to face conversation , working meant doing something other than browsing the internet for links to share among co-workers ...
      
      1 0 Reply
  2. Wednesday 28th February 2018 10:16 GMT donk1
    
    6000 machines...so run 200 machines at a time for 30 times.
    
    What is this obession with 10,100,2000,rest and doing a massive population in 5 steps?
    
    Even if 2110 machines worked fine how long would it take to fix the last 3900 machines if enough of them broke?
    
    For failures it is not the number of times you have done it before but the size of the failure domain and how long it takes to fix.
    
    it should be possible to rollout automatically in small batches and even had multiple upgrades rolling out at the same time on an automatic schedule, ripple across the farm!
    
    If it is automated and scheduled who cares how many batches of upgrades are run?
    
    You would catch errors with less impact that way as the failed batch size would be smaller and it would be minimal extra work if designed correctly.
    
    This is the next stage in cloud service design - being able to have slower rolling upgrades with smaller batches!
    
    0 0 Reply
2. Thursday 2nd March 2017 21:22 GMT fronty
  
  rm -rf /
  
  4 0 Reply
  1. Thursday 2nd March 2017 22:51 GMT Kevin McMurtrie
    
    Funny, this should have finished while I was at lunch
    
    $ cd storage
    
    $ rm -rf tmp1* tmp2* tmp3 *
    
    19 0 Reply
    1. Friday 3rd March 2017 09:21 GMT muddysteve
      
      Re: Funny, this should have finished while I was at lunch
      
      >$ cd storage
      
      >
      
      >$ rm -rf tmp1* tmp2* tmp3 *
      
      That's always been the trouble with computers - they do what you tell them to, rather than what you wanted them to.
      
      4 0 Reply
    2. Friday 3rd March 2017 09:28 GMT Doctor_Wibble
      
      Re: Funny, this should have finished while I was at lunch
      
      When it comes to spotting mistakes, the first guess is probably the correct one - and having had numerous requests for file recovery over the years, the 'extra space' problem is not that rare.
      
      Perhaps oddly it seemed to be more common amongst people who did know what they are doing but didn't stop to re-inspect what they typed to see if they accidentally batted the space bar somewhere.
      
      Though at the other end of the scale, someone trying to follow unfamiliar instructions printed in a poorly-selected font where they have been told 'do this exactly' and it sure as hell looks like that's meant to be a space there...
      
      1 0 Reply
    3. Friday 3rd March 2017 12:08 GMT Colin Bull 1
      
      Re: Funny, this should have finished while I was at lunch
      
      It is very easy to set an alias for rm so that it lists all directories it is going to delete and asks you for confirmation first - simples
      
      0 1 Reply
      1. Friday 3rd March 2017 22:32 GMT Anonymous Coward
        
        Re: Funny, this should have finished while I was at lunch
        
        or just use " rm -i"?
        
        0 0 Reply
    4. Friday 3rd March 2017 23:21 GMT stu 4
      
      Re: Funny, this should have finished while I was at lunch
      
      I did similar thing about 2 months ago on my mac while trying to tidy stuff up in the root drive.
      
      UserTemp
      
      Usertemp
      
      ...
      
      sudo rm -rf User*
      
      hmm.that's taking an awful long time to delete some temporary crap....
      
      ..argh!@!!@#!@^#^
      
      CntlC CntlC CnltC
      
      Luckily good old timemachine got me back to an hour before and I had a 'Users' directory again.
      
      I have to say, in 10 years of mac ownership... one of the many many many times timemachine has got me out of a deep deep hole.
      
      I also remember one time, about 20 years ago - working for a large UK telecom company...needed to reboot one of the live boxes that handled 30% of the load of UK non geographic phone calls (0845, 0800, etc)...
      
      sudo shutdown now -r
      
      ....
      
      ...
      
      hmm can't seem to connect to that... doesn't seem to be coming back up..
      
      It was in an unmanned exchanged 30 miles from the nearest engineer.... had to get one of em to go out there, and press the ON button again.
      
      0 0 Reply
3. Friday 3rd March 2017 08:07 GMT roselan
  
  rm -rf //
  
  0 0 Reply
4. Friday 3rd March 2017 13:21 GMT TomChaton
  
  re: ...and what did they actually enter.
  
  I suspect it had an asterisk in it somewhere.
  
  1 0 Reply
Thursday 2nd March 2017 19:30 GMT Dwarf

This command will affect 13,432,454,456,234 objects . Are your sure ?

Of course I'm sure, its pre-programmed I hit Yes when any pop up or confirmation is shown.

17 0 Reply
1. Thursday 2nd March 2017 19:43 GMT Anonymous Coward
  
  We used to ask for double confirmation on important decisions like abandoning things.
  
  We soon learned that the second prompt had to have an inverse question - so a second "yes" was effectively a "no". That blocked trigger happy responses and made people stop and think.
  
  28 0 Reply
  1. Friday 3rd March 2017 14:52 GMT Anonymous Coward
    
    We soon learned that the second prompt had to have an inverse question - so a second "yes" was effectively a "no". That blocked trigger happy responses and made people stop and think.
    
    Works even better if you present the two dialogs in a random order...
    
    3 0 Reply
Thursday 2nd March 2017 20:05 GMT Daedalus

Wur doomed

The real Y2K problem was that in the year 2000 technology got big enough that there would never be enough wise people to look after it.

15 0 Reply
Thursday 2nd March 2017 20:20 GMT Anonymous Coward

SELECT * FROM EC3_Instance THEN DROP ALL$

Beware the wildcard!

1 2 Reply
Thursday 2nd March 2017 20:22 GMT Anonymous Coward

Availability Zones

What Amazon left out, and what El Reg didn't mention in their article 12 hours ago, is Availability Zones. You're not supposed to have to go multi-region in order to be able to sustain a major AWS outage. Being in multiple AZs is supposed to allow you to survive a fat finger by an AWS employee.

The fact that Amazon's statement talks so casually about US-EAST-1 S3 makes it clear that there is no segmentation of S3 between AZs. If S3 isn't segmented that probably means other AWS services aren't either. Paid extra for multi-AZ RDS? Added extra EC2 instances for multi-AZ load balancing? It won't help at all if RDS and ELB are administered at the regional level anyway.

I think Amazon has some splaining to do. If their own services aren't redundant across AZs then what is the point of customers paying extra to be in multiple AZs? Is the only independent component of AZs the power source? That is a far cry from Amazon's selling points of multiple AZs.

3 8 Reply
1. Thursday 2nd March 2017 20:25 GMT diodesign
  
  Re: Availability Zones
  
  We didn't mention AZs because S3 doesn't use availability zones. That's for EC2.
  
  C.
  
  15 1 Reply
  1. Friday 3rd March 2017 06:14 GMT Anonymous Coward
    
    Re: Availability Zones
    
    > We didn't mention AZs because S3 doesn't use availability zones. That's for EC2.
    
    Pretty much every service uses AZs except for S3. RDS, EBS, EFS, Elasticache, ELB. Maybe S3 doesn't because it was one of their original services. But it's worth asking why they haven't upgraded it yet. If they had, most sites that were affected by the outage would probably have been fine.
    
    1 1 Reply
2. Thursday 2nd March 2017 21:18 GMT jamesb2147
  
  Re: Availability Zones
  
  Also they're still physically the same datacenter, so susceptible to combinations of backhoes, bad weather, and poorly performing power cutover systems, etc.
  
  Using only one AWS region is a bad idea. Period. In fact, I'd argue (thanks, BGP hijacking!) that using only Amazon services is a bad idea. If that is too difficult to manage for you, then set the appropriate expectations with your business managers and users. Your product is too cheap to support that high of an uptime requirement.
  
  Amazon fails sometimes, Google fails sometimes, Microsoft fails sometimes (and in at least one instance took weeks to restore!)... don't put all your eggs in one basket, people. Don't be that guy.
  
  This whole fiasco is probably a good example of why developers should not be put in charge of the IT systems, no matter how "easy" they are... Operations teams tend to focus like a laser on uptime and stability, while developers are more interested in maximizing new features.
  
  6 1 Reply
  1. Friday 3rd March 2017 00:27 GMT Doctor Syntax
    
    Re: Availability Zones
    
    "If that is too difficult to manage for you, then set the appropriate expectations with your business managers and users. Your product is too cheap to support that high of an uptime requirement."
    
    We keep hearing people saying things like this. And we have to keep replying that marketing has set inappropriate expectations with these very people who are the ones who make the decisions. They've been told that ~~cloud~~ someone else's computer is cheap and that it's resilient.
    
    "This whole fiasco is probably a good example of why developers should not be put in charge of the IT systems"
    
    To some extent I take objection to this. Back in the day it was possible to be in charge of development and operation and be paranoid about stability and uptime. It encouraged not developing what you knew you couldn't run. Times have changed and not, I think, for the better.
    
    But some ~~cloud~~ someone else's computer usage is shadow IT, paid for with a company credit card by people who don't see the need for all the costs and time needed for the detailed stuff which enables in-house developers and operations to combine to provide reliable systems. Don't assume either real developers or operations get anywhere near such deployments. Again, sales and marketing by providers have to take some responsibility here.
    
    And whilst you're extolling operations, don't forget it seems to have been Amazon's operations staff who grew fat fingers in this instance.
    
    8 1 Reply
  2. Friday 3rd March 2017 06:15 GMT Haberdashist
    
    Re: Availability Zones
    
    > Also they're still physically the same datacenter
    
    No, each region is made of many data centers. US-EAST-1 is spread across Northern Virginia.
    
    > Using only one AWS region is a bad idea. Period. In fact, I'd argue (thanks, BGP hijacking!) that using only Amazon services is a bad idea. If that is too difficult to manage for you, then set the appropriate expectations with your business managers and users. Your product is too cheap to support that high of an uptime requirement.
    
    >
    
    > Amazon fails sometimes, Google fails sometimes, Microsoft fails sometimes (and in at least one instance took weeks to restore!)... don't put all your eggs in one basket, people. Don't be that guy.
    
    Have fun living on your planet where everyone has the budget and time for multi-provider multi-region setups. It's one thing to chide people for not having proper backups or never considering HA, but expecting every site to launch their own satellite to maintain continuity in case the internet fails is pretty pointless.
    
    4 1 Reply
    1. Friday 3rd March 2017 09:39 GMT dancres
      
      Re: Availability Zones
      
      Those that don't have the budget presumably are spending it on features? That's not about cost that's about where one believes the revenue is ie features. However, if you're down, your features don't get used. A similar argument can be made for time expended in building HA: You can expend engineering effort once or support and admin effort every time you're down.
      
      Ultimately, this is about your users. Do you care enough about them to put their fate and yours in another's hands or do you choose to use the available facilities (and if you used a DR style arrangement you could save much of the infrastructure cost until time of need, magic of elasticity) to protect everyone?
      
      No doubt, for a fledgling company the choice has to be features but it should be a knowing choice. Amazon make it clear what needs doing for HA, choosing not to do it is on the respective business owner. For those with a decent paying user base the balance is somewhat different, all about how much you value your reputation. Blaming Amazon for your downfall will be limited consolation for your users. If you fall victim often enough you'll be paying the cost in lost revenue through inaction and support interactions. Alternatively you can pay the cost of moving clouds or developing your HA options.
      
      1 0 Reply
Thursday 2nd March 2017 20:45 GMT Anonymous Coward

Isn't puppet , chef , and Jenkins .. CI/CD .... devops

Suppose to cure this type of HUMAN fkck ups ?

2 3 Reply
1. Thursday 2nd March 2017 21:43 GMT zanshin
  
  Re: Isn't puppet , chef , and Jenkins .. CI/CD .... devops
  
  "Suppose to cure this type of HUMAN fkck ups ?"
  
  In a word, no.
  
  Those tools and the processes they support are for automated testing of changes you plan to roll out, and automated deployment of those changes, hopefully after someone or something has approved them. They make replication of change across many environments simple, including setup of servers, environments and so forth.
  
  The people in question were carrying out triage on a production performance issue. "Infrastructure as code" isn't really that helpful during triage. You usually have to dive in and run commands by hand. In such a situation, if what you are trying to resolve is related to production load and scale, you probably cannot replicate it on-demand in a test environment, even if you'd like to. That, in turn, can mean you can't really usefully test the command you plan to run.
  
  Given the nature of AWS/S3, I'm quite sure the command line entered did something heavily automated at scale, and might well have been executed with their equivalent of something like Chef, but *what* it was told to do was likely derived from the triage efforts. You can bork your production environment just fabulously with the wrong command inputs to a tool like Chef. It will dutifully obey you if the command you give it is legit. (They mention that they will change their definition of what's legit based on this experience.)
  
  I certainly do run what I perceive as "dangerous" commands in test environments before I run them in production, just to make sure I got them right. I can then copy-paste them exactly from dev into prod, at least where the command will be identical in either environment. But if I don't think the command is dangerous, possibly just because I've become used to running it without failure, I could conceivably type it out in full confidence and still screw it up. Triple-checking yourself before you hit "enter" is a matter of experience and, too often, not being over-tired or in a rush.
  
  5 0 Reply
  1. Saturday 4th March 2017 15:13 GMT fredesmite
    
    Re: Isn't puppet , chef , and Jenkins .. CI/CD .... devops
    
    ...
    
    Certainly Agile baby sitting with story book of post-its on a white board would have prevented it ....
    
    0 0 Reply
2. Friday 3rd March 2017 08:01 GMT Anonymous Coward
  
  Re: Isn't puppet , chef , and Jenkins .. CI/CD .... devops
  
  No. As long as someone uses a CLI he or she *will* make mistakes. Especially when the switches/parameters you have to set have a man page that looks like the Encyclopedia Britannica, and the average command line is just a little shorter than "The Rime of the Ancient Mariner".
  
  1 4 Reply
3. Friday 3rd March 2017 13:00 GMT 1Rafayal
  
  Re: Isn't puppet , chef , and Jenkins .. CI/CD .... devops
  
  Hmm, no, it isnt.
  
  DevOps is intended to support developers. Clue is in the name.
  
  0 0 Reply
  1. Sunday 5th March 2017 00:26 GMT Anonymous Coward
    
    Re: Isn't puppet , chef , and Jenkins .. CI/CD .... devops
    
    You are clueless - dev - > test -> production .. repeat .; . the CI model
    
    0 0 Reply
Thursday 2nd March 2017 21:19 GMT Anonymous Coward

This is why spending money to go beyond 4 9s is generally wasted

Unless you have ironclad procedures (which would include prepackaged scripts to do all such tasks, so command line access is available to virtually no one) you'll lose your 5th 9 due to human error, 9 times out of 10 9.

5 0 Reply
Thursday 2nd March 2017 21:31 GMT Mage

Wizards know

1 in a million miracles happen 9 times out of 10.

Or something.

Next time it will be a rush to release patch that is auto updated. Perhaps like HP toner or ink cartridge DRM it won't be obvious till later.

Beware potato based Cloud computing.

4 0 Reply
Friday 3rd March 2017 02:52 GMT Anonymous Coward

Oops

I work for a large bank. I once took all of the ATMs off the air by entering a simple command to empty a load library on the mainframe. I was asked to do it by the application expert coz he knew what he was talking about and I had the access.

Oops. Came half a bee's dick from losing my job.

2 0 Reply
Friday 3rd March 2017 04:08 GMT Alan W. Rateliff, II

-Confirm:$false

Now always your friend.

1 0 Reply
Friday 3rd March 2017 06:23 GMT Brenda McViking

I remain impressed

By the ability of amazon to do a route cause so quickly and go public with it.

In most corporates I've worked with it would take them at least 3 months to figure this out, even with C-Suite backing, and they'd only admit it 2 years later, because lawyers or something.

Makes a breath of fresh air that they have kept us informed. Unlike say, every bank ever, or talktalk, or adobe. Although naturally they've used up 5 years of their standard 99.99% availability quota in a single day so I'm by no means advocating they get supplier of the year... Just that others might learn that this is the proper way to keep users informed after a crisis.

14 0 Reply
1. Friday 3rd March 2017 09:58 GMT Broooooose
  
  Re: I remain impressed
  
  "By the ability of amazon to do a route cause so quickly and go public with it"
  
  If you're a CIO and the whole company is betting on your strategy and you choose to go with a provider who fails and it take them 3 months to figure out and report on that failure, then you're gonna be asked to move off it pretty quickly and your credibility goes down the swanny.
  
  If AWS want to maintain their leadership and gain their customers trust, they have to be transparent and quick to resolve. And yes, it is impressive. But I don't think they get a choice.
  
  2 0 Reply
Friday 3rd March 2017 06:37 GMT Anonymous Coward

Wait! They want me to use their automated tools..

..while they're doing stuff manually? hmmm.. what'd I miss?

2 0 Reply
Friday 3rd March 2017 07:12 GMT smartypants

I fixed it!

Without lifting a finger.

This cloud stuff is brilliant!

0 0 Reply
Friday 3rd March 2017 08:21 GMT Joe H.

The website is down dude...

The outage reminded me of this,

https://www.youtube.com/watch?v=W8_Kfjo3VjU

0 0 Reply
Friday 3rd March 2017 08:27 GMT Potemkine

Pity for the poor SOB

[rumor] I heard someone at the IT was displaced to Amazon's warehouse on Northern Alaska to wrap toothpicks 8 hours a day [/rumor]

1 0 Reply
Friday 3rd March 2017 09:10 GMT EnviableOne

~~Cloud Servers~~ Other Peoples Tin (OPT) is just as likely to go down as your own, due to Layer 8 errors.

The problem is their marketing machine says "dont worry if Some idiot does it, we got some more Tin over here that we'll move your stuff to."

but as this demonstrates no one told ops that

3 0 Reply
Friday 3rd March 2017 09:28 GMT DaddyHoggy

"You're about to break the Internet. Are you sure? Y/N"

Y

"No, seriously. This will expose Cloud Based solutions as the delicate soap bubble it is. Are you sure? Y/N"

Y

"Sigh... OK..."

4 0 Reply
Friday 3rd March 2017 09:58 GMT rh587

Could have been worse

Who remembers this epic typo in 2014?

1 0 Reply
Friday 3rd March 2017 10:13 GMT Anonymous Coward

Oh well, take learnings, move on, this kind of thing happens every day in companies all over the world (though hopefully only once per mistake.) It's only more noticeable because of scale.

4 0 Reply
Friday 3rd March 2017 11:11 GMT wyatt

They're not the first and won't be the last. I'm guilty of taking servers down instead of workstations due to 1 character being different, the knock on effect can be massive. Limiting this is essential along with recovering from it.

3 0 Reply
1. Friday 3rd March 2017 11:56 GMT Locky
  
  Me to
  
  Once powershelled the entire company to have an out of office saying "I have now left the company"
  
  That was the day I leaned that with one specific get-mailbox filter, if it finds zero results, it selects all mailboxes. Thanks for that M$
  
  3 0 Reply
Friday 3rd March 2017 12:02 GMT DropBear

"...limiting the ability its debugging tools have to take multiple subsystems offline"

...from now on, they'll need to use "sudo".

0 0 Reply
Friday 3rd March 2017 13:59 GMT HurdImpropriety

Move everything to the cloud yet have a single point of failure...nice.

"Those two subsystems handled the indexing for objects stored on S3 and the allocation of new storage instances. Without these two systems operating, Amazon said it was unable to handle any customer requests for S3 itself, or those from services like EC2 and Lambda functions connected to S3."

Move everything to the cloud yet have a single point of failure...nice.

1 1 Reply
Friday 3rd March 2017 14:38 GMT Bowlers

I wonder

I wonder how long before the fat fingered one feels confident enough to report this to EL REG's On Call?

3 0 Reply
Friday 3rd March 2017 14:41 GMT TeeCee

Hmm. "Playbook".

That'll be the source of the problem. When such is in use, it can only mean one thing.

The person typing the commands, while "authorised" to do so, almost certainly hasn't got a clue what they actually do.

If they did then a) they wouldn't need some else to have written it down for them and more importantly, b) they'd have spotted the typo before hitting Enter.

2 3 Reply
1. Friday 3rd March 2017 22:25 GMT fredesmite
  
  Re: Hmm. "Playbook".
  
  So airline pilots should skip pre-check list ... because only rookies would need that.
  
  1 0 Reply
Friday 3rd March 2017 14:44 GMT Anonymous Coward

If we give a human the power to destroy at some point they will do it - usually by accident

First an admission, I have performed too many of the "fat finger" events over my lifetime. Back in the day when most of this was new it was accepted. Technology and its intricacies have changed dramatically and leaving the mere mortal to the capabilities of unrestricted command lines leaves us open to the next fat finger event.

Solutions to the problems have existed for a long time, it has had many names, currently it is called "service orchestration" but companies still need to invest in it and technicians have to embrace it. Simply put it allows any technologies command facilities to be exposed but in a controlled state. Removing the potential for the next "fat finger" - for any company where the technology is the company there is no excuse in blaming the human - it is time for the company to own up and say we didn't support the human - we provided the capability for this to happen and as always will - it happened.

So to Amazon and any other high tech company providing critical services I would ask - what are you doing to make certain this never happens again - get the cheque book out and remove the potential for it to happen in the first place.

1 1 Reply
Friday 3rd March 2017 15:13 GMT russmichaels

good on them for being honest about the cause and not trying to blag everyone. these things happen.

2 0 Reply
Friday 3rd March 2017 16:01 GMT Anonymous Coward

All I can say is...

Looks like some companies are going to reconsider cloud and bring back their data on-prem. Good day to be a salesman =p

3 0 Reply
Friday 3rd March 2017 16:52 GMT Anonymous Coward

/dev/sda1 has gone 2323 days without being checked, check forced

BTST. Alternative good times.

These system were probably around since Amazon first needed a storage system. And never mind that the variables in the code all refer to books. That's historical legacy code, the original developers have left and you'd better not touch it.

1 0 Reply
Friday 3rd March 2017 17:28 GMT Anonymous Coward

The internet *is* "someone else's computers"

Sorry, I really get annoyed by this "other people's tin" quark.

You cannot run Internet services on your own computers alone. If a major ISP mucks up their routers and goes dark, your precious customers will not be able to see your website, no matter how much disaster recovery you planned for.

AWS is a "cloud" hardware store. They stock some nice tools and might offer advice on your design questions. You're still responsible for building a stable system yourself.

3 0 Reply
Friday 3rd March 2017 17:46 GMT quxinot

It suddenly occurs to me the primary issue with cloud stuff.

Cloud is way cheaper than doing it on site. It's also comically, laughably more expensive than having it under your own roof. The differences are pretty simple: Doing it right is significantly more expensive than doing it wrong--to the point that the tools you're using aren't important. If your cloud stuff goes down because you don't have geographic failover redundant whatever etc because you cheaped out, you did it wrong. Holds precisely as true as the day the roof leaks and your rack emits a sad little pfzzt sound.

I do wish I lived in a world, or even a location, where this magical internet of perfect connectivity at wonderful speed was available, though. The last thing I want is any important data or processing being done on the other end of an unreliable, tiny, crooked straw. Someday, someday...

2 0 Reply
This post has been deleted by its author
Friday 3rd March 2017 18:30 GMT Anonymous Coward

Put it in a shell script

At our company, the rule was to put anything potentially bad in a shell script, show it to someone else first, and no $* in the script.

0 0 Reply
Friday 3rd March 2017 20:25 GMT Anonymous Coward

Was it an outsourced Sys Admin from India, who did this?

0 1 Reply
Friday 3rd March 2017 22:19 GMT Anonymous Coward

for those who haven't done :

[root] : rm -rf /* when you meant

rm -rf ./*

Have not really experienced life as a software guru.

0 0 Reply
Sunday 5th March 2017 14:30 GMT Anonymous Coward

FLUTTER OF BUTTERFLY WINGS

hows that go, a slight disruption of air over here, leads to a hurricane over there...?

its all connected folks, so get ready when it all comes crashing down..

1 0 Reply

POST COMMENT House rules

Not a member of The Register? Create a new account here.

Other stories you might like

US-EAST-1 region is not the cloudy crock it's made out to be, claims AWS EC2 boss

It's the region where stuff gets stressed at scale first, says Dave Brown, as he plots variants of Amazon's Outposts

PaaS + IaaS 10 Apr 2024 | 4

AWS must pay $525M to cloud storage patent holder, says jury

Updated Computing giant will appeal ruling, which found infringement was not 'willful'

Storage 11 Apr 2024 | 22

Irish power crunch could be prompting AWS to ration compute resources

Exclusive Users report being pointed to other EU regions if they need more grunt

On-Prem 9 Apr 2024 | 109

UK govt office admits ability to negotiate billions in cloud spending curbed by vendor lock-in

Exclusive After slew of AWS deals signed under MoUs, CDDO says current approach might weaken its position

Public Sector 4 Apr 2024 | 98

AWS severs connection with several hundred staff

'Necessary,' 'focusing our efforts,' 'deliver maximum impact' ... sounds just like all the other tech layoffs lately

PaaS + IaaS 3 Apr 2024 | 13

Amazon to lure upstarts with $500K in AWS AI credits each

Come on in, drill into Anthropic and Mistral – that's not the sound of a door slamming shut behind you

AI + ML 2 Apr 2024 | 1

Cyberattack hits Omni Hotels systems, taking out bookings, payments, door locks

Updated As WhatsApp, Facebook Messenger, other Meta bits plus Apple stuff fall offline today

Security 3 Apr 2024 | 18

GenAI will be bigger than the cloud or the internet, Amazon CEO hopes

And Andy Jassy will happily take your money along the way

Off-Prem 11 Apr 2024 | 15

Microsoft hiring Inflection team triggers interest from EU's antitrust chief

All sorts of levers being pulled to lure AI developers from here, there, everywhere

AI + ML 5 Apr 2024 | 4

Datacenter outages are on the decline, but when they hit, they hit hard

Power snafus take limelight in latest downtime diary from Uptime Institute

On-Prem 2 Apr 2024 | 3

Stability AI reportedly ran out of cash to pay its bills for rented cloudy GPUs

Generative AI darling was on track to pay $99M on compute to generate just $11M in revenues

AI + ML 3 Apr 2024 | 22

Amazon finishes pumping $4B into AI darling Anthropic

Adds $2.75B to the ML sweepstakes ante and is counting on Claude

AI + ML 27 Mar 2024 | 3

The Register Biting the hand that feeds IT

About Us

Our Websites

Your Privacy

Situation Publishing

Copyright. All rights reserved © 1998–2024