back to article Fat-fingered admin downs entire Joyent data center

Cloud operator Joyent went through a major failure on Tuesday when a fat-fingered admin brought down an entire data center's compute assets. The cloud provider began reporting "transient availability issues" for its US-East-1 data center at around six-thirty in the evening, East Coast time. "Due to an operator error, all …

COMMENTS

This topic is closed for new posts.
  1. Anonymous Coward
    Anonymous Coward

    How fat was the finger?

    We need to know, so we can lean from his mistak: How fat is his finger?

    I will obviously be instituting a company wide lathe based pre-emptive fix for this for my staff. Now that's ISO9000 Preventative Action stuff. It's not Corrective - none of us have ever done something like that before.

    Cheers

    Jon

    1. Pete 2 Silver badge

      Treating the symptoms, not the disease

      Unfortunately, this¹ is exactly what most companies do when faced with this sort of issue. They say "oooh, the <command> is far too powerful - let's remove it, or require an operator to get approval from the change board before it's used in future"

      Although Joyent have said they are instigating a full investigation, they will find that their system has so many fundamental holes designed in that fixing them all will require not only a total re-write, but a complete redesign of their software and operational practices. A prospect that is likely (considering how poor the whole discipline of system design is) to introduce as many new problems as it fixes.

      So ultimately I fully expect the expedient solutions to be applied: an extra layer of checks that will slow down operations and make life for operators even more exasperating (such as an "are you sure" dialog after every command) and will soon become ineffective due to the pressures of getting stuff done (a 10% decrease in operational effectiveness is never paid for with a 10% increase in staff numbers) and management cuts.

      [1] yes, satire: I get it

    2. I. Aproveofitspendingonspecificprojects

      Re: How fat was the finger?

      I think the register is in sausages per finger but we are dealing with lemons here -until we learn more.

    3. TheVogon

      Re: How fat was the finger?

      "How fat is his finger?"

      Fat enough to press a big red button labelled 'EPO' - and stupid enough to confuse it with the light switch!

  2. Nate Amsden

    worse

    http://www.theregister.co.uk/2008/08/28/flexiscale_outage/

    "As Lucas explained in an email to customers - posted to the Web by CNet - the outage occurred when an XCalibre engineer accidentally deleted one of FlexiScale's main storage volumes."

    http://www.theregister.co.uk/2009/05/15/flexiscale_upgrade/

    "Nine months after an engineer accidentally deleted its Amazon-like compute cloud - and six months after a second major outage - FlexiScale has finally completed a software overhaul meant to avoid such extended blackouts."

    1. A Non e-mouse Silver badge

      Re: worse

      I the days before "cloud", and when Novel were still alive, they had a tool (in beta) which allowed you to run commands on all your NetWare servers at once. It was useful, but then Novel wiped the tool from the face of the planet. I did some asking around, and was told that Novel pulled it because it was too easy to use and too many customers were shooting themselves in the foot with it. I was heard tales of people deleting NDS from every server in the tree with the tool. Ouch!

  3. Jim 59

    op/admin

    Joyent says operator. El Reg says administrator. Which is it?

    1. Anonymous Coward
      Anonymous Coward

      Re: op/admin

      .. maybe its a forklift operator inside them datacenters! happens all the time with telco cables on the street when somebody's digging.

    2. foxyshadis

      Re: op/admin

      You don't consider the BOFH an administrator?

      Well he administers the pints, that's for sure.

    3. Anonymous Coward
      Anonymous Coward

      Re: op/admin

      Operator of the mop, or administrator of the bucket - does it matter?

  4. Allan George Dyer
    Joke

    Shock collars?

    He makes it sound like he tried. Guess they should invest in waterproof keyboards that can be operated with flippers.

    1. Anonymous Coward
      Anonymous Coward

      Re: Shock collars?

      Patronisingly refering to his staff as "dolphins" is going to make him REALLY popular I'm sure. No doubt another clueless accounting oik parachuted into the CTO job who has no idea what his tech staff actually do and probably thinks its all pretty easy and he could do sooo much better himself if only he had enough spare time away from the golf course.

      1. Anonymous Coward
        Anonymous Coward

        Re: Shock collars?

        That or he is a proper network tech, and is aware of the in-joke of that remark (which you're obviously not).

      2. Anonymous Coward
        Anonymous Coward

        Re: Shock collars?

        No doubt another clueless accounting oik parachuted into the CTO job

        Hmm, you've no idea who Bryan Cantrill is, have you? He jumped ship to Joyent when Oracle bought Sun, where he was one of the senior Solaris/ZFS designers. You should try and catch one of his presentations one day, he can be a PITA at times but he's far from clueless, and very entertaining.

        1. Anonymous Coward
          Anonymous Coward

          Re: Shock collars?

          As an engineer I worked with Bryan. He's one of the smartest and most capable people I've ever seen or known. Far from being "another clueless accounting oik" if he were, he'd unlikely have spoken so frankly about this event.

          1. Jamie Jones Silver badge
            Thumb Up

            Re: Shock collars?

            Well said, anon.

            It's the inept who are used to bluffing their way through life that naturally turn to spin and bullshit.

            Those who actually have a clue are quite open to admit when there's been a cockup, as their reputation isn't based on smoke and mirrors.

      3. bcantrill

        Re: Shock collars?

        (I'm the CTO in question. Normally, I could/would resist being trolled, but given that this constitutes special circumstances, I feel it's appropriate to clarify a few things.)

        First, I was clearly not referring to our own staff as dolphins, but rather that meting out punishment rarely changes behavior. I can assure you that no one internally took this the wrong way.

        Second, as for me being a "clueless accounting oik": to the contrary, I have dedicated my entire career to the understanding and improvement of software operating production systems. Having spent a ton of time on production systems, I have also made by far share of mistakes -- and (like every human who has done something important that requires precision) have had plenty of near-misses that I only found when I double- or triple-checked my own work. So not only do I not think it's easy, I know it isn't.

        Finally: I hate golf.

        1. Anonymous Coward
          Anonymous Coward

          Re: Shock collars?

          Fair enough - I retract my remarks. I'll leave my original post up otherwise no one will know what this thread is about.

        2. Vic

          Re: Shock collars?

          > meting out punishment rarely changes behavior

          I wish you'd been one of my bosses.

          I've been to so many places[1] where the "important" part of a post-mortem is working out who carries the can. It never helps.

          CxOs who understand the need to fix the problem rather than fix the blame seem to be few and far between...

          Vic.

          [1] Usually as a contractor, called in to sort out the cock-up. Thank $deity...

        3. Jamie Jones Silver badge
          Pint

          Re: Shock collars? @Bryan

          I was going to post this as a new message, but seeing you've posted here, I'll reply instead.

          I found your candidness in your reponse, your openess, and your proposal for moving forward very refreshing.

          If I was a customer, I'd have found it most reassuring.

          Other companies (and politicians!) should note that bullshit and spin and skirting around the issue impresses no-one.

  5. Don Jefe

    Biomed Coverup, Political Intrigue or Rock Band

    I'm pretty sure 'transient availability issues' should be the pivotal soapbox issue of the next presidential election. You can play that so many ways: 'Enhanced border protection has resulted in transient availability issues that have caused a spike in food prices, leaving many to starve', for example.

    It also works for unsanctioned medical experiments, value priced organ transplants and urban hunting outfitters: 'Transient availability issues have slowed field trial results and publication of Stage IV results of REDACTED will not be available until Q4'. 'Transient availability issues have resulted in an unforeseen shortage of viable Human organs. Until inventory levels have stabilized hunting at inner city housing projects will be permitted between 1-5 AM'.

    Also, 'Transient Availability Issues' would be a great band name!

    1. Fatman

      Re: Biomed Coverup, Political Intrigue or Rock Band

      HUH????

      For a moment, I thought I was reading one of 'amanfrommars' posts!

      1. keithpeter Silver badge

        Re: Biomed Coverup, Political Intrigue or Rock Band

        @Fatman

        Google 'gonzo journalism' and HST if you like the style.

        I suspect Mr Jefe would not be allowed within 25 feet of a console capable of controlling more than one running kernel.

        1. Don Jefe

          Re: Biomed Coverup, Political Intrigue or Rock Band

          Furthermore, I'll have you know my console is capable of running large numbers of kernels simultaneously. Long before you lot were messing about with your virtual kernels and manipulating kernels with concentrated RF radiation I was commanding THOUSANDS of kernels simultaneously using naught but alternating current, a dead short and a bunch of fucking butter. So suck it.

        2. Mpeler
          Paris Hilton

          Re: Biomed Coverup, Political Intrigue or Rock Band

          I suspect Mr Jefe would not be allowed within 25 feet of a console capable of controlling more than one running kernel.

          Ah, me glasses again - I read that as "controlling more than one running kennel"....wondering what kind of transients were meant.....

          It's a dog's life.....

          Paris...her glasses don't seem to be working either....

      2. Don Jefe

        Re: Biomed Coverup, Political Intrigue or Rock Band

        Pah! amanfrommars uses bendy logic and a very special form of punctuation in his posts. I, on the other hand, utilize the wholly correct method of applying various definitions of a word in a context other than that used in the original statement.

        Indeed, others, some from inside asylums for the disturbed, foresaw such intentional mangling of the words of others and even created an alphabetized index of words and their meanings specifically so that context and definitions can be interchanged to great effect. You can make great jokes, create stunning headlines, delight and horrify shareholders or run for public office based solely on understanding what a dictionary is and mastering its use. Dictionary Expansion Packs include Thesaurus, Foreign Language Cross-Indices, The Urban Dictionary and Trade Jargon Indexes.

  6. Hazmoid

    So I suspect that operator will be told subtly that maybe he would be better looking for work somewhere else. either that or they promote him out of the operator position so he can't do it again :)

    1. Mark 85

      I suspect you may be right, or they will create a position of "corporate scapegoat" and put him in it. Then everything from then on that goes wrong can be blamed on a "fat fingered admin".

    2. Uffish

      Re: don't do it again.

      A colleague of mine used to work for a small company where people had to make quick decisions on the fly. They had a 'mistakes book' where all bad decisions and goofs were recorded. You could make any sort of mistake as long as it wasn't in the book. He said the system worked well.

      1. Anonymous Coward
        Anonymous Coward

        Re: don't do it again.

        A new CEO (Marcus Steyn?) of M&S looked at the company's very thick file of accumulated corrective actions. He took it away and returned it slimmed down to a few pages of generic guidelines. He said that the original was too much for most people to have read - therefore its only possible use had been to apportion blame after the event.

      2. Anonymous Coward
        Anonymous Coward

        Re: don't do it again.

        "You could make any sort of mistake as long as it wasn't in the book."

        In the evening the work surfaces along the computer suite walls were stacked high with card boxes for the night runs. A development programmer was standing waiting for his dedicated hands-on timeslot. He leaned back against the card trays - and they moved backwards. Unfortunately there was an emergency off mushroom button at that point on the wall.

        As a result - the work surface cabinets were re-arranged to leave a gap in front of the button. A few days later the same programmer was waiting again. He was now wary of leaning against the boxes. So he positioned himself in the convenient gap in the work surface cabinets and leaned back against the wall - and against the emergency off button.

        After that a papertape reel plastic core was taped round the button as a shield - so that only a finger could press it.

    3. Anonymous Coward
      Pint

      Wrong answer!

      Unless it's malicious, you take the person aside and in private figure out what went wrong, and why. Then you gather everyone after everything's back up, publicly thank every one for their response and do an after-action review. What I wouldn't do is to pass the operator's name around. If anyone's ass is going to be chewed on, it;s mine.

      I have to get the recipe exactly right since I have the social graces of a male warthog in mating season. I do know how to react in this one.

  7. skeptical i

    Jheez, poor bastard. :\

    There but for the grace ....

    1. Alan W. Rateliff, II
      Paris Hilton

      Re: Jheez, poor bastard. :\

      ... really go all of us. Even the snotty it-won't-ever-happen-to-me RegTards.

      1. Anonymous Coward
        Anonymous Coward

        Re: Jheez, poor bastard. :\

        I'm always doubtful, and I suppose it is a healthy and respectful attitude to have. The trouble is the "mgnt." see that as a weakness - not a strength. So they get in some cock sure punk, and I let him cock it up - just for the LULZ!

    2. Anonymous Coward
      Anonymous Coward

      Re: Jheez, poor bastard. :\

      You're not a real sysadmin until you've made a major cockup (e.g. rebooting the Production system rather than the Test one).

      Until then, you're an accident waiting to happen. Afterwards, you have the heightened situational awareness and the little voice in your head that whispers "something's not right" to stop you hitting the wrong button again.

      So Joyent now have an experienced sysadmin. Why would they sack him and bring in another overconfident time-bomb?

      Anon, because "voices in your head" doesn't sound good!

      1. Trygve Henriksen

        Re: Jheez, poor bastard. :\

        Absolutely.

        And the way to see who is a good sysadmin from a bad is if he admits that he screwed up or tries to blame someone else.

        If he blatantly blames someone else, he's a bad one.

        Taking the blame is a good one...

        Successfully laying the blame on Microsoft... Well, then you have a bona-fide Guru on your hands.

        ;-)

        1. Pete 2 Silver badge

          Re: Jheez, poor bastard. :\

          > And the way to see who is a good sysadmin from a bad is if he admits that he screwed up or tries to blame someone else.

          Maybe. But the mark of a truly excellent (ahem!) sysadmin is that he / she gets the problem fixed before anyone else notices.

        2. Fatman

          Re: Jheez, poor bastard. :\

          If he blatantly blames someone else, he's a bad one mangler in training.

          There!

          FTFY!

      2. Vic

        Re: Jheez, poor bastard. :\

        > e.g. rebooting the Production system rather than the Test one

        I've found it useful, in environments where someone might get confused like that, to blacklist shutdown and reboot in the sudoers file.

        It won't stop someone with privilege from rebooting the machine, of course, but it does mean the procedure is slightly different, which prevents accidental reboots. It's remarkably effective...

        Vic.

        1. Anonymous Coward
          Anonymous Coward

          Re: Jheez, poor bastard. :\

          >I've found it useful, in environments where someone might get confused like that, to blacklist

          >shutdown and reboot in the sudoers file.

          >It won't stop someone with privilege from rebooting the machine, of course, but it does mean the

          >procedure is slightly different, which prevents accidental reboots. It's remarkably effective...

          You're going to have to explain how someone "accidentally" types shutdown or reboot.

          1. noboard
            Pint

            Re: Jheez, poor bastard. :\

            No they don't, it covers the case where the person wants to shut-down or reboot the test server, but hasn't realised they're on the live server.

            Have a pint for the old reading comprehension :)

          2. Missing Semicolon Silver badge
            Boffin

            Re: Jheez, poor bastard. :\

            ... by being logged in to an SSH session, and not noticing which machine you are connected to.

            I now make sure that every VM I create has a descriptive hostname, so that it appears at the shell prompt (and in the title bar of Putty).

          3. Anonymous Coward
            Anonymous Coward

            Re: Jheez, poor bastard. :\

            >You're going to have to explain how someone "accidentally" types shutdown or reboot.

            I'm guessing you're not a Linux admin? Very simply - tab completion. Say you have powermt installed in 98% of your boxes. You'd probably gain a bad habit of typing power<tab>&<enter> to autocomplete powermt and execute the command.

            Say one day you hit one of those 2% machines that don't have powermt installed or configured in the path. power<tab>&<enter> suddenly becomes power<off>. Machine gone.

            There are lots of examples with other dangerous commands. Personally I'd like to see init, poweroff and reboot etc all require a -y flag by default.

      3. J P

        Re: Jheez, poor bastard. :\

        Perhaps someone high up at Joyent has themselves had the fat fingered moment of dread, and recognises the value of experience...

        There's a similar thing with discharged bankrupts; they are statistically the least likely people to go bankrupt (yes, again). Despite which, they are also one of the groups that finds it hardest to get bank accounts. The test is of course administered by bankers who have not themselves gone bankrupt.

      4. Captain Scarlet
        Unhappy

        Re: Jheez, poor bastard. :\

        "You're not a real sysadmin until you've made a major cockup (e.g. rebooting the Production system rather than the Test one"

        I have to confess I shutdown a wrong customer server, after being unable to RDP to the server I went on via a KVM and misread o as a in a list of a few thousand similarly named servers. I was on the phone with people in front of the machine and I misread the name four times before hitting shutdown. I then saw the wrong green dot went red on the network screen and realised what I had done!

      5. Keith Langmead

        Re: Jheez, poor bastard. :\

        Definitely true! Only after experiencing that sinking feeling where it feels like the bottom has just dropped out of your world, and then having to tell your boss what you've done, only then can you truly appreciate axioms like "don't assume, check" and "hope for the best but plan for the worst". Until then they're just words that are impossible to put into proper context.

        1. Vic
          Joke

          Re: Jheez, poor bastard. :\

          Only after experiencing that sinking feeling where it feels like the bottom has just dropped out of your world

          That's for beginners.

          You know things have gone really wrong as you experience that feeling of the world dropping out of your bottom...

          Vic.

    3. Vic

      Re: Jheez, poor bastard. :\

      > There but for the grace ....

      I once deleted a live /var filesystem while trying to clean up a machine.

      Luckily for me, it was one of my own.

      The box is still in use - and I keep finding orphaned unpackaged files lying around from before the accident...

      Vic.

  8. A Non e-mouse Silver badge
    Pint

    Poor sod...

    I feel sorry for that poor sysadmin. It's going to be a while before they get over this. The only glimmer of hope, is that Joyent don't sound like they're going to hang the sysadmin out to dry.

    Have a beer to drown your sorrows.

    1. Keith Langmead

      Re: Poor sod...

      Though you can guarantee he'll be getting the piss taken out of him over it by the other sysadmins for years to come (in a light hearted way of course). My major screw up was about 10 years ago and it still comes up occasionally, and we still mention a friend of mines best screw up now and again some 15 years later.

  9. Etherealmind

    Its all about blame

    As a IT executive, I love cloud because when something goes wrong I can blame someone else.

    Of course our own data centres have the same types of outages and problems, but I'm responsible for those. Putting it in the cloud means there is a lot less things that are my fault.

    1. A Non e-mouse Silver badge

      Re: Its all about blame

      As an IT executive with in house IT, when sometime goes wrong, you can pull staff from (almost) anywhere to chip in and get things working again. If you treat your staff well, they may even offer to work silly hours to help you and the company out.

      As an IT executive with outsourced IT, you can fire off ranty emails (and swear down the phone if you're lucky) and....?

      1. Preston Munchensonton

        Re: Its all about blame

        When everything is humming along smoothly, IT always seems like its a service that could be offloaded. When IT issues come crashing down all around you, that's the moment that you realize that your CV isn't updated enough to help you get another job when they sack you for oursourcing IT.

        IT is a resource. In-source as much as possible and out-source only those functions that you can't support well for lack of talent (and work on finding the talent).

        1. tfewster

          Re: Its all about blame

          I'd go further than that - recognise that Corporate IT is a core function that creates competitive advantage, not a commodity service that can be blindly outsourced.

          1. JeffTravis

            Re: Its all about blame

            Some Corporate IT is, some Corporate IT will be your differentiator.

            Some Corporate IT is just the cost of doing business.

            Some Corporate IT is just an expensive way of turning power into heat.

    2. Fatman

      Re: Its all about blame

      Let me fix your first sentence, it should read:

      "As a IT executive mangler, I love cloud because when something goes wrong I can blame point the finger at someone else, absolving myself in the process of having to take responsibility for a bad decision."

      I would append: "Once I have shit all over my current employer, I will fly off to find another place to park my ass, and shit all over that employer in true seagull manglement style."

  10. Longrod_von_Hugendong
    FAIL

    Molly Guard...

    That is what you are needing

  11. Polhotpot

    As someone who has, in the past, managed to power off an entire rack of backbone switches at a corporate datacentre whilst doing a move, I can entirely sympathise with that unique sinking feeling.

    1. Phil O'Sophical Silver badge

      There's nothing like the silence you get when you've accidentally powered off the whole server room, AC as well, and your footsteps echo off the raised floor all the way across to the phone...

      1. JLH

        Oh yes. Re. the comments above, I agree about the 'voices in the head'

        I once installed an Oracle RAC cluster in an English university (OK it was UMIST).

        I did ask if there was enough power for it before turning it on.... Oh yes I was told.

        Sure enough, power up the racks and...... silence. Except for the beeping of UPSes and the running of IT staff feet towards the machien room.

        1. Anonymous Coward
          Anonymous Coward

          Oh yes. Re. the comments above, I agree about the 'voices in the head'

          I once installed an Oracle RAC cluster in an English university (OK it was UMIST).

          I did ask if there was enough power for it before turning it on.... Oh yes I was told.

          Sure enough, power up the racks and...... silence. Except for the beeping of UPSes and the running of IT staff feet towards the machien room.

          That reminds me of the time that "we" were given an old working "mainframe" system. I don't recall which one this was as it was one of several that we were given around that time. A PDP11, I think but it was a long time ago. The machine in this memory was housed in two 19inch racks. We carted the machine's parts home to the apartment complex where we were staying, set it up in the laundry room on the patio and went down the checklist. We used the laundry room because it had a 220 power feed for the dryer and we didn't have a washer and dryer. After we had checked over everything, we threw the switch, applied power and sat back to a facility wide brownout while the system came online and drives spun up. We only started it a few times and it was great fun while it lasted but finally we were forced to break it down. Some of the drive platters became wall clocks. Pretty nice ones, too :)

          Currently I have a small system36 sitting in the corner holding up one end of my eight foot long workbench/computer desk. The other end is held up by one of those metal two drawer filing cabinets ;)

          Anonymous to protect the innocence :)

      2. Anonymous Coward
        Anonymous Coward

        "There's nothing like the silence you get when you've accidentally powered off the whole server room"

        That deafening silence is truly awe-inspiring. The mainframe operator couldn't explain why he had had an impulse to press that large red button by the door.

    2. Alan W. Rateliff, II
      Paris Hilton

      Re: that unique sinking feeling

      A bit like that feeling you get just as you let go of the car door and at the exact moment realize your keys are on the front seat, innit?

      1. Steven Raith

        Re: that unique sinking feeling

        Or when you realise that the car keys are in the boot, and that electric boot release that you meant to fix three months ago....oops.

        Thank god for spare keys an inside pocket.

        I have had that horrible sinking feeling when you realise the wrong server is powering down. And once I tripped over an outlet and pulled out the power to an entire rack. Thankfully, the (massively overspecced) UPS saved my arse on that occasion, but I've always treaded very carefully in server rooms (or broom closets, or the back of the office on the empty desk that subsitutes their server room, etc...

        Nothing teaches you how to make things reliable like accidentally breaking stuff early in your career, I'll tell you that. The trick is to stop doing it once you are looking after other peoples gear....!

        Steven R

      2. ecofeco Silver badge

        Re: that unique sinking feeling

        "A bit like that feeling you get just as you let go of the car door and at the exact moment realize your keys are on the front seat, innit?"

        There was a "sniglet" for that: ignasec

    3. Mpeler
      Alien

      That sinking feeling

      That admin is probably not ever going to like the song "I've looked at clouds from both sides now...."....

      AManFromMars, because he might see clouds from both sides too (still waiting, AMFM....)...

  12. Rubber chicken
    Mushroom

    Nom Nom Nom Nom.....

    I have very nearly chewed my way through my chair as a system has gone down at my hands in error. (and no - not using my mouth to chew - the pucker was that strong)

  13. Anonymous Coward
    Anonymous Coward

    https://www.youtube.com/watch?v=Zv1w9bg3bMM

  14. Michael H.F. Wilkinson Silver badge
    1. Diamandi Lucas
      Pint

      Re: Fat fingers or ...

      "BOFH keyboard, or BOFH control centre"

      most likely a fat-handed twat

      A pint for the poor operator he/she could use a drink or several.

  15. Pascal Monett Silver badge

    Kudos to Joyent

    First they are upfront about the issue and about the cause of the issue, and they declare publishing the follow-up and not hanging the admin.

    That's a far cry from just about everybody else who start by denying any problem until it is absolutely blindingly obvious to everyone that they've got their heads up their ass, then go on to make absurd "only a small percentage of users were impacted" statements.

    Joyent sounds like a rather good company to work for, where the entire management chain is capable and takes responsibility. Almost unheard-of these days.

    HP, are you noticing ?

  16. batfastad

    Cloud != Cloud

    If it's all in one datacentre, attached to one network, in one country, it's not a cloud. Just a bunch of dedicated servers with a nice fat cloud(TM) price markup.

  17. kmac499

    Apocryphal ?? There but for the grace ..

    I was told these ones years ago, If true my sympathies to all involved

    An engineer working in a server room of a big banky\finance outfit leant against a wall, or stretched out an arm, ( we all know how cramped it can get working in a tight space.) and hit the emegency off kill switch for the room. Once order had been restored the great and good were gathered in the room to conduct the inquest and learn from it.

    "Right so what exactly did you do..?"

    " Well I just stood up stretched out and ..... Oh Shit.."

    Outside a big disaster recovery centre Paddy was playing\prospecting in the car park with his company car, a JCB, laying new pipes or summat, As sometimes happens he found the mains duct and broke it killing the power. The UPS's kicked in and everyone waited for the standby gennies to start up, and waited and waited, Then some one realised the control cables and power feeds were in the same duct as the outside mains..

    Finally a software one to spread the blame...

    Another bank was planning a small mail shot to it's personal clients. aka high value individuals. The vellum was loaded into the printers, The delivery owls were put on standby. Unfortunately the test data used in the mail shot, as created by I suspect a lowly paid junior pogrammer 'leaked' into the production run.

    So each letter began Dear RIch Bastard...

    The current whereabouts of said programmer unknown.

    1. Nigel 11

      Re: Apocryphal ?? There but for the grace ..

      #1. Someone needed a Mollyguard clue.

      (Out of interest, what do the military call Mollyguards? You know, the ones that stop you accidentally launching an ICBM when you sneeze, or blowing the bridge before your army has retreated over it, things like that? )

      1. Roger Varley

        Re: Apocryphal ?? There but for the grace ..

        REME's

      2. Anonymous Coward
        Anonymous Coward

        Re: Apocryphal ?? There but for the grace ..

        Switch-guard? It's been a long while and the only one I dealt with as a matter of course was for reactor-scrams.

  18. Anonymous Coward
    Anonymous Coward

    # rm -rf / tmp/foo/no-more-rubbish_here

    Spot the typo.

    Add nfs with parts of the sub-tree structure chmod'd 0777

    Add users who think backup is for sometime when they get a free month.

    1. John Brown (no body) Silver badge

      Re: # rm -rf / tmp/foo/no-more-rubbish_here

      That's why I always use tab completion. No tab completion = something's wrong.

      1. Nigel 11

        Re: # rm -rf / tmp/foo/no-more-rubbish_here

        Tab completion: Yes. To which I'd add,

        # rm / tmp/foo/no-more-rubbis [TAB]

        and add the -rf at the end of the line, if and only if it does tab-complete, and after you've mentally checked for the very last time, "do I really mean this"?

        BTW if your command isn't amenable to this sort of rearrangement, an open-bracket will accomplish much the same for other commands. Add ) CR after you've thought hard.

        # ( dangerous_command

        >

        1. Vic

          Re: # rm -rf / tmp/foo/no-more-rubbish_here

          > Add ) CR after you've thought hard.

          *Nice*.

          Vic.

  19. Anonymous Coward
    Anonymous Coward

    Shutdown -r -f -t 0

    ...is pretty much harmless and hard to type (since windorz reboots all patch Tuesdays), until you create a desktop link with that. Don't. Get that typed on prompt every time and you get used to that, for safety.

    The other jewel I remember was something like a chat app... that ran straight off a prompt. Saying the words quit or reboot would cause lulz.... I just can't remember what kind of hardware was that... a phone or pad....

  20. Anonymous Coward
    Anonymous Coward

    Misunderstood automation software?

    Saw it happen once in a data centre. Company installed some global automation software, agents on every server. Some Wazzock decided to test a reboot job, instead of sending to a small selection of dev servers accidentally sent it to every single agent for execution. Not immediately of course, oh no. They set to to go at 9pm on a weekday, right in the middle of the batch runs! Next thing we hear is callouts to every Windows admin in the company as 450 Windows app servers start rebooting all at once!

    I refused to let them install the agents on the production Unix boxes and as the jobs didn't have the right scripting to kickstart reboots on the test/dev servers, they were safe from "fat finger" cock up!

  21. ElNumbre
    Joke

    Hello?

    Hello IT, yes we have tried switching it off and back on again.

    1. ecofeco Silver badge

      Re: Hello?

      *snerk*

      Well done! Have an upvote.

  22. The Dude
    Boffin

    Been there... done that.

    usually, when working on virtual machines and accidently rebooting the host. Easy enough to get confused, they all look the same at the end of a long day.

    It's happened to other people I've worked with, usually accompanied by a very quiet "oops" coming from their cubicle.

  23. ecofeco Silver badge
    Trollface

    I won't say I told you so

    Just "derp"

  24. quartzie

    Worst day of my life: Shot down a cloud

    I'm sure it's considered a magic trick to shoot down clouds, unless you work as a BOFH of one.

  25. naylorjs

    Brown out.....

    I think it was a brown out in more than one sense!

  26. JeffTravis

    I concur - experience is everything

    I always compare a big outage to the scene in "It's a Wonderful Life" where Clarence the Angel says that every time a bell rings it's angel getting it's wings.

    Every time there's an outage a sysadmin is earning theirs.

  27. Speltier
    Pint

    Oh Dear Oh Dear Oh Dear

    The child safe translation of the verbiage emitted by a meatbag after hitting the enter key and by mere milliseconds belatedly realizing that the machine minions are dutifully following orders to cause an unintended bitfaust. Or in the case referenced, a cluster bitfaust.

    Worse, dead server walking and the pardon is not on the way.

  28. Dave Nicholson
    IT Angle

    GUIs are LOLing at the CLI crowd right about now

    How about a button that says "Reboot All Servers"? When pressed, a dialogue box with "Are you sure you want to reboot YOUR ENTIRE DATA CENTRE???" would appear.

    It is hard to call missing a dash or a pike or a space "fat fingering".

    Build a GUI or hire a robot. :-)

    1. Anonymous Coward
      Anonymous Coward

      Re: GUIs are LOLing at the CLI crowd right about now

      "Are you sure you want to reboot YOUR ENTIRE DATA CENTRE???"

      Followed by another prompt "Are you really, really sure you want to KILL EVERYTHING?"

      In the days of Teletype mainframe consoles it was found to be best that dangerous actions should ask a series of questions - with the last one requiring a different affirmative than the standard "Y". Otherwise the operator often went into auto mode and typed "Y" without thinking.

  29. cordwainer 1
    Pint

    Per the El Reg offer, have a pint...

    "El Reg would like to commend Joyent for its transparency about the outage and has made one virtual Sorry You Borked A Bit Barn pint available to the operator that caused the error. Interested parties can provide additional pints by selecting the beer icon in the comments below."

    A pint for the poor operator, who will be needing to drown his sorrows

    And a pint to Cantrill for what may be the only comprehensive, transparent, honest corporate explanation for a screw-up I've read in years. Should be required reading for all clueless executives and companies (eBay, yes, I DO mean you).

  30. kbsartain

    Depending on the tool is not sufficient to avoid issues. Partitioning of the datacenter such that any one operation only had a potential scope of impact of, say, 1/3rd of the datacenter is a better approach. That can be accomplished through bastion hosts/gateways. It means that the admin has to log in to a unique gateway for every partition. That 2-step process is often enough of a sanity check to avoid these types of issues entirely.

This topic is closed for new posts.

Other stories you might like