back to article Sysadmin shut down server, it went ‘Clunk!’ but the app kept running

Hello? Anyone there? We understand that plenty of you in the northern hemisphere might not bother this week. For those of you who are still working, welcome to another instalment of “Who, me?”, The Register’s confessional column in which readers reveal their worst mistakes. This week meet “Rick” who told us that in the mid- …

  1. alain williams Silver badge

    Halted machine on other side of the planet

    Some 25 years ago: a small amount of inattention and it was a machine in California, not the machine in Blighty that I powered down. Whoops! I sent a grovelling email & had to wait until they arrived the following morning.

    Fortunately: a development machine so my only penalty was to be the butt of jokes for a while.

    1. monty75

      Re: Halted machine on other side of the planet

      I've done the same with my home server. Accidentally did a shutdown -h instead of shutdown -r whilst remoting in from India. Not a lot you can do about that when you're five thousand miles away.

      1. DougS Silver badge
        Pirate

        Re: Halted machine on other side of the planet

        You could have hacked into your home city's power grid and caused an outage long enough for your home UPS to drain, so when the power returns your server will power up.

        1. Anonymous Coward
          Anonymous Coward

          "so when the power returns your server will power up."

          I avoid to have expensive machines to power up as soon as they see an electron arriving at their PSU - there are situations when power goes up and down for a while, and if the UPS can't keep on feeding properly, things could become nasty.

          If you really need to boot remotely wake-on-lan, remote controlled switches, or good server out of band management are better solutions.

          1. DougS Silver badge

            Re: "so when the power returns your server will power up."

            I was talking about a home system of the guy I was responding to. I agree you don't want servers in a datacenter powering up just because they see power - if for no other reason than the inrush might cause even more problems than the outage did. But at home, if you have something you want up all the time like a home email server, it better come up when the power does or it may be down for the duration of your vacation if it loses power on your way to the airport.

            1. Cirieno

              Re: "so when the power returns your server will power up."

              Not if it's a Gigabyte H61M-USBV3 -- this POS mboard gets stuck in a BIOS reboot loop if you try to boot it with any external USB hard drives plugged in... I've had to make a midnight drive some 200 miles home, in the middle of the week, to pull two cables and reboot, else it would have been rebooting/powering the drives/failing for ten days before I was "due" home again.

            2. CrazyOldCatMan Silver badge

              Re: "so when the power returns your server will power up."

              if it loses power on your way to the airport

              HAd to drive home from a holiday in Wales once - my parents were house-sitting for us and the guy arrived to service the AC in the room at home where I have the servers.

              He did the service and then the muppet then proceeded to plug his small vacuum cleaner into one of the power strips clearly marked "Connected to UPS - Computers only" and blow up the UPS.

              Rather than trying to talk my elderly parents[1] on how to shut everything down cleanly (the UPS went into bypass mode and was continuously bleeping), we drove the 150 miles back home so that I could reset everything and bypass the UPS (which needed a new power board).

              The idiot AC technician then had the temerity to phone me up later that month to try and book his next visit. My reply was short and to the point and involved snowballs and very hot places. He's lucky that I didn't send him the bill for fixing the UPS.

              [1] One of whom was technically capable but had very poor eyesight and co-ordination and the other could see OK but was clueless about computers

      2. Nick Kew Silver badge

        Re: Halted machine on other side of the planet

        I must be too boring: always been too super-careful with distant machines. In setting up a firewall, I've used a cron job to reset-everything every few hours, as an ultimate failsafe against accidentally locking myself out. Stop the cron job only when finished configuring and verified my own access.

        Nowadays I have a cloud-based server and a web-based control panel. I can ssh in as root, but for something like a reboot I'll use the web panel to protect from certain possible accidents.

        1. GrumpenKraut Silver badge
          Thumb Up

          Re: Halted machine on other side of the planet

          > I must be too boring: always been too super-careful ...

          Have an upvote from another boring person.

        2. CrazyOldCatMan Silver badge

          Re: Halted machine on other side of the planet

          I've used a cron job to reset-everything every few hours

          On my home virtualisation server I was getting random hangups in one of the VMs (happened to be the old, much patched and upgraded Ubuntu VM that handled email and web service) so I put in a cron job that pinged it every 30 minutes and, if 3 reples failed, would stop and restart the VM.

          Finally solved that one by cloning it to a new VM and using the new VM instead - which never, ever falls over. I suspect that one of the SSDs has a transient error or there was a file size error on the qcow2 image and copying it fixed the error.

      3. Anonymous Coward
        Anonymous Coward

        Re: Halted machine on other side of the planet

        I typed shutdown -h on a system to bring up the help. Except it didn't bring up the help :-)

        1. Scott Marshall
          Black Helicopters

          Re: Halted machine on other side of the planet

          We're the Sysadmin Police, and we're here to "help" you.

          1. CrazyOldCatMan Silver badge

            Re: Halted machine on other side of the planet

            Sysadmin Police, and we're here to "help" you

            Please take no notice that we appear to be guiding you towards *that* window.. you know, the famously loose one..

        2. waldo kitty
          Facepalm

          Re: Halted machine on other side of the planet

          I typed shutdown -h on a system to bring up the help. Except it didn't bring up the help :-)

          yeah, that should have been "--help" instead of that old DOSism "-h" :lol:

      4. HWwiz

        Re: Halted machine on other side of the planet

        @Monty75

        Yes there is. You log into the iLo, ir iDrac of what ever you server has and simply power it back up. No matter where it is in the world. Thats why servers have remote access !!.

        1. monty75

          Re: Halted machine on other side of the planet

          Not when the "server" is a Raspberry Pi hanging off the end of a consumer broadband connection.

          1. Francis Boyle Silver badge

            Re: Halted machine on other side of the planet

            Not when the "server" is a Raspberry Pi hanging off the end of a consumer broadband connection.

            Then get another Pi to manage the first one. The solution* is always another Raspberry Pi.

            *Even when there isn't actually a problem.

      5. onefang Silver badge

        Re: Halted machine on other side of the planet

        "Not a lot you can do about that when you're five thousand miles away."

        This is the sort of thing that Intels IME is actually good for, remote control of the power button, even for turning on accidentally powered down servers. Though generally you need a server class computer to use that sort of thing, even if it is in most / all Intel CPUs these days.

      6. Nathanial Wapcaplet

        Re: Halted machine on other side of the planet

        live and learn to setup Wake On Lan, my good man

    2. Rattus
      Alert

      FIXED: Halted machine on other side of the planet

      Fingers are often faster thn synapses:-)

      The solution?

      molly-guard

      Taken from packages.debian.org:

      The package installs a shell script that overrides the existing shutdown/reboot/halt/poweroff/coldreboot/pm-hibernate/pm-suspend* commands and first runs a set of scripts, which all have to exit successfully, before molly-guard invokes the real command.

      One of the scripts checks for existing SSH sessions. If any of the four commands are called interactively over an SSH session, the shell script prompts you to enter the name of the host you wish to shut down. This should adequately prevent you from accidental shutdowns and reboots.

      molly-guard diverts the real binaries to /lib/molly-guard/. You can bypass molly-guard by running those binaries directly.

      1. HCV

        Re: FIXED: Halted machine on other side of the planet

        Thank you! I've been trying to remember the "girl the plastic cover is named after" name on and off for years, and for whatever reason both my Google Fu and my friends' memories have failed me. I must travel in the wrong circles, or live in the wrong country.

      2. CrazyOldCatMan Silver badge

        Re: FIXED: Halted machine on other side of the planet

        molly-guard

        Now installed on one of my devuan VMs - lets see how it goes..

      3. Criggie

        Re: FIXED: Halted machine on other side of the planet

        Yep - I heard it was good, so I tested mollyguard by running the mollyguard binaries directly.

        Works exactly as described on the box - insta-shutdown when called directly.

    3. CrazyOldCatMan Silver badge

      Re: Halted machine on other side of the planet

      and it was a machine in California

      Been there, done that. Fortunately, we had had the foresight to have all the machines on a combined KVM/serial/power board that you could telnet into (remember telnet? Oh, how innocent we were in those days) and bounce the power on individual servers..

      I did get questioned about an unexpected reboot by the US sysadmin but, since it was in their early morning, he didn't really care. And it was only the SMTP spool so nothing of value was affected :-)

      1. Criggie

        Re: Halted machine on other side of the planet

        I remember a chap doing desktop support at a church. All the staff went out for lunch, and locked the office door while he was working elsewhere on the site. However he needed to reboot the server in the office.

        No ILO or anything - its a glorified desktop because churches lack in money. No UPS either, so there-in lies the basis of his fix.

        What was his fix? The power distribution board was overhead, so he hits the cutouts labelled "front office" until the room got suspiciously quieter. After a 10-count, turn it back on and server reboots. Problem solved and he could leave.

        1. Juan Inamillion

          Re: Halted machine on other side of the planet

          What was his fix? The power distribution board was overhead, so he hits the cutouts labelled "front office" until the room got suspiciously quieter. After a 10-count, turn it back on and server reboots. Problem solved and he could leave.

          Love it. Lateral thinking. The sort of thing a British roadie would do..,

    4. Gerhard Mack

      Re: Halted machine on other side of the planet

      "Some 25 years ago: a small amount of inattention and it was a machine in California, not the machine in Blighty that I powered down."

      I've done this often enough that molly-guard is a standard package that gets installed on all servers that I maintain. It helpfully requires that you confirm a shutdown or reboot by typing the hostname of the machine in question.

  2. big_D Silver badge

    DEC Engineer

    No, not me...

    We had a series of VAX 11/7xx machines in a row in the computer room, about a dozen of them.

    DEC sent an engineer out to do some maintenance and upgrade the memory on one machine. We duly moved all jobs and users to the next machine in line, shut the machine down and told the engineer he could power down the machine.

    He disappeared behind the CPU cabinet and... Nothing. He reappeared, the VAX was still in Shutdown mode. His face went a bit pale and, suddenly, there were screams and shouts from the next machine in the row. You know, the one we had shoved all the users and jobs onto from the one we had shut down.

    He'd managed to mix up the circuit breaker for the machine he wanted to turn off and the one next to it.

    1. Rich 11 Silver badge

      Re: DEC Engineer

      He'd managed to mix up the circuit breaker for the machine he wanted to turn off and the one next to it.

      I've seen a competent and experienced senior sysadmin do that. A few months previously, while he'd been on holiday and left everything in his deputy's capable hands, we'd had a fourth minicomputer added to our machine room, and a sparky had been brought in to ensure everything was wired up properly. They also -- quite sensibly -- took the opportunity of the weekend's downtime to sort out the less than perfect wiring of the previous installations and to standardise everything. The end result was that we all naturally assumed that the location of each shiny, new circuit breaker on the board matched the location of the minicomputer it controlled. Come the first occasion where it was necessary to power down one of our machines, we were strongly reminded of the old adage to never assume anything.

      1. Anonymous Coward
        Anonymous Coward

        Re: DEC Engineer

        You think that's bad? We just switched off all the servers in an entire building because the power company was having to drop power for longer than the UPSes could stay up. Cue lots of meetings, late night working to shut down nicely etc.

        At which point they switched off the power on our *other* building...

    2. This post has been deleted by its author

    3. imanidiot Silver badge

      Re: DEC Engineer

      That's what you get for letting an outside engineer operate the circuit breakers in your server room. Someone familiar with the system should have been doing that imho.

      1. Adam Foxton

        Re: DEC Engineer

        @iamanidiot

        Absolutely disagree. Breakers should be labelled with which machine(s) they power, and ideally machines labelled with appropriate breakers too.

        At the very least have a map.

        Having to know "okay, so it's the second breaker down for the first machine's primary PSU (excluding the red one for the UPS) and fifteenth up on that other switchboard for the secondary PSU, ah, no, wait, fifteenth /single phase/ one" is a recipe for disaster.

        1. imanidiot Silver badge

          Re: DEC Engineer

          @Adam Foxton,

          I don't think you disagree with me actually. Breakers SHOULD be clearly labelled that's not what I'm saying. But even with clearly labelled breakers, it shouldn't be an outside contractor unfamiliar with your setup that does the actual switching. If it's YOUR mission critical stuff, YOU should be doing the switching.

          I'm absolutely NOT advocating for "security by obfuscation" or anything. Even people familiar with how stuff works that could find the right breakers in their sleep shouldn't have to rely on memory to find the correct switch at 3am on a monday morning (or any other time of any day).

          1. The Oncoming Scorn Silver badge
            FAIL

            Re: DEC Engineer

            The wiring of the circuit breakers & the labelling thereof down in the basement of my old house bore only a passing resemblance to what it might actually be connected to.

            1. Rich 11 Silver badge

              Re: DEC Engineer

              The wiring of the circuit breakers & the labelling thereof down in the basement of my old house bore only a passing resemblance to what it might actually be connected to.

              A few years ago I stayed in a hotel which had been built in four stages over about 120 years, with the original building having been retro wired once electricity became popular and it and the various wings rewired whenever the regs of the day required it. They were having power problems all weekend, with no-one knowing which circuit controlled what, and a sparky plunging entire sections into darkness apparently at random while trying to figure out what was going on. The only thing which made the situation bearable was that the smaller bar and the kitchen had its own supply, so we were never short of refreshment. It was, however a lottery as to whether or not you'd get your feet wet mid-flow as the lights went out when visiting the adjacent toilet, unfortunately located in the Victorian block. I also don't recommend trying to take a dump in a narrow tiled cubicle lit only by a mobile phone held in one hand, and... well, you get the drift.

        2. CrazyOldCatMan Silver badge

          Re: DEC Engineer

          Breakers should be labelled with which machine(s) they power

          A bit like how network leads should be labelled correctly. So that when your contract network engineer disconnects the Head of Site (for the 3rd time) you can be sure that they can't blame *you*.

          They will of course, but at least you'll know you were right.

          1. Anonymous Coward
            Anonymous Coward

            Re: DEC Engineer

            I've no experience of this in these big network situations, but from real life experience in other bits of connected or powered kit I'd never trust the label on a lead. I'd use it as a guide to make sure it began and ended where it was meant to, but that's all. Because there's always a good chance that some stupid sod has moved the label/swapped the leads over or something. I even came across a case where someone had switched some patch cables for longer ones, and had then replaced the labels onto the new cables, but not necessarily on the correct ones. And one lable had lost the stickiness, fallen off or something, and been stuck back loosely on the nearest bit of wire (which wasn't even a patch cable).

      2. The First Dave

        Re: DEC Engineer

        This is _exactly_ why you MUST let the outside person do it. He might be more likely to make the mistake, (though is more likely to check things properly beforehand,) but when things go wrong, it's not _you_ that gets the bullet.

        1. usbac

          Re: DEC Engineer

          @The First Dave

          "This is _exactly_ why you MUST let the outside person do it. He might be more likely to make the mistake, (though is more likely to check things properly beforehand,) but when things go wrong, it's not _you_ that gets the bullet."

          Back when I worked in consulting, I often thought that was the reason we were there. I though the in-house guys were more than capable of doing some of the projects we worked on. I think the reason they called us was to have someone to blame if things turned to shit!

    4. This post has been deleted by its author

    5. Anonymous Coward
      Anonymous Coward

      Re: DEC Engineer

      We had two servers in the office that were running one was for some new software in testing before going live, two was being used for something visible on the company website. So when we had a serious problem with the software under test the firm came in to fix it.I was delegated to stay and keep an eye on them that night as their quick fix had taken a bit longer than that. At one point they asked if they could reboot the server? I said yes and showed them which machine it was in the racking. However they were a few minutes away from being ready so I went to collect my takeaway delivery from reception. As I returned they had just powered down the server and were looking confident. My phone rang seconds later and it's the boss asking why the website is down. I looked at the racks and saw the wrong server rebooting. I asked the staff if they had rebooted the wrong one on purpose and they went ashen. They protested that they hadn't and the one I'd shown them had been powered down. They walked round the left side of the racks and showed me the box they'd touched which was in the first rack you came to. I said that's the wrong one, I had taken them round the right side and showed them the correct one which was also the first one you came to but from the other side. It could have been understandable if the boxes hadn't been labelled and a note on one saying do not touch without checking with the manager and red striped tape round the box to reinforce the message. That server was actually running something that was critical for the website. It wasn't in in the server room because this had only arrived a few days earlier and had been rushed into service that day due to a PSU issue with the live one. Fortunately it came back up quickly but the software firm were deeply unpopular as a result.

    6. HWwiz

      Re: DEC Engineer

      big_D

      Did you know there is a certain large bank in the UK that still has loads of DEC VAX machines running. I walk past them every night.

      Even have 5 dead ones that are kept for parts.

  3. Pascal Monett Silver badge

    I crashed a server once, at client site

    I was working on client site about 15 years ago, one of a multinational's many branches. I had access to the server room, although I didn't really need it because I'm a developer, not a sysadmin. When the summer was hot, I did sneak in from time to time, to cool off, but I digress.

    I knew the product I worked with didn't much like having the client opened locally on the server. I knew it. One day, for some reason I can't even recall, my workstation was doing something time-consuming and I was in a hurry to get another thing done. Those among you with experience will recognize right there a recipe for disaster.

    I decided that it was a small thing and I could do it quickly on the server. I knew where it was, so I badged myself into the server room, walked over to the dedicated Solaris server I needed and double-clicked the icon for the client. The client's loader flashed a screen, and the server went down then and there. I was left to facepalm myself while watching nervously the server come back online.

    Of course, when a production server goes down in a site that employs over a thousand people, there will be people to notice. I could only go to the head of IT and report my actions. As penance, I asked my access to the server room to be revoked, which the IT manager graciously accepted.

    I never did try that again, anywhere.

    1. OzBob

      Re: I crashed a server once, at client site

      "I didn't really need it because I'm a developer, not a sysadmin"

      I call BS, there's not a developer alive who doesn't think he can do a sysadmin's job better. (I coded for 6 years before I switched over and every now and then I have to knock back a developer in a meeting)

      1. Pascal Monett Silver badge

        Apparently, there is at least me.

        1. spodula

          And me.. although admittedly in my case, there was a "Learnnig experience" which proved the point to me.

      2. Nick Kew Silver badge

        Re: I crashed a server once, at client site

        I call BS, there's not a developer alive who doesn't think he can do a sysadmin's job better.

        I've certainly encountered sysadmins whose job I can do better than them. In some cases I did - 'cos otherwise it just wouldn't have got done.

        But I wouldn't say that of sysadmins in general. Nor would I wish to antagonise a sysadmin by backseat-driving the job, unless it was clear that the individual was one of those who really need my help.

      3. Loyal Commenter Silver badge

        Re: I crashed a server once, at client site

        I call BS, there's not a developer alive who doesn't think he can do a sysadmin's job better.

        I wouldn't want to do a sysadmin's job better than they do. For a start it would involve learning AD and a whole slew of internal procedures, which would be sure to push something more interesting out of my brain.

      4. Anonymous Coward
        Anonymous Coward

        @ozbob

        "I call BS, there's not a developer alive who doesn't think he can do a sysadmin's job better."

        In my opinion that would depend more on the sysadmin than the developer(s). Most developers I came across with started complaining when they couldn't do their job properly. And why wouldn't they complain because in many cases they're the actual heart of the organization which keeps the whole thing running (especially if you're selling software products).

        I've worked on both sides of the spectrum (though I'm not a professional developer, as in: never took on a full time job as developer) and in my opinion it's mostly certain sysadmins who come over as sort of arrogant because they know how to keep the company safe. And if you then keep in mind that "keeping safe" usually boils down to "limiting users" you got yourself a dilemma.

        Of course in many cases those sysadmins weren't really arrogant at all, but the way they expressed themselves... ye gods. And there lies your problem in the making because action = reaction.

        1. cream wobbly

          Re: @ozbob

          ' "I call BS, there's not a developer alive who doesn't think he can do a sysadmin's job better." '

          ' In my opinion that would depend more on the sysadmin than the developer(s). '

          Nah, in my experience, it doesn't matter how genius the sysadmin is, developers are always complaining that they could do a better job with their eyes shut...

          ...which is generally how the buggers code, anyway.

        2. J. Cook Silver badge
          Big Brother

          Re: @ozbob

          @ShelLuser:

          Of course in many cases those sysadmins weren't really arrogant at all, but the way they expressed themselves... ye gods. And there lies your problem in the making because action = reaction.

          That's something that a lot of sysadmins have problems with, myself included. We tend to be blunt, to the point, and cranky when asked to allow Yet Another Hole in the security wall without adequate explanation why.

      5. CrazyOldCatMan Silver badge

        Re: I crashed a server once, at client site

        there's not a developer alive who doesn't think he can do a sysadmin's job better

        Or sees any need to follow those tedious change and release procedures that were put in place after the last IT disaster that they caused by releasing software with a major bug that crashed all the POS tills..

        Not that I'm bitter or anything.

    2. hmv

      Re: I crashed a server once, at client site

      That there is the reason I fought for years to get the allowed list of people to enter a DC reduced to the absolute minimum.

      Not because sysadmins are any less likely to do Dumb Things (although we do get more opportunity to appreciate the "measure twice, cut once" rule), but because the fewer people who can do Dumb Things in a data centre, the less frequently painful lessons are learnt.

      1. Nolveys Silver badge

        Re: I crashed a server once, at client site

        although we do get more opportunity to appreciate the "measure twice, cut once" rule

        Why is the server down and why are you holding a gas powered angle grinder?

        1. CrazyOldCatMan Silver badge

          Re: I crashed a server once, at client site

          why are you holding a gas powered angle grinder?

          Because I couldn't find a mains outlet to plug my electric one in?

  4. bombastic bob Silver badge
    Facepalm

    type 'reboot' in the local console instead of the remote one

    Did that a few weeks ago, while working on important kernel updates for a popular ARM platform. whoops. And I had over 100 days of uptime on the box that got rebooted, too.

    it was too late when I realized it. I had to watch the shutdown complete and the system restart. It turned out ok, as it eventually motivated me to update the kernel+world and some necessary kernel modules. All good now.

    [working versions of the kernel updates are going into the target OS, too. all good]

    icon because it's what I did when I saw it.

    1. Symon Silver badge
      Coat

      Re: type 'reboot' in the local console instead of the remote one

      "over 100 days of uptime"

      How do you know you're a sysadmin? When your uptime is longer than you've had a girlfriend*.

      *Other partners are available.

      1. jake Silver badge

        Re: type 'reboot' in the local console instead of the remote one

        As I typed elsewhere a couple weeks ago: "Who gives a flying fuck about uptime on any single given machine? Keeping it up forever pales in comparison to overall system stability and security. If a box needs a reboot, then reboot the fucking thing already! It's not like it sentences your firstborn to death or anything.

        "Honestly, I thought THAT particular DSW was over a couple decades ago."

        1. spodula

          Re: type 'reboot' in the local console instead of the remote one

          "It's not like it sentences your firstborn to death or anything."

          You've obviously never dealt with senior product managers.

          1. DougS Silver badge

            Long uptimes are a disaster waiting to happen

            Except perhaps in very stable systems. Not talking about security patching, though that matters too. I'm talking about startup scripts. If you apply patches to application software, sometimes it will futz with startup scripts - either removing your customizations or making changes that don't take your changes into account. Or you might change them yourself, because of other changes you made (maybe you add a drive to a cluster, and modify the mount script for that cluster app accordingly)

            If you haven't rebooted in a year, and something goes wrong that's immediately obvious, it can be incredibly difficult to track down. Especially if it is something like a typo in the mount point for a new cluster drive, which causes the cluster app to mostly but not entirely function, but you won't notice it just looking at 'df' unless you are intimately familiar with the application.

            You should never let servers go too long without a reboot, where "too long" varies depending on how much non-reboot change activity is happening on it.

            1. tfb Silver badge

              Re: Long uptimes are a disaster waiting to happen

              This has happened to me several times. It's remarkably easy for machines to get into a state where they won't boot because no-one has bothered to test they will. A particularly nice case is a machine which had a disk in its boot mirror fail: it gets replaced, resynced but the bootblocks never get written because someone forgot. Then the same thing happens to the other half of the mirror, and now you have to hope you can either netboot the thing or have the right media.

            2. l8gravely

              Re: Long uptimes are a disaster waiting to happen

              I agree that rebooting things more frequently is a nice thing to do, but when you have legacy applications which aren't horizontally scalable, then it can be extremely difficult to get the downtime. I had a bunch of Netapps with over three years of uptime before I was allowed to shut them down, and then only because we were moving them across town.

              Let me tell you, when they booted and came up fine, I was very happy! They were off netapp support, and disk failures were at the wrong end of the bathtub curve... it's got to be replaced one of these days, but they won't move until it falls over I suspect.

            3. onefang Silver badge

              Re: Long uptimes are a disaster waiting to happen

              That is precisely why I reboot computers after anything that is involved in the reboot process is involved in an update. I want to test that right now, not at some random point in the future where it might be important for it to reboot quickly and smoothly coz the boss is breathing down your neck and the customers are waiting for normal service to return after the week long power failure that drained all your UPSes. That's exactly the wrong time to find out you need to reboot into repair mode to repair some random update that happened months ago, and it might take some time to fix.

              1. tfewster Silver badge

                Re: Long uptimes are a disaster waiting to happen

                I usually recommend rebooting before making any significant changes as well as after. If it was broken before I got there, I don't want to get the blame.

        2. John Sanders
          Childcatcher

          Re: type 'reboot' in the local console instead of the remote one

          I keep asking the same question to people who religiously refuse to reboot or patch for dread of the "downtime"

          I ask: Is this the fire department server? Police? Emergency services? The NHS?

          No?

          Then patch & reboot out of hours during your period of least activity.

        3. cream wobbly

          Re: type 'reboot' in the local console instead of the remote one

          I quite often advise other sysadmins to reboot the box instead of clicking about trying to "find root cause", much to their horror. In my eyes, a 30 day uptime is a prime contributor to root cause, and because of failure to patch, an increasing security risk. (Or if it's Windows, 7 days.)

          You can do your analysis after you've rebooted. But I know they won't.

          If you think that's too harsh, have a word with yourself.

          1. Doctor Syntax Silver badge

            Re: type 'reboot' in the local console instead of the remote one

            "In my eyes, a 30 day uptime is a prime contributor to root cause"

            If you think a 30 day uptime is a prime contributor to root cause I've got news for you. The prime contributor is some other problem, maybe a memory leak, that's slowly poisoning your system. Your reboot is just treating the symptom.

        4. Terry 6 Silver badge

          Re: type 'reboot' in the local console instead of the remote one

          Even with stand alone and even with home PCs there are still people saying proudly that they never reboot the thing. So they are proudly burning carbon every night. And why? To save a couple of minutes boot time in the morning? Or because they think hardware in 2018 is too flaky to boot regularly.

      2. tfb Silver badge
        Meh

        Re: type 'reboot' in the local console instead of the remote one

        How do you know when you're a real sysadmin? When, rather than a spurious pride that the machines have been up for a very long time, you start worrying about machines with uptimes greater than the time since the release of the most recent critical security patch.

        1. Doctor Syntax Silver badge

          Re: type 'reboot' in the local console instead of the remote one

          "How do you know when you're a real sysadmin? When, rather than a spurious pride that the machines have been up for a very long time"

          If you're an old enough sysadmin you can remember when patches arrived at very infrequent intervals. Personally I can remember a machine which we really didn't ever want to reboot as we weren't convinced its disks would restart. That had an uptime of years.

          1. imanidiot Silver badge
            Joke

            Re: type 'reboot' in the local console instead of the remote one

            Servers with high up time are like senior citizens. You put them to bed and you never know if they'll wake up again.

            1. CrazyOldCatMan Silver badge

              Re: type 'reboot' in the local console instead of the remote one

              You put them to bed and you never know if they'll wake up again

              And people misunderstand it if you apply the same percussive maintenance[1] to them as you would to old Sparc drive arrays..

              [1] Moved a very old array from one site to another. As expected, about 50% of the drives failed to spin up. The Sun engineers advice was to take the drives out, tap them sharply on the edge of the desk then put them back in. Reduced the drive failure to only one drive (and that one was terminal if the dark char-marks on the circuit board were anything to go by)

      3. katrinab Silver badge
        Unhappy

        Re: type 'reboot' in the local console instead of the remote one

        [katrina@naoto ~]$ uptime

        12:03PM up 174 days, 14:33, 2 users, load averages: 0.15, 0.19, 0.17

        Time with girlfriend, 4322 days, 2:38

        So I guess I'm not a sysadmin

      4. Doctor Syntax Silver badge

        Re: type 'reboot' in the local console instead of the remote one

        "When your uptime is longer than you've had a girlfriend"

        Stay off the Viagra.

  5. Anonymous Coward
    Anonymous Coward

    I called an admin in an offshore office on his mobile, and asked him to reboot a hung email server. He came back to the phone telling me it was done, but I was still moving the mouse around on the hung remote session. When pressed to confirm details, I couldn't recognise the server name he gave me. It all became clear when we figured out he'd changed companies, kept his mobile number, and had just rebooted his new employer's mail server. Oops...

  6. Tigra 07 Silver badge
    Facepalm

    I pushed the wrong button on the fan this morning and blew all my papers on the floor...

    1. jake Silver badge

      Not a fan of that ...

      ... I flipped the wrong breaker in my shop once, starting up a ceiling fan that hadn't run in the several years since I had installed air conditioning ... dumping the dust that had collected on the blades onto the components of a 351 Cleveland that I was about to start assembling.

      I didn't realize that I could swear continuously (at myself!) for over an hour. Not the most joyous occasion of my life. Took a couple days to clean it up to my satisfaction ...

  7. jake Silver badge

    I was burning in some two dozen nodes of T-carrier gear.

    After the end of the ten day burn-in period, one of the final tests was to physically pull the plugs on the redundant power supplies, and then re-insert the plug, before moving on to the next supply. Lather, rinse, repeat, first with the supplies plugged into "power A", and then the supplies plugged into "Power B". A Sun workstation logged the relevant voltages & currents, to be printed out as part of the complete verification package for each machine. I got to the end of the long line of plugs, absentmindedly noted that this plug was different from the rest & unplugged it ... only to take down the Sun box, and completely trash the drive holding the data. It's the only time I ever lost data on a CDC Wren SCSI drive ... and it invalidated the ten day burn-in results for about two million dollars worth of "must ship" gear at quarter end.

    I shouldered the blame, as I had pointed out to my Boss that having the Sun plugged into the test power bus was probably not a very good idea. All I could do is claim exhaustion, working a couple weeks of fourteen hour days because Sales had over-sold production capability and for some reason TheBoard decided we had to make the projected sales figures. Fortunately, my Boss managed to cover my ass & I kept the job. Daft thing is that I wasn't even part of the QA group that managed the burn-in, I was only roped in to help because of a lack of hands ...

    1. Nick Kew Silver badge
      Pint

      Re: I was burning in some two dozen nodes of T-carrier gear.

      Jake, in this instance you should be the main story, not a comment.

  8. Anonymous Coward
    Anonymous Coward

    BBC2's transmission mixer...

    Many years ago, BBC Presentation (the part that plays the programmes out and does the announcements etc) had 3 control rooms and associated equipment for BBC1 and 2. This was to allow for upgrades, testing, redundancy etc. The areas were called Red, Green and Blue. One day a colleague and I were working in the Green area (not on air) while Red was doing BBC1 and Blue BBC2. We wanted to restart the Green mixer so two of us trolled through to the apps room* taking great care to "support each other" and stood in front of the Red bays**. These were clearly painted red. We were seconds away from pulling the power before we both said "NO!!!! Green, Green Green...."

    I think it was only Wimbledon so not many people watching....

    *Broadcast/BBC terminology

    **Racks to most of you

    1. Doctor Syntax Silver badge

      Re: BBC2's transmission mixer...

      "I think it was only Wimbledon so not many people watching"

      As the Beeb often contrived to have Wimbledon on both channels (even in the days when it had the test match coverage) you might only have lost half of it.

  9. Why Not?

    Label, Label Label

    When you build a server put a label on every switch & power socket.

    Also check the Mac address if unsure.

    Been there, done that took the sh1t

    1. Symon Silver badge
      Paris Hilton

      Re: Label, Label Label

      Yes, and may I also recommend the checklist?

      E.g. "Simple and seemingly obvious items like "FUEL QUANTITY" in a pre-departure checklist, or "LANDING GEAR DOWN" in a landing checklist, are there for a very important reason."

      http://aviationknowledge.wikidot.com/aviation:checklists

      1. monty75

        Re: Label, Label Label

        My first job in IT was with a major fibre network provider. They'd grown too fast and hadn't kept proper records of where their fibres actually went and how they were connected up. I was employed to map the network from a pile of surveyor's notes. There were parts of the network where fibre would go down a duct only to have disappeared by the time it got to the next inspection chamber. This was live fibre carrying live traffic so it must have gone somewhere but buggered if we knew where.

        Unfortunately, I didn't stay there long enough to find out if they ever tracked it down.

        1. I Am Spartacus
          Pint

          Re: Label, Label Label

          @Monty75

          This rings a distant bell. I may have worked at the same seat-of-the-pants telco as well.

          1. monty75

            Re: Label, Label Label

            @I Am Spartacus

            If it involved canals then it was the same place :)

          2. GrizzlyCoder

            Re: Reminds me of a story

            "This rings a distant bell. I may have worked at the same seat-of-the-pants telco as well."

            Likewise, especially if they started out by buying the "London Hydraulic Company"....

            I served a short time as the trainer for the GIS system they were building to try and map the underground plant so that they didn't have to keep toting the drawings from Bracknell every time they expanded/modified the network

        2. Jock in a Frock

          Re: Label, Label Label

          Nope, we're still scratching our heads.....

    2. Tinslave_the_Barelegged Silver badge

      Re: Label, Label Label

      Oh sure, like one place where the labelling of the underfloor wiring was dodgy, so they decided to put the labels on the ceiling tiles. Great idea, until one weekend the aircon guys had to come in and do work, collected all the ceiling tiles into a pile and at the end of the job restored them, not, as the saying goes, necessarily in the correct order.

    3. I am the liquor

      Re: Label, Label Label

      Indeed, good advice.

      The Dymo machine's not just for printing "fuck the police" labels to stick on the boss's car's bumper, after all.

      1. Chairman of the Bored Silver badge

        Re: Label, Label Label

        Good times! Probably as much fun as taking a razor blade, a stack of new bumper stickers, and then modifying the boss' 'US Navy Retired' bumper sticker to read 'US Navy Retarded'

        1. jake Silver badge

          Re: Label, Label Label

          Speaking of modifying stickers ... anybody here listen to Silly Con Valley's own KOME between the early '70s through the '80's? I've seen their decals as far away as Perth ... and a friend sent me a picture of one that reads "-98.5 Our Temp" on the back of a Cat at McMurdo ...

          1. Chairman of the Bored Silver badge

            Re: Label, Label Label

            Saw KOME sticker on a large ETM Electromatic transmitter power supply in Indiana, circa 2001.

            Speaking of which, I lived in Indiana in the early 90s and the police give out these public service announcement bumper stickers showing a seat belted stick man and the words "Indiana - buckled up for life" we would mod with razors so they would read "Indiana - f...cked up for life" Bonus points if the stick man was manipulated into an appropriate posture to receive...

            1. onefang Silver badge

              Re: Label, Label Label

              In the place where I volunteer my IT services to seniors, about a month and a half ago, I was contemplating ways of getting better WiFi in the other end of the building. Inside a small room with a sign saying "STORAGE ROOM" on the door, was a network socket, which would have been the perfect place to plug in an old Cisco AP I had found in a cupboard. So I updated and configured the AP while it was plugged into my laptop, then plugged it in to that room, no joy. The network socket had a label saying "PROJECT ROOM", all the staff that had been there way longer than I had never heard of a project room. After some searching, in the main office, laying on the ground, hidden behind a rather heavy desk, was an unplugged network cable, also labelled "PROJECT ROOM". I plugged it into a spare port on one of the switches, ran back to to storage room, and the AP had lit up. Much testing later, and I declared the WiFi issues in that end of the building as solved. The new AP was working perfectly, instead of dropping out all the time at that end of the building like the existing AP would, coz the existing AP was too far away at the other end.

              Over the last three weeks, the place had been closed while they replaced the old floor. Prior to that, we packed everything away in labelled boxes, so the construction crew could shift everything to one end of the building, replace the floor in the cleared area, move everything to the other end, and replace the rest of the floor. On Monday we got to move everything back, including plugging all the computers and phones back in. Luckily the network and phone gear had been mostly left in place, being bolted, nailed, glued, or embedded in the walls. Very lucky, that PROJECT ROOM label was the only label. No labels on anything else, would have had no idea what to plug in where. We got everything working again except for one phone, though it turned out the plug on one end of it's cable had been cut off. I don't happen to have any RJ11s, but the IT guy they pay to look after the half of the computers that I don't look after said he has some back at his office, and promised to fix that on Tuesday. I usually only work there on Mondays.

              After the dust had settled, I gave a little "Label, label, label" type speech. I'm guessing they'll ask me to do that next week.

              P.S. For those wondering, the computers used by the office staff where paid for by the organisation itself, and they have been paying a computer business to look after those computers. The other computers where donated by a government scheme called "Broadband For Seniors", for the purpose of educating seniors, and letting them use these computers. Educating seniors, and helping them with their computers / smartphones is my volunteer job, so I get to look after these freebies. Oddly enough, one of those freebies is the most powerful computer in the entire building.

              There are archeological layers of "old" computer equipment in that place. I've learned that it sometimes pays to go digging, which is how I found that Cisco AP. They had asked me to advise them about a new scanner they wanted to buy. They told me their requirements, and my reply was "Hang on, I think I saw one of those in that cupboard over there." Came back a minute later, "Here, will this do?". They've been very happy with that scanner.

  10. Anonymous Coward
    Anonymous Coward

    Ooops....

    A few years ago I was replacing an old server we have co-located in a server farm. I was escorted to the rack where our server was located, and the engineer unlocked the cabinet and left me to work. I duly powered our server down, and went to the rear of the rack to remove the cables so I could slide the server out. I moved back to the front of the rack, started pulling on our server but it would not slide out, the cables were all still plugged in.

    The cables I'd removed? Belonged to someone else's server. I quickly plugged everything back in and powered up the server, hoping that whoever it belonged to would not notice. I carried on with our server replacement, and no-one collared me when I left, so I did get away with it.

  11. Cem Ayin

    Copy & paste

    Many years ago I happened to be admin of a student computing lab running Solaris 9 on both server and workstations (those were the days...!) The workstations were set up to allow passwordless SSH logins from the server, so whenever a remote root shell session was needed on one of the workstations, the usual procedure was to log into the server, start (or re-attach) a GNU screen session and log into the clients from there. So far so good.

    One fine day I had to re-boot a number of workstations I had been working on in "parallel" (really "time sharing" between the screen windows of course). So I typed the usual command: "/usr/sbin/shutdown -i6 -g0<CR>" in the first window, down and back up the client went, and being a lazy sysadmin, I just marked this same command line to be pasted into the second window, which I duely did next. Unfortunately, the second workstation had somehow reset the SSH connection in the meantime, so after the "/" had been consumed from the cut buffer and gone down the bit bucket, the SSH session's TCP connection was closed with a "connection abort" error message.

    Guess what happened next?

    Well, the terminal was of course feeding the string "usr/sbin/shutdown -i6 -g0<CR>" to the underlying shell session *on the server*; this being Solaris 9 the root's home directory, and thus the one that was normally your CWD when working as root, was "/", and, yes - hate to admit it - "." was in root's $PATH...

    (Fortunately, I had made it a habit *not* to use -y with shutdown, so the server was duely asking for confirmation of the shutdown, which I *happily* declined...)

  12. codebran

    I was a trainee programmer in a small software house in London (30yrs back). 10 Staff all run of a little Unix box that was located on a table to the right just as you walked into the office.

    As I was keen I was often the 1st one in and would flick the switch on the machine to boot it. This process became one of nonchalance until 1 day I did it and then walked passed the MDs office to see a bunch of faces looking out of the window inquiring why the demo had been so rudely interrupted.

  13. Anonymous Coward
    Anonymous Coward

    My shop a long time ago had a rather large outage

    Electricians outside the datacentre (of a global financial services company) in the battery room pulled the wrong isolated.

    The entire datacentre went clunk, quite odd a silent datacentre without even fans running i'll tell you.

    Didn't take long to get most things working again but took days or a few systems to get their data integrations synced.

    1. Cpt Blue Bear

      Re: My shop a long time ago had a rather large outage

      A mate worked for a (read only) hosting centre in this one horse town. One morning the sparkies come in to test UPS batteries. They take them offline and test each and all is good. Then they decide to test the failover. Only issue is they didn't put the UPSs back on line first...

  14. Anonymous Coward
    Anonymous Coward

    You would think someone would create something small that plugs into the USB and displays the host name. You would set it with some kind of key for security so only you could see your own host names. Then you don't have to worry about labels coming off or being incorrect.

    1. MonkeyCee Silver badge

      Random USB

      If you stick a random USB device into a prod server under my care, there is a roll of carpet with your name on it.

      1. Anonymous Coward
        Anonymous Coward

        Re: Random USB

        Didn't think of that. Good point.

        1. Doctor Syntax Silver badge

          Re: Random USB

          "Didn't think of that. Good point."

          However, an LCD panel built into the back and front of each server that displays the name.

          1. keith_w

            Re: Random USB

            Isn't that what Brother P-Touch or variants was designed for?

          2. Anonymous Coward
            Anonymous Coward

            Re: Random USB

            @Doctor Syntax

            I did think that but then would you want to advertise the host name? I suppose you could give it a code name like "Sitting Bull".

            1. eldakka Silver badge

              Re: Random USB

              > I did think that but then would you want to advertise the host name?

              If you are inside the server room already such that you can see the hostname on a servers little LCD info screen, then you already have physical access to the hardware, therefore that type of 'obscurity' - hiding hostnames - probably no longer matters.

          3. Smoking Man

            Re: Random USB

            Decades ago, when I worked in tech presales of a somewhat bigger system supplier, I learned in a discussion with one of our director folks, that a LCD display on the front panel _and_ on the back panel of system sold for more than a million £/$/€ was way too expensive and by this would never happen.

            I thought of and described the same sort of LCD display that was used in those days in "our" laser printers..

      2. Doctor Syntax Silver badge

        Re: Random USB

        "a roll of carpet with your name on it"

        Or, given the present context, somebody's name on it.

  15. Anonymous Coward
    Anonymous Coward

    The traders were out to lunch.

    Many,many moons ago I was sent out to the branch office in Milan to shuffle the NFS servers along to the end of the row. It should have been a great gig. The weather fantastic, nice hotel - everything going for it.

    So the Sun Solaris NFS servers were a pair, primary and secondary. Pretty std. stuff. I shut down the secondary to boot prom and powered it off. This model of Sun wouldn't poweroff all the way down and had no UID lights. I proceeded to sticker it up with the little red, yellow, blue, green stickers you got from WHSmith ion the cables and ports, and disassembled it onto the cart.

    You can guess where this is going... I received a system email on my terminal from head office complaining that the Milan NIS servers were both offline. Muggins here had powered down the secondary and pulled power and from the primary NFS server for a Fixed income trading desk for an global investment bank. I was looking at it in bits on a cart.

    Cue a quick panicked phone call to HO to confirm my disbelief and they laughed at me roundly and told me to fix it double quick. Powered up the secondary NFS VERY quickly and wend up to see the traders to grovel for my career. It took me a good 45 mins to make an apology this as every last one of them was out to lunch, this being Italy in summer. Not a single on of the dozen stations was in use and they had barely even noticed. Italian Sun workstations being even more laid back than their users....

    In the end I reassembled the primary even faster in the correct location, noticing that some marvellous chap had swapped the CPU units on the primary and secondary servers without changing the labels. I then allowed myself to go for a cigarette, coffee and a nervous breakdown. Still love Milan though.

  16. Anonymous Coward
    Anonymous Coward

    The northern hemisphere is still at work...

    ... here, at least until August... our HR still believes in '60s-'70s-style holidays, everybody on holiday the same weeks of August... when the site shuts down. Yes, many years ago this was mostly a manufacturing site, no longer today, but who cares? Traditions, traditions! As long as they fit HR holidays schedules, of course...

    1. Anonymous Coward
      Anonymous Coward

      Re: The northern hemisphere is still at work...

      The UK 1976 heatwave was very, very long. At the time I was cooped up in a succession of customers' air conditioned windowless computer suites solving system problems. Every day I would walk out into the relative cool of evening - and be back in again early next morning.

      Finally all the problems were resolved. Rang my boss to tell him I was taking a week's holiday. Drove down to Cornwall with the windows wide open to get some breeze. Arrived at Newquay and booked into a hotel with a room having a grand view of the setting sun on the placid sea.

      Next morning at 6am I opened the curtains - to see lashing squalls sweeping in all the way from the horizon over a grey sea. The heatwave had broken.

      The government had finally decided to appoint a minister, Dennis Howell, as "Minister for Drought" a few days earlier. They soon changed his title to "Minister for Floods".

  17. Stuart Castle

    A few years ago, when I first started work after my degree.. We didn't have a server room as such, more a bench in my office with the PDC, one BDC, various storage servers and our web server. I was working on my own one day, so I plugged in a portable CD player with the idea of listening to it while working. There were no spare sockets around my own desk (my PC took the only socket I had), and I'd been told my bosses desk (which was just behind mine) was out of bounds.

    All of a sudden, after the first CD finished, the power went, the UPS went crazy. I worked out that me plugging my CD player in had caused the circuit breaker to trip (luckily, it was in the same room), and I did manage to reset the breaker and power everything up before too much damage was done. I came clean to my boss. Luckily, while he wasn't happy about it, he admitted he'd been told that the circuit we'd plugged all the servers into was at it's limit, so it wasn't entirely my fault..

    The other time I am thinking of needs some explanation. I work for a university managing computer labs. Most of them have 30 computers is, so, at the time, each lab tended to be on it's own network switch. We had one switch that had been showing an error light for a few days, and our network bods had asked us to reboot it when the lab associated with it was unused. Unfortunately, the lab was, and still is, used from 9am to 9pm nearly every day, and we finish work at 5. For security reasons, the part time staff don't have access to the patch rooms.

    So, one day, I was in the patch room. I don't have remote access to the switches, so I had to go to the room and physically switch it off and on again.. I couldn't reach the power switch, so I traced the power cord to what I thought was the right socket and pulled the plug. On the wrong switch. A switch through which 30 students were trying to log in. I quickly plugged it back in again, and luckily those old 3com switches powered up a lot more quickly than the ciscos we currently use.

  18. Chairman of the Bored Silver badge

    Label, label, label?

    Your $30 label maker is your friend, but ensure that you Trust But Verify.

    War story- worked in lab with very good electrical lock out/tag out discipline. All of the 480VAC boxes that had more than one source of input power were clearly labeled. Except of course for the one I reached my hand into. In the aftermath I had a shattered forearm, a heart beating faster than a hummingbird on meth, and a lot of DNA evidence spattered everywhere.

    I got lazy and didn't bother to use meter or chicken stick to check the box. Just because a label {is | is not} present...

    1. imanidiot Silver badge

      Re: Label, label, label?

      The $30 dollar ones work for flat surfaces, not in direct sunlight, in relatively constant temperatures. For anything else (or when business critical) add a zero and get the real stuff, especially for cable labels. The pro printed labels really make all the difference. The $30 cheapo labels will have disappeared from the cable after a year and end up in a pile on the floor (or in the cleaners vacuum). The pro ones last the lifetime of the device.

      1. J. Cook Silver badge

        Re: Label, label, label?

        Plus many; the Brother "M-Touch" labels have that nasty tendency to detach themselves from where you put them when sitting in a data center environment after about 6 months.

        When we were using the Dell Poweredge servers, I generally set the LCD display on the front to the host name. There is also an indicator light that is triggered by a button on front and back that blinks for the same reasons mentioned all over these comments; locate the machine in the front by the hostname, push the ID button, walk around back and look for the blinking LED. The Cisco UCS B chassis, blades, and C-series servers have similar.

  19. IHateWearingATie

    Was told this story by a work colleague...

    He was working in Germany for a telco and they were upgrading the redundant systems for a couple of data centres (electrics, hardware, software etc). The demo was set for the CIO to come and see it all working and properly test it by killing grid power supply to the data centre by throwing the breakers, simulating a power outage.

    The day came, with great ceremony the breakers were pulled by him, grid power ceased and lo and behold everything worked as it needed to. Generators generated, UPSs hummed, servers shut down in a controlled and graceful manner. Seeing the success the CIO then said (in German of course), "Excellent, well done. Well, better get the power back then...." and threw the breakers back before anyone could stop him, missing several hours of carefully prepared procedure to move back to normal operation in a single swoop. The surge of current took out enough power hardware (generators, switches, UPS etc) that it was days and days before the data centre was back up and running.

    Ooops.

    1. Anonymous Coward
      Anonymous Coward

      An automated tram system in the USA was a shuttle service on a single track. At the end of each run the conductor turned a key in a wall panel to reverse the power for the other direction. It was a rotary switch - with a central "neutral" position.

      Switching through "neutral" without sufficient pause caused - IIRC - welded contacts elsewhere.

      Eventually the problem was solved by having two key switches - each having only ON/OFF positions. The key could only be removed in the OFF position - and the conductor only had one key. The switches were widely spaced on the panel - so that moving the key between them took long enough to give the required delay.

  20. jms222

    shutdown silliness

    > I used the shutdiown -h

    Very easy. Don't use the shutdown command.

    If you want to stop the machine use the halt command (perhaps with "-p")

    If you want to reboot use the reboot command (with "now")

    Documentation suggesting use of the shutdown command is a relatively modern phenomenon.

    1. Nick Kew Silver badge

      Re: shutdown silliness

      Relatively modern?

      It was in the mid-'90s I first read TFM recommending shutdown -[r|h] over reboot or halt.

      1. Smoking Man

        Re: shutdown silliness

        Ahh, in the good old days [tm] on "my" systems that was "CTRL-B RS" or "CTRL-B TC".

        Nowadays it's a "echo b > /proc/sysrq-trigger"

        Or as I like to call the procedure: "take the system to a known state."

    2. Doctor Syntax Silver badge

      Re: shutdown silliness

      "Don't use the shutdown command."

      I suppose it doesn't make as much difference as it used to do but when shutdown was first introduced it was a script. It enabled us to build in various extras such as inhibit new database connections during the grace period. Good luck doing that with halt.

      1. onefang Silver badge

        Re: shutdown silliness

        "It enabled us to build in various extras such as inhibit new database connections during the grace period. Good luck doing that with halt."

        That's the sort of thing the previously mentioned molly-guard is for, either available in other flavours of OS, or you could roll your own using similar principles.

        1. Doctor Syntax Silver badge

          Re: shutdown silliness

          "That's the sort of thing the previously mentioned molly-guard is for"

          If I were still working I'd look at that.

      2. tfewster Silver badge
        Facepalm

        Re: shutdown silliness

        It used to be the case (HP-UX?) that `shutdown` ran the shutdown scripts and then issued `reboot`, whereas `reboot` or `halt` didn't bother with such niceties.

        `shutdown` also prompts you with "are you sure?". Which would have been nice when I typed `last |grep reboot` but, for some inexplicable reason, didn't actually type the "grep" part in.

        1. Doctor Syntax Silver badge

          Re: shutdown silliness

          "HP-UX?"

          That was the one.

        2. jake Silver badge

          Re: shutdown silliness

          The earliest version of shutdown that I can find at the moment has a man page creation date of April 3 1983, in 4.2BSD ... The top line of the actual man page says "SHUTDOWN 8 "1 April 1981"" ... Seems to me that it came over from UNIX Version 6, and made it into 4.1BSD. Very early 1980s, anyway. It's already a C program ...

          #ifndef lint

          static char *sccsid = "@(#)shutdown.c 4.19 (Berkeley) 83/06/17";

          #endif

          #include <stdio.h>

          #include <ctype.h>

          #include <signal.h>

          #include <utmp.h>

          #include <sys/time.h>

          #include <sys/resource.h>

          #include <sys/types.h>

          /*

          * /etc/shutdown when [messages]

          *

          * allow super users to tell users and remind users

          * of iminent shutdown of unix

          * and shut it down automatically

          * and even reboot or halt the machine if they desire

          */

  21. Iggle Piggle

    I did share a room with a couple of guys and a server a long while ago. This is back in the day when the power button was a real push button switch. Well my colleague went over to the server and switched on the monitor, did some work and then pushed the button to turn the monitor power off. Only he didn't, he pushed the server power button but realised his mistake before letting go.

    His job then became simply to stand there holding the button in while we went round to all the users of the server and get them to log out. Then we safely shut the server down via the keyboard and only then could he let go of the button.

  22. steviebuk Silver badge

    I thought I was being....

    ...super secure by changing the security settings on Equitrac to purge all print jobs when a person logs off an MFD. I did this without a Request for Change.

    Then I started seeing calls come in. "My print jobs are only half printing. What's going on?" by various people.

    It then dawned on me. When people were sending print jobs, because they are impatient, they'd login with their card, start the print, then log off mid print. Or the MFD would eventually time out as their print job was too long and just log them off. At which point it would purge the rest of their job.

    Oops.

    I changed the setting back and closed all the opens calls before anyone in management noticed.

  23. Anonymous Coward
    Anonymous Coward

    Beware of Windows clusters

    Junior colleague in a global trading bank was tasked to patch a Windows server cluster. Howls of outrage erupted when the concluding reboot revealed that the patched node was active, and not the standby he had assumed. Chastened by the deluge of constructive criticism, he promised to take more care in future.

    A resolution he put into practice an hour later by carefully patching the node he now knew to be passive. If only the cluster hadn't responded to his first intervention by dutifully failing over. More howls ensued and or SLA was well and truly shot.

  24. e-wan

    Set the time on a NetWare server

    In a university computer studies department, I had championed using Novell Netware as a departmental file & print service. The head of department also got a copy of Oracle running as an NLM on Netware, and was teaching in a classroom next door to the computer studies office.

    A student came in to complain that all the machines in the labs were 2 or 3 minutes off the right time, as they their clocks were synced with the Netware environment. I flicked the Netware console screen on, and sure enough noticed that the server was running slow. No problem, thought I, and typed in (say) SET TIME 3:15 <enter>

    Immediately, several hundred lines blurred past with the server beeping furiously, as each open session was unceremoniously booted off. We had time restriction set up, you see, so the labs were only open from 8am to 9pm... so every logged in user was kicked off, and I think the Oracle NLM came crashing to a halt (or at least all of its users did), whereupon the door flew open and the HoD (a big man) came storming in, red faced, wanting to know what the hell had happened.

    Next time, make sure you know the difference between 3:15 and 15:15 ...

    1. Iggle Piggle
      Facepalm

      Re: Beware of Windows clusters

      Your comment reminds me of another instance and I think you might get to the punch line before the end. We installed a system for a client where two windows servers sat side by side. One was a fail over for the other and, the fail over was configured with some wonderful software that mirrored important data to the fail over machine's hard drive so that if the first machine stopped responding, the fail over would take over with the latest data.

      So imagine our horror when the client called to say both machines had failed. We looked carefully at the first machine that had died and it turned out that a system log file had grown to the point where windows just ceased up. Well someone had decided that the windows log files should be mirrored, so naturally the fail over server started up and discovered it too had a log file that was full.

  25. chivo243 Silver badge
    Pint

    Wrong serial cable, wrong ups

    Whhrrrr, silence, darkmess!

    That's the worst I will admit to ;-}

  26. rpjs

    Sounds like local government to me

    Back in the 90s I once deleted three years of financial records for the county purchasing dept from a production Oracle DB thinking I was logged on to dev. The relevant manager didn’t really care because they were the oldest records and only kept for audit purposes which was highly unlikely to happen.

    Nevertheless we all agreed the Right Thing To Do would be to restore from the backup tapes. But the most recent backup wouldn’t restore, and investigations showed that no backups had been verified for some months. The problem was found and fixed and in the end I was not in trouble as I had inadvertently prevented potentially much more serious data loss if something important had been fubarred.

    I don’t think we ever got the deleted data restored cos the manager was happy that there were some older verified backup tapes which could be mounted if audit ever needed them (they never did).

  27. Anonymous Coward
    Anonymous Coward

    We'll send our best engineer....

    In the early 2000's I was earning my stripes as a filed engineer in a data storage company. (Later they became a Data Management Company)

    I was on-call and had to a attend a Disk System at a large bank which made me nervous. They actually had enough kit to have an onsite engineer during the week, but at weekends and nights the on call engineer had to attend site.The disk system had 3 Power Supplies and 1 was faulty. The Power Supplies were wired in a way that even the faulty PS would have an Amber Fault light. Armed with instructions and a new PS I attended site and pulled the faulty PS with the Amber light. Next thing I heard sounded like a 747 turning off its engines. As it turns out there was a software bug that would illuminate the fault light on the wrong PS. Next time I would check fans & airflow, but the system needed 2 out of 3 to stay up.

    About 20 minutes later (it took that long to boot) the disk subsystem was up again. Another 30 minutes of confusion, apologising to the operators and the applications were up again. I took one for the engineering team.

    Next time - same customer - had an issue with a large robotic tape library. Tapes Stuck, Drives offline or "boxed" ... I attended site again (on call) and as a fist step I tried to unmount the tapes manually via a Solaris based Library Management Software (ACSLS). That didn't work either. So I rebooted the Solaris Server - which then took down ALL Libraries, not just the one who had trouble.

    Turns out that the one library Controller (or Management Unit) was switched from BNC to a brand new TCP/IP Card - which would hang every other week due to a software bug. The Software bug was already known but the Onsite engineer liked to collect a Call-out fee once in a while. And back then it was quite a bit of $$$. So now I copped it for my colleague.

    The customer already thought I was a complete idiot. Every time I show up they have a major incident.

    It was about that time I realised that its less important what you know, but how you sell yourself.

    My colleague - who didn't warn me about the TCPIP Card liked to talk himself up. He once spoke to a customer on the phone. He told the customer "we will send our best engineer" - he than closed his laptop, took his tool bag and visited that customer.

    Everybody else thought the guy was a fraud, but our boss thought he was the bee's knees.

    1. tfewster Silver badge

      Re: We'll send our best engineer....

      A "filed" engineer? One who's been smoothed off?

      1. onefang Silver badge

        Re: We'll send our best engineer....

        Or has been filed away in the small circular filing cabinet by the HR derpartment?

  28. Radelix
    Mushroom

    reboot of terror

    I was working for an MSP in Orange County, CA. One of my responsibilities was to configure new equipment that was being shipped to a customers retail locations. The engineering team made this stupid easy in so far as open ticket in system > click config > copy config > paste to ssh terminal > write config > reboot and verify that it is communicating with the headend.

    I was configuring something in the area of 10 boxes and I was starting to close in on the shipping window. I had them all stacked and I had all the boxes open in SecureCRT with the headend as the first tab. Pasted and wrote all the configs and no issues. Started issuing reboots to test and, sure as shit, I reboot the headend.

    Normally, for our customers, this is not an issue because they usually have redundant headends, not this one. Open a console, start pinging public IP which is whitelisted to respond from our site

    no reply

    no reply

    no reply

    no reply

    ....shit

    Go talk to the engineer who is packing up to leave, fess up to what I did, the broken man puts his stuff down, opens a console and picks up the phone. I ask if I need to stay and he says no.

    The next day the engineer is absent and I look up the customer, our ticketing system at somewhere around 3 AM had autogenerated and autoclosed somewhere north of 3000 tickets. Come to find out from another engineer that the headend in question had been running with bad nvram so the reboot wiped the config and there had been no IOS to load. The customers engineer had discovered the headend sitting at a ROMMON prompt. New IOS and new base config later and the customer is back up.

    In the end, I was told everyone gets to do that once.

  29. PeterO

    I'm sure I'm not the only one to have pulled the disk in the failed slot out of the perfectly working raid array :-) "Why are they both beeping at me now ?"

    1. onefang Silver badge

      I had a machine sitting on my desk with two hot swappable drive caddies. I don't recall why, but I do recall at one point pulling out the one I thought was not currently in use, but instead pulling out the one the OS was operating from. Luckily no damage, and nothing critical was running at the time.

  30. Will Godfrey Silver badge
    Unhappy

    I've done the exact oposite

    Shut down the wrong machine then {cough} hot swapped {cough} a drive in a running machine. It didn't go well.

  31. Pliny the Whiner

    Had I been Rick's boss ...

    ... I would have laughed my ass off, too. Then asked him if he needed to run home to change his underwear. (If he had fresh underwear in his desk drawer, it would sort of make you wonder if this type of thing happened often.)

    One thing's for certain: I'll bet Rick never did that again.

  32. Marshalltown

    Power? What power?

    About 20 years, working on an archaeological project in Israel, the director had the habit of buying and shipping most of the computer hardware over from the US at the start of the field season. Actually, every member of the crew carried some expedition-owned gear as part of the baggage alotment. So first day, our job as lab staff was to set up the computers and LAN. So, as noted, hardware purchased in US, staff from US, in Israel. Israel uses the same 240 volt power standard similar to the UK and other European countries. Setting up the gear, we have several "work stations" - actually PCs with MS Windows (may be even Windows 98) using the basic networking tools that came with Windows. One machine was much more powerful since it handled the GIS system, We installed and powered up each system safely until we got to the GIS system with a pentium processor, loads of ram and a huge harddrive (for the time). I leaned over to plug in the big guy and there was a loudish "POP" from inside the box, accompanied by that order of burnt electrics. On that one machine, the biggest and most critcal we forgot to throw the switch over for 240 volt power. Happily it only took a trip to Haifa to find a replacement power supply.

  33. Hazmoid
    FAIL

    Speaking of Financial institutions

    Many years ago we were running a very small options desk and running off a single server. When the time came to upgrade, we had to do it overnight, and by 4.30 am (after starting at 6am the previous day) I was more than a little groggy and decided that it was time to call it a night ( and crashed out on one of the couches in the board room). Next day with all the stress, I forgot to copy the historical data across to the new server. A few days later, one of the other admins asked if they could re-use the server and without thinking I said yes. About a day later I was asked where all the historical data was :(

    Unfortunately they needed it for a court case and the only backups were done on DAT tapes using a reader in the old server. So after a Stat Dec saying that the data had been irretrievably lost (We spent ~$5k on a data recovery service but were unable to recover the data) I learnt a couple of things.

    1. document the upgrade process and test run it before proceeding.

    2. make sure you have a checklist for data checking and completion.

    3. have a roll back position (much easier with VMware)

    4. retain the old server for at least a month after the process.

    5. never do a job like this when sleep deprived

  34. old_iron

    Follow the process...

    Facilities reported a major outage at "a leading petro-chemical firm". Ops chaps duly grabbed the playbook to understand and execute the tidy shut down of some systems and the transfer of some other workloads to the secondary site (More than just DR...)

    Process duly executed they sat back, satisfied until the phone rang...

    They had taken the healthy facility down, but in a very professional and orderly manner.

    The UPS in the other DC had just run out...

  35. olemd

    Tried to log in to a virtual machine to shut it down. Didn't quite catch the fact that the login failed, one "sudo shutdown -h now" later and my workstation with everything I was working on went black.

    Even molly-guard doesn't help when you're local. Fortunately it was just an annoyance rather than an actual problem. Obviously the VM was set to auto boot so I had to shut it down after the reboot.

    1. onefang Silver badge

      "Even molly-guard doesn't help when you're local."

      I've only recently started using molly-guard, and I haven't delved deep into it's configuration yet, just a quick test that told me "yep, when you try to shut down via ssh, it asks you 'which box do you think this is'". I do believe you can add your own checking scripts though. Say, for example, "Test if this machine is currently running any VMs, then ask - Which VM do you think you are trying to shutdown?", which might have helped you. If you answer "Eddy", being the name of one of the VMs, but the local machine is called "Bob", then Bob's still your uncle, Molly bitches, you slap your forehead, and try telling Eddy to go away instead.

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Biting the hand that feeds IT © 1998–2019