back to article A sysadmin's top ten tales of woe

Get enough people of any profession in one room and the conversation drifts inexorably towards horror stories. Everyone loves a good “…and then it all went horribly sideways” yarn, and we all have more than one. The IT profession is flush with tales of woe. There are no upper boundaries on ignorance or inability. From common …

COMMENTS

This topic is closed for new posts.

Page:

  1. NoneSuch Silver badge
    Boffin

    You missed a big one.

    Inconsistent times set on the company servers. Nothing worse than trying to fault find a sequence of errors across several servers and / or workstations when the time is not set consistently. Surprising in this day and age just how common this is.

  2. Anonymous Coward
    FAIL

    backup the backup, eh what?

    In the early 90's I worked for a company that had ported a software application from proprietary hardware to a Unix platform. As part of this a backup solution had been developed for our customers, "the Unix guy" would go on site and install the backup software. To test this he would backup the folder containing the backup software. Yup, you guessed it, three years after "the Unix guy" had left a customer had a disk failure, when we came to restore their data, they had only ever been backing up the backup software.

    1. Mark Allen
      Pint

      backup to the backup folder

      That reminds me of the client who would do their own zip file backups of data folders. And then store that zip file in that original data folder. So the next backup would include the previous backup. And so on. These (multiple) backup files exponentially grew and grew until the hard disk jammed full...

      And how many people have found their clients backing up their Sage data to the same folder as Sage was installed in? And never taking those backups off site...

      Users are such fun...

      1. Anonymous Coward
        Coat

        users are such fun?

        Well yes, they are, but my biggest single lesson as a sys admin was, when stuff broke, not to ask the users what they had been up to until I had thoroughly examined what I had been up to.

        Send not to know what the users did:

        The person most likely to have broken the system is you!

      2. chr0m4t1c

        As was said elsewhere

        It's not just the users.

        I remember discovering that the default configuration on some NetWare backup software we had was to backup the server structure. Nothing else. So all you could have recovered were the directories, filenames and permissions, not the actual data in the files.

        Fortunately it was just an irritation, as I became suspicious that the first night's backup of what should have been the entire server took around 4 minutes when I had expected about 3 hours based on the amount of data and tape speed.

        It wasn't even a schoolboy error on my part, I'd selected "Full Backup" on the scheduler, it was only when you drilled into the options for what to include in a full backup that you could see the default was to behave like this. Epic fail from the software company there, I think.

    2. Anonymous Coward
      Anonymous Coward

      You mean the flashy red light on the backup was important?

      Client has multi-million dollar sales database being backed up by secretary/receptionist. Head secretary has problem with sales database. HS decides reboots fix PCs so the best thing to do is delete the database because it will rebuild itself. Is surprised when it doesn't. Head pilot fish takes call, says "no problem, that's what your back-up is for." SR puts in the tape and it won't read. Head Pilot fish asks "Is the read light on the tape backup blinking?" SR says "why yes it is." Head Pilot fish asks "How long has that been blinking?" SR answers "Not sure, but at least 3 weeks. Why is it important?"

      Fortunately for both HS and SR, Head Pilot fish (not me) was damned good and able to reconstruct their entire database from the Novel 3.12 recovery area (I have since forgotten their name for the very handy feature).

    3. informavorette
      Facepalm

      Not only the users fall for such strategies

      On my old job, were lucky to have a really good admin. He did everything as he should, running a separate backup server for each of the primary servers. When virtualization came around, it went really smooth. Everything continued as it always had been, just on virtual servers, completely transparent for us happy users. He may have been falling behind on documentation, but the bosses understood that as much as he worked to keep things running, there wasn't the time for proper documentation.

      Eventually, we got the budget for a second admin. His first task was to get to know the systems and document them along. He found some interesting system landscape choices, including the fact that most of the backup virtual servers ran on the same physical machine as their corresponding virtual primary server.

  3. raving angry loony

    another one

    My horror story is trying to find work when the only people you can talk to are H.R. people with zero understanding of system administration, job "requirement" sheets that list a plethora of buzzwords that will be obsolete in 90 days, and absolutely no understanding that one of the biggest jobs a sysadmin has is "learning new stuff ALL the time".

    At least I'm on the outside when their data centres, staffed by buzzword compliant know-nothings, bursts in flames. It's happened to two locally already, and lessons aren't being learned.

    1. Anonymous Coward
      Meh

      Recruiters and HR: a close match.

      Saw a job that tickled me fancy, so gave it a go. Recruiter comes back that HR says I'd already have to live in that city because moving there means blah blah can't take the risk yadda yadda bad experiences mumble bs mumble mumble.

      Sure. Couple of stains on that reasoning: It wasn't hard to work out just what gpu super shop it was, there aren't too many around in Europe n'mind Amsterdam. Their linkedin account shows fresh hires from Germany and France. And the job --installing the product at customer sites, requiring good unix and troubleshooting, which I definitely have-- has as requirement "extensive travel in Benelux and Europe". But they don't want you unless you already live in Amsterdam. Go on then, pull the other one.

      Sounds like a bit of a disconnect between HR and the operational side of the company. Or maybe they just suck at the "white lies" people skill. Either way, I'd not put much trust in their shares.

      1. raving angry loony

        sounds right

        I keep getting the "you're overqualified" excuse by the recruiters and H.R.. Because of course actually knowing too much or having too much experience might be detrimental to the job? WTF?

        1. David_H
          Thumb Down

          Been there as well

          HR droid got me to fill in the IQ test sheet.

          She came back 10 minutes later looking hot under the collar.

          Some bloke popped his head round the door saying "you scored more than anyone else here ever has, you'll be bored and leave. Goodbye" and left. And the droid escorted me offsite, refusing to enter into conversation.

          FFS for 3 times my salary at the time in research, I'd have stuck it for more than 6 months!

  4. Vic

    Counting the backups...

    I got a call-out to a customer. They'd lost a load of data.

    On arrival at site, I found a dead disk. It was a DeathStar, and had been in continuous operation for a goodly number of years - a good order of magnitude longer than most of them lasted.

    RAID? You've got to be kidding.

    I gave them the bad news, and told them they'd need to restore from backup. Blank looks ensued.

    "Oh", says one girl, "There's a copy of everything on the other machine. We'll be OK".

    But there wasn't. There was a share to the machine that had gone down.

    So I had yet another round of disk recovery, and actually found most of their actual data. But I was fielding calls for weeks that statred "we used to have this file on the desktop..."

    Vic.

    1. Ross K Silver badge
      Mushroom

      Ah Deathstars

      with their glass platters... I can still remember that scratch, scratch, scratch noise even though I haven't touched one of those pieces of crap in ten years.

  5. Pete 2 Silver badge

    It's not disaster recovery unless you know it works

    In a meeting a couple of years back when the following dialog took place:

    Yes, we have a best-practice disaster recovery procedure. We have fully redundant hot-standby servers at a beta site, mirrored disks at the location and two sets of network connections with no common points of failure.

    When did you last perform a failover test?

    Oh we've never tested it

    Why not?

    It might not work.

  6. Andrew Moore
    Facepalm

    The cleaner unplugged it

    I remember once client screaming and cursing me over the phone threatening all forms of abuse at me because one of his systems was down and if I did not do something about it now there would be legal action, I'd be sued for lost revenue etc, etc, etc. So after a 2 hour drive to his office I walked in, looked at the computer and then plugged it back in and powered it up. I then just turned and stared at the client.

    I then I sent his boss an invoice for a half day, emergency onsite maintenance with expenses with "Problem" filled in with "System unplugged" and "Resolution" filled in with "System plugged back in"

    1. PatientOne

      This isn't so uncommon

      Had this a couple of times when I was working for a company down Swindon way.

      User calls, makes demands and threats, I drop what I'm doing, head on down and find the computer was unplugged. User, of cause, was never around when I got there. So I plug the computer back in and leave it plugged in. Last time I went down, checked the computer worked, then unplugged it and returned to the office to report on the ticket. Note in the comments box: No fault found. Computer left in state found in. I didn't mention that the problem seemed to occur at about 8:30 when the canteen opened...

      And as to cleaners: Yes, had that, too. Server room was kept locked: Main door was key coded with access audited. A rear escape door was secure from the outside: Had to break the glass tube to unlock it. The side door was kept locked with 'no entry' markers on it, and the office it was accessed from was also kept locked. So, the cleaner went through the office and in through the side door using the master key. We only found out this was happening when she forgot to plug the server back in when she left the one night: The UPS had been keeping it running the rest of the time.

      But my favorite has to be the offsite server farm: Two sites mirrored, just in case, and in separate counties. Only when the substation supplying power to one went down, both sites went dark. When they investigated why, they found both sites were supplied by the same substation. Apparently no one had thought to check that possibility out...

  7. Anonymous Coward
    Facepalm

    Once experience the classic "single tape" cock-up

    "Yes we have backups every night, without fail for the last 18 months."

    "Oh great, should save some time. We need to test restores for the auditors. Who changes the tapes and what's the offsite storage like, fire-proof that sort of thing?"

    "Change the tape? No need, it's big enough, well it's never run out yet."

    "No, you only have 150MB of space on a tape and the server is holding 40MB of data, so that's only 5 days worth!" ( 40MB on a server, that shows you how long ago this was! )

    "Sorry?!"

    "There is no way that tape could do more than 3 nights of backups before it fills up. You've most likely overwritten the same tape hundreds of times, so you have no historical data available."

    "Sorry?!"

    ( You will be! )

    Quick check and yep, they'd run 450 backups on the same tape, over and over and over and over...the software didn't bother to check the tapes had data, it just rewound and overwrote it.

    Needless to say the auditors were not in the least bit impressed and lots of shouting ensued at management level and plenty of overtime was paid out to IT to make sure it did not happen again!

  8. praxis22
    Stop

    the emergency switch

    That big red power button that takes down the whole data center, the one in plain sight, make sure it has a cover. At some point somebody will trip, or shoot out a stabilising hand at exactly the wrong location. You do not want to be anywhere near that person when it happens.

    1. KroSha
      Mushroom

      Similar one

      The emergency cutoff switch is right next to the exit door. The exit door is access controlled. The new guy did not realise that you have to swipe your card to get out, not just press a button. 500 servers go down in an instant.

      1. Mayhem

        Or the cabling gets a little loose

        Unlabelled emergency cutoff was located on a false wall near the door, connections had shaken loose from vibrations from door closing over the years. Walked in, turned on the lights, turned off a room full of servers. Very good heart attack test.

        Only vindication was the astonished expressions in the faces of the highly sceptical building electricians when I did it again in front of them after dragging them in to explain to me what had gone wrong with the lighting circuits.

    2. Anonymous Coward
      Anonymous Coward

      EPO

      That happened to MCI were I live. To make things worse this site was were the do the peering for ATT and Sprint. They lost a few routers and DNS to a week to fix. When sad they lost a few routers I mean they didn't turn on. Some of the t1 cards were fried to. Then there was some of the routers and switches that lost their config . 5 hours off line . By the way this controlling the zone for northern California .

    3. Anonymous Coward
      Mushroom

      Or the flirty junior playing - What does this switch do ?

      CLUNK !

      COS for high-integrity 24v DC C&I supply.

      Screaming Yodalarms.

      Primary DC was offline, and X-hour duration 'backup' battery hasn't yet been installed.

      30 seconds later, temporary 'buffer' battery is fully discharged.

      Immediately followed by a resounding series of BANGs, as every breaker in 6 multi-MW switchboards trips-out.

      Cue total darkness and ominous silence, broken only by watchkeeper's swabbie-level cursing.

      How to take out a nuke-sub from the inside.

    4. Anonymous Coward
      Anonymous Coward

      Re: the emergency switch

      Seen that happen. Except it wasn't a server room, it was a large ocean going vessel, and the button in question was the Man Overboard Button.

      Mind you, at least nobody got hurt.

    5. Matt K

      EPO switches

      ...another on the subject of EPO switches.

      When told that your switch should be covered to avoid being accidentally triggered, don't forget, call your local electrician in a panic when reminded, and watch as said electrician sticks a drill bit into your PDU in the middle of your peak processing window...

      Change management: more than just the logical stuff.

    6. Peter Simpson 1
      Devil

      It's usually the boss

      or his relative.

      //you can't even yell at them

    7. Alain

      Re: the emergency switch

      A variant of this one: in a very large computerised hospital, due to an electrical fire in a transformer room (that was mid-August of course... Murphy's laws at work), we had been doing several complete system shutdowns and startups (6 Unix clusters, 20+ Oracle DBs, 50+ blade servers) over a few days. This was required by frequent and mostly announced (but on very short notice) blackouts due to problems on the generator trucks they had parked on the street next to the building. At some point, totally exhausted after 3 almost sleepless nights, we were doing yet another system start-up after just receiving a confirmation that power was back and "hopefully" stable. A guy taking care of just a couple of not-too-important Windows servers came into the room to boot up his own boxes. He almost never comes here, doing his work remotely. After being finished, he goes out and ... switches off the lights of the room. None of us instantly died of a heart attack, but that was close.

    8. Andy Miller
      Facepalm

      Location, location, location

      Or simply put the emergency off next to the light switch in a server room that is usually run lights out. So groping around to turn the room lights on turns all the little lights off....

  9. Inachu
    Pint

    Demanding employees and their email.

    One employee demanded to have access to his email 24/7 and wanted to have company email fowarded to his home ISP email.

    Well sooner or later his home ISP email inbox became full and not only sent a message that the inbox was full but also copied the message back to the sender at the company.

    So in effect it filled up the email server and it crashed and had to be rebuilt.

    1. pixl97

      Poor admins.

      The story here is the crappy email server that commits suicide when the disk is full.

      This is why you don't store queue's on your operating system partition. All sysadmins should know this.

      1. Anonymous Coward
        Anonymous Coward

        very poor admins

        The story here is also that someone didn't set mailbox size limits?

  10. fiddley
    Thumb Up

    Aircon fail

    2006, London Summer heatwave stressed our Aircon too much and the lot failed. What did the PHB do, order repair or replacement? Nah, sent us to somerfield for some Turkey sized tinfoil to line the windows of the server room and "keep the heat out". Cue 45 degree temps and a dead exchange server the next day. Ha! served him right, needless to say we didn't rush with the restore!

    Last time I went past the building, the foil was still in the windows :)

    1. Anonymous Coward
      FAIL

      @Aircon...

      Or the time when I came back from a weekend off to find two of my servers with fried power supplies .. turns out a DR exercise had happened over the weekend and the ops staff were getting a bit cold in the small machine room and so they might have turn off some of the aircon for a few hours during the exercise (and my servers were furthest away from the cold outflow) ... hmm now looking at the logs when did the exercise start and when did the logs get truncated? 8-(

      I would suspect that it was pure karma as a few years earlier in same said machine room I turned off one of the aircon blowers as I was getting hypothermia working next to the cold air outlet and forgot to turn it back on before leaving for the day .. the temperature alarm went off overnight .. whoops (I owned up) ... funnily enough little plastic covers were screwed over the aircon controls a few days later .. you then had to use a pen through a small hole in the cover to flick them on and off 8-)

  11. Peter Jones 2
    Pirate

    Domain name...

    Small company that relies on e-commerce and e-mail "downsizes" their sysadmin to replace with a cheaper outsourcing company.

    Three months later, their site and e-mail stops working. Numerous phone calls to the outsourcing company yield nothing. I am called in to troubleshoot a week later. One WHOIS trawl, and I ask "so who is John xxx?" "He was our old sysadmin" "Well you may want to call and ask him for your domain back."

    The sysadmin had been paying the bills through his limited company, and effectively "owned" the domain. When the renewal came up, it was forwarded to a parking site. Not sure whether the company bought the domain back, went through the arbitration, or some other solution. But every company since, I have been interested to see that a lot of sysadmins do this as a form of "insurance", ostensibly because "it's easier to have them contact me."

  12. Woodnag

    Floopy days are here again

    1. The person doing MSBACKUP only put one floppy in, and just pressed ENTER when asked to replace with enxt disk.

    2. New software is purchased, IT person makes a master backup floppy for the office safe, and working backup floppy, and working copy floppy for user. Orig floppy goes home with CEO. Loads of redundancy, multi site, gold stars all round.

    One day, the user's machine says bee-baaaaar, bee-baaaaar etc - can't read the disk. Ok, we'll make you a new one from working backup floppy. Oops, same problem. Try master backup floppy. That's duff too. CEO's copy is brought in, bad also. Of course the problem was the floppy drive, which has now killed all copies of the software...

    1. Anonymous Coward
      Anonymous Coward

      Re: Floopy days are here again

      there is a utility under unix called 'dd' that allow you to make a image of your floppy, you can then write that image on a fresh floppy when you need it.

      I know that this utility have been ported to DOS and I have even made use of it in the past, but for some reason I can't remember the name used under DOS, so I can't give you a link for it. Sorry.

      1. galbak

        usefull tool

        DD.EXE is used to create an image file from a floppy bootdisk. Do not confuse it with the unix "dd" command it is not quite the same.

        Example: dd a: <filename>.img

        This will create an image file from your bootdisk in drive A:.

        You could also use winimage (not freeware) for this, but remember to save your floppy image as .IMA (not compressed) file.

        WINIMAGE http://www.winimage.com

        DD.EXE http://www.nu2.nu/download.php?sFile=dd.zip

        usefull tool, along with

        unlockerassistant http://www.softpedia.com/reviews/windows/Unlocker-Review-106258.shtml

        windowsenabler http://www.softpedia.com/get/Others/Miscellaneous/Windows-Enabler.shtml

        and sharpkeys http://www.randyrants.com/2008/12/sharpkeys_30.html

        1. trindflo Bronze badge

          loaddskf savedskf

          IBM had two compatibility-mode utilities to make and restore an image of a floppy. They run in DOS, OS/2, and last time I checked Windows. They were loaddskf.exe and savedskf.exe.

  13. Drummer Boy
    FAIL

    It never rains when it pours

    Especially when the pouring is from the aircom overflow tray above the large IBM piece of tin that ran 15 warehouses (the real kind, not data ones!), spread across Europe.

    It took 4 days to get the system back up, and then senior management suddenly saw the sense in spending several million on a separate site system.

    They lost £5m per day.

    Or the 'meat' error in the same company where the clerk pushed through a quarters VAT payments a day early and lost the company IRO £3m in VAT, on an entire warehouse of fags an booze.

    1. Anonymous Coward
      Anonymous Coward

      Pipe leak on a floor above the IT floor.

      Water flows over the box with the mains. Eventually the mains blow. Team has disaster recovery problem including generator for power and a failover. Head Server Pilot fish confirms failover has worked properly and servers are now on the generator, different circuit than the mains. We need to evacuate building finally, no problem setup for remote access to cleanly power down the servers. Small problem: Head Server Pilot fish lives 45 minutes away. But everything's working so it should be fine. Except when the maintenance guys came through and saw the generator circuits were working, they turned it off so it wouldn't be dangerous (mind you same maintenance worker who previously saw no problem with water flowing over the mains). Batter backup were only good for 20-30 minutes, so by the time Head Server Pilot fish got home to remotely shut down the servers, they'd already gone down hard.

    2. irrelevant

      AirCon

      Ah yes ... aircon drip trays sited directly above the brand new IT room, raining down into the cupboard with the power distribution racks in it. Major player in the mobile phone retail sector, mid 90s. I was on-site at the time, too..

      Same place, year or two earlier, I was dragged out of bed because their tech had done an rm -r in exactly the wrong folder ... at least he'd phoned his boss before bolting. We arrived to find the place empty and barely even locked up.

  14. Pat 4

    UPS Cables

    When installing and configuring a UPS monitoring system that will automatically and gracefully shut down your data center in proper order before the batteries run out, always make sure you keep track of the serial cable that came WITH said UPS.

    I once installed one of those for a medium size ISP and got my hands on the wrong cable. It did not take me long to realize that on a regular cable, pin 5 is grounded, and on a UPS, pin 5 to ground means emergency shut-off... The sound of a big UPS, 25+ servers and a plethora of telecom equipment all clicking off simultaneously is not one that I ever want to hear again...

    Best of all... RTFM.

    1. philblue

      Been there...

      Did exactly the same thing but thankfully only on a single Windows SBS server - the last person to work on the server had left the serial cable for the UPS loose down the back so like a good boy I plugged it back in - instant silence.

      The odd thing was, 15 minutes later when the server finally came back up, no-one had noticed...

    2. Peter Simpson 1
      WTF?

      The real WTF is...

      The damn UPS manufacturer "thought outside the box" and used a DB-9 connector with a non-standard pinout for a "serial port". Connecting a standard cable causes Bad Things to happen.

      Poor design doesn't even begin to cover it.

      1. Slow Joe Crow
        Meh

        I blame lock in

        I think it was more a matter of making you buy their "special" cable at twice the price of a generic DB9 serial cable. Fortunately USB has made this a thing of the past since I also found out the hard way about UPS cable pinouts, but luckily it was only a desktop machine.

    3. The First Dave
      Mushroom

      UPS

      On a similar note, I once discovered that if you have a UPS running a couple of servers, and decide to re-install windows on the one that actually has the serial cable in it, then as part of the hardware check a little signal gets sent to the UPS, that shuts it down instantly...

  15. Arthur the cat Silver badge

    Make sure the backup is going where you think it is going.

    One of our guys installed a SunOS based customer site which backed up to an Exabyte tape. The backup would verify the tape contents after backing up to ensure the tape was written correctly, and each day the customers religiously rotated the tapes and put them in the fire safe. One day they wanted to duplicate their data onto another machine, so tried to restore the backup tapes onto the new machine. Nothing on the tape. Nothing on *any* tape. Turns out that the backup had been going to /dev/rmt0 when the Exabyte was /dev/rst0 or somesuch name, i.e. the backups had simply been written into a file in /dev on the original machine. Fortunately they hadn't actually lost anything and it was corrected, but if the original machine had fried they'd have lost man years of work.

    1. Anonymous Coward
      Thumb Up

      Re: Make sure the backup is going where you think it is going.

      This is actually a good lesson to learn.

      I have made a note that from now on, one shouldn't just test that it works when it's supposed to--one may also want to make sure it doesn't work when it shouldn't!

      1. Peter Mc Aulay
        Thumb Up

        Re: Make sure the backup is going where you think it is going.

        This is why, if at all possible, I query the tape drive for status during the first test backup to see it's busy, or better yet, go up to the the machine and check for the correct blinkenlights & noises.

        1. relpy
          Thumb Up

          /dev/null

          backups are much faster that way...

  16. alain williams Silver badge

    Where do I start ... ?

    Both sides of a mirror on the same disk ...

    Multi file tar backup to the rewind (not norewind) tape device, all except the last archive over written ...

    Sysadmin working in the machine room at the weekend, felt a little cold so turned the air con off. On monday the servers were fried ...

    Top sysadmin and deputy are the only ones who understand things. They fall in love, give 9 months notice of round world trip. Company starts looking for replacement three days before they leave ...

    Raid 1 is backup isn't it ? Don't need anything else. Until a user error deletes a file. Cos it is raid 1 both copies go ...

    Backup up to tape, read verify it, all the files seem to be there. Disks blow up, restore from tape. Why is the data 6 months old ? Because 6 months ago the tape write head failed, the y had been verifying old data ...

Page:

This topic is closed for new posts.

Other stories you might like