back to article A sysadmin's top ten tales of woe

Get enough people of any profession in one room and the conversation drifts inexorably towards horror stories. Everyone loves a good “…and then it all went horribly sideways” yarn, and we all have more than one. The IT profession is flush with tales of woe. There are no upper boundaries on ignorance or inability. From common …

COMMENTS

This topic is closed for new posts.

Page:

  1. NoneSuch
    Boffin

    You missed a big one.

    Inconsistent times set on the company servers. Nothing worse than trying to fault find a sequence of errors across several servers and / or workstations when the time is not set consistently. Surprising in this day and age just how common this is.

  2. Anonymous Coward
    FAIL

    backup the backup, eh what?

    In the early 90's I worked for a company that had ported a software application from proprietary hardware to a Unix platform. As part of this a backup solution had been developed for our customers, "the Unix guy" would go on site and install the backup software. To test this he would backup the folder containing the backup software. Yup, you guessed it, three years after "the Unix guy" had left a customer had a disk failure, when we came to restore their data, they had only ever been backing up the backup software.

    1. Mark Allen
      Pint

      backup to the backup folder

      That reminds me of the client who would do their own zip file backups of data folders. And then store that zip file in that original data folder. So the next backup would include the previous backup. And so on. These (multiple) backup files exponentially grew and grew until the hard disk jammed full...

      And how many people have found their clients backing up their Sage data to the same folder as Sage was installed in? And never taking those backups off site...

      Users are such fun...

      1. Anonymous Coward
        Coat

        users are such fun?

        Well yes, they are, but my biggest single lesson as a sys admin was, when stuff broke, not to ask the users what they had been up to until I had thoroughly examined what I had been up to.

        Send not to know what the users did:

        The person most likely to have broken the system is you!

      2. chr0m4t1c

        As was said elsewhere

        It's not just the users.

        I remember discovering that the default configuration on some NetWare backup software we had was to backup the server structure. Nothing else. So all you could have recovered were the directories, filenames and permissions, not the actual data in the files.

        Fortunately it was just an irritation, as I became suspicious that the first night's backup of what should have been the entire server took around 4 minutes when I had expected about 3 hours based on the amount of data and tape speed.

        It wasn't even a schoolboy error on my part, I'd selected "Full Backup" on the scheduler, it was only when you drilled into the options for what to include in a full backup that you could see the default was to behave like this. Epic fail from the software company there, I think.

    2. Anonymous Coward
      Anonymous Coward

      You mean the flashy red light on the backup was important?

      Client has multi-million dollar sales database being backed up by secretary/receptionist. Head secretary has problem with sales database. HS decides reboots fix PCs so the best thing to do is delete the database because it will rebuild itself. Is surprised when it doesn't. Head pilot fish takes call, says "no problem, that's what your back-up is for." SR puts in the tape and it won't read. Head Pilot fish asks "Is the read light on the tape backup blinking?" SR says "why yes it is." Head Pilot fish asks "How long has that been blinking?" SR answers "Not sure, but at least 3 weeks. Why is it important?"

      Fortunately for both HS and SR, Head Pilot fish (not me) was damned good and able to reconstruct their entire database from the Novel 3.12 recovery area (I have since forgotten their name for the very handy feature).

    3. informavorette
      Facepalm

      Not only the users fall for such strategies

      On my old job, were lucky to have a really good admin. He did everything as he should, running a separate backup server for each of the primary servers. When virtualization came around, it went really smooth. Everything continued as it always had been, just on virtual servers, completely transparent for us happy users. He may have been falling behind on documentation, but the bosses understood that as much as he worked to keep things running, there wasn't the time for proper documentation.

      Eventually, we got the budget for a second admin. His first task was to get to know the systems and document them along. He found some interesting system landscape choices, including the fact that most of the backup virtual servers ran on the same physical machine as their corresponding virtual primary server.

  3. raving angry loony

    another one

    My horror story is trying to find work when the only people you can talk to are H.R. people with zero understanding of system administration, job "requirement" sheets that list a plethora of buzzwords that will be obsolete in 90 days, and absolutely no understanding that one of the biggest jobs a sysadmin has is "learning new stuff ALL the time".

    At least I'm on the outside when their data centres, staffed by buzzword compliant know-nothings, bursts in flames. It's happened to two locally already, and lessons aren't being learned.

  4. Vic

    Counting the backups...

    I got a call-out to a customer. They'd lost a load of data.

    On arrival at site, I found a dead disk. It was a DeathStar, and had been in continuous operation for a goodly number of years - a good order of magnitude longer than most of them lasted.

    RAID? You've got to be kidding.

    I gave them the bad news, and told them they'd need to restore from backup. Blank looks ensued.

    "Oh", says one girl, "There's a copy of everything on the other machine. We'll be OK".

    But there wasn't. There was a share to the machine that had gone down.

    So I had yet another round of disk recovery, and actually found most of their actual data. But I was fielding calls for weeks that statred "we used to have this file on the desktop..."

    Vic.

  5. Pete 2 Silver badge

    It's not disaster recovery unless you know it works

    In a meeting a couple of years back when the following dialog took place:

    Yes, we have a best-practice disaster recovery procedure. We have fully redundant hot-standby servers at a beta site, mirrored disks at the location and two sets of network connections with no common points of failure.

    When did you last perform a failover test?

    Oh we've never tested it

    Why not?

    It might not work.

  6. Andrew Moore Silver badge
    Facepalm

    The cleaner unplugged it

    I remember once client screaming and cursing me over the phone threatening all forms of abuse at me because one of his systems was down and if I did not do something about it now there would be legal action, I'd be sued for lost revenue etc, etc, etc. So after a 2 hour drive to his office I walked in, looked at the computer and then plugged it back in and powered it up. I then just turned and stared at the client.

    I then I sent his boss an invoice for a half day, emergency onsite maintenance with expenses with "Problem" filled in with "System unplugged" and "Resolution" filled in with "System plugged back in"

  7. Anonymous Coward
    Facepalm

    Once experience the classic "single tape" cock-up

    "Yes we have backups every night, without fail for the last 18 months."

    "Oh great, should save some time. We need to test restores for the auditors. Who changes the tapes and what's the offsite storage like, fire-proof that sort of thing?"

    "Change the tape? No need, it's big enough, well it's never run out yet."

    "No, you only have 150MB of space on a tape and the server is holding 40MB of data, so that's only 5 days worth!" ( 40MB on a server, that shows you how long ago this was! )

    "Sorry?!"

    "There is no way that tape could do more than 3 nights of backups before it fills up. You've most likely overwritten the same tape hundreds of times, so you have no historical data available."

    "Sorry?!"

    ( You will be! )

    Quick check and yep, they'd run 450 backups on the same tape, over and over and over and over...the software didn't bother to check the tapes had data, it just rewound and overwrote it.

    Needless to say the auditors were not in the least bit impressed and lots of shouting ensued at management level and plenty of overtime was paid out to IT to make sure it did not happen again!

  8. praxis22
    Stop

    the emergency switch

    That big red power button that takes down the whole data center, the one in plain sight, make sure it has a cover. At some point somebody will trip, or shoot out a stabilising hand at exactly the wrong location. You do not want to be anywhere near that person when it happens.

  9. Inachu
    Pint

    Demanding employees and their email.

    One employee demanded to have access to his email 24/7 and wanted to have company email fowarded to his home ISP email.

    Well sooner or later his home ISP email inbox became full and not only sent a message that the inbox was full but also copied the message back to the sender at the company.

    So in effect it filled up the email server and it crashed and had to be rebuilt.

  10. fiddley
    Thumb Up

    Aircon fail

    2006, London Summer heatwave stressed our Aircon too much and the lot failed. What did the PHB do, order repair or replacement? Nah, sent us to somerfield for some Turkey sized tinfoil to line the windows of the server room and "keep the heat out". Cue 45 degree temps and a dead exchange server the next day. Ha! served him right, needless to say we didn't rush with the restore!

    Last time I went past the building, the foil was still in the windows :)

  11. Peter Jones 2
    Pirate

    Domain name...

    Small company that relies on e-commerce and e-mail "downsizes" their sysadmin to replace with a cheaper outsourcing company.

    Three months later, their site and e-mail stops working. Numerous phone calls to the outsourcing company yield nothing. I am called in to troubleshoot a week later. One WHOIS trawl, and I ask "so who is John xxx?" "He was our old sysadmin" "Well you may want to call and ask him for your domain back."

    The sysadmin had been paying the bills through his limited company, and effectively "owned" the domain. When the renewal came up, it was forwarded to a parking site. Not sure whether the company bought the domain back, went through the arbitration, or some other solution. But every company since, I have been interested to see that a lot of sysadmins do this as a form of "insurance", ostensibly because "it's easier to have them contact me."

  12. Woodnag

    Floopy days are here again

    1. The person doing MSBACKUP only put one floppy in, and just pressed ENTER when asked to replace with enxt disk.

    2. New software is purchased, IT person makes a master backup floppy for the office safe, and working backup floppy, and working copy floppy for user. Orig floppy goes home with CEO. Loads of redundancy, multi site, gold stars all round.

    One day, the user's machine says bee-baaaaar, bee-baaaaar etc - can't read the disk. Ok, we'll make you a new one from working backup floppy. Oops, same problem. Try master backup floppy. That's duff too. CEO's copy is brought in, bad also. Of course the problem was the floppy drive, which has now killed all copies of the software...

  13. Drummer Boy
    FAIL

    It never rains when it pours

    Especially when the pouring is from the aircom overflow tray above the large IBM piece of tin that ran 15 warehouses (the real kind, not data ones!), spread across Europe.

    It took 4 days to get the system back up, and then senior management suddenly saw the sense in spending several million on a separate site system.

    They lost £5m per day.

    Or the 'meat' error in the same company where the clerk pushed through a quarters VAT payments a day early and lost the company IRO £3m in VAT, on an entire warehouse of fags an booze.

  14. Pat 4

    UPS Cables

    When installing and configuring a UPS monitoring system that will automatically and gracefully shut down your data center in proper order before the batteries run out, always make sure you keep track of the serial cable that came WITH said UPS.

    I once installed one of those for a medium size ISP and got my hands on the wrong cable. It did not take me long to realize that on a regular cable, pin 5 is grounded, and on a UPS, pin 5 to ground means emergency shut-off... The sound of a big UPS, 25+ servers and a plethora of telecom equipment all clicking off simultaneously is not one that I ever want to hear again...

    Best of all... RTFM.

  15. Arthur the cat Silver badge

    Make sure the backup is going where you think it is going.

    One of our guys installed a SunOS based customer site which backed up to an Exabyte tape. The backup would verify the tape contents after backing up to ensure the tape was written correctly, and each day the customers religiously rotated the tapes and put them in the fire safe. One day they wanted to duplicate their data onto another machine, so tried to restore the backup tapes onto the new machine. Nothing on the tape. Nothing on *any* tape. Turns out that the backup had been going to /dev/rmt0 when the Exabyte was /dev/rst0 or somesuch name, i.e. the backups had simply been written into a file in /dev on the original machine. Fortunately they hadn't actually lost anything and it was corrected, but if the original machine had fried they'd have lost man years of work.

  16. alain williams Silver badge

    Where do I start ... ?

    Both sides of a mirror on the same disk ...

    Multi file tar backup to the rewind (not norewind) tape device, all except the last archive over written ...

    Sysadmin working in the machine room at the weekend, felt a little cold so turned the air con off. On monday the servers were fried ...

    Top sysadmin and deputy are the only ones who understand things. They fall in love, give 9 months notice of round world trip. Company starts looking for replacement three days before they leave ...

    Raid 1 is backup isn't it ? Don't need anything else. Until a user error deletes a file. Cos it is raid 1 both copies go ...

    Backup up to tape, read verify it, all the files seem to be there. Disks blow up, restore from tape. Why is the data 6 months old ? Because 6 months ago the tape write head failed, the y had been verifying old data ...

  17. Boris the Cockroach Silver badge
    Flame

    Number 6

    IS not just the bane of Sys admin's lives , its the bane of every person who ever has to use a PC remotely

    Eg A certain industrial robot programmer has to upload a bunch of programs to the robots, this is done via a file browser on the robots' control panel coupled to the programming station's PC.

    The PC is in the office 100 yrds away from the robots.... so set the PC comm program running and walk across the factory.

    at least once a week some idiot decides to turn the PC off 1/2 way through the transfer because either his iPod/iPad/iArse needs charging or 'oh look, Boris has left his PC on again'

  18. richardgsmith
    Unhappy

    Reminded me

    It's been a while now but the incident of the cleaner and the plug socket is certainly no urban legend. Whilst I was contracting at a bank of Scottish origin in the late 90's we spent days trying to work out why the overnight stored procedures failed every time, yep it was the 'woman who does'.

    Which leads me to suggest an entry for a similar article (which I'd like to see) for db admin disasters, which goes along the lines of 'Figuring out sql performance issues at the end of the project doesn't always work out'.

    1. KroSha
      Mushroom

      Similar one

      The emergency cutoff switch is right next to the exit door. The exit door is access controlled. The new guy did not realise that you have to swipe your card to get out, not just press a button. 500 servers go down in an instant.

      1. Mayhem

        Or the cabling gets a little loose

        Unlabelled emergency cutoff was located on a false wall near the door, connections had shaken loose from vibrations from door closing over the years. Walked in, turned on the lights, turned off a room full of servers. Very good heart attack test.

        Only vindication was the astonished expressions in the faces of the highly sceptical building electricians when I did it again in front of them after dragging them in to explain to me what had gone wrong with the lighting circuits.

    2. Jeremy 2

      Really?

      I'm no expert in server protection stuff but surely there are simple mitigations against the cleaner unplugging stuff, from the simple (a sticker) to the more complex (locking shield over the outlet) not to mention not leaving machines with sensitive data and/or that are mission critical in rooms that the cleaning staff have access to.

      Surely any sensible person would take measures to prevent exactly this scenario?

      1. cnorris517

        Yes....but...

        Fair point in an ideal world but sadly the bean counters are in control. The amount of times I've seen bean counters refuse expenditure of a few thousand pounds upfront only to cost themselves tens of thousands down the line is unreal.

    3. sandman

      More cleaner woes

      I used to have to do a lot of 3D rendering. In this case the project was to render the column bedecked interior of a planned large building - and then create a plot to be shown to the royal carbuncle hater next day. This needed an overnight run. Despite the plug switch being taped open and a "do not switch off" sign hung over it, the cleaner (no, she was English and could read) turned it off.

      This resulted in much swearing (me), much panic and fear of vanishing chances of making the Honours List (the CEO and directors of the charity) and hiring a motorbike courier to get the new drawings up to London just after lunch.

    4. Anonymous Coward
      Anonymous Coward

      EPO

      That happened to MCI were I live. To make things worse this site was were the do the peering for ATT and Sprint. They lost a few routers and DNS to a week to fix. When sad they lost a few routers I mean they didn't turn on. Some of the t1 cards were fried to. Then there was some of the routers and switches that lost their config . 5 hours off line . By the way this controlling the zone for northern California .

    5. philblue

      Been there...

      Did exactly the same thing but thankfully only on a single Windows SBS server - the last person to work on the server had left the serial cable for the UPS loose down the back so like a good boy I plugged it back in - instant silence.

      The odd thing was, 15 minutes later when the server finally came back up, no-one had noticed...

    6. Anonymous Coward
      Anonymous Coward

      Pipe leak on a floor above the IT floor.

      Water flows over the box with the mains. Eventually the mains blow. Team has disaster recovery problem including generator for power and a failover. Head Server Pilot fish confirms failover has worked properly and servers are now on the generator, different circuit than the mains. We need to evacuate building finally, no problem setup for remote access to cleanly power down the servers. Small problem: Head Server Pilot fish lives 45 minutes away. But everything's working so it should be fine. Except when the maintenance guys came through and saw the generator circuits were working, they turned it off so it wouldn't be dangerous (mind you same maintenance worker who previously saw no problem with water flowing over the mains). Batter backup were only good for 20-30 minutes, so by the time Head Server Pilot fish got home to remotely shut down the servers, they'd already gone down hard.

    7. Anonymous Coward
      Thumb Up

      Re: Make sure the backup is going where you think it is going.

      This is actually a good lesson to learn.

      I have made a note that from now on, one shouldn't just test that it works when it's supposed to--one may also want to make sure it doesn't work when it shouldn't!

      1. Peter Mc Aulay
        Thumb Up

        Re: Make sure the backup is going where you think it is going.

        This is why, if at all possible, I query the tape drive for status during the first test backup to see it's busy, or better yet, go up to the the machine and check for the correct blinkenlights & noises.

        1. relpy
          Thumb Up

          /dev/null

          backups are much faster that way...

    8. Anonymous Coward
      Mushroom

      Or the flirty junior playing - What does this switch do ?

      CLUNK !

      COS for high-integrity 24v DC C&I supply.

      Screaming Yodalarms.

      Primary DC was offline, and X-hour duration 'backup' battery hasn't yet been installed.

      30 seconds later, temporary 'buffer' battery is fully discharged.

      Immediately followed by a resounding series of BANGs, as every breaker in 6 multi-MW switchboards trips-out.

      Cue total darkness and ominous silence, broken only by watchkeeper's swabbie-level cursing.

      How to take out a nuke-sub from the inside.

    9. Anonymous Coward
      Anonymous Coward

      Re: the emergency switch

      Seen that happen. Except it wasn't a server room, it was a large ocean going vessel, and the button in question was the Man Overboard Button.

      Mind you, at least nobody got hurt.

    10. Matt K

      EPO switches

      ...another on the subject of EPO switches.

      When told that your switch should be covered to avoid being accidentally triggered, don't forget, call your local electrician in a panic when reminded, and watch as said electrician sticks a drill bit into your PDU in the middle of your peak processing window...

      Change management: more than just the logical stuff.

    11. pixl97

      Poor admins.

      The story here is the crappy email server that commits suicide when the disk is full.

      This is why you don't store queue's on your operating system partition. All sysadmins should know this.

    12. Anonymous Coward
      Anonymous Coward

      Re: Floopy days are here again

      there is a utility under unix called 'dd' that allow you to make a image of your floppy, you can then write that image on a fresh floppy when you need it.

      I know that this utility have been ported to DOS and I have even made use of it in the past, but for some reason I can't remember the name used under DOS, so I can't give you a link for it. Sorry.

      1. Anonymous Coward
        Anonymous Coward

        very poor admins

        The story here is also that someone didn't set mailbox size limits?

      2. galbak

        usefull tool

        DD.EXE is used to create an image file from a floppy bootdisk. Do not confuse it with the unix "dd" command it is not quite the same.

        Example: dd a: <filename>.img

        This will create an image file from your bootdisk in drive A:.

        You could also use winimage (not freeware) for this, but remember to save your floppy image as .IMA (not compressed) file.

        WINIMAGE http://www.winimage.com

        DD.EXE http://www.nu2.nu/download.php?sFile=dd.zip

        usefull tool, along with

        unlockerassistant http://www.softpedia.com/reviews/windows/Unlocker-Review-106258.shtml

        windowsenabler http://www.softpedia.com/get/Others/Miscellaneous/Windows-Enabler.shtml

        and sharpkeys http://www.randyrants.com/2008/12/sharpkeys_30.html

        1. trindflo

          loaddskf savedskf

          IBM had two compatibility-mode utilities to make and restore an image of a floppy. They run in DOS, OS/2, and last time I checked Windows. They were loaddskf.exe and savedskf.exe.

    13. Ross K Silver badge
      Mushroom

      Ah Deathstars

      with their glass platters... I can still remember that scratch, scratch, scratch noise even though I haven't touched one of those pieces of crap in ten years.

    14. Peter Simpson 1
      WTF?

      The real WTF is...

      The damn UPS manufacturer "thought outside the box" and used a DB-9 connector with a non-standard pinout for a "serial port". Connecting a standard cable causes Bad Things to happen.

      Poor design doesn't even begin to cover it.

    15. Peter Simpson 1
      Devil

      It's usually the boss

      or his relative.

      //you can't even yell at them

    16. The First Dave
      Mushroom

      UPS

      On a similar note, I once discovered that if you have a UPS running a couple of servers, and decide to re-install windows on the one that actually has the serial cable in it, then as part of the hardware check a little signal gets sent to the UPS, that shuts it down instantly...

      1. Slow Joe Crow
        Meh

        I blame lock in

        I think it was more a matter of making you buy their "special" cable at twice the price of a generic DB9 serial cable. Fortunately USB has made this a thing of the past since I also found out the hard way about UPS cable pinouts, but luckily it was only a desktop machine.

    17. Alain

      Re: the emergency switch

      A variant of this one: in a very large computerised hospital, due to an electrical fire in a transformer room (that was mid-August of course... Murphy's laws at work), we had been doing several complete system shutdowns and startups (6 Unix clusters, 20+ Oracle DBs, 50+ blade servers) over a few days. This was required by frequent and mostly announced (but on very short notice) blackouts due to problems on the generator trucks they had parked on the street next to the building. At some point, totally exhausted after 3 almost sleepless nights, we were doing yet another system start-up after just receiving a confirmation that power was back and "hopefully" stable. A guy taking care of just a couple of not-too-important Windows servers came into the room to boot up his own boxes. He almost never comes here, doing his work remotely. After being finished, he goes out and ... switches off the lights of the room. None of us instantly died of a heart attack, but that was close.

    18. irrelevant

      AirCon

      Ah yes ... aircon drip trays sited directly above the brand new IT room, raining down into the cupboard with the power distribution racks in it. Major player in the mobile phone retail sector, mid 90s. I was on-site at the time, too..

      Same place, year or two earlier, I was dragged out of bed because their tech had done an rm -r in exactly the wrong folder ... at least he'd phoned his boss before bolting. We arrived to find the place empty and barely even locked up.

    19. Anonymous Coward
      Devil

      Who cares about the cleaner

      Agree about the cleaner. My wife hired a chinese bloke to clean the house once a week a couple of years back. His first deed was to bring down the network by plugging a 2KW vacuum cleaner into the UPS backed up socket. I fired him after he did it for the second time and bought a Rumba.

      Which reminds me - have you seen the incident where the builders plug in a welding apparatus into the UPS socket? Half of the UPS overload schematics out there do not work correctly with inductive loads that size. Older APC definitely does not. Trust me, a Galaxy class APC charged to the hilt exploading in a 3x3m server room is not a pretty sight.

      Also - on 7 - snow storm. Snow storms are actually not that bad if you have the right vehicle, warm clothes and a shovel. Now East Anglian fog... In my old job I had to go and plug in cold spare equipment at 11pm in the office with visibility under 10m. The only thing you could see through the windshield was a white wall. You do not see the road, nothing. So you crawl on 5mph with dead reconing and hearing being your primary means of navigation. Thankfully, the paint used for roads in the UK is thick enough - you actually notice when you drive over it.

Page:

This topic is closed for new posts.

Biting the hand that feeds IT © 1998–2019