back to article A sysadmin's top ten tales of woe

Get enough people of any profession in one room and the conversation drifts inexorably towards horror stories. Everyone loves a good “…and then it all went horribly sideways” yarn, and we all have more than one. The IT profession is flush with tales of woe. There are no upper boundaries on ignorance or inability. From common …

COMMENTS

This topic is closed for new posts.
  1. NoneSuch Silver badge
    Boffin

    You missed a big one.

    Inconsistent times set on the company servers. Nothing worse than trying to fault find a sequence of errors across several servers and / or workstations when the time is not set consistently. Surprising in this day and age just how common this is.

  2. Anonymous Coward
    FAIL

    backup the backup, eh what?

    In the early 90's I worked for a company that had ported a software application from proprietary hardware to a Unix platform. As part of this a backup solution had been developed for our customers, "the Unix guy" would go on site and install the backup software. To test this he would backup the folder containing the backup software. Yup, you guessed it, three years after "the Unix guy" had left a customer had a disk failure, when we came to restore their data, they had only ever been backing up the backup software.

    1. Mark Allen
      Pint

      backup to the backup folder

      That reminds me of the client who would do their own zip file backups of data folders. And then store that zip file in that original data folder. So the next backup would include the previous backup. And so on. These (multiple) backup files exponentially grew and grew until the hard disk jammed full...

      And how many people have found their clients backing up their Sage data to the same folder as Sage was installed in? And never taking those backups off site...

      Users are such fun...

      1. Anonymous Coward
        Coat

        users are such fun?

        Well yes, they are, but my biggest single lesson as a sys admin was, when stuff broke, not to ask the users what they had been up to until I had thoroughly examined what I had been up to.

        Send not to know what the users did:

        The person most likely to have broken the system is you!

      2. chr0m4t1c

        As was said elsewhere

        It's not just the users.

        I remember discovering that the default configuration on some NetWare backup software we had was to backup the server structure. Nothing else. So all you could have recovered were the directories, filenames and permissions, not the actual data in the files.

        Fortunately it was just an irritation, as I became suspicious that the first night's backup of what should have been the entire server took around 4 minutes when I had expected about 3 hours based on the amount of data and tape speed.

        It wasn't even a schoolboy error on my part, I'd selected "Full Backup" on the scheduler, it was only when you drilled into the options for what to include in a full backup that you could see the default was to behave like this. Epic fail from the software company there, I think.

    2. Anonymous Coward
      Anonymous Coward

      You mean the flashy red light on the backup was important?

      Client has multi-million dollar sales database being backed up by secretary/receptionist. Head secretary has problem with sales database. HS decides reboots fix PCs so the best thing to do is delete the database because it will rebuild itself. Is surprised when it doesn't. Head pilot fish takes call, says "no problem, that's what your back-up is for." SR puts in the tape and it won't read. Head Pilot fish asks "Is the read light on the tape backup blinking?" SR says "why yes it is." Head Pilot fish asks "How long has that been blinking?" SR answers "Not sure, but at least 3 weeks. Why is it important?"

      Fortunately for both HS and SR, Head Pilot fish (not me) was damned good and able to reconstruct their entire database from the Novel 3.12 recovery area (I have since forgotten their name for the very handy feature).

    3. informavorette
      Facepalm

      Not only the users fall for such strategies

      On my old job, were lucky to have a really good admin. He did everything as he should, running a separate backup server for each of the primary servers. When virtualization came around, it went really smooth. Everything continued as it always had been, just on virtual servers, completely transparent for us happy users. He may have been falling behind on documentation, but the bosses understood that as much as he worked to keep things running, there wasn't the time for proper documentation.

      Eventually, we got the budget for a second admin. His first task was to get to know the systems and document them along. He found some interesting system landscape choices, including the fact that most of the backup virtual servers ran on the same physical machine as their corresponding virtual primary server.

  3. raving angry loony

    another one

    My horror story is trying to find work when the only people you can talk to are H.R. people with zero understanding of system administration, job "requirement" sheets that list a plethora of buzzwords that will be obsolete in 90 days, and absolutely no understanding that one of the biggest jobs a sysadmin has is "learning new stuff ALL the time".

    At least I'm on the outside when their data centres, staffed by buzzword compliant know-nothings, bursts in flames. It's happened to two locally already, and lessons aren't being learned.

    1. Anonymous Coward
      Meh

      Recruiters and HR: a close match.

      Saw a job that tickled me fancy, so gave it a go. Recruiter comes back that HR says I'd already have to live in that city because moving there means blah blah can't take the risk yadda yadda bad experiences mumble bs mumble mumble.

      Sure. Couple of stains on that reasoning: It wasn't hard to work out just what gpu super shop it was, there aren't too many around in Europe n'mind Amsterdam. Their linkedin account shows fresh hires from Germany and France. And the job --installing the product at customer sites, requiring good unix and troubleshooting, which I definitely have-- has as requirement "extensive travel in Benelux and Europe". But they don't want you unless you already live in Amsterdam. Go on then, pull the other one.

      Sounds like a bit of a disconnect between HR and the operational side of the company. Or maybe they just suck at the "white lies" people skill. Either way, I'd not put much trust in their shares.

      1. raving angry loony

        sounds right

        I keep getting the "you're overqualified" excuse by the recruiters and H.R.. Because of course actually knowing too much or having too much experience might be detrimental to the job? WTF?

        1. David_H
          Thumb Down

          Been there as well

          HR droid got me to fill in the IQ test sheet.

          She came back 10 minutes later looking hot under the collar.

          Some bloke popped his head round the door saying "you scored more than anyone else here ever has, you'll be bored and leave. Goodbye" and left. And the droid escorted me offsite, refusing to enter into conversation.

          FFS for 3 times my salary at the time in research, I'd have stuck it for more than 6 months!

  4. Vic

    Counting the backups...

    I got a call-out to a customer. They'd lost a load of data.

    On arrival at site, I found a dead disk. It was a DeathStar, and had been in continuous operation for a goodly number of years - a good order of magnitude longer than most of them lasted.

    RAID? You've got to be kidding.

    I gave them the bad news, and told them they'd need to restore from backup. Blank looks ensued.

    "Oh", says one girl, "There's a copy of everything on the other machine. We'll be OK".

    But there wasn't. There was a share to the machine that had gone down.

    So I had yet another round of disk recovery, and actually found most of their actual data. But I was fielding calls for weeks that statred "we used to have this file on the desktop..."

    Vic.

    1. Ross K Silver badge
      Mushroom

      Ah Deathstars

      with their glass platters... I can still remember that scratch, scratch, scratch noise even though I haven't touched one of those pieces of crap in ten years.

  5. Pete 2 Silver badge

    It's not disaster recovery unless you know it works

    In a meeting a couple of years back when the following dialog took place:

    Yes, we have a best-practice disaster recovery procedure. We have fully redundant hot-standby servers at a beta site, mirrored disks at the location and two sets of network connections with no common points of failure.

    When did you last perform a failover test?

    Oh we've never tested it

    Why not?

    It might not work.

  6. Andrew Moore
    Facepalm

    The cleaner unplugged it

    I remember once client screaming and cursing me over the phone threatening all forms of abuse at me because one of his systems was down and if I did not do something about it now there would be legal action, I'd be sued for lost revenue etc, etc, etc. So after a 2 hour drive to his office I walked in, looked at the computer and then plugged it back in and powered it up. I then just turned and stared at the client.

    I then I sent his boss an invoice for a half day, emergency onsite maintenance with expenses with "Problem" filled in with "System unplugged" and "Resolution" filled in with "System plugged back in"

    1. PatientOne

      This isn't so uncommon

      Had this a couple of times when I was working for a company down Swindon way.

      User calls, makes demands and threats, I drop what I'm doing, head on down and find the computer was unplugged. User, of cause, was never around when I got there. So I plug the computer back in and leave it plugged in. Last time I went down, checked the computer worked, then unplugged it and returned to the office to report on the ticket. Note in the comments box: No fault found. Computer left in state found in. I didn't mention that the problem seemed to occur at about 8:30 when the canteen opened...

      And as to cleaners: Yes, had that, too. Server room was kept locked: Main door was key coded with access audited. A rear escape door was secure from the outside: Had to break the glass tube to unlock it. The side door was kept locked with 'no entry' markers on it, and the office it was accessed from was also kept locked. So, the cleaner went through the office and in through the side door using the master key. We only found out this was happening when she forgot to plug the server back in when she left the one night: The UPS had been keeping it running the rest of the time.

      But my favorite has to be the offsite server farm: Two sites mirrored, just in case, and in separate counties. Only when the substation supplying power to one went down, both sites went dark. When they investigated why, they found both sites were supplied by the same substation. Apparently no one had thought to check that possibility out...

  7. Anonymous Coward
    Facepalm

    Once experience the classic "single tape" cock-up

    "Yes we have backups every night, without fail for the last 18 months."

    "Oh great, should save some time. We need to test restores for the auditors. Who changes the tapes and what's the offsite storage like, fire-proof that sort of thing?"

    "Change the tape? No need, it's big enough, well it's never run out yet."

    "No, you only have 150MB of space on a tape and the server is holding 40MB of data, so that's only 5 days worth!" ( 40MB on a server, that shows you how long ago this was! )

    "Sorry?!"

    "There is no way that tape could do more than 3 nights of backups before it fills up. You've most likely overwritten the same tape hundreds of times, so you have no historical data available."

    "Sorry?!"

    ( You will be! )

    Quick check and yep, they'd run 450 backups on the same tape, over and over and over and over...the software didn't bother to check the tapes had data, it just rewound and overwrote it.

    Needless to say the auditors were not in the least bit impressed and lots of shouting ensued at management level and plenty of overtime was paid out to IT to make sure it did not happen again!

  8. praxis22
    Stop

    the emergency switch

    That big red power button that takes down the whole data center, the one in plain sight, make sure it has a cover. At some point somebody will trip, or shoot out a stabilising hand at exactly the wrong location. You do not want to be anywhere near that person when it happens.

    1. KroSha
      Mushroom

      Similar one

      The emergency cutoff switch is right next to the exit door. The exit door is access controlled. The new guy did not realise that you have to swipe your card to get out, not just press a button. 500 servers go down in an instant.

      1. Mayhem

        Or the cabling gets a little loose

        Unlabelled emergency cutoff was located on a false wall near the door, connections had shaken loose from vibrations from door closing over the years. Walked in, turned on the lights, turned off a room full of servers. Very good heart attack test.

        Only vindication was the astonished expressions in the faces of the highly sceptical building electricians when I did it again in front of them after dragging them in to explain to me what had gone wrong with the lighting circuits.

    2. Anonymous Coward
      Anonymous Coward

      EPO

      That happened to MCI were I live. To make things worse this site was were the do the peering for ATT and Sprint. They lost a few routers and DNS to a week to fix. When sad they lost a few routers I mean they didn't turn on. Some of the t1 cards were fried to. Then there was some of the routers and switches that lost their config . 5 hours off line . By the way this controlling the zone for northern California .

    3. Anonymous Coward
      Mushroom

      Or the flirty junior playing - What does this switch do ?

      CLUNK !

      COS for high-integrity 24v DC C&I supply.

      Screaming Yodalarms.

      Primary DC was offline, and X-hour duration 'backup' battery hasn't yet been installed.

      30 seconds later, temporary 'buffer' battery is fully discharged.

      Immediately followed by a resounding series of BANGs, as every breaker in 6 multi-MW switchboards trips-out.

      Cue total darkness and ominous silence, broken only by watchkeeper's swabbie-level cursing.

      How to take out a nuke-sub from the inside.

    4. Anonymous Coward
      Anonymous Coward

      Re: the emergency switch

      Seen that happen. Except it wasn't a server room, it was a large ocean going vessel, and the button in question was the Man Overboard Button.

      Mind you, at least nobody got hurt.

    5. Matt K

      EPO switches

      ...another on the subject of EPO switches.

      When told that your switch should be covered to avoid being accidentally triggered, don't forget, call your local electrician in a panic when reminded, and watch as said electrician sticks a drill bit into your PDU in the middle of your peak processing window...

      Change management: more than just the logical stuff.

    6. Peter Simpson 1
      Devil

      It's usually the boss

      or his relative.

      //you can't even yell at them

    7. Alain

      Re: the emergency switch

      A variant of this one: in a very large computerised hospital, due to an electrical fire in a transformer room (that was mid-August of course... Murphy's laws at work), we had been doing several complete system shutdowns and startups (6 Unix clusters, 20+ Oracle DBs, 50+ blade servers) over a few days. This was required by frequent and mostly announced (but on very short notice) blackouts due to problems on the generator trucks they had parked on the street next to the building. At some point, totally exhausted after 3 almost sleepless nights, we were doing yet another system start-up after just receiving a confirmation that power was back and "hopefully" stable. A guy taking care of just a couple of not-too-important Windows servers came into the room to boot up his own boxes. He almost never comes here, doing his work remotely. After being finished, he goes out and ... switches off the lights of the room. None of us instantly died of a heart attack, but that was close.

    8. Andy Miller
      Facepalm

      Location, location, location

      Or simply put the emergency off next to the light switch in a server room that is usually run lights out. So groping around to turn the room lights on turns all the little lights off....

  9. Inachu
    Pint

    Demanding employees and their email.

    One employee demanded to have access to his email 24/7 and wanted to have company email fowarded to his home ISP email.

    Well sooner or later his home ISP email inbox became full and not only sent a message that the inbox was full but also copied the message back to the sender at the company.

    So in effect it filled up the email server and it crashed and had to be rebuilt.

    1. pixl97

      Poor admins.

      The story here is the crappy email server that commits suicide when the disk is full.

      This is why you don't store queue's on your operating system partition. All sysadmins should know this.

      1. Anonymous Coward
        Anonymous Coward

        very poor admins

        The story here is also that someone didn't set mailbox size limits?

  10. fiddley
    Thumb Up

    Aircon fail

    2006, London Summer heatwave stressed our Aircon too much and the lot failed. What did the PHB do, order repair or replacement? Nah, sent us to somerfield for some Turkey sized tinfoil to line the windows of the server room and "keep the heat out". Cue 45 degree temps and a dead exchange server the next day. Ha! served him right, needless to say we didn't rush with the restore!

    Last time I went past the building, the foil was still in the windows :)

    1. Anonymous Coward
      FAIL

      @Aircon...

      Or the time when I came back from a weekend off to find two of my servers with fried power supplies .. turns out a DR exercise had happened over the weekend and the ops staff were getting a bit cold in the small machine room and so they might have turn off some of the aircon for a few hours during the exercise (and my servers were furthest away from the cold outflow) ... hmm now looking at the logs when did the exercise start and when did the logs get truncated? 8-(

      I would suspect that it was pure karma as a few years earlier in same said machine room I turned off one of the aircon blowers as I was getting hypothermia working next to the cold air outlet and forgot to turn it back on before leaving for the day .. the temperature alarm went off overnight .. whoops (I owned up) ... funnily enough little plastic covers were screwed over the aircon controls a few days later .. you then had to use a pen through a small hole in the cover to flick them on and off 8-)

  11. Peter Jones 2
    Pirate

    Domain name...

    Small company that relies on e-commerce and e-mail "downsizes" their sysadmin to replace with a cheaper outsourcing company.

    Three months later, their site and e-mail stops working. Numerous phone calls to the outsourcing company yield nothing. I am called in to troubleshoot a week later. One WHOIS trawl, and I ask "so who is John xxx?" "He was our old sysadmin" "Well you may want to call and ask him for your domain back."

    The sysadmin had been paying the bills through his limited company, and effectively "owned" the domain. When the renewal came up, it was forwarded to a parking site. Not sure whether the company bought the domain back, went through the arbitration, or some other solution. But every company since, I have been interested to see that a lot of sysadmins do this as a form of "insurance", ostensibly because "it's easier to have them contact me."

  12. Woodnag

    Floopy days are here again

    1. The person doing MSBACKUP only put one floppy in, and just pressed ENTER when asked to replace with enxt disk.

    2. New software is purchased, IT person makes a master backup floppy for the office safe, and working backup floppy, and working copy floppy for user. Orig floppy goes home with CEO. Loads of redundancy, multi site, gold stars all round.

    One day, the user's machine says bee-baaaaar, bee-baaaaar etc - can't read the disk. Ok, we'll make you a new one from working backup floppy. Oops, same problem. Try master backup floppy. That's duff too. CEO's copy is brought in, bad also. Of course the problem was the floppy drive, which has now killed all copies of the software...

    1. Anonymous Coward
      Anonymous Coward

      Re: Floopy days are here again

      there is a utility under unix called 'dd' that allow you to make a image of your floppy, you can then write that image on a fresh floppy when you need it.

      I know that this utility have been ported to DOS and I have even made use of it in the past, but for some reason I can't remember the name used under DOS, so I can't give you a link for it. Sorry.

      1. galbak

        usefull tool

        DD.EXE is used to create an image file from a floppy bootdisk. Do not confuse it with the unix "dd" command it is not quite the same.

        Example: dd a: <filename>.img

        This will create an image file from your bootdisk in drive A:.

        You could also use winimage (not freeware) for this, but remember to save your floppy image as .IMA (not compressed) file.

        WINIMAGE http://www.winimage.com

        DD.EXE http://www.nu2.nu/download.php?sFile=dd.zip

        usefull tool, along with

        unlockerassistant http://www.softpedia.com/reviews/windows/Unlocker-Review-106258.shtml

        windowsenabler http://www.softpedia.com/get/Others/Miscellaneous/Windows-Enabler.shtml

        and sharpkeys http://www.randyrants.com/2008/12/sharpkeys_30.html

        1. trindflo Bronze badge

          loaddskf savedskf

          IBM had two compatibility-mode utilities to make and restore an image of a floppy. They run in DOS, OS/2, and last time I checked Windows. They were loaddskf.exe and savedskf.exe.

  13. Drummer Boy
    FAIL

    It never rains when it pours

    Especially when the pouring is from the aircom overflow tray above the large IBM piece of tin that ran 15 warehouses (the real kind, not data ones!), spread across Europe.

    It took 4 days to get the system back up, and then senior management suddenly saw the sense in spending several million on a separate site system.

    They lost £5m per day.

    Or the 'meat' error in the same company where the clerk pushed through a quarters VAT payments a day early and lost the company IRO £3m in VAT, on an entire warehouse of fags an booze.

    1. Anonymous Coward
      Anonymous Coward

      Pipe leak on a floor above the IT floor.

      Water flows over the box with the mains. Eventually the mains blow. Team has disaster recovery problem including generator for power and a failover. Head Server Pilot fish confirms failover has worked properly and servers are now on the generator, different circuit than the mains. We need to evacuate building finally, no problem setup for remote access to cleanly power down the servers. Small problem: Head Server Pilot fish lives 45 minutes away. But everything's working so it should be fine. Except when the maintenance guys came through and saw the generator circuits were working, they turned it off so it wouldn't be dangerous (mind you same maintenance worker who previously saw no problem with water flowing over the mains). Batter backup were only good for 20-30 minutes, so by the time Head Server Pilot fish got home to remotely shut down the servers, they'd already gone down hard.

    2. irrelevant

      AirCon

      Ah yes ... aircon drip trays sited directly above the brand new IT room, raining down into the cupboard with the power distribution racks in it. Major player in the mobile phone retail sector, mid 90s. I was on-site at the time, too..

      Same place, year or two earlier, I was dragged out of bed because their tech had done an rm -r in exactly the wrong folder ... at least he'd phoned his boss before bolting. We arrived to find the place empty and barely even locked up.

  14. Pat 4

    UPS Cables

    When installing and configuring a UPS monitoring system that will automatically and gracefully shut down your data center in proper order before the batteries run out, always make sure you keep track of the serial cable that came WITH said UPS.

    I once installed one of those for a medium size ISP and got my hands on the wrong cable. It did not take me long to realize that on a regular cable, pin 5 is grounded, and on a UPS, pin 5 to ground means emergency shut-off... The sound of a big UPS, 25+ servers and a plethora of telecom equipment all clicking off simultaneously is not one that I ever want to hear again...

    Best of all... RTFM.

    1. philblue

      Been there...

      Did exactly the same thing but thankfully only on a single Windows SBS server - the last person to work on the server had left the serial cable for the UPS loose down the back so like a good boy I plugged it back in - instant silence.

      The odd thing was, 15 minutes later when the server finally came back up, no-one had noticed...

    2. Peter Simpson 1
      WTF?

      The real WTF is...

      The damn UPS manufacturer "thought outside the box" and used a DB-9 connector with a non-standard pinout for a "serial port". Connecting a standard cable causes Bad Things to happen.

      Poor design doesn't even begin to cover it.

      1. Slow Joe Crow
        Meh

        I blame lock in

        I think it was more a matter of making you buy their "special" cable at twice the price of a generic DB9 serial cable. Fortunately USB has made this a thing of the past since I also found out the hard way about UPS cable pinouts, but luckily it was only a desktop machine.

    3. The First Dave
      Mushroom

      UPS

      On a similar note, I once discovered that if you have a UPS running a couple of servers, and decide to re-install windows on the one that actually has the serial cable in it, then as part of the hardware check a little signal gets sent to the UPS, that shuts it down instantly...

  15. Arthur the cat Silver badge

    Make sure the backup is going where you think it is going.

    One of our guys installed a SunOS based customer site which backed up to an Exabyte tape. The backup would verify the tape contents after backing up to ensure the tape was written correctly, and each day the customers religiously rotated the tapes and put them in the fire safe. One day they wanted to duplicate their data onto another machine, so tried to restore the backup tapes onto the new machine. Nothing on the tape. Nothing on *any* tape. Turns out that the backup had been going to /dev/rmt0 when the Exabyte was /dev/rst0 or somesuch name, i.e. the backups had simply been written into a file in /dev on the original machine. Fortunately they hadn't actually lost anything and it was corrected, but if the original machine had fried they'd have lost man years of work.

    1. Anonymous Coward
      Thumb Up

      Re: Make sure the backup is going where you think it is going.

      This is actually a good lesson to learn.

      I have made a note that from now on, one shouldn't just test that it works when it's supposed to--one may also want to make sure it doesn't work when it shouldn't!

      1. Peter Mc Aulay
        Thumb Up

        Re: Make sure the backup is going where you think it is going.

        This is why, if at all possible, I query the tape drive for status during the first test backup to see it's busy, or better yet, go up to the the machine and check for the correct blinkenlights & noises.

        1. relpy
          Thumb Up

          /dev/null

          backups are much faster that way...

  16. alain williams Silver badge

    Where do I start ... ?

    Both sides of a mirror on the same disk ...

    Multi file tar backup to the rewind (not norewind) tape device, all except the last archive over written ...

    Sysadmin working in the machine room at the weekend, felt a little cold so turned the air con off. On monday the servers were fried ...

    Top sysadmin and deputy are the only ones who understand things. They fall in love, give 9 months notice of round world trip. Company starts looking for replacement three days before they leave ...

    Raid 1 is backup isn't it ? Don't need anything else. Until a user error deletes a file. Cos it is raid 1 both copies go ...

    Backup up to tape, read verify it, all the files seem to be there. Disks blow up, restore from tape. Why is the data 6 months old ? Because 6 months ago the tape write head failed, the y had been verifying old data ...

  17. Boris the Cockroach Silver badge
    Flame

    Number 6

    IS not just the bane of Sys admin's lives , its the bane of every person who ever has to use a PC remotely

    Eg A certain industrial robot programmer has to upload a bunch of programs to the robots, this is done via a file browser on the robots' control panel coupled to the programming station's PC.

    The PC is in the office 100 yrds away from the robots.... so set the PC comm program running and walk across the factory.

    at least once a week some idiot decides to turn the PC off 1/2 way through the transfer because either his iPod/iPad/iArse needs charging or 'oh look, Boris has left his PC on again'

  18. richardgsmith
    Unhappy

    Reminded me

    It's been a while now but the incident of the cleaner and the plug socket is certainly no urban legend. Whilst I was contracting at a bank of Scottish origin in the late 90's we spent days trying to work out why the overnight stored procedures failed every time, yep it was the 'woman who does'.

    Which leads me to suggest an entry for a similar article (which I'd like to see) for db admin disasters, which goes along the lines of 'Figuring out sql performance issues at the end of the project doesn't always work out'.

    1. Jeremy 2

      Really?

      I'm no expert in server protection stuff but surely there are simple mitigations against the cleaner unplugging stuff, from the simple (a sticker) to the more complex (locking shield over the outlet) not to mention not leaving machines with sensitive data and/or that are mission critical in rooms that the cleaning staff have access to.

      Surely any sensible person would take measures to prevent exactly this scenario?

      1. cnorris517

        Yes....but...

        Fair point in an ideal world but sadly the bean counters are in control. The amount of times I've seen bean counters refuse expenditure of a few thousand pounds upfront only to cost themselves tens of thousands down the line is unreal.

    2. sandman

      More cleaner woes

      I used to have to do a lot of 3D rendering. In this case the project was to render the column bedecked interior of a planned large building - and then create a plot to be shown to the royal carbuncle hater next day. This needed an overnight run. Despite the plug switch being taped open and a "do not switch off" sign hung over it, the cleaner (no, she was English and could read) turned it off.

      This resulted in much swearing (me), much panic and fear of vanishing chances of making the Honours List (the CEO and directors of the charity) and hiring a motorbike courier to get the new drawings up to London just after lunch.

    3. Anonymous Coward
      Devil

      Who cares about the cleaner

      Agree about the cleaner. My wife hired a chinese bloke to clean the house once a week a couple of years back. His first deed was to bring down the network by plugging a 2KW vacuum cleaner into the UPS backed up socket. I fired him after he did it for the second time and bought a Rumba.

      Which reminds me - have you seen the incident where the builders plug in a welding apparatus into the UPS socket? Half of the UPS overload schematics out there do not work correctly with inductive loads that size. Older APC definitely does not. Trust me, a Galaxy class APC charged to the hilt exploading in a 3x3m server room is not a pretty sight.

      Also - on 7 - snow storm. Snow storms are actually not that bad if you have the right vehicle, warm clothes and a shovel. Now East Anglian fog... In my old job I had to go and plug in cold spare equipment at 11pm in the office with visibility under 10m. The only thing you could see through the windshield was a white wall. You do not see the road, nothing. So you crawl on 5mph with dead reconing and hearing being your primary means of navigation. Thankfully, the paint used for roads in the UK is thick enough - you actually notice when you drive over it.

      1. TeeCee Gold badge
        Happy

        Re: Who cares about the cleaner

        Yup, seen that one too. Regional (i.e. Europe) server and comms room. Every AM, something's been unplugged. Mostly trivial but annoying. The solution was to carefully label each plugged in item in Dymo with "Do Not Remove This Plug". Sorted.

        That night, the cleaner came in and found nowhere to plug in the hoover as all the sockets were occupied with plugs so labelled. Then she noticed a nice block of sockets in a line, handily mounted at "not having to bend down" level too. She whacked in the old Numatic "Henry" and switched it on. Said sockets were in the back of the comms rack and the clean power supply they were attached to shat itself on the spot, producing a Europe-wide outage of all the European shared services.

        Moral: Sometimes, finding your screen unplugged in the morning ain't such a bad thing......

        1. Anonymous Coward
          Megaphone

          Moral of the story?

          So what stops us from finding a handy outlet in a spot convenient for the cleaner and labeling it "for cleaning use ONLY"?

          Sometimes it's really more useful to facilitate what you'd like to happen than it is to forbid every niggle and thing you don't want to happen. For bonus points wire it up to a nice and isolated group, then get with staffing and make sure all new cleaning hires know that in rooms full of computing equipment there will be a socket specifically for them, labeled and such and free, and to use any other is a firing offence.

          Why yes, the cleaning staff too should know where their priorities lie: In certain rooms a bit of dust is preferrable over having blinkenlights go dark. Go clean the visitors area again instead, hmkay.

          My story? Getting all comms, net, phone, alarm, everything, ripped out without the aid of a backhoe. Telco street cabinet cable administration mass spring cleaning session. Four hour service contract and watching the telco not care. "We're busy! Come back next week!" Silver linging? Sending in someone from ceerow to shout at them; if nothing else, saves my hearing. The icon is for that guy, though he wouldn't need one.

          1. Field Marshal Von Krakenfart

            Moral of the Story???

            You can't make idiot proof software/hardware/procedures, the idiots are just too inventive.

            Or as Albert Einstein said Only two things are infinite, the universe and human stupidity, and I'm not sure about the former".

          2. David_H
            Holmes

            Sockets

            There is a very simple answer.

            When I wired up offices I just bought a whole load of 13Amp plug and sockets with the live and neutral sockets oriented at 45 degrees to normal. These sockets were on a seperate supply to everything else and all cleaners devices had the normal plugs removed and fitted with these special plugs. I even used to give them an extension cable from the special plug to ordinary socket for their mobiles etc.

            Either that, or the fact I made sure they knew where I kept the baseball bat, made sure that I've never had such problems.

  19. Anonymous Coward
    Anonymous Coward

    More cleaner/plug woes

    Back in the day I did frequent battle with a flaky Sparc box that pretty regularly either the cleaner insisted on unplugging in favour of the higher interests of the vacuum cleaner or the users insisted on shutting down using the great big OFF switch instead of triggering the shutdown procedure.

    In theory this wasn't a big deal: you'd provide some juice, it would run a fsck, boot and everything would be hunky-dory.

    In practice it was a pain in the arse, because the fsck just went into an infinite loop until such time as you got bored and pulled the plug again and started the game of floppy roulette that was reloading the bugger: 35 disks, 3 hours and a 1 in 3 chance that number 30 would be the one that failed to read, sending you back to the beginning.

    At least, that is, until I discovered that you could boot it to a kernel from one floppy, then telnet in and delete the fsck request...

  20. Jacqui

    cleaners and security

    First the cleaners. When I worked for Cray some 17+years ago one of the high up sales bods complianed they cleaners did not clean his keyboard properly. Evidently bits of sandwich were still in his mac when he came in next day.

    So cleaners told to clean keyboards - big big mistake. Our dev desktops were various sysstem but many were sun's and cleaners giving the KB a bloody good wipe would crash the suns.

    Dev staff installed a mouse driven lockscreen so we avoided death by cleaner early until HR told cleaners to stop cleaning KB's.

    Next was dev box crashes every night at midnight. We finally found it was a securit guard who SLAMMED office doors when doing his rounds. Some of the dev boxen were on a table next to the door - guess what disk crashes. Asked him to stop slamming doors - you dont want to hear his reply... The fix?

    1) move table away from wall - duh!

    2) install a cron job to run "breaking glass" on a sun at one end of the dev corridor then

    at the other end of the corridor after just enough time to let him get to the other end.

    Repeat at various times thoughout the night :-)

    We left the breaking glass running on odd nights for a few weeks - He stopped slamming doors - suns had mics - and looked a lot fitter not longer after.

    FInally the cheap UPS. I had two of these beasties - bought a cheap external modem and used this in a serial port to detect when the power had gone off.

  21. Anonymous Coward
    Anonymous Coward

    My stories ...

    My personal favourite: A young operator head butting the Emergency Power Off button when tripping over in the machine room whilst running to pick up a ringing phone .. the famous "lift to use" plastic cover was added to said button after all the systems were brought back online. A few years later I changed jobs and when visiting a different data centre I noticed the lack of covers on their Emergency Off buttons .. sometime later on another visit the covers had been added .. I did wonder.

    In the age of green screen consoles you knew which server you were rebooting .. bring on Serial Console Terminal Servers and it took more effort to concentrate .. now with GUIs its takes extreme concentration to stop from typing "shutdown -r now" or even "rm -rf *" in the wrong window. [I would never admit to either of those finger troubles 8-]

    Ahh cleaners - In some DC's the "cleaner's' sockets" are widely advertised and the only ones you can plug your laptop power brick in without needing an approved "power change request" 8-(

    1. There's a bee in my bot net

      Plusnet?

      You didn't work for Plusnet a few years back did you?

      That is exactly what happened during an email storage migration. Two console windows open to the old and new storage device. Format the new storage array, which was actually the old one. Cue angry customers, Weeks of trying to recover customers data from the old storage array by the storage array vendor and the admission that they didn't backup the storage array. The storage vendor wasn't able to recover much data.

      My email now lives in the cloud, my own personal cloud, oh all right, its a little atom based server that sits on a shelf at home. This all reminds me, I must see if we have any old tape drives kicking around at work that I might be able to take home for personal use... (raid 5 plus scripted backups to another disk is the only back, but still feels safer than trusting someone else - and before someone says it - I know RAID isn't a backup - jeez you people are so critical).

      1. Anonymous Coward
        Thumb Up

        @Plusnet?

        >You didn't work for Plusnet a few years back did you?

        Nope, not me but as a long time Plusnet customer I remember that incident .. didn't affect me though as I run my own mail server and so any email loss would be purely my fault .. which is I why I backup to a RAID partition and to tape and have UPSes and also use IMAP to synchronize my mail to various laptops ... Paranoid? Nope, just have the scars of experience my son.

        Two things an admin always needs to use and should check regularly .. Backups and Logs .. both invaluable when used correctly.

  22. K. Adams
    Mushroom

    Old Electronic Engineers' maxim:

    "It won't work after you let the smoke out..."

    (Often said to EE students at University, as an observation by the Professor or Prof's Assistant regarding a student's freshly-blown and still-wafting circuit.)

  23. Smoking Man
    Happy

    Professional system care

    Had once to present our servers to a -hopefully- new client.

    Got several times the question about "how to upgrade firmware, can you downgrade firmware?"

    Answer "yes, of course, was always possible and will be."

    Sometimes later I was told the reason for this strange kind of question:

    They had their most critical SAP System deployed as cluster. Multiple database servers, but all connected to the same single storage array. Technician comes in, "does a small online firmware upgrade" to the diskarray (where the data was perfectly RAID protected), and, here you go, storage array goes belly-up. No firmware downgrade possible. Array vendor (same as server vendor) had to bring in a new diskarray, configure it, reload all data from backup.

    5 days of production loss.

    Next cluster (from us, servers and storage :-)) did array-based replication to a 2nd site, and, obviously to a 2nd array.

  24. Lord Lien
    Facepalm

    Not my horror story...

    .. but I know a sysadmin who asked for backs up of the 5 1/4 disks that came with the system & some one pulled photo copies out of a filing cabinet.

    1. GrahamT

      True

      That happened at one of our clients, but they were 8" discs then.

  25. Anonymous Coward
    Mushroom

    My favourite... 5 1/4 inch disks

    This story was told to me a few years ago.

    Back in the day the client used to do backups to multiple 5 1/4 inch floppy disks.

    One day a disk failure takes down the server and the technician is called in to restore the data...

    Tech: "Where do you keep the backup disks?"

    Client: "You'll be pleased to hear that we're very organised - we keep all of out backup disks in a folder in the fire safe"

    The client then hands over the folder to the technician - a ring binder full of neatly hole-punched 5 1/4 inch disks.

    I've always loved that story :)

    1. JasperJay
      Facepalm

      5 1/4 inch disks were great

      Years ago in a colonial backwater I responded to a client's request for assistance with their backup procedure for a stock app as they were unable to do any more because no more disks would go in the machine?!?

      Turns out the warehouse manager who had been given the job of running the backups had tired of it and so passed the task on to a lackey with very little IT nous. Turns out the little lady had been loading 5 1/4 inch disks into the computer religiously every night as required, but had never taken any back out again! As the little door on the floppy drive was down preventing the disk going in she had fed them all into a little gap between the floppy drive and the case of the machine. Opened up the box and out cascaded years worth of pristine blank floppys...

      1. T.a.f.T.
        Coffee/keyboard

        Mi Ol Mum

        I could have seen my mum doing this a while ago. It was just too funny not to comment.

    2. Anonymous Coward
      Facepalm

      Or customers returning their data

      on 5 1/4inch floppies, stapled to letters telling us what was wrong

    3. Apocalypse Later

      I still keep one...

      ...5 1/4 inch floppy drive installed on one computer just in case I need to access ancient media. I also keep a pair of forceps to remove CDs and DVDs from said drive. It is actually more likely to get a DVD in it than a floppy, these days.

  26. Klio
    Facepalm

    Check your syntax

    We needed to merge the data from 3 different DB tables from an old system to our new one for which the developer chose to write a PHP script. After successfully pulling the data together his loop set about inserting it into the new table, but his query has an error in it. Normally this wouldn't be an issue, except the DB wrapper he'd used had an 'email on error' function built in to it.

    Close to 400,000 emails later the building got back some sort of connectivity.

    Moral: test your query once before doing it half a million times.

    1. Synonymous Howard
      FAIL

      SMS alerts

      Or when auto-raising alerts via SMS try to ensure you don't send out duplicates 8-(

      My "quick hack" to dial-up the SMS message centre and send alerts to peoples' on-call mobiles went down a treat until one night hundreds of alerts were raised due to a mail queue overflowing .. said colleagues were deleting the messages from their phones for days after [10 years ago phones only held something like 10 SMSes at a time and so deleting a few hundred took a while we had to wait for them to be received in batches ... beep beep every few minutes 8-]

      Suffice it to say another quick hack later and duplicates alerts were ignored for an hour or two before SMSing only one out !

  27. A1batross
    Mushroom

    Separate networks

    Invoked like a magical spell, utility SCADA networks are supposed to be "completely separate" from the utility Intranet. Of course then there are the metering applications that watch SCADA electrical generation in order to sell power to the grid, but those are "just screen scrapers," read-only interpreters of SCADA reporting. So I was told.

    So I'm doing a site audit at a nuclear generating facility in the USA, following a wire on the SCADA network. The wire runs up onto the chief system administrator's desk, where he has a Windows NT (yes... NT) box... with another wire from a second network card running off to the Intranet.

    "I thought you told me these were totally separate networks?" I asked.

    He looked at me and sneered. "I *HAVE* a routing table!"

  28. Anonymous Coward
    Meh

    Test your DR scenerio.

    I was managing a servicedesk at a previous job. I hadn't been there long when this little gem happened.

    We discovered the server room had a problem when something (failed backup from previous night being rerun through the day IIRC?) failed and emailed the Servicedesk with a failure report. A tech sent to check the tape discovered the server room at ~30°C at the thermometer (/at the door/) and reported this back quickly.

    Simultaneous failure of a dozen air con units? I'm not a believer in coincidences. Checks reveal power to the server room is down, and the servers are running from the UPS but not the aircon. A facilities dept sparky was produced in record time (especially considering it was lunchtime) when nature of the problem was explained to my counterpart in facilities. ~35°C

    Frantic improvised efforts to provide airflow in attempt to avert thermal overload start. ~40°C

    Bad things™ start happening with servers as temp starts stabilising at ~45°C. Staff evacuate server room.

    Stewed Area Network shuts down at unknown, but absurd temperature taking everything on a VM down. Conventional servers keep plugging along despite tropical temperatures.

    A post mortem showed that the UPS had actually sent an email when it took on the server room load, but it only went to the network manager, who was at lunch.

    Lessons learned.

    1) Senior management panic in a crisis.

    2) Make sure that you have an alert group so that everybody with a need to know gets critical alerts, instead of just one frigging person.

    3) Actually do a real world test of the DR scenario.

  29. Anonymous Coward
    FAIL

    You need more than one back up tape ?

    Small local recruitment company, SBS2003 backing up to tape (12 tapes so 2 weeks worth of fallback). We set the system up and checked the back up system worked then showed sysadmin (the owners 16 year old son who "knew about computers" - they didnt want to pay support fee's) the basics and left.

    9 months later, panic call, they had been broken into, server stolen etc etc and would we assist in getting everything back up and running. Start aranging new server, workstations & I went in to collect the back up tapes.

    Changing the tapes every day was a bit of a chore so he had been just putting the same tape back in every day. Daddy was so proud of his boy that day (we, of course got the blame still)

  30. philblue
    Mushroom

    Always know which server you're wiping

    I was called out a few years ago to a small 5-person firm to replace a 2 month old server which had given no end of troubles from the beginning - a exact replacement was provided and as it was sat under a desk at the time pending the completion of its new home (a cupboard with a vent in the door...) I plonked the new one next to it.

    The live server was taken offline while the migration completed. One of the first things I had to do was adjust the RAID config on the new replacement server. I was constantly rolling from one side of the desk to the other between the two servers and had other work on my mind too. It wasn't until about five seconds after I'd hit 'INITIALIZE ARRAY' in the RAID BIOS that it dawned on me that I'd done it on the wrong server.

    Thank heavens it didn't actually do anything to the data on the disks - the 2 minutes it took to reboot before the Windows logo appeared on the screen were the longest of my life.

  31. Robert Helpmann??
    Childcatcher

    Dancing on the Power Switch

    I installed a new switch for our main campus as part of a network and server room upgrade. It had four power supplies - redundant and hot-swappable. I plugged each one into a separate and dedicated surge suppressor and each one of those into a different outlet all with UPS and generator back-up. I put the surge suppressors under the raised floor behind the wiring cabinets, between them and the wall. They were spread out a total distance of about 16 feet.

    Enter the electrician! Someone else had hired an electrical contractor to upgrade some of the wiring. As part of this effort, he pulled up the flooring and walked behind the wiring cabinets. It seems he could not hear me screaming for him to stop moving over the sound of the HVAC. He stepped on one power switch after another, completely downing our campus network. Alas, I had bought 3COM equipment which had no persistent configuration, but I did have it backed up externally and was able to restore it while the director of IT stood and tapped his foot on the other side of the wiring closet door.

  32. Anonymous Coward
    Anonymous Coward

    trees tress and more trees

    A collegue at a major south coast university was working testing some Netware objects. Having finished he decided to rename the tree on his test box.

    Only it wasnt the test box

    After 4 lots of 'Server xxxx is going down' it was realised that not only had he done this, but he'd renamed the root tree, had confirmed hismistake and 200+ servers were busilly deleting/destroying/shredding users, NAL bojects and home dirs.

    Hats off to him, he went right into the IT directors office and fessed up, possibly the only reason he is still there 10 years on. It took weeks to get it all back up and was never 'right' again. Some students lost a hell of a lot.

  33. Anonymous Coward
    Facepalm

    Fresh from the battlefield.

    Me: Are you *sure* it's not a part of the RAID0 set? It looks like it is to me.

    Remote admin: Positive, just pull it and replace it, it's hot swap.

    Me: ID the disk for me please <cue flashing amber on relevant disk> Right, Channel1 ID3 yes?

    Remote Admin: Yup.

    Me: Last Chance, you're absolutely sure it's not part of the RAID0?

    Remote Admin: Look, I know what I'm doing, just pull the bloody disk and replace it.

    Me: Oh, well if you're that sure, OK, here goes <click> <pull><bluescreen> Ooopsy, looks like it was part of the RAID0 doesn't it..

    Remote Admin: Ermmm.

    Me: Want me to save your arse for you?

    Remote Admin: <small voice> Please?

    AC because it was only last week, the muppets had set up a RAID0 spanned across the first ~9GB of all five disks and then a RAID5 on the remaining space on all 5 disks so every disk was part of both sets. (yeah, I didn't think it was possible or anyone would be so bloody stupid even if it were.) Scary thing is that every server in the business on every remote site was configured in the same way apparently.

  34. Tom 13

    Always know where your time server is pointed

    Windows server farm, don't remember what the original time server was, but it failed. So it started going down its list of alternates. You know, the default list. Until at last it queried the core switch. Which read its date and time from the BIOS ROM. And told the primary AD server it was now more than 2 years ago, in fact I think it was 10 years. At which point the server promptly began tombstoning all of the current accounts, starting with the first ones created. Which of course were the network admin accounts. Ugh!

  35. Anonymous Coward
    Thumb Up

    While writing software for a website undersold by noone.

    A certain cloud adherent, in his first job, which he now describes as project management, modified the 404 to reroute to an asp page which loaded a library file. The library file, cleared the shopping basket.

    On one of the pages, which we will call for the sake of argument, product.asp, an image was referenced which didn't exist. This then loaded the page, which emptied the shopping basket.

    A man, who Leeds from the front in dba circles, then proposed tracing using dbcc. He has a recommendation to this day on Linked in for this brilliant piece of analysis.

    Naturally, it was early 2000's so we fixed it and all went on the piss, and the man, let's call him Sidney, went on to hugely respectable career, having learned a lot from this.

    The moral, everyone makes mistakes. It's the fixing them which shows brilliance.

  36. Richard Pennington 1
    FAIL

    Those were the days ...

    ... at my first job post-university (mid-1980s), they had a machine room downstairs, with rudimentary air-conditioning controls, i.e. temperature and humidity.

    Rumour had it that by proper (i.e. highly improper) adjustment of the controls (temperature right down and humidity right up) it was possible to get it to snow inside the machine room...

  37. Dave 32
    Unhappy

    Crash!

    Well, let's see. There's the contractor that was in to change the blown light bulbs in the machine room. On the way out, he leaned his ladder against the wall near the door. And, it just so happens that one of the steps was at the exact correct height to push that big red EPO button. CLUNK! Followed by the sound of total silence in a machine room with 15+ mainframes. Ohoh!

    So, the obvious solution was to put a protective cover around the big red EPO button. So, a contractor was called in to do it. He opens the door, brings his ladder in (despite the fact that he really didn't even need a ladder!), leans it against the wall next to the door....CLUNK! Ohoh!

    As for backups, the I/T group assured us that the system disks were being backed up to tape every night. A few months in, someone asked them for exact details, and they gladly told us which two tapes the system disks were being backed up to. We eventually realized that when they said "system disks", they meant the two disks with the operating system on it, not the 40+ disks with the user data on them that were attached to the system! Whoopsie!

    Then, there was the time that the automated backup job would run overnight, and back user data up to tape. Unfortunately, the amount of user data had eventually grown to the point where it wouldn't fit on the tapes allocated. So, the night operations staff simply grabbed the needed tapes from the scratch tape pool, and satisfied the backup tape requests with those tapes. Oh, and, of course, as soon as the backup jobs were finished, they returned the tapes to the scratch tape pool! Yeah, I lost some data from that one, but it was almost worth it for hearing the explanation from the I/T installation manager.

    Let's see, there was that incident where the people on the third floor called the machine room on the fourth floor to ask them why there was a stream of water coming through their ceiling. That was about the same time that the operations staff noticed that one of the mainframes had started failing. Can you say "Broken chilled water line?". Ohoh!

    Then, there was that time I requested a tape from the automated tape librarian. The tape was assigned to me, mounted, and I dumped a sizable quantity of data to it. I then unmounted the tape, but kept it assigned to me. A month later, when I went to retrieve the data, I requested that the tape be mounted. After an hour of waiting, tape operations finally called me up: "We keep seeing this mount request for a tape for you..." "Correct." "only, we don't have a tape with that volume serial. Furthermore, we've never had a tape with that volume serial." "Err, how did the automated tape librarian assign it to me last month, and how was it mounted then, and how did I use it, if it doesn't exist?" "Ohoh!" (Yeah, I had a LOT of *fun* with tapes.).

    Then, there was that lightning strike. Did you know that lightning can make CRT displays show funny colours for days afterwards? Did you know that lightning can also make pretty colours inside of computers, routers, and just about anything else? I'm still trying to clean up that mess. :-(

    There are always those amusing, and somewhat rhythmic, yet very unmusical sounds from disk drives as they're dieing a horrible death: "Screeeee-CLUNK-CLICK-Screeeee-CLUNK-CLICK-Screeeee-CLUNK-CLICK..."

    As for sensations, who can forget the smell of hot electrolytic capacitor electrolyte from inside a system. Can you say "Bulged capacitor syndrome"? I can! :-(

    I could go on for hours. :-(

    Dave

  38. Anonymous Coward
    Facepalm

    O NOES IT'S Y2K!!!!!!1!!!ELEVEN

    I took over Messaging around 2001 for a company from an eccentric gearhead that had absolutely no idea what proper IT practice was all about. Over the years we had found all sorts of *interesting* things he had done, but this one in particular really was beyond the pale.

    The server in question was in a closet on a factory floor. Accounts differ on what exactly happened, circa 2003 IIRC, but it was something along the lines of someone was trying to unplug something else in a rats nest of cables and accidentally pulled the plug on the firewall.

    The aforementioned gearhead - who hadn't left on the best of terms - had never bothered to tell anyone that for years he had been rolling back the BIOS clock every time the system had to be restarted. The "firewall" (to use the term loosely) was a non-Y2K compliant DOS (yes, friggin DOS in ~2003) based system and would fry itself if it were allowed to boot up with it's date and time post-Y2K.

    Of course, if it were not for this experience I probably would have never learned how to read/write PIX firewall configs as we stumbled through building the config - which was also never documented - from the ground up. I think we might have been one of the only companies around to have a Y2K issue and, LOL, it was in 2003!

    1. Anonymous Coward
      Happy

      Sorry to reply to my own post, but forgot to mention...

      That was the only firewall in the environment and when it fried itself it took down their only external connection. All their inbound/outbound internet mail, web, etc... and every third party connection that they had went dark too until we were able to source, install and configure a new one - which I thought we did pretty well at considering that they were majority back up and running in three days.

      I think someone still owes someone a kidney for getting a PIX in under 48 hours - still not sure how or who managed to pull that one off.

  39. Robert E A Harvey

    doing things on the cheap

    Large engineering company developing its own control system.

    Decided not to spend money on hardware, gave old PDP11 to software team.

    Only storage was two 10" removable hard drives. Compiler and target development done on one, backed up to the other. Software man took backup platter home each day, swapped for another one the next day. Very professional.

    Team on holiday during works shutdown. Compulsory that.

    During works shutdown man from Dec came to do contract maintainence. Discovered lots of things wrong with both 10" drive stacks, carefully re-aligned heads (remember eyeball diagrams?) and adjusted read/write amplifiers. Got jobcards signed.

    Back on Monday Morning, and neither the system nor any of the backups could be read. 2 years of development down the tubes. Large amounts of cash wasted trying to recover data. Project cancelled. Half a million quid wasted saving the price of a new vax.

  40. kosh

    Backups, db sync ...

    modern data protection doesn't use backups or database sync, it takes single-instance archive snapshots.

  41. Anonymous Coward
    Anonymous Coward

    Whoops!

    A common problem I have noticed is logging into the wrong environment.

    A simple configuration change is taking place in a non-production system. To ensure the values are accurate a screen to production is opened to copy values from. The two screens are basically identical... production data ends up in a test system. That has happended a number of times. Not too difficult to recover from but always embarrasing.

    Or better yet, a problem has been spotted in production. Admin logs in to investigate. Admin then logs into a test environment to replicate the problem. Sets up something incorrectly and deletes the setup to try again and match production. Oops, that was production and the stack has just been blown away. Slight embarrasment and a good few hours recovering.

    A RAID 5 array had been in use for many years. Never suffered any failings. Suddenly it goes down and users are complaining. A brief investiagtion finds that a significant number of the disks in the array were dead... and had been for some time. Shame no-one monitored it.

  42. Richard Pennington 1
    FAIL

    Those were the days (2)

    On to the about 1988, and a small consulting firm. One day one of the consultants found a small machine downstairs and took a board out of it for use in another machine. There was only one problem: the machine he'd cannibalised was still live ... and running the firm's timesheeting operation.

    The next round of timesheets was done on new-fangled IBM PCs...

  43. Anonymous Coward
    Facepalm

    My worst

    My management believed that our top-tier colocation facility's backup generators were insufficient protection against a power outage, so, against the colo's advice, they insisted on installing UPSes in each of our production racks. To comply with fire regulations, the colo required an emergency power off (EPO) circuit be installed so that all the UPS units could be shut down at once. One of the UPSes faulted, so I went down to the colo to reset it if possible or take it out of service if necessary. I hit the power button, and the UPS shorted back along the EPO circuit, shutting down all the UPSes taking out our entire production infrastructure, including the SAN, while the dev and QA systems hummed merrily along. There was a tense moment like something out of a bad movie involving a bunch of us huddled around the back of the UPS with an electrician as we figured out what to do upon discovering that the EPO wire had fused to the innards of the faulted UPS and that the only solution was to cut the wire.

    He cut the wire, we spent a few hours verifying that everything had come up correctly, and few weeks later, the UPSes were gone.

  44. Anonymous Coward
    Anonymous Coward

    Errrr... Sorry!

    Possibly the time I went to do some work on one of my servers in a county council machine room which needed me to get behind a couple of comms cabinets. This was fine, until I tripped over a cable on the way out that. "not to worry" thinks I, the UPS will kick in until I pop that 13 amp plug back in, it's not like anyone would have important stuff running straight off the mains, is it? Oh, except for the entire VoIP system for the entire county council... It was a nervous few mintutes till that came back up.

  45. Anonymous Coward
    Anonymous Coward

    Sometimes it pays to walk around the outside of the building.

    A company set up a tape-based backup system, including a fireproof cabinet to store the on-site backups. Their testing indicated that their backup process worked as they expected. However, a few months later, when they needed to retrieve some files from backup, the critical tapes were readable. Thank god for off-site backups. They reevaluated their process, and everything seemed to check, so they wrote it off as “one of those things”. Several months later, they encountered the same problem, and this time they called in a consultant to determine what they were doing wrong. Long story short, it turned out the tapes were being marinaded in a strong magnetic field from the transformer pad located outside the building right against the wall where the tape storage cabinet was located!

  46. Diogenes
    Windows

    Bin there got the scars and I was a developer not a sysadmin

    The BRS (big red switch) without a cover - check

    The duff backup not being done for 12 months - check

    The overheated servers because the admin turned the AC off - check

    The best I had tho, was an ex collegue who was working near Sydney(mascot) airport and the AS400 would start to boot , then crash after 2 - 3 seconds. AS400 replaced 5 times(and all worked when tested at IBM). Company called in all the experts - all baffled until an very very senior vendor engineer (flown in from IBM US) happened to look out the window and noticed that when a radar was pointed at the building the server crashed. Two dollars worth of tinfoil fixed the problem

    Icon cause thats what my collegue looked like !

  47. Field Commander A9
    Thumb Up

    Funneist post

    and replys I ever read on El Reg!

  48. Anonymous Coward
    Anonymous Coward

    tales of OH SH__!

    anonymous to protect the guilty...

    1) The first story I remember, my father worked for a mainframe company as hardware tech support, one day he gets a call that a set of backup tapes are bad, and they've called to have the drives checked. He goes in, verifies that the drives (reel-to-reel) are properly aligned, etc, runs a job to the device, reads it on a 2nd device, all's well, he then tells the ops manager to call him if he has any more problems. Next week, another call, same problem. He goes in, pulls the drives apart and puts them back together. Runs his tests (which pass), reports this to the ops manager, who is getting upset. 6 days later he's in performing some off-hours work, totally unrelated, when housekeeping unlocks the door, drags the industrial floor buffer into the data center, plugs it in and starts buffing the floor tiles. After the cleaner finished doing the aisle between the front and back doors, he pushes the buffer around the operator console and starts buffing the tiles between the tape racks...30 seconds later dad has pulled the plug on the buffer and is calling the ops manager at home. Three years later, the floor looked grubby, but by god the backups were readable.

    2) UPS's and cabinet design. Working for a large manufacturing operation ($500K/hr.) In order to help us prioritize our time, our manager requested that anyone in operations or support use the text paging sytem to contact us, and to supply some kind of detail so we could properly prioritize things. Its lunchtime, so 4 of us go to lunch, me (on call), other network guy, an applications guy, and my manager. Sitting in a restaurant, I get a number only '911' page. I ignore it because there's no message... 2 minutes later, all of our pagers go off with just '911'. Manger calls operations, and when he hangs up the phone, wraps up the rest of his lunch and says we've got to rush back, everything's down. That was hard to believe, we had 100 separate circuits in the data center, all fed by an inline UPS. We found out later that the UPS maintenance tech had come in to do a PM/test on all of the UPS batteries, and had walked down the row opening all of the cabinet doors. When he opened the last door, the next-to-last door finished swinging open, and the latch went right between the vertical bars, inside the last cabinet, that protect the main power switch... and shut down the entire UPS. This, of course, took down every server/switch/router in the data center.

    Before the PM was completed, maintenance had built a hood to mount over the switch to prevent that from happening again. Also, shortly after this, we started a project (which made the union electricians really happy) to convert half of the data center power to bypass the UPS. These were fed from site power which was fed from two different power grids via two onsite substations

    3) is that a complete backup? This one happened in college. The computer science department had a machine running AIX that supported some programming and admin classes. The department sysadmin got some of his more ?bright? students to admin the box: account maintenance, backups, volume management, etc. Well, one evening around 8pm, the server went offline. I called the admin (because I ususally spent 20hrs/day online), and he met us in the computer room to work on the machine. The problem was a disk failure, and required the disk to be replaced, formatted and reinstalled from tape. Everything went well, and could possibly have been a shining example of what we could do...until we tried to restore the user accounts. After restoring from tape, we couldn't log in, but root could log in. As it turned out, the only partition that had been backed up was /home. We spent the next several hours re-creating, by hand, /etc/passwd and /etc/group, from a dump of currently enrolled CS students, and then searching through the files in the restored directories to try to determine who the owners were. Fortunately, they were mainly CS students and had assignment program files with name/course/intstructor/project, so we were easily able set their name/uid/gid. We then generated new passwords for every account and then printed the list. The last thing we did was change the motd to explain that we'd had a hardware failure, that all of the data was restored but all of the passwords had been reset and could be obtained from the department secretary during normal hours.

  49. Anonymous Coward
    Anonymous Coward

    Checkpoint FW woes...

    New-ish tech was hired for his PIX experience, but we had CP FW's. So he proceeded to make changes to one FW config, then pushed it out - to all 5 FW's running from that console. Took the best part of the afternoon to sort that wee error out... And new-ish tech was then sent to CP admin class.

  50. Anonymous Coward
    FAIL

    Learn maths: 24 x 7 x 365

    24 x 7 x 365 provides seven times as much availability as there are hours in a year. Lack of thinking about things properly is the NUMBER ONE cause of problems.

    http://www.theregister.co.uk/Design/graphics/icons/comment/fail_32.png

    1. Frumious Bandersnatch
      Thumb Down

      24/7/365

      I was going to post something similar but checked before posting. I'd have given you an upvote except that you seem to think dragging the fail icon into the message box tags your post with an icon.

  51. Anonymous Coward
    Anonymous Coward

    Beware automation

    Outsourced support are testing a new desktop image, instead of pushing this to just the two test laptops via SCCM, they force a mandatory OS re-build on all PC's in the collection. Cue 350+ PC's starting to rebuild. Cue much running around pulling network cables and shutting down processes by in-house and outsourced support staff.

    In the end only 9 machines were wiped, luckily all the senior managers were at an off-site meeting that day....

    One week later, the same team pushed the wrong patch to all revenue-generating production servers.

  52. Anonymous Coward
    FAIL

    Bad things to do to tech stuff

    Co-location facility, thousands per month per rack for rental. A facilities power test shuts down the environmental control system - redundancy fails. It's discovered that instead of cooling running at steady state till the control system comes back, it all shuts down. Hundreds of very warm client systems...

    Following a light rain shower at a site where I used to work I pointed out (in email) to the site manager that drainage at the front of the building was poor - given water had run up to the front doors of reception & the office area (including the server room) was all on one floor level. Was assured all would be fine, they were cleaning leaves out of the drain & wouldn't be a problem. I suggested that having a drainage channel across the footpath to prevent water running from the road straight downhill to the front doors might be a good idea - was told not to worry about it. Suggested may be a false floor to get the UPS & server (tower) off the floor in the server room would be good - told not to worry about it, room has never flooded before. Flooded server room 2 months later, a pain. Having his reply emails to accompany my "I told you so", priceless. Companies response - drain channel & false floor installed.

    Different site for the same company was having a problem where part of their network would drop out - but only during the evening shift. Eventually discovered that a prior admin had saved the cost of installing a fiber optic run by using a switch as a network extender - approx 90m from server room to switch location, 80m from there to end user. Fine except for the fact that the switch was installed in the bottom of a chest of drawers in the security office. Cables run in by using a hole saw to drill through the back of the bottom drawer. One security guard used to unplug the power cord for the switch - cause he couldn't see what it connected to in the locked drawer - to connect his portable TV for the night shift.

    Similar story from a person I worked with for a bit. Their server room was being revamped, new layout, new racks, new cooling. One corner was being closed off as a test / work area. 1RU pizza box server used as an Active Directory domain controller had to be left running whilst racks moved around, so was pulled from the rack live & leaned against a wall. Just happened to be the wall under construction. No one found it until about 5 years after it had been plastered inside the new wall.

    My company took over a smaller company & I was sent to investigate what would need to be done to migrate their users & data into our systems. Walked into the "server room" (aka closet behind the stairs) to find it about 35 degrees Celsius. Told by the office manager that their IT support guy came from Lebanon originally & didn't like being cold. Not surprisingly, both of their servers had drive failures when we shut them down & relocated them - glad we copied the data off them first.

    1. Anonymous Coward
      Anonymous Coward

      Water...

      So there was a problem with groundwater penetrating into the below-ground server room. The facilities people came and looked, then sent a brickie to build a trough at the foot of the wall. The trough had a pump and a floating switch to trigger it. Except that it pumped out through the wall to resume the cycle. The floor also sloped and the switch was at the uphill end, so it only started pumping when the water was lapping at the brim of the trough.

      Then the room next door started leaking through a crack in the wall. The initial solution was to use silicone mastic to build a little fence around the leak on the basement floor beneath a lifted floor tile. A security guard was detailed to sit in the server room overnight with a big roll of kitchen tissue to soak up the water. If it had risen enough to cross the 'silicone bund' it would have hit all the power strips for the servers.

      I took a walk around the outside of the building and discovered a large puddle lying up against the wall where we had the leak. Had a moan and got the facilities people to get a trench dug around the outside of the building to lead the water to the drain. The problem went away. As did the leak in the wall with the trough.

      And as the puddle drained we found a bent hammer that someone had used to try and pry the metal mesh off the server room window (it was in a rough area).

  53. Anonymous Coward
    Anonymous Coward

    And there's more ...

    Customer is complaining that his server is slow, then he finds he has a minor data-corruption problem. Turns out he'd had a dead drive in a raid 5 array for a long time but didn't want to pay to have it replaced. Then the machine had a problem and they kept restarting it - by turning it off and on again at the mains.

    Funnily enough, it ran a lot better when the drive was replaced - and it cost him more than it saved in our time to fix his data.

    And from a while ago - and not related to my current employer. Customer has storage array, which reports a bad drive. Service tech arrives with new drive and proceeds to start pulling each drive - pull drive, nope, not that one, shove drive back in. For some reason, this didn't do the array much good.

    And one of my customers ...

    Bought an Apple XServe. One day, one of their lads wanted to plug a USB device in so went across the front panel "popping out the covers" to find the USB ports. As he hit the second one, the machine stopped dead and started beeping - each of the "access covers" was in fact a drive caddy, and on popping the second one out it killed the array. I was able to force the "last one out" drive back online and let it rebuild the other, it seems to have survived.

  54. Anonymous Coward
    Anonymous Coward

    A/C for obvious reasons ...

    Recent one.

    Customer has a new office built, with a dedicated "server room" - only 4 servers, but it's a global operation. My preference was for passive fresh air cooling, round here it simply doesn't get too hot for servers if you can provide them with fresh air. Unfortunately, planning restriction limit what they can do, and before I can discuss options (yes they did have options) they'd decided (= been sold) a split system A/C.

    Roll forward 15 months, and one weekend our monitoring alerts that one of their servers is down. Coincidentally, it was a hot weekend. Yup, A/C had failed.

    But, I'd been onsite a few weeks earlier and we'd noticed the A/C wasn't functioning properly and alerted the customer. They'd done nothing until a server actually shut down with temperature - then they called in the A/C guys. The next bit you couldn't make up - the A/C technician told them the unit was working perfectly and it was their room that was "the wrong kind of room" ! There then followed a week or two of messing around trying to blow air in with portable fans etc while the A/C tech refused to accept the A/C unit wasn't working. I'd offered to speak the A/C people, but this didn't get taken up.

    Eventually, customer gives me name and number of an engineering person at the A/C outfit, I speak to him, he agrees that it doesn't sound like the system is working - and service call is placed. Different tech immediately accepts the unit isn't working, but spends all afternoon trying to find the fault. It turned out the reversing valve in the outdoor unit was sticking - they had to swap the outdoor unit with another one from the meeting room.

    Closer to home.

    Then in our own server room, one day we had the office centre manager come in "in a bit of a fluster" - "the building is about to burn down" sort of fluster you get from non-technical people when something electrical is getting a tad warm. It turns out that the "hot electrics" smell downstairs was the switch in the meter room that feeds our unit - and by now it was hot and a bit brown. This was Friday afternoon, and the electrician (quite correctly) said it couldn't wait. So what were we to do ?

    Plan was - nip out to electrical wholesaler and get a length of cable, attach the 63A trailing socket we had to fit the genny input, and wire it to the board in the next unit. Switch load across, sparky could then swap the switch. Well that's the theory !

    The distribution boards in the units are an old design and breakers are not an off the shelf item. I didn't have a spare 63A breaker, and we couldn't wait a week or two to get one. OK, we've a UPS that tests show should last about an hour - so get as much done as possible, knock the power off, whip the breaker out of the board, stick it in the other board, and just the live to connect - what could possibly go wrong. So there I am, just tightening the last screw when the hum is replaced by the sound of fans spinning down, and then silence, and then some utterances which I think meant something like "that's odd, it's not supposed to do that" ! Yup, the UPS didn't have the capacity it claimed in tests - because the batteries were past their best.

    It turns out that when it does a battery test, it doesn't completely run the load off the batteries, but only partially transfers it. The batteries were fine under the (about 2/3) load applied during tests - but gave up quickly when presented with the 100% load.

  55. Scott 19

    The first 2

    The first 2 times my engineer clicked shut down instead of log off on a terminal I smiled grimley and then threatened to break his fingers if he did it a 3rd time, 2 1/2 years later and hes still not done it again.

    Forgetting to order the software that the customer has paid for and realising the day before so instead of telling your team they downloaded a bit torrent version and installing that on all the machines, the licence key generator gave it away.

    And generally young up starts that think your just doing things the hard way to make more work for people and that all the rules that are in place for IT hardware/software do not count for them and then they do something stupid.

  56. Jedit Silver badge

    Number 9

    Didn't some fired sysadmin recently face criminal charges for refusing to turn over passwords to company machinery?

    My own tale of woe:

    "I turned on my PC and it went bang, now it won't turn on."

    "OK ... you pushed the button on the front and it went bang?"

    "No, the switch on the back."

    "The rocker switch with 0 and 1 on it?"

    "No, the sliding switch..."

    For those familiar only with modern power supplies, that would be the switch toggling the PSU between running at 115V and 230V. "Bang", indeed.

    1. Anonymous Coward
      Anonymous Coward

      Terry Childs

      The afrorementioned sysadmin. Google/wikipedia has details.

    2. Peter Simpson 1
      Mushroom

      That would be...

      ...the self-destruct switch. That's why it's recessed and needs to be operated with a tool.

      //idiot-proof design only serves to create smarter idiots

      1. Jedit Silver badge
        Boffin

        The voltage toggle switch...

        ... is recessed because it would only be operated BY a tool.

      2. T.a.f.T.
        Meh

        Finger is good enough tool

        I have done this before; I was surprised just how easy it was when having a blind feel behind a machine to toggle this switch. Fortunately the fact that nothing came back on as I expected alerted me to the fact that I may not have just flipped the black power switch and I took a good look before powering things on.

        No tool needed, just a bit of a fumble in the dark.

      3. James Dore
        Trollface

        Self destruct switch

        ITYM "operated BY a tool"

    3. Anonymous Coward
      WTF?

      re: Number 9...

      My sister used to work in a Biology Lab, installed across several old houses that were neighbor to the University (how do you expand a lab next to a University?). Cue complete rewiring of said homes so the lab stuff could run properly, including some Sparc kits, along with the Novell kit running mail and shared folders (old kit, but running ok). The whole thing had been properly beefed up so it could run several fridges (it is a biolab anyway), several independent A/C and whatever.

      Enter Mr. Mad Ron Sparky, rewires the whole place with 127V and 220V sockets side-by-side (you know where this is going)... and everything seems to be working just fine.

      Until someone turns off the lights by the end of day. I noticed that all the sockets would spark when the lights were turned off. Odd, really odd.

      Nobody knows how, but turning off the lights would change all the 127V sockets to 220V. WTFFF?

      Every PC still running with old PSUs with voltage rocking switches would immediately fry or trip-off. All the replaced PSUs were multi-voltage, and wouldn't care less what voltage was feeding them.

      Nobody figured this out until she asked me in a while waiting the completion of a lab test and give her a lift home. It took a solid week of PCs psu's going belly-up and being replaced. It turned out the servers didn't fry because their UPS had fried on day 1 and was replaced by a multi-voltage compatible too.

  57. Anonymous Coward
    Anonymous Coward

    I'm sure office politics is missing off the list

    It seems a common theme amongst the comments..... oh idiot users too.

    I was a new sysadmin (first IT job) at a smallish company (100+ users).

    Every day I would hear from the Design department how they had implemented the network, installed the servers, etc. And how dare I not give them the admin account passwords they would get me fired etc.... Not really bothered, security above idiots.

    A few days later the MD calls me in, asks about password for Design department. I pulled out my server log showing their internet abuse (Pron, gambling, etc) during work time (not lunch). Can't trust them with something like that, why should they have control of the network? Never heard about it again.

    Or explaining to the sales department that I can read anyones emails from anywhere. I pulled up the remote server, browse the userlilst and access my email on their machine.... cue lots of rapid clicking and tapping from 20 people listening in :D

    And the user that was logging on as someone else stealing data trying to get someone fired.....they just logged out, in as user B and then back in as themselves. didn't even change machines.

  58. Anonymous Coward
    Facepalm

    Couple of fails,....

    1. Head Sysadmin says "the intruder alarms on the oustside of the building only go off when both beams are broken, like this,..." Beep beep beep beep etc. "Uh Mike, where's the reset code for the alarm system?" Cue long day sifting through reams of paper to find the reset code.

    2. I implemented a split-mirror backup and 6 months later was asked to restore some data. Turned out I had been splitting the wrong mirror to be backed up. Queue frantic rejig of the scripts to put a timestamp of the epoch into a file, and rework the split scripts to check if the epoch file is within 15 minutes of the current date. Oooops.

  59. Anonymous Coward
    Linux

    Rebooting the wrong server over identical-looking SSH sessions?

    molly-guard is your friend.

    http://packages.debian.org/squeeze/molly-guard

  60. Anonymous Coward
    Trollface

    "And one of my customers ..."

    "Bought an Apple XServe."

    <insert trolling here>

  61. This post has been deleted by its author

  62. Anonymous Coward
    Anonymous Coward

    A couple of oopses...

    Heard from a friend in Britain about a hospital where he worked that 'suddenly the network died' at a floor.

    When they checked the comms cupboard, the network switch was switched off!

    The switched it back on...

    A few days later it happened again.

    Eventually, they caught one of the nurses in the act of switching it off.

    He 'didn't like the noise'...

    ---

    Back in the days when I was relatively new on routers and bridges, and we still used 64Kbps lines, I began reading up on compression...

    Since we had UB networks equipment everywhere (UB Nile/Danube routers at the smaller offices, and an older 'network cabinet' with slot-in devices at the main office) I had the idea to switch on compression...

    That day I learned two things...

    1. Make absolutely certain that thwo 'similar' devices use the same compression techniques...

    2. Have a failback plan for when it fails. There's a reason that some types of equipment can have a 'revert to saved config in xx seconds' function...

    Also, talking a remote user through logging to a router and changing the setting is a pain in the posterior...

    another, and more recent experience...

    We moved an office to a temp location. We instaleld a couple of Cisco switches, a UPS, a server, router, modem... Everything went into the rack with no problem.

    There was this annoying beeping sound, though, but as it seemed to come from amonst the equipment belonging to another tenant in the same building, we ignored it.

    Seems that the power sockets at the bottom of the rack wasn't directly from the grid, but through a cute little UPS belonging to another tenant... Our server alone was enough to overload it... And with ALL our equipment hanging off our UPS, which in turn was hanging off that little toy...

    It all came down a few days later...

    LABEL EVERYTHING in a rack!

  63. perolsen

    Murphy rules ...

    A major telecom company in Denmark had a datacenter with a diesel generator as backup power. But the generator was getting too small, so a bigger was to be installed.

    Before the new generator could be installed the old one had to be removed. It was pulled out of the building and loaded to a truck. The truck then reversed and drove into the transformer box with the main power line.

  64. Robert E A Harvey

    Datapoint

    Datapoint had a magic OS (Dos.h? dos.K?) which was agnostic about drive IDs. You defined a removable device in the form <volume label>filename.ext. During a payroll run, for example, you could knock two drives offline, spin them down, physically swap <employee details> and <timesheets>, spin up, online, and it would carry on processing from where it left off. Brilliant.

    At the end of each end-of-month run we would copy the result table to a backup device so we had two copies.

    or at least we thought we did. I had tested both backing up and restoring and listing the two copies, etc. without noticing that I had defined them as <wages>EOM.dat and <wages>EOM_month.dat. <wages1> remained surprisingly empty when I needed it....

  65. Wize
    Thumb Up

    "frankenbreakers"

    Oh the joys of old buildings with their fused neutrals and the likes.

  66. Mark 110

    Power

    At a certain cable company I worked for a few years ago, everytime networks did a generator test at a remote network node (generally done at 2am on a Sunday morning to minimise risk) we seemed to lose that node a few hours later just as our lucrative b usiness customers were arriving at work. It seems engineers were buggering off home after the test without checking mains power was restored to everything so 6 hours later when the UPS's died the site fell over, leaving whole cities with no internet.

    Sigh

  67. DanB

    Things you should not take for granted

    Raid5 and you're safe - when a disc fails and rebuild starts, you will soon learn that another disc has an unreadable sector. Bye bye. Use raid6 and monthly raid checks.

    Stupid raid controller. I've seen an ibm server that would remain stuck initializing raid because the battery was dead (visibly swollen). Removing the battery saved the day.

    Raid controller software - some let you make stupid things. Like reinserting a disc (with now stale data) and letting you put it online without rebuilding. No warnings. Just the expected file system corruption.

    Raid array with raid configuration saved on each disk. Sounds great, no more configuration backups from raid bios to floppy. Except when a power glitch to the disk cage triggers some bug and corrupts all disks. Bonus: because of previous raid reshaping array did not start at the beginning of the disks, requiring manual wizardry to recover.

    UPS. IBM branded APC, batteries full. Except when I tripped the main breaker ups shuts down immediately signalling battery discharged. Subsequent tests and calibrations went perfectly fine.

  68. Linker3000

    HOW DEEP??

    Oh - I have sooo many!

    Best one was a firm that had a new business centre built and somehow between the architects and the builders, the computer room floor void jumped from its 4 inch deep spec to 4 FEET. How this was not spotted during construction I will never know. Special pedestals were built for the floor tiles and you really really didn't leave a tile up lest someone disappeared into the void, which - as it happened - was put into use to house a second level of servers; albeit they were pigs to work on.

    Big Red buttons right behind the light pull switch - check

    Cleaners unplugging things - check

    Computer room aircon that just didn't seem to work - check (endgame: when you box in an old room with new stud walls & dry lining, TURN OFF THE RADIATORS YOU'RE BOXING IN TOO

  69. Anonymous Coward
    Anonymous Coward

    A couple I ran across...

    1) If you're installing a nice new server room in a (very) old building, it's good to spec all kinds of redundancy, high-quality racks and UPSs, extra air-con, and so-on. It's also good to check, though, that the floor can support the weight. One joist cracked, the corner slumped, and everything in the room slid down to that corner and then fell through to the floor below, taking a few square yards of the server room floor with it. Fortunately, the server room was on the ground floor, so stuff fell into the basement, rather than keeping going, it hadn't gone live or had any data loaded yet, and the company was well-insured, because it was impossible to keep a straight face when you looked at the mess. Even the MD, having heard about the 'disaster' came down from head office to have a look - came in looking very severe, took one look, tried to hold his expression, and broke down in giggles.

    2) On a similar note, I once saw an installation where the server room was 'secured' against accident, because only authorised, highly trained personnel could gain access to the server room. Cue Big Johnny Fatbastard tripping on a cable outside the server room, and literally falling through the partition wall. Wiped out a rack, but escaped any serious injury. (And when I say wiped out a rack, I mean he pancaked the thing.). The servers were not as unharmed. Not really the sysadmin's fault, although I did warn her of the possibility some time previously - but neither of us took it seriously.

  70. Anonymous Coward
    Flame

    Server Room Security

    My first day where I work was odd, the mail server disappeared in a remote office overnight even though cameras and door access was in place. No-one admitted not locking the doors as obviously door access can't be considered security if its just a reader and a magnet on the door.

    Never did find out what happened to it and why on earth they choose the P2 server instead of the shiny new P4 Xeon.

    1. Robert E A Harvey

      Ah, but that's theft.

      In the 1970s the firm I worked for built "robot" garages - just a forecourt and pole sign, no buildings. Each pump had a "note acceptor" next to it. I designed and commissioned the note acceptors.

      We built one on the A45, just after it stopped being the M45. I got a call one day that the site had gone. "Gone wrong? gone heywire?" I asked. "No. Gone."

      I drove up there. The whole lot had gone. Pumps, note acceptors, pole sign. The only bit left was the tank underground, and they had emptied that.

  71. philbo
    FAIL

    Actually taking backups..

    ..is a good idea:

    In one London hospital with a system of mine, I was called back after about six years because a power surge had killed the PC, and when it came back up the database was corrupt.

    The PC had a tape drive, and the tape in that drive was the original one I had left there all those years ago. Checking the logs, the customer had not run a single backup since my first site visit. They lost all their data.

  72. Anonymous Coward
    Flame

    DR without planning

    Working for a big multinational manufacturing company, we had 2 uk sites. IT ran from the other site, but there was IT servers on both sites. We in the manufacturing systems side of things had servers in the IT server room and backup servers on site in another location.

    So disaster was, big site wide fire, massive smoke damage to the IT server room, but not to our backup server location.

    IT came up from their site, said all would be fine, took their kit back with them and plugged it into their server room. We got our kit washed and cleansed by the recovery company, they did not, we carried on running on the backup servers, got primary kit back in a week and carried on as if nothing had happened.

    They polluted their server room with contaminated smoke in the kit they had 'rescued'. All that kit failed in a couple of days, lots of other items in their server room also failed due to corrosive smoke damage thrown out of the fans from the rescued kit!

    Recovery is possible as long as you plan for the recovery activity in your DRP.

    1. Anonymous Coward
      Anonymous Coward

      My DR plan

      Working for major company during Y2k.

      They had to come up with a plan in case one of their centres lost all power in the new year. Even though the centres had backup generators we had to assume all power had failed.

      I jokingly suggested they have lorries standing by and in the event of power failing they load any mission critical servers into the lorries, drive them to a working centre and power them up (with appropriate changes). On Jan 31st 1999 they had lorries standing by.....

      1. Tel
        WTF?

        erm?

        *JANUARY* 31st?

        Surely that would have been a bit of a long wait for the lorries? Then again, this does rather seem to be appropriate for management fail...

  73. J.G.Harston Silver badge
    FAIL

    Space under tables are for FEET!

    Set up an office for a small printing/design firm. Everything neatly wired along cable lines along the walls, so no trip hazards. Put the PCs on the desks so easy access for incoming disks, USB sticks, cameras, memory cards, other data devices, etc.

    Went along a few weeks later to do some work of my own, and all the computer kit - along with boxes of paper and crap - had been put under the desks. Ah, I suppose everybody that works here have no legs and four-foot-long arms. Where the hell do you put your feet when trying to work? I presume eveybody has to sit side saddle. Multiple instances of flailing feet hitting shutdown buttons and getting tangled in under-table power cables. People have to grovel under tables to plug in USB devices to get photos off cameras etc.

  74. Anonymous Coward
    Facepalm

    Aircon causing problems

    More servers go into the server room, we will need more air con installing say us, just turn the temperature down says the beancounters and facilities, it will be fine.

    Cue hot day, aircons working overtime, pipes freeze due to being too cold, air con stops working, servers overheat and switch themselves off. Happened 4 times before we could convince them to up the capacity.

    Similar thing also used to happen when it was cold, no insulation on the pipes outside, they froze, aircon stops, server room overheats again!

  75. Anonymous Coward
    FAIL

    Planning :)

    Anonymous, because I'm sure someone else will recognise this.

    About 4 years ago, purpose built new head office.

    With centralised comms & server room

    Moved all the kit from the old head office.

    Everything working OK.

    Then there's a power cut.

    No problem, comms & servers onto backup gens.

    Critical desktops still powered up.

    Then someone noticed that the air con in the comms & server room isn't running any more, and the temp is rapidly rising.

    Cue half of IT support dashing about, finding fans to cool servers, while the other half started shutdowns.

    Not sure if the extra gen to run the air con was ever installed :)

    1. Anonymous Coward
      Anonymous Coward

      could be worse

      Something similar happened at a previous employer, underground "high security" secure bunker with airlock access and UPS and diesel backup etc, housing operations staff as well as a small server room in the corner with all the servers in, weekly tests shutting off the power for 15 minutes to test UPS, then firing up the diesels for 30 minutes, then back to mains power, all works fine.

      Actual power cut happens, and the redundant aircon units all shut off, it soon starts to get a bit too hot so the shift supervisor authorises everyone to call the DR providers for their respective systems to take over, then go wait in the car park - at which point they find another problem, the door controls also don't work - aircon and doors were wired in to the (unprotected) office building above, not the bunkers protected power supply (and doors are fail-secure)... Luckily it happened during the day so someone with the manual override key for the airlock doors was in the building upstairs, otherwise it could have been much worse!

  76. Anonymous Coward
    Anonymous Coward

    The good, the bad and the ugly...

    The good: Many years ago now I did my sandwich year working for a government department, mostly Cobol on VME/B. :-) I got on so well with the team, they invited me back for the Christmas party after I had gone back to University. I arrived lunchtime expecting to start early down the pub but was quickly accosted. They had deleted an entire directory of source code some months previously and wondered if I had saved a copy anywhere? No, but the automatic backups they weren't aware of had. Two minutes at the command line and their files were back. A good party was had by all. :-)

    The bad: not me personally, thank God, but as part an upgrade weekend for a large Government department, the firewall management software was being upgraded. The technician has idiot proof, detailed step-by-step instructions. One step is to set a new admin password. Password policy is very strong so this has been pre-generated for him and consists of a 100 character plus, entirely random alphanumeric password. All he has to do is type it in. He duly does and the software asks him to confirm. He types the password in again and of course it is different. Try again. Different again. Try a third time. Different again. The software decides that it will fail 'safe' by wiping the config on *all* of the firewalls it controls. Result: THIRTY THOUSAND users can no longer use their PCs. It took four days to fix. The service penalty payments went into millions. Oops. Lesson learned: next time have two people - one to read out the letters, the other to type them in. (But one does have to wonder WhoTF thought that wiping firewall configs was appropriate 'default' behaviour as opposed to leaving them unchanged.)

    The ugly: same Government department. A new system which is not too significant at the moment but is rapidly gaining use and will soon be a lynchpin component has no contractual DR requirement. So if the data centre fails, our DR process consists of starting a contract negotiation. ;-)

  77. Anonymous Coward
    Anonymous Coward

    The RAM disappeared? ! MAGIC !

    Mid 90's, the sales department found that the RAM from their Sparc Stations also worked in their home Pentium Computers. For several months we'd get a call telling us that their Sparc Station "just stopped working." We eventually find that the RAM had magically disappeared. Everyone on the Sales team is "stunned."

  78. Anonymous Coward
    Unhappy

    10$mil accidental reboot

    A junior sysadmin was ssh'd in to the ONLY directory server for a 3000+ user office. From the directory server, ssh'd in to other systems (other systems only allowed remote access from the directory server). Not paying very close attention to which system he was on at the time...shutdown -r now...oops. The President of the company comes down from the 10th floor to ask why THE ENTIRE COMPANY was suddenly kicked out of their Unix systems and to remind us that their burn rate is 1 million per minute. The receptionist announces over the PA system to a building with 10-floors filled with people that "the system" would be back up within 5 minutes. 3000+ people all decide to walk around and chat...many going outside. After 5 minutes many people are outside...it looks like a fire drill. Once the directory server comes back online few people notice and continue to loiter and chat. The PA announcement isn't heard. The IT team had to quickly cover great distances outside and in, to tell people nicely that 'the system is back up'. (get back to work ya lazy Hobo's!). It took about 10 minutes to restore productivity.

    I.T. later had a pizza party to commemorate the most colossal non-destructive goof up to date.

  79. Steve Brooks

    Logic Watson, logic

    Most failures are failures of logic. For instance had a customer who had two computers and two email addresses, he wanted the same emails to arrive on both computers, so he set up a rule on one computer to send a copy of all incoming emails to the other email address, and yes you guessed it, the same rule on the other computer. The emails ran on an endless loop all day before I managed to get there, and had to struggle to not burst out laughing as thousands of emails ran merrily in a circle between email addresses. Even when I explained it to him he couldn't understand the problem, so I just fixed it, walked out and waited for the next call.

    The same applies to most big and small disasters, had one companies secretary turned on the computer every morning and looked at the screen that said, your last backup was done xxx days ago, would you like to do a backup now, then promptly clicked no and ignored it. Yes they did have an automated backup system, but it does actually rely on the backup system being powered up and plugged into the server. They had moved computers months ago and unplugged the backup system, plugged it back into the wrong computer, and they must have wondered what that extra power cord was for, then tucked it away into a drawer a forgot about it. I just happened to be there on a routine service when I spotted the message on the server screen that said, "your last backup was done 265 days ago.....etc". The software knew there was a problem because the backups weren't happening, oh well.

  80. Anonymous Coward
    Anonymous Coward

    Human errors are still the best.

    Duplicate root users... What's this strange user "toor"? Security policy dictates only one uid 0.

    Remove user. Hmm, this is taking a while. Strange errors start to appear. What was toor's home directory? Oh. (We were just lucky this was in the backup data centre.)

    Then there was the rogue DHCP server... a wireless router a manager had brought in and then refused to fess up to, even though the network admin could trace the MAC address through the switches to the port under his desk... Couldn't tell him off for political reasons. Bah.

    At my first job: pulled out wrong disk while replacing a failed RAID5 member. Spent 5 hours restoring from backup, which luckily I had actually set up properly, so no data was lost.

    Firewall firmware upgrades... don't start me on firewall upgrades... to the list of Great Lies such as "this won't hurt a bit" you can safely add "this upgrade will be transparent" and "maybe just a blip in the network". (Just like storage arrays, basically.)

  81. WFW
    FAIL

    Tales of Woe

    Years ago, we had shiny new ISDN phones. The UPS was even configured to call various sysadmins if the power shutdown.

    Except that the ISDN phone system wasn't on the UPS...

  82. TeeCee Gold badge
    Facepalm

    Outage by routine....

    A mate worked as operator on an IBM midrange site (System / 38). Small site, one machine, half a dozen Programmers / Analysts and an IT manager. Mostly bespoke software. One of his daily tasks was to unload the report stacks from the two band printers and split / stack them for distribution.

    One day he took a day off. Other staff are busy, so it falls to The Boss to perform the donkey work, including sorting out the reports. He's noticed that Joe has a very efficient way of doing this. Behind the two printers is a nice, wide windowledge with a smooth granite surface, so the drill is: Flick, Tear, Remove, Stack, followed by sliding the stack of reports down along the windowledge and starting a new stack. End result; a set of nice, neat report stacks along the ledge.

    Joe knows one *very* important piece of information which he does not. How high the first stack may be built before it hits the EPO push button at the end of the ledge..........

  83. Anonymous Coward
    FAIL

    The Generation Game.

    I may have told this one before here, if so, apologies.

    Nice shiny new Data Centre with full online building UPS and a diesel genny out back. Genny is tested weekly with a full load cutover test every month. That should cover it, right? So when the power actually *did* fail, do you think we could get the sodding thing to start? The looks on the faces of those trying as the UPS Death Clock counted down was a picture though.

    Anyhow, single point of failure lesson is learned, the arse-covering budget gets a caning and some jiggery-pokery with planning permission later, the building loses a car parking space and sprouts a backup generator for the backup generator.

    Now, when the power *next* failed, what do you think happened to the power switchgear specced for one genny when it got two across it simultaneously........?

    1. Anonymous Coward
      WTF?

      Nasty switchgear

      As support for a large retail chain, I'm paged to a store at opening time because there's no power in the server room and all retail systems are down. The rest of the store is running right along. I check, all the protected outlets are dead; the help desk has talked management through moving all the UPS to standard power. An electrician is on the way, I hang around until he gets there. He takes a quick look, says "I've seen this before." and walks straight to the power room, where he gives the switch solenoid a sharp tap with a screwdriver. It shifts, everything comes up, and I spend the next half-hour moving power connectors back to where they belong.

      Turns out there had been a power outage overnight. The generator came up and the switchgear switched. Then, when main power was restored, the generator shut down, but the solenoid didn't release. It happened again about a week later; now there's a new solenoid.

  84. Destroy All Monsters Silver badge
    Devil

    Did I tell you about the one...

    ...where a new Main Circuit Breaker was installed but the closet wouldn't close any longer [the lever thingy being too large]. So the closet is left open for the time being.

    Some guys are shown around the Big IBM Iron Computer Room. It's hot, and one genius thinks it's a good idea to hang his coat onto the big red lever there.

    The rest is history.

  85. Anteaus
    Devil

    Beware the 'smart wizard' ....

    One of the worst snafus I've seen was with database software which had one of those 'wizards' which kicks-in every time a new user logs-on. The wizard asks the user where to store the data, and defaults to a location on the C: drive.

    So, one morning the cleaner snags the LAN cable and rips the RJ45 off. User, a data-entry clerk, logs-on locally and is greeted by this wizard because the network data isn't accessible.User answers, 'Yes, Yes, Yes, blah-de-blah' and continues entering data (company registrations) as usual.

    Later that day an IT guy calls and fixes the LAN cable. Pings server..OK. Asks user if anything else needs looking-at. Told no. Goes on his way, job done.

    User continues entering data for several months.

    Alarm is finally raised when one of the accounts team notices that certain companies' data hasn't been updated for recent corporation-tax changes. They contact the data-entry clerk, who dutifully re-enters the data. Only to be told by Accounts that she hasn't entered anything, the data they're seeing is unchanged.

    At this point I get called-in to investigate, and discover a cache of data in the user's local "Application Data" folder going back several months. Merging this data with the proper, network database involves special code being written by the software authors, with four-figure costs.

    Meanwhile at the company's behest I set-up a test platform to find out exactly how this snafu arose, and discover that any LAN interrruption, however brief, will trigger this 'wizard' into instructing the user to permanently reset the data location to C: This will happen even if the user has limited rights.

    I notify the design fault in this 'wizard' to the software authors as a critical bug.

    Several versions later, that damn wizard is still there, still as stupid.

    Moral: Where data security is concerned, 'smart' software is your worst enemy.

  86. Anonymous Coward
    Flame

    Air con failures...

    Came into work a few months back to see all doors to our server room open with practically every fan in the building pressganged into keeping the servers cool enough to keep running,and the on-site network guys scurrying around trying to find more fans.

    It worked for long enough for the aircon to get fixed.

    Icon because toasty, toasty servers.

  87. Anonymous Coward
    Anonymous Coward

    Heard of so many....

    1) Somebody plugging in a KVM to a solaris server, but only the keyboard worked. Go to KVM to use a different machine, see the black screen and assume windows is crashed -> CTRL-ALT-DEL -> Critical Solaris server reboots.

    2) Calling BT out for a failed NTE after a power down. Only when they arrive we realise it was plugged into a different power bank which wasn't switched on.

    3) Customer required some security software on all machines to prevent unauthorised software and hardware running. It worked off whitelists provided by a seriously overloaded server. Despite warnings the customer wouldn't upgrade it. One day the server throws a hissy fit and provides corrupted whitelists which blocked NT.DLL, preventing login on all machines which had updated the whitelist.....

  88. jwkinneyjr

    Staff Efficiency

    Many years ago in a universe far far away, I was the IT managing partner in a small architectural firm. We used a Wang minicomputer (remember those?) for bookkeeping duties that was generally managed and maintained by an outside service.

    One day the service technician walked into my office with a face the color of fresh clean photocopier paper. He explained that he had made an error in upgrading some routine and the Wang was not working. He mounted the supposedly current backup (the techs always called ahead and asked for the staff to do a fresh backup before they came over to make changes) however the newest backup was showing 8 week old data. These backups were done on 8 inch floppy disks that were verified at the end of each set. We had 4 disk sets in rotation, all properly numbered, verified and dated. Sure enough, all the current sets including the one dated earlier in the day had 8-week old data.

    So I asked the bookkeeper to show us how she had done the backup. Instead of going to the regular backup screen, she pressed a combination of keys which flashed a warning signal, "For computer technician use ONLY. Bookkeeping personnel DO NOT USE." Near the center of the screen was an option titled "Add updates to current backup." The bookkeeper said, "I've been using this update for the past couple of months. It's a lot faster than the regular one. I don't know why they don't want us to use it."

    The technician turned even whiter, checked the backups again and sure enough. The backup record showed that it had been run daily for the past two months, dutifully noting that there had been no system updates during that time. The technician finally asked the bookkeeper to do paper printouts of all the reports on the system, which the bookkeeper grudgingly agreed to do. Later that day, the technician walked out with two boxes of printout and the hard drive from the system. We hired some help and switched to manual bookkeeping (payroll that day as I recall).

    A week later the service boss called and said they were unable to reconstruct the data from the hard drive in a meaningful way. He offered to loan us an identical computer so we could reconstruct the system by entering the data by hand from the paper reports and proposed that we share the cost. I finally convinced my senior partners that this was a fair arrangement. The service also modified the Wang dos so that updating the system would never overwrite valid data. We were back in operation in about 3 weeks.

    We eventually outgrew the Wang and switched to a miniPDP, but that's another near-tragedy for another time.

    John

  89. Bruno Girin

    Power failure

    Power failure in the building, go up to the sales department and find out that none of the sales staff computers are connected to the UPS backed sockets but the Christmas tree is: nice blinking lights!

  90. Slackness
    FAIL

    Or..

    Who bought a bloody great UPS... with no batteries in the early 80's and never tested it.

    Then one day.....

  91. Justin Bennett
    Thumb Up

    Re: Make sure the backup is going where you think it is going.

    Put scripting in to estimate how much data you're going to backup and then query the backup solution to find out how much (file count/overall size) it actually backed up...

    Then get that pumping into a DB with alerting if it backs something not consistent with history.

    Oh and do restores - just to check, they're more important than the backup :)

  92. Vic

    Testing your Makefiles...

    This one happened just a few weeks ago.

    I was on-site at a customer. I accidentally overheard a conversation.

    The group in question had a makefile which would take a project name and apply the make to that - so you'd do something along the lines of "make PROJ=foo all". In this situation, the user had typed "make PROJ=foo clean".

    Except he hadn't.

    He'd typed "make PROJ= foo clean". And that spurious space meant that PROJ was defined as null within the makefile.

    Did the makefile do something sensible with null values? Did it buggery. It merrily deleted everything from the current directory down - including all the user's source code. Which wasn't checked in. Or backed up. And hadn't been for nearly a fortnight.

    I earnt many brownie points for pulling out a copy of foremost :-)

    Vic.

  93. Anonymous Coward
    Mushroom

    This is starting to sound familiar...

    The first time our computer room UPS failed to operate during a power cut, we called out the service engineer who dutifully arrived on site and pressed then "ON" button... Oops. It had been sitting there for nearly a year doing absolutely nothing !

    The second time it failed when one of the large capacitors on the rectifier went bang - quite literally. I still remember vividly, talking to the UPS engineer over the phone and asking him to confirm, again, that it would be ok to flick the breakers to put the thing into bypass mode, whilst smoke was still pouring out from the top...

  94. Tech monkey
    Mushroom

    Not the cleaner for once.

    Way way back at the Uni we had a cleaner unplugging stuff on a regular basis.

    So one day when a server was down i made my way over to the dingy "server room" to plug it back in.

    Opened the door and saw a rack with halfmelted servers slammed the door and called the fire department.

    Turned out a holding tank of sulfuric acid on the floor above had sprung a leak.

  95. Anonymous Coward
    Happy

    /me sniggers at jwkinneyjr

    regarding Wangs not working properly and being too small.

  96. Anonymous Coward
    Facepalm

    Virtualization blues

    Good idea: converting a dozen separate servers to virtual containers on OpenVZ, freeing up a whole rack.

    Better Idea: having a second OpenVZ that mounts/runs the containers from the shared NAS so you can shove containers from one to the other in seconds.

    Bad idea: running debian updates on the primary OpenVZ host without testing them on the secondary first.

    Result: server finds no OS at next reboot. Cue failover of critical containers to secondary, (yay!) and all-nighter trying to fix primary (woe!), with every step seeming to just make things worse, until it ends in a clean reinstall of the OpenVZ host.

    Cause? Grub update silently changed the device.map

    Cunning idea: writing a script that will run the same command on all the virtual containers you list on the command line.

    One careless moment later: renaming the script to 'clusterfuck.sh' so anyone trying to use it knows what they're risking.

    Also: NAS (Qnap in this case) firmware updates are evil.

  97. scott 30

    yup

    Tales from the Crypt of 15 years in the IT world.

    1. A certain global company had 3 operations centres. I worked in the EMEA one. The server room had lovely UPS rated for a couple of hours and was hooked up to an external generator.

    When the local utility company sawed through the mains for a laugh, everything went A-OK. When it became obvious the generator would be needed, it was spooled up, but it required someone to be physically in the server room to let the juice from the generator in (a security feature apparently). Except, the only way into the server room was via a badge reader. Yup, the lock was electric and not hooked to the UPS. The only key holder for the physical lock was 10 hours flight away..with the key. Queue panic and IT managers running for the fire axes.

    2. Same global company later outsourced half the ops work to a famous Indian outsource brigade. The lovely IT campus on the East coast of the subcontinent had survived the tsunami a few months earlier, but the whole area was regularly flooded. I was there during one such flood. The servers were safely tucked away at least 5m above the highest theoretical tsunami/flood mark. Except the emergency generators were....in the basement! Queue a small army of locals shifting sandbags and running bilge pumps/chucking buckets by hand.

    3.Same company had operations in just about every landmass with more than a couple of thousand inhabitants. The SAP rollout required as much data as could be sucked in - meaning IT infra in places served by meagre comms - and lacking in anyone vaguely qualified in IT. The best they could be expected to do was wire a plug. One local expert did well to physically build the server and attach the storage. It had been shipped imaged - already in the domain etc - then stripped down for the journey. The build process required a testing of the RAID array, which is where it all went horribly wrong. Queue 3 day journey for the 3rd level support guru by train/plane/boat to discover the poor local had thought that testing the hot-swap meant taking out all the disks sequentially (at the same time) and putting them back.

    4. A certain behemoth of an IT company I worked for. We needed to move an NT server that had become mission critical from an office space to the nuke-proof data centre at the end of the corridor, it genuinely did have real 1960s blast doors, sprung floors, faraday cage etc). Months of planning, contingencies coming out our @rses. One thing we didn't think of was that the machine hadn't been cold booted in about 4 years. Click, click, click. Oh dear....processor dead. In those days you couldn't just whack the disks in another box, it had to be exactly the same hardware.

    I could go on and on....

    1. paulf
      Coffee/keyboard

      Classic

      "Yup, the lock was electric and not hooked to the UPS. The only key holder for the physical lock was 10 hours flight away..with the key. Queue panic and IT managers running for the fire axes."

      Running for the fire axes? You owe me a new keyboard - excellent story.

  98. Yet Another Anonymous coward Silver badge

    The Snow bomb

    Great story on the daily WTF about a US university that cut into an unused elevator shaft to lower some big iron into the basement machine room. They then covered over the shaft at the bottom - not the top.

    After a winter of heavy snowfall came the spring thaw - and a 5 storey plug of snow and ice melted into the machine room. Quite impressive tsunami apparently.

  99. Anonymous Coward
    Flame

    Air con woes

    And in the same room as the exploding UPS we had some spectacular air conditioning failures...

    Most of these could be attributed to the senior management who insisted on going with the cheapest quote when upgrading the air con, to cope with increase in the number of servers.

    Despite being told, time and time again that you need proper computer room air con, they insisted on going with the cheapest office comfort cooling units they could find.

    Some of the highlights..

    1. The external air con units vented out into the warehouse rather than the open air. They quite regularly got turned off by the night shift in the warehouse because they didn't like having warm air being blown onto them - until we put padlocks on the isolators !

    2. One of the "new" air con units to increase the cooling capacity had a faulty condensate pump. The resulting "overflow" created rather a large puddle under the floor, where all the power outlets happened to be. Fortunately it didn't get too deep to trip anything out !

    3. Air con units designed for office environments don't like being run at full pelt 24 x 7, and they tend to shut themselves down when they get too hot....(which somewhat defeats the point of them being there) The ensuing "melt down" as all six units decided that they'd got too hot to function was quite interesting to say the least. I still remember nearly burning my hand on one of the server racks ! It took nearly a day to get some of the servers cool enough to work again.

    Needless to say the "I told you so" fell on deaf ears.

  100. Locky
    FAIL

    RDP chain

    Note to self - When RDPing from one server to another, check, double check and triple check which machine you're on before clicking Shutdown.

    I thought I was on Cape Town's backup server, evedently it was Frankfurts prodution one...

  101. Anonymous Coward
    Anonymous Coward

    Brand new access card system...

    ... all controlled by a server behind a secured door.

    You guessed it, the server failed. The on-site fire fighters soon chopped through the door with an axe.

  102. Anonymous Coward
    Anonymous Coward

    Do talk things through with colleagues, don't blindly trust automated protection

    In the past I managed, with the assistance of a friendly UNIX admin, to set up a mail loop between an Exchange system and the OfficePower setup it was replacing. Each of us had taken it upon ourselves to set up mail synch, pushing new mails out to the other system - both worked very well.

    Well, up to a certain volume of mail.

    We were on our fourth lunchtime pint when someone arrived to drag us out of the Cross Keys.

    -

    On a similar theme, the same organisation trusted mail loop detection enough to allow some senior staff to set up automatic forwarding to external mail services.

    Sadly a "This mailbox is full" auto-response from the automated remote system admin was considered a new mail and not part of a loop so when the then-stingy webmail box was full it merrily informed the sending system of the fact which dutifully forwarded the information on . . .

  103. Zippy the Pinhead

    I've seen a few of these as well

    Every 3rd business day at 1 of the ISDN sites for a company I used to work for, would lose all customer connectivity.. turns out the the cleaning crew had access to the comms room and would unplug the Remote Access Server to plug in the vacuum. Why you need a vacuum instead of a duster is beyond me???

    Also every friday at 3pm for the same company in the building we had just built, customer connection service levels would plummet but network traffic would soar... Seems there was a large group of people who would fire up nice rousing game of Duke Nukem and kill the network.

  104. Smoking Man
    Facepalm

    So you're root? Unfortunately, yes.

    Let's make a small contest here, what went wrong:

    I happened to be an outsourcing slave for a while. Job indluded to fix everything, what the customer, HAVING ROOT AACCESS, did to the system(s).

    Once a while, one system shows really strange and stupid behaviour. On a Sunday, naturally. Buddy calls me if I can take a look. No network access. Drive to the office to look at the serial console. Ok, looks like the whole OS is gone ?!?!? Must be the bootdisk, then. Not mirrored, 'cause a 2nd disk costs real money, doesn't it? Ok, fix the OS, thanks god, SAP-Oracle database is till there. Next day, same issue. Machine looks and feels strange. Os is gone again. Fix that da*** disk and call the bloody hardware vendors support. They swear that the disk is physically fine. Next day, you guess it, Os is gone again. Fix/restore from backup.On the 4th day, same problem, I do a fresh install of the OS to a brand new disk. From now on, system runs like a charm. "See, it must have been the disk!!!" No.

    2 weeks later a colleague gives me a call: "Hey, problems on that system start over again, OS is gone!"

    Me: "What changed???"

    Him: "Mr. <customer root user> asked me to enable his shell-script via cron again."

    Me: "So let's take a closer look at that script now..."

    #!/bin/sh

    cd /<somedirectory>

    find . -mtime +30 -exec rm {} \;

    #EOF

    Hint 1: What happens, when /<somedirectory> doesn't exist ???

    Hint 2: What's (quite often) the home directory of root?

    1. Gabor Laszlo
      FAIL

      root

      one time (and never again) I let myself be persuaded to give root to a developer on a solaris 8. next thing you know, really weird fails: you can ssh in but a lot of the system commands throw errors or just fail outright, not all though. some digging revealed that everything under /usr/lib was gone. Solaris 8 system utils were mostly dynamically linked (which is a design fail in itself - try to fschk /usr after a power fail and you'll see why).

      1. Vic

        Re: root

        > I let myself be persuaded to give root to a developer

        I once had a wonderful argument with a customer.

        He owned the box. I commissioned it. He *insisted* on having the root password.

        I took the password to the office in a sealed envelope and got them to put it in the fire safe. I told them in no uncertain terms that if I saw a login from any address but mine, there would be no warranty whatsoever, and all repairs would be chargeable at full cost.

        Said punter was apoplectic, claiming that he should have the ability to do whatever he liked with his own machine. "You do", I replied, "but I'm not responsible for picking up the pieces". The rest of the staff generally decided to stick to my recommended way of working, and the root password has still not been touched.

        ...Which is good. A few weeks after this bust-up, I installed Mediawiki for them. The same guy had another fit when I refused, without written authorisation, to configure it to allow PHP script on the pages to be executed... :-)

        Vic.

  105. keylevel Silver badge

    Before you go, can you just...

    One of the IT systems admins got asked to do a 'quick' job on a day when he really, really needed to leave on time.

    Five minutes after he had to go the job was done.

    shutdown -h now

    "That's strange. My system doesn't seem to want to shut down." he says, shortly before all the phones start ringing.

    "Remote connection closed" appears on the screen...

  106. Ron Christian

    sometimes it takes a little detective work

    Some of the stories here remind me of my first job as a sysadmin, back in the days of very low density removable disk pack drives. Space was so tight that most of my time was spent trying to find stuff to delete. Once I figured I could save some space in the root partition by stripping the kernel of its symbol table. That was a long night.

    In this same job, once a month, I'd come in Thursday morning and see a bunch of new disk errors. Head crashes happened now and then (disks weren't very reliable back then) but the fact that the great majority of them happened on a Wednesday was bizarre.

    So one Wednesday I camped out in the computer room. Just before midnight, a pair of janitors came in with a floor buffing machine and proceeded to buff the raised floor of the computer room, banging THUMP.... THUMP... THUMP into the disk racks as they worked down the rows. I chased them out, and next day met with the building manager, and COULD NOT make him understand that (a) the raised floor did not need to be BUFFED, and (b) the act of slamming the buffing machine into the disk racks was causing thousands of dollars in damage.

    Finally I locked out the janitors by changing the combination on the push button door lock. Since I wasn't given the "reset" key to this lock, I had to take it apart, figure out how it worked, and change the combination with a screwdriver. The building manager complained to my boss that the janitors couldn't get into the computer room. My boss told him to get stuffed.

    What I did not understand at the time was that the floor *did* need to be dusted occasionally. Then one day the halon alarm went off, fortunately when one of my operators was in the computer room. I ran downstairs and found him looking panicked as he held down the "abort" button. Dust had gotten into the smoke detectors under the floor. We were much more tidy after that.

  107. Getter lvl70 Druid
    Coat

    When setting up BGP for an ISP

    Always remember: Copy Start Run

    That is all.

  108. Scott 9
    Facepalm

    Ah, the memories

    Over 15 years ago but.....being written up for "refusing to install a program properly" because they thought a Windows 95 program should be able to run on Windows 3.1, even when I ran the installer to show them it didn't work. Being yelled at when installing a retail copy of Office 95 because they saw the ads for other Microsoft products flashing across the screen and thought they were being installed also. Lastly, letting someone's teenage son play on my computer all day and then accusing me of looking at porn when I wasn't even there.

  109. Ron Christian

    don't step on the orange cables...

    Years ago I shared responsibility to build up a huge (for the time) decision support system. System and storage interconnect was via many strands of fiber optic cable, which for some reason was run along the floor behind the cabinets.

    One morning we came in and the system was totally hosed. Eventually found out that some minor wiring repair had been done the night before, and the electricians had walked behind the cabinets in their big old boots right on the orange fiber optic cables.

    We eventually got everything working again, but for reasons I can't recall, the cables still weren't put in trays, educating the electricians was deemed prevention enough, and two months later it happened again! Different electricians, of course.

  110. scott 30
    FAIL

    On Error Resume Panic

    During my stint managing the desktop park for a household name (around 60k XP boxes in Europe, 200k worldwide...) me and my team managed to shoot down a fair few kamikaze runs. Nothing was allowed to be distributed without going through us first, and we did a pretty good job.

    One minor Achilles Heel was that the high gods in the Engineering team could (but shouldn't) change the login script unannounced. Sure enough, one morning I came into a sea of white faces. The incident management team were getting hundreds of calls a minute. Everyone had a VBS error on their screen, which couldn't be cancelled and blocked the login dialog.

    4 simple little words had been commented out on part of the script "ON ERROR RESUME NEXT". A bad reference to a share had caused the untrapped script to throw it's toys out the pram, and meant millions of dollars of downtime.

    More recently, a heavyweight banking client of mine had an unplanned data center swap during trading hours - causing billions of dollars of transactions to be frozen (~2 hours from event to complete restore of service in another DC) . The CEO was called to the Regulators to explain how such a thing happened, and what was put in place to prevent it ever happening again. Old Murphy was on fine form that day when he got one of the SAN guys to dismount the live storage rather than redundant at exactly the time the CEO was giving his presentation. Queue automated failover, and all the local bars suddenly getting a rush of mid-afternoon customers.

    Funnily enough, we'd all voiced our concerns to the fancy suit wearing US IT Consultant (Mc something...) that reducing overtime costs by doing infra maintenance during business hours was a baaaad idea. Who ever listens to the people doing the job though. McSomething and the CEO are still here, but the poor tech who thought he was working on X when he was actually on Y didn't fare so well...

  111. Anonymous Coward
    Flame

    Can you get me five books and a fire extinguisher please?

    Working at a top university library in 1997, there were OPACs (Library terminals) - 386sx 16's, 1Mb, loading a small network stack and Telnet client, These had been in place for eight or more years (new, in Uni/Library terms) and had been unmoved, and unopened for much of that time.

    Dust accumulation was apparently a problem, as one self-immolated after the PSU got too hot.

  112. Anonymous Coward
    Anonymous Coward

    If you're going to use RAID 5 make sure you know how it works...

    Joined a small company back in '04. One of my first tasks as a developer was to build out a new Exchange 2003 server (small company, if you were in IT you did everything). After much reading up on the subject I managed to deploy an Exchange 2003 server along with a new Active Directory server. I migrated the entire company's mail from Exchange 2000 to 2003 along with implementing Active Directory to support the new Exchange system. Everything ran fine for about a week (including backups) until one particularly warm weekend. Now I should mention that IT was located on the 3rd floor of a 3 floor building. This included the server rack which sat next to my desk. The building was circa 1940 and the cooling for the severs came from 2 air conditioners in the windows which were usually turned off at the end of the day. Along comes a hot August weekend and I returned to the building early Monday morning to find that the blade the Exchange instance was running on had a blinking red light. Not knowing what it meant I rebooted the server to see if I got any useful messages. During the POST I was informed that one of the drives in the RAID was bad, but, given that it was RAID 5 everything booted up and ran normally. Once my boss (the head of IT) arrived I informed him that we had a bad drive in the array and should request a new one from the vendor. He goes over to the machine, looks at it, proceeds to reboot it and goes into the RAID controller and marks the drive as normal. I'm not sure exactly what the next prompt was but whatever he selected proceeded to take the data on the re-mounted drive and dish it out across the rest of the RAID. The data, of course, was corrupt due to the failure and the entire Exchange server I had built over 3 days went to pieces. "Well" he says, "at least we have our offsite backups". So I spent the rest of the day rebuilding the Exchange server on another blade we had received for a different project. Once it was up (around 6pm) all we needed to do was restore the data from the offsite backup. Now we had restored other files from offsite and it was pretty quick but I knew that we'd be restoring multiple GB of data. My boss said it shouldn't take more than an hour or 2 since that's how long it takes to get the backups out to the offsite location and we had a fast link to them. Apparently the up link was faster than the down link because 12 hours later we finally had the backups onsite and were able to perform the restore....

  113. Anonymous Coward
    Linux

    anti-viruses

    i havent been working in the field for too long (a whole 2 years) so i dont have too many fun stories, but i though i should share one that recently happened.

    i work for a small company and we rent a server from a company on the other side of the country, we run backups on the data and download them over dsl, as the site got bigger the downloads got bigger and we were pulling 17gigs at 120kbps. management asked me to make this better, so i made a backup that tars the data and then makes a diff of that weeks data from last weeks data to be patched on the local machine. some time later the diffs stopped patching. the md5 of the tar from the last patch no longer matched the md5 from the server. after re-downloading, and confirming the md5 i think "ok..." next week it didnt patch, the md5 of the tar had changed. after some investigation we found that the anti-virus went into the tar, found some files it didnt like, and DELETED them. this changed the tar and the diff wouldnt patch.

  114. James Dore
    Coffee/keyboard

    Why I buy Big Brand desktops.

    When I first started in IT work, fresh out of college, my employer was in the middle of a migrate to Windows 95 (ahh, happy days. Except Win95.) They had to purchase new hardware too, despite having reasonable 486 DX100/120's to run it on. Turns out they'd bought them from a UK system builder that built their own motherboards. Which had a fault. After 30 or so failures, we were informed that there were no more spares, and they weren't making any more. That's the rest of the custom-made 1-year-old machines up the spout then.

  115. David McCarthy

    Two tales from the 1990s

    Two problems, both caused by me:

    1. Changed an account latency setting in Netware (to stop people being logged in when they didn't need to be on an over subscribed server). Set the wrong value and watched as Netware terminated the connections of over 100 users in just 30 seconds.

    2. Was testing conditional email forward in Lotus Notes. So, unfortunately was a colleague. We set the same conditions and forwarded to each other. The system sent thousands of forwarded emails before we could brake the loop.

    Ah, happy days!

  116. CleverRichard
    Headmaster

    Backup with a punch

    Act Sirius PC backed up onto five and a quarter floppy disks each night.

    Customer visited to find out why second weeks backup failed.

    All backup disks neatly hole punched and filed in a ring binder.

  117. Keith Langmead

    A little knowledge can be a dangerous thing

    Many years ago whilst at Uni I was the main sysadmin for the SU's computing society (TermiSoc for those in the know), which had three linux servers of our very own, stored in one of the Uni building's basement.

    There were a few other guys who also had root access, one of whom was very interested in security and spent a lot of time attempting to hack into and then improve our systems.

    Now this guy had been reading about the risks of files being owned by root and having execute permission within user accessible folders. He started searching through the filesystem, and discovered that within each users folder there was a . and .. folder with the permissions he'd been looking out for. Now while the exact details are a little fuzzy (it was at least 12 years ago) I know our ever diligent security geek decided to fix this issue. He proceeded to change the permissions on both folders to prevent executing by normal users.

    Shortly afterwards he started hearing people in the lab comment that they could no longer login. Of course removing that permission prevents a user from traversing back through the folder structure, and the login process is unable to traverse to the home directory and /etc directories. The only user able to login was root, but we'd already restricted that so remote connections were only allowed by normal users, who could then su to root, so we had no remote access what so ever.

    Myself and another sysadmin friend, with resident security geek in tow, had to get someone to let us into the basement so we could get console access to the machine and fix the glitch. A fun day, but I think everyone learnt a valuable lesson, and of course the story continues to be recounted occasionally to this day!

  118. Anonymous Coward
    FAIL

    You gave root

    to someone who didn't know what . and .. are?

  119. Anonymous Coward
    FAIL

    On Error Resume Next

    is not a substitute for proper debugging and error handling!

  120. aspir8or
    Facepalm

    It's in the post

    In the late '80s we had a 386 setup as a bridge running software off 5 1/4" floppy. This little beastie fell over requiring me to come in an hour early every morning to download the previous day's db transactions to a 3 1/2" floppy then uploading them to the account department's system (I offered to reconfigure and bring everything up to date, but too expensive for the bean counters). 5 days of this and tracking down the guy who wrote the bridge, (he was way in the outbacks of Australia) and he agreed to send over an updated version. A few days later I get an envelope which had been folded in half somewhere on it's journey and which contained the 5 1/4" cardboard floppy disk that was supposed to give me my morning hour back. Another week later I got a properly packaged and undamaged floppy in the mail. I handed in my notice a week later

    when my boss decided to go with my original update plan so we (meaning I) wouldn't have to go through that again. Being the entire IT dept with no budget just wasn't me.

  121. Inachu
    FAIL

    NOVELL HELL.

    At a soon to be failed insurance company the CFO learned that NOVELL 3.1.1 OS server did not have any backdoor if you supplied the wrong admin password then the OS would be on permalock and you will not be able to get any databack.

    I had ZERO traning in NOVELL and had no desire to touch it. But the CFO wanted me to log into.

    ( I refused to touch it.) Found out the CFO wanted the server to have total dataloss so he could blame someone and not himself. And so was the end of PBHC health care.

    Yes this was the same one that made headlines that patients getting medical support from teen girls who have no medical lic or background at all reading from Q&A books.

  122. Inachu
    Facepalm

    ISP in USA virginia.

    The ISP who served Virginia,DC,Maryland ran their servers not in server or pc cases but just sat the motherboards on plywoood that had those industry fans blowing on them.

    The server room had no locks.

    Ticked off employee walked in and ripped out the server memory and people called tech support wondering why they can't connect to server via dialup.

  123. Inachu
    Pint

    another story from failed insurance company.

    Same failed insurance company IT manager(had no IT training at all.)

    Wanted me to build 5 servers from scratch and already had some pc cases laying around to use.

    Walked back to manager to inform him the pc cases were too small to fit the motherboard in.

    He did not care.... He replied ," MAKE IT FIT!"

    Ok so I made it fit and had them all built and they all bluescreened and crashed or did not boot at all.

    Because the case was so tight it was making the motherboards fault and sometimes bend and or touch the sides of the case and just die.

    A multi certified DBA/network guru was sent to look at my handiwork and I told him what was happening and he was like really? So after he had a go at it they still failed.

    IT manager wasted over $12,000 on things that did not work.

  124. Anonymous Coward
    Anonymous Coward

    And not even AC

    Enjoy your lawsuit bro.

This topic is closed for new posts.

Other stories you might like