Get enough people of any profession in one room and the conversation drifts inexorably towards horror stories. Everyone loves a good “…and then it all went horribly sideways” yarn, and we all have more than one. The IT profession is flush with tales of woe. There are no upper boundaries on ignorance or inability. From common …
You missed a big one.
Inconsistent times set on the company servers. Nothing worse than trying to fault find a sequence of errors across several servers and / or workstations when the time is not set consistently. Surprising in this day and age just how common this is.
backup the backup, eh what?
In the early 90's I worked for a company that had ported a software application from proprietary hardware to a Unix platform. As part of this a backup solution had been developed for our customers, "the Unix guy" would go on site and install the backup software. To test this he would backup the folder containing the backup software. Yup, you guessed it, three years after "the Unix guy" had left a customer had a disk failure, when we came to restore their data, they had only ever been backing up the backup software.
backup to the backup folder
That reminds me of the client who would do their own zip file backups of data folders. And then store that zip file in that original data folder. So the next backup would include the previous backup. And so on. These (multiple) backup files exponentially grew and grew until the hard disk jammed full...
And how many people have found their clients backing up their Sage data to the same folder as Sage was installed in? And never taking those backups off site...
Users are such fun...
users are such fun?
Well yes, they are, but my biggest single lesson as a sys admin was, when stuff broke, not to ask the users what they had been up to until I had thoroughly examined what I had been up to.
Send not to know what the users did:
The person most likely to have broken the system is you!
You mean the flashy red light on the backup was important?
Client has multi-million dollar sales database being backed up by secretary/receptionist. Head secretary has problem with sales database. HS decides reboots fix PCs so the best thing to do is delete the database because it will rebuild itself. Is surprised when it doesn't. Head pilot fish takes call, says "no problem, that's what your back-up is for." SR puts in the tape and it won't read. Head Pilot fish asks "Is the read light on the tape backup blinking?" SR says "why yes it is." Head Pilot fish asks "How long has that been blinking?" SR answers "Not sure, but at least 3 weeks. Why is it important?"
Fortunately for both HS and SR, Head Pilot fish (not me) was damned good and able to reconstruct their entire database from the Novel 3.12 recovery area (I have since forgotten their name for the very handy feature).
Not only the users fall for such strategies
On my old job, were lucky to have a really good admin. He did everything as he should, running a separate backup server for each of the primary servers. When virtualization came around, it went really smooth. Everything continued as it always had been, just on virtual servers, completely transparent for us happy users. He may have been falling behind on documentation, but the bosses understood that as much as he worked to keep things running, there wasn't the time for proper documentation.
Eventually, we got the budget for a second admin. His first task was to get to know the systems and document them along. He found some interesting system landscape choices, including the fact that most of the backup virtual servers ran on the same physical machine as their corresponding virtual primary server.
As was said elsewhere
It's not just the users.
I remember discovering that the default configuration on some NetWare backup software we had was to backup the server structure. Nothing else. So all you could have recovered were the directories, filenames and permissions, not the actual data in the files.
Fortunately it was just an irritation, as I became suspicious that the first night's backup of what should have been the entire server took around 4 minutes when I had expected about 3 hours based on the amount of data and tape speed.
It wasn't even a schoolboy error on my part, I'd selected "Full Backup" on the scheduler, it was only when you drilled into the options for what to include in a full backup that you could see the default was to behave like this. Epic fail from the software company there, I think.
It's been a while now but the incident of the cleaner and the plug socket is certainly no urban legend. Whilst I was contracting at a bank of Scottish origin in the late 90's we spent days trying to work out why the overnight stored procedures failed every time, yep it was the 'woman who does'.
Which leads me to suggest an entry for a similar article (which I'd like to see) for db admin disasters, which goes along the lines of 'Figuring out sql performance issues at the end of the project doesn't always work out'.
I'm no expert in server protection stuff but surely there are simple mitigations against the cleaner unplugging stuff, from the simple (a sticker) to the more complex (locking shield over the outlet) not to mention not leaving machines with sensitive data and/or that are mission critical in rooms that the cleaning staff have access to.
Surely any sensible person would take measures to prevent exactly this scenario?
More cleaner woes
I used to have to do a lot of 3D rendering. In this case the project was to render the column bedecked interior of a planned large building - and then create a plot to be shown to the royal carbuncle hater next day. This needed an overnight run. Despite the plug switch being taped open and a "do not switch off" sign hung over it, the cleaner (no, she was English and could read) turned it off.
This resulted in much swearing (me), much panic and fear of vanishing chances of making the Honours List (the CEO and directors of the charity) and hiring a motorbike courier to get the new drawings up to London just after lunch.
Fair point in an ideal world but sadly the bean counters are in control. The amount of times I've seen bean counters refuse expenditure of a few thousand pounds upfront only to cost themselves tens of thousands down the line is unreal.
Who cares about the cleaner
Agree about the cleaner. My wife hired a chinese bloke to clean the house once a week a couple of years back. His first deed was to bring down the network by plugging a 2KW vacuum cleaner into the UPS backed up socket. I fired him after he did it for the second time and bought a Rumba.
Which reminds me - have you seen the incident where the builders plug in a welding apparatus into the UPS socket? Half of the UPS overload schematics out there do not work correctly with inductive loads that size. Older APC definitely does not. Trust me, a Galaxy class APC charged to the hilt exploading in a 3x3m server room is not a pretty sight.
Also - on 7 - snow storm. Snow storms are actually not that bad if you have the right vehicle, warm clothes and a shovel. Now East Anglian fog... In my old job I had to go and plug in cold spare equipment at 11pm in the office with visibility under 10m. The only thing you could see through the windshield was a white wall. You do not see the road, nothing. So you crawl on 5mph with dead reconing and hearing being your primary means of navigation. Thankfully, the paint used for roads in the UK is thick enough - you actually notice when you drive over it.
Re: Who cares about the cleaner
Yup, seen that one too. Regional (i.e. Europe) server and comms room. Every AM, something's been unplugged. Mostly trivial but annoying. The solution was to carefully label each plugged in item in Dymo with "Do Not Remove This Plug". Sorted.
That night, the cleaner came in and found nowhere to plug in the hoover as all the sockets were occupied with plugs so labelled. Then she noticed a nice block of sockets in a line, handily mounted at "not having to bend down" level too. She whacked in the old Numatic "Henry" and switched it on. Said sockets were in the back of the comms rack and the clean power supply they were attached to shat itself on the spot, producing a Europe-wide outage of all the European shared services.
Moral: Sometimes, finding your screen unplugged in the morning ain't such a bad thing......
Moral of the story?
So what stops us from finding a handy outlet in a spot convenient for the cleaner and labeling it "for cleaning use ONLY"?
Sometimes it's really more useful to facilitate what you'd like to happen than it is to forbid every niggle and thing you don't want to happen. For bonus points wire it up to a nice and isolated group, then get with staffing and make sure all new cleaning hires know that in rooms full of computing equipment there will be a socket specifically for them, labeled and such and free, and to use any other is a firing offence.
Why yes, the cleaning staff too should know where their priorities lie: In certain rooms a bit of dust is preferrable over having blinkenlights go dark. Go clean the visitors area again instead, hmkay.
My story? Getting all comms, net, phone, alarm, everything, ripped out without the aid of a backhoe. Telco street cabinet cable administration mass spring cleaning session. Four hour service contract and watching the telco not care. "We're busy! Come back next week!" Silver linging? Sending in someone from ceerow to shout at them; if nothing else, saves my hearing. The icon is for that guy, though he wouldn't need one.
Moral of the Story???
You can't make idiot proof software/hardware/procedures, the idiots are just too inventive.
Or as Albert Einstein said Only two things are infinite, the universe and human stupidity, and I'm not sure about the former".
There is a very simple answer.
When I wired up offices I just bought a whole load of 13Amp plug and sockets with the live and neutral sockets oriented at 45 degrees to normal. These sockets were on a seperate supply to everything else and all cleaners devices had the normal plugs removed and fitted with these special plugs. I even used to give them an extension cable from the special plug to ordinary socket for their mobiles etc.
Either that, or the fact I made sure they knew where I kept the baseball bat, made sure that I've never had such problems.
It never rains when it pours
Especially when the pouring is from the aircom overflow tray above the large IBM piece of tin that ran 15 warehouses (the real kind, not data ones!), spread across Europe.
It took 4 days to get the system back up, and then senior management suddenly saw the sense in spending several million on a separate site system.
They lost £5m per day.
Or the 'meat' error in the same company where the clerk pushed through a quarters VAT payments a day early and lost the company IRO £3m in VAT, on an entire warehouse of fags an booze.
Pipe leak on a floor above the IT floor.
Water flows over the box with the mains. Eventually the mains blow. Team has disaster recovery problem including generator for power and a failover. Head Server Pilot fish confirms failover has worked properly and servers are now on the generator, different circuit than the mains. We need to evacuate building finally, no problem setup for remote access to cleanly power down the servers. Small problem: Head Server Pilot fish lives 45 minutes away. But everything's working so it should be fine. Except when the maintenance guys came through and saw the generator circuits were working, they turned it off so it wouldn't be dangerous (mind you same maintenance worker who previously saw no problem with water flowing over the mains). Batter backup were only good for 20-30 minutes, so by the time Head Server Pilot fish got home to remotely shut down the servers, they'd already gone down hard.
Ah yes ... aircon drip trays sited directly above the brand new IT room, raining down into the cupboard with the power distribution racks in it. Major player in the mobile phone retail sector, mid 90s. I was on-site at the time, too..
Same place, year or two earlier, I was dragged out of bed because their tech had done an rm -r in exactly the wrong folder ... at least he'd phoned his boss before bolting. We arrived to find the place empty and barely even locked up.
Demanding employees and their email.
One employee demanded to have access to his email 24/7 and wanted to have company email fowarded to his home ISP email.
Well sooner or later his home ISP email inbox became full and not only sent a message that the inbox was full but also copied the message back to the sender at the company.
So in effect it filled up the email server and it crashed and had to be rebuilt.
The story here is the crappy email server that commits suicide when the disk is full.
This is why you don't store queue's on your operating system partition. All sysadmins should know this.
very poor admins
The story here is also that someone didn't set mailbox size limits?
The cleaner unplugged it
I remember once client screaming and cursing me over the phone threatening all forms of abuse at me because one of his systems was down and if I did not do something about it now there would be legal action, I'd be sued for lost revenue etc, etc, etc. So after a 2 hour drive to his office I walked in, looked at the computer and then plugged it back in and powered it up. I then just turned and stared at the client.
I then I sent his boss an invoice for a half day, emergency onsite maintenance with expenses with "Problem" filled in with "System unplugged" and "Resolution" filled in with "System plugged back in"
This isn't so uncommon
Had this a couple of times when I was working for a company down Swindon way.
User calls, makes demands and threats, I drop what I'm doing, head on down and find the computer was unplugged. User, of cause, was never around when I got there. So I plug the computer back in and leave it plugged in. Last time I went down, checked the computer worked, then unplugged it and returned to the office to report on the ticket. Note in the comments box: No fault found. Computer left in state found in. I didn't mention that the problem seemed to occur at about 8:30 when the canteen opened...
And as to cleaners: Yes, had that, too. Server room was kept locked: Main door was key coded with access audited. A rear escape door was secure from the outside: Had to break the glass tube to unlock it. The side door was kept locked with 'no entry' markers on it, and the office it was accessed from was also kept locked. So, the cleaner went through the office and in through the side door using the master key. We only found out this was happening when she forgot to plug the server back in when she left the one night: The UPS had been keeping it running the rest of the time.
But my favorite has to be the offsite server farm: Two sites mirrored, just in case, and in separate counties. Only when the substation supplying power to one went down, both sites went dark. When they investigated why, they found both sites were supplied by the same substation. Apparently no one had thought to check that possibility out...
IS not just the bane of Sys admin's lives , its the bane of every person who ever has to use a PC remotely
Eg A certain industrial robot programmer has to upload a bunch of programs to the robots, this is done via a file browser on the robots' control panel coupled to the programming station's PC.
The PC is in the office 100 yrds away from the robots.... so set the PC comm program running and walk across the factory.
at least once a week some idiot decides to turn the PC off 1/2 way through the transfer because either his iPod/iPad/iArse needs charging or 'oh look, Boris has left his PC on again'
Where do I start ... ?
Both sides of a mirror on the same disk ...
Multi file tar backup to the rewind (not norewind) tape device, all except the last archive over written ...
Sysadmin working in the machine room at the weekend, felt a little cold so turned the air con off. On monday the servers were fried ...
Top sysadmin and deputy are the only ones who understand things. They fall in love, give 9 months notice of round world trip. Company starts looking for replacement three days before they leave ...
Raid 1 is backup isn't it ? Don't need anything else. Until a user error deletes a file. Cos it is raid 1 both copies go ...
Backup up to tape, read verify it, all the files seem to be there. Disks blow up, restore from tape. Why is the data 6 months old ? Because 6 months ago the tape write head failed, the y had been verifying old data ...
Floopy days are here again
1. The person doing MSBACKUP only put one floppy in, and just pressed ENTER when asked to replace with enxt disk.
2. New software is purchased, IT person makes a master backup floppy for the office safe, and working backup floppy, and working copy floppy for user. Orig floppy goes home with CEO. Loads of redundancy, multi site, gold stars all round.
One day, the user's machine says bee-baaaaar, bee-baaaaar etc - can't read the disk. Ok, we'll make you a new one from working backup floppy. Oops, same problem. Try master backup floppy. That's duff too. CEO's copy is brought in, bad also. Of course the problem was the floppy drive, which has now killed all copies of the software...
Re: Floopy days are here again
there is a utility under unix called 'dd' that allow you to make a image of your floppy, you can then write that image on a fresh floppy when you need it.
I know that this utility have been ported to DOS and I have even made use of it in the past, but for some reason I can't remember the name used under DOS, so I can't give you a link for it. Sorry.
DD.EXE is used to create an image file from a floppy bootdisk. Do not confuse it with the unix "dd" command it is not quite the same.
Example: dd a: <filename>.img
This will create an image file from your bootdisk in drive A:.
You could also use winimage (not freeware) for this, but remember to save your floppy image as .IMA (not compressed) file.
usefull tool, along with
and sharpkeys http://www.randyrants.com/2008/12/sharpkeys_30.html
IBM had two compatibility-mode utilities to make and restore an image of a floppy. They run in DOS, OS/2, and last time I checked Windows. They were loaddskf.exe and savedskf.exe.
Once experience the classic "single tape" cock-up
"Yes we have backups every night, without fail for the last 18 months."
"Oh great, should save some time. We need to test restores for the auditors. Who changes the tapes and what's the offsite storage like, fire-proof that sort of thing?"
"Change the tape? No need, it's big enough, well it's never run out yet."
"No, you only have 150MB of space on a tape and the server is holding 40MB of data, so that's only 5 days worth!" ( 40MB on a server, that shows you how long ago this was! )
"There is no way that tape could do more than 3 nights of backups before it fills up. You've most likely overwritten the same tape hundreds of times, so you have no historical data available."
( You will be! )
Quick check and yep, they'd run 450 backups on the same tape, over and over and over and over...the software didn't bother to check the tapes had data, it just rewound and overwrote it.
Needless to say the auditors were not in the least bit impressed and lots of shouting ensued at management level and plenty of overtime was paid out to IT to make sure it did not happen again!
2006, London Summer heatwave stressed our Aircon too much and the lot failed. What did the PHB do, order repair or replacement? Nah, sent us to somerfield for some Turkey sized tinfoil to line the windows of the server room and "keep the heat out". Cue 45 degree temps and a dead exchange server the next day. Ha! served him right, needless to say we didn't rush with the restore!
Last time I went past the building, the foil was still in the windows :)
Or the time when I came back from a weekend off to find two of my servers with fried power supplies .. turns out a DR exercise had happened over the weekend and the ops staff were getting a bit cold in the small machine room and so they might have turn off some of the aircon for a few hours during the exercise (and my servers were furthest away from the cold outflow) ... hmm now looking at the logs when did the exercise start and when did the logs get truncated? 8-(
I would suspect that it was pure karma as a few years earlier in same said machine room I turned off one of the aircon blowers as I was getting hypothermia working next to the cold air outlet and forgot to turn it back on before leaving for the day .. the temperature alarm went off overnight .. whoops (I owned up) ... funnily enough little plastic covers were screwed over the aircon controls a few days later .. you then had to use a pen through a small hole in the cover to flick them on and off 8-)
When installing and configuring a UPS monitoring system that will automatically and gracefully shut down your data center in proper order before the batteries run out, always make sure you keep track of the serial cable that came WITH said UPS.
I once installed one of those for a medium size ISP and got my hands on the wrong cable. It did not take me long to realize that on a regular cable, pin 5 is grounded, and on a UPS, pin 5 to ground means emergency shut-off... The sound of a big UPS, 25+ servers and a plethora of telecom equipment all clicking off simultaneously is not one that I ever want to hear again...
Best of all... RTFM.
Did exactly the same thing but thankfully only on a single Windows SBS server - the last person to work on the server had left the serial cable for the UPS loose down the back so like a good boy I plugged it back in - instant silence.
The odd thing was, 15 minutes later when the server finally came back up, no-one had noticed...
On a similar note, I once discovered that if you have a UPS running a couple of servers, and decide to re-install windows on the one that actually has the serial cable in it, then as part of the hardware check a little signal gets sent to the UPS, that shuts it down instantly...
The real WTF is...
The damn UPS manufacturer "thought outside the box" and used a DB-9 connector with a non-standard pinout for a "serial port". Connecting a standard cable causes Bad Things to happen.
Poor design doesn't even begin to cover it.
I blame lock in
I think it was more a matter of making you buy their "special" cable at twice the price of a generic DB9 serial cable. Fortunately USB has made this a thing of the past since I also found out the hard way about UPS cable pinouts, but luckily it was only a desktop machine.
the emergency switch
That big red power button that takes down the whole data center, the one in plain sight, make sure it has a cover. At some point somebody will trip, or shoot out a stabilising hand at exactly the wrong location. You do not want to be anywhere near that person when it happens.
The emergency cutoff switch is right next to the exit door. The exit door is access controlled. The new guy did not realise that you have to swipe your card to get out, not just press a button. 500 servers go down in an instant.
That happened to MCI were I live. To make things worse this site was were the do the peering for ATT and Sprint. They lost a few routers and DNS to a week to fix. When sad they lost a few routers I mean they didn't turn on. Some of the t1 cards were fried to. Then there was some of the routers and switches that lost their config . 5 hours off line . By the way this controlling the zone for northern California .
Or the flirty junior playing - What does this switch do ?
COS for high-integrity 24v DC C&I supply.
Primary DC was offline, and X-hour duration 'backup' battery hasn't yet been installed.
30 seconds later, temporary 'buffer' battery is fully discharged.
Immediately followed by a resounding series of BANGs, as every breaker in 6 multi-MW switchboards trips-out.
Cue total darkness and ominous silence, broken only by watchkeeper's swabbie-level cursing.
How to take out a nuke-sub from the inside.
Re: the emergency switch
Seen that happen. Except it wasn't a server room, it was a large ocean going vessel, and the button in question was the Man Overboard Button.
Mind you, at least nobody got hurt.
...another on the subject of EPO switches.
When told that your switch should be covered to avoid being accidentally triggered, don't forget, call your local electrician in a panic when reminded, and watch as said electrician sticks a drill bit into your PDU in the middle of your peak processing window...
Change management: more than just the logical stuff.
Or the cabling gets a little loose
Unlabelled emergency cutoff was located on a false wall near the door, connections had shaken loose from vibrations from door closing over the years. Walked in, turned on the lights, turned off a room full of servers. Very good heart attack test.
Only vindication was the astonished expressions in the faces of the highly sceptical building electricians when I did it again in front of them after dragging them in to explain to me what had gone wrong with the lighting circuits.
It's usually the boss
or his relative.
//you can't even yell at them
Re: the emergency switch
A variant of this one: in a very large computerised hospital, due to an electrical fire in a transformer room (that was mid-August of course... Murphy's laws at work), we had been doing several complete system shutdowns and startups (6 Unix clusters, 20+ Oracle DBs, 50+ blade servers) over a few days. This was required by frequent and mostly announced (but on very short notice) blackouts due to problems on the generator trucks they had parked on the street next to the building. At some point, totally exhausted after 3 almost sleepless nights, we were doing yet another system start-up after just receiving a confirmation that power was back and "hopefully" stable. A guy taking care of just a couple of not-too-important Windows servers came into the room to boot up his own boxes. He almost never comes here, doing his work remotely. After being finished, he goes out and ... switches off the lights of the room. None of us instantly died of a heart attack, but that was close.
Location, location, location
Or simply put the emergency off next to the light switch in a server room that is usually run lights out. So groping around to turn the room lights on turns all the little lights off....
It's not disaster recovery unless you know it works
In a meeting a couple of years back when the following dialog took place:
Yes, we have a best-practice disaster recovery procedure. We have fully redundant hot-standby servers at a beta site, mirrored disks at the location and two sets of network connections with no common points of failure.
When did you last perform a failover test?
Oh we've never tested it
It might not work.
Small company that relies on e-commerce and e-mail "downsizes" their sysadmin to replace with a cheaper outsourcing company.
Three months later, their site and e-mail stops working. Numerous phone calls to the outsourcing company yield nothing. I am called in to troubleshoot a week later. One WHOIS trawl, and I ask "so who is John xxx?" "He was our old sysadmin" "Well you may want to call and ask him for your domain back."
The sysadmin had been paying the bills through his limited company, and effectively "owned" the domain. When the renewal came up, it was forwarded to a parking site. Not sure whether the company bought the domain back, went through the arbitration, or some other solution. But every company since, I have been interested to see that a lot of sysadmins do this as a form of "insurance", ostensibly because "it's easier to have them contact me."