back to article Sysadmin wiped two servers, left the country to escape the shame

Grab a very small cake and a bunch of candles, dear readers, for today we mark the 10th edition of “Who, me?”, The Register’s confessional for IT pros who broke things badly. This week, meet “Graham” who “ended up as an authority on a fledging new product called SFT III from Novell. SFT stood for “system fault tolerance”. “ …

Anonymous Coward

A young boy was trying to grow cauliflowers. He had heard about a technique which his gardener grandfather said would never work. The system meant that the growth of the plants could not be observed.

On the day of the local gardening competition his grandfather watched as the boy revealed each plant. After the third stunted one his grandfather shook his head and went home. The next one was a giant. The boy took it to the competition tent - which was deserted - and managed to fit it into the display by removing its leaves.

After the judging - the boy and his grandfather looked at the results. The grandfather admired the cauliflower which was now sporting a "2nd" rosette. A judge remarked that if it had had its leaves it would have been the winner.

The grandfather put his spectacles on and picked up the plant's identification card saying admiringly "Now let's see who this expert is".

***a story from the autobiography of someone raised in a northern English town in the middle of the 20th century.

8
5
Silver badge
Trollface

Biggest point - glossed over.

The backups actually restored.

40
0
Silver badge

Re: Biggest point - glossed over.

"The backups actually restored."

All fiction, when in the history of IT has the backup EVER restored when you really really really needed it too (especially at 3 in the morning when its your last chance)?

30
0

Re: Biggest point - glossed over.

"The backups actually restored."

All fiction, when in the history of IT has the backup EVER restored when you really really really needed it too (especially at 3 in the morning when its your last chance)?

My non-fiction contains success most times, but it always takes longer than the last recovery exercise on the last backup as it has got bigger, and especially when you are the sole resource/implementer in the wee smalls your sphincter does goldfish impressions until it is done and running. I have been lucky, the worst I`ve had to do is load incremental backups on an older full one, but colleagues have told me of nightmare scenarios.

6
0

I've had a backup restore

When I moved a number of user accounts from one Active Directory OU to another then deleted the (now empty) OU. Then I discovered that in an act of malice the servers decided that deleting the OU (and all that it contains) should synchronise throughout the domain before the account moves should synchronise. At least that's the only explanation I could come up with for all teh user accounts vanishing.

But, as I said, the backup worked.

9
0
Silver badge

Re: Biggest point - glossed over.

"The backups actually restored."

and nobody mentioned the loss of work done since the last backup.

RPO and all that.

4
0
Silver badge

Re: Biggest point - glossed over.

when in the history of IT has the backup EVER restored

I'm sure it must have happened once. After all, the laws of infinite improbability demand it.

5
0
Silver badge

Re: I've had a backup restore

But, as I said, the backup worked.

You managed to back up (and restore) AD in a way that actually worked? Wow!

11
0

Re: I've had a backup restore

I cheated, which I could do because it's a small network.

2
0
Silver badge
Boffin

Re: Biggest point - glossed over.

@Gordon

when in the history of IT has the backup EVER restored when you really really really needed it to

You do test the backups regularly, right ?

You do have multiple backups, right ?

Kept at different locations ?

And, if your management has common sense, hard to come by, agreed, you even have DR, right ?

Go tell the bean-counters that you are PRODUCTION, NOT A COST CENTER, if they don't believe you, simulate an IT incident, with good tested backups of course ... and let them sweat it for, say, an hour ... ;-)

8
1
TVU
Bronze badge

Re: Biggest point - glossed over.

"All fiction, when in the history of IT has the backup EVER restored when you really really really needed it too (especially at 3 in the morning when its your last chance)?"

With me for one although it took quite a bit of time and I did lose the most recent data but the bulk of it was saved by the backups. As Aunt Mabel used to say, "One can never have too many back up options".

OK, so I did make that very last bit up but the point stands - backups can save your bacon/butt/job/etc.

2
0
Silver badge

Re: Biggest point - glossed over.

"and let them sweat it for, say, an hour"

And make sure the beancounters' stuff is the last to get restored.

5
0
Anonymous Coward

Re: Biggest point - glossed over.

> simulate an IT incident

Yeah, back in the '90s with S-100 systems, the accountants told us we couldn't have backup hardware (tape drives) 'cuz they're just too expensive.

So the next payroll run, the "systems went down" (since the admin disabled logins) and "we have no backups so it'll be a while" so the accountants had to hand write paychecks for 200+ folks. We got our tape drives.

Anon b/c he's probably reading...

12
0
Silver badge

Re: Biggest point - glossed over.

Never let the truth get in the way of a good story...

2
0
Anonymous Coward

Re: Biggest point - glossed over.

Always, because I always randomly perform test restores which saved my neck on several occassions.

4
0
Anonymous Coward

Re: I've had a backup restore

Easy. It just needs practice.

0
0
Anonymous Coward

Re: Biggest point - glossed over.

"[...] I always randomly perform test restores [...]"

It's always the one you haven't tested that lets you down.

0
0

Re: Biggest point - glossed over.

Once again long time ago, using an odd OS called Magix (I think, could have been Magic), was a Pick-like PC oriented OS for database development. Had a number of sites with their own server running a production and sales database and each system had a tape backup unit with a well established backup process. You know the sort of thing, 2 x Monday, Tuesday etc, 5 x Fridays, all rotated and stored in a separate part of the site - and even tested so that we were reasonably certain that the backup and restore process worked.

So one site had a serious hardware fault that brought the server down, so we replaced it. Loaded a fresh OS copy, configured everything and then fetched the latest tape and set about restoring the Data. Up popped a message - "Please put in Tape 1". Puzzled we tried again, and then tried the rest of the tape series, all with the same message. So we summoned the operator and asked about this. His response was that a few weeks back the back process had started to pop the tape out and ask for a second tape to be inserted - so he just pushed the same tape back in ! Never thought to tell anyone about this. Disaster loomed, as we'd tried every day and every Friday tape we knew about. Then by sheer luck I saw another atpe sitting over the other side of the office on a shelf and asked what it was - turned out that two days before the user had wondered about the two tape thing and had actually but in a new tape on that day and that was the magical "Tape 1" from two days back. And hallelujah, it was exactly that, put that in and then the two day ago tape 2, and voila, a working system with only 2 days of update data required.

Quite why the operator said nothing at first, and then did try a second tape at that crucial point - and not mention it earlier - we never did find out. We just thanked our lucky stars and did a runner to the pub for something to calm the nerves.

Next day a memo goes out saying that if the system asked for a second tape: first use a new tape (and label it) and second, tell IT Support !

9
0
Silver badge
Windows

Re: Biggest point - glossed over.

Just to add that back in the day with mainframes and exchangeable discs we always (on full backup day once a week) took the full backup to tape, removed the discs, put in fresh disc packs and restored to them.

We always had a proven set of backup tapes plus the discs from a running system.

Three sets of tapes as well.

Tape store, on site disaster store, off site disaster store.

Tell that to the kids of today....

4
0
Silver badge

Regarding RAIDs and hotswaps - I prefer to do a full backup, then put said backup offline, then insert the new hard drive for a rebuild session.

Never had the need to restore from backups.

11
0

SFTIII was great

It mirrored servers perfectly, down to the smallest of faults. One server developed a fault? You could be sure the other would follow as soon as the mirror caught up.

Apart from that the majority of SFTIII installations I encountered suffered from the hardware being Netware compliant but not SFTIII compliant, so it didn't work correctly.

2
0
Silver badge
Facepalm

A long time ago I was making totally legal backups of Amiga games. Ahem.

Yeah, ok, so I was pirating some games, and to do so, a friend had lent me his copy of White Lightning, which was reputed to be able to copy pretty much anything, and fast. Not having much in the way of pocket money, even new blank discs were pretty pricey, so I'd not even bothered to make a copy of White Lightning, and so I was using my freind's diskette to copy various games.

You can probably guess what happened next, somehow I mixed up which disc was in which drive, and wrote over my friend's copy program. At the time I remember trying to blame it on him leaving the copy protect tab open on the disc. Sorry Tim.

Since then I've learnt my lesson and only got 'Source' and 'Destination' the wrong way round about four or five times since. Maybe six. Or seven.

19
0

Started learning my more serious computing with a Sinclair QL (when they were being sold cheap at Dixons). The first lesson with a QL was backup everything because the microdrives were finicky as hell!

I ended up having 3 or more copies of everything I was using. Still paranoid even now.

6
0

I made a slightly more serious mistake when my 20Mb Amigs HDD filled and I bought a new one. My RocTec HDD didn't have a second IDE power cable or splitter so to copy from one to the other, I plopped the two drives on top of each other so the connectors were aligned and jury rigged a power cable by putting the loose ends of a spare 4 pin connector into the same positions of the old drive.

Sadly, somehow the pins were in the other order on the new drive, and I toasted my old drive.

Fortunately I'd backup up most of my work onto floppies, but it was an expensive lesson, particularly back in 1992 odd.

2
0
Anonymous Coward

Modern disks

I miss the days when disk drives had physical write-protect buttons, with red lights to show the drive was protected.

12
0
Facepalm

dd if=/mnt/some/long/involved/path of=/dev/sda

Followed 0.03 seconds later by swearing and control-C. It didn't have time to do much... other than totally corrupt the file system on /dev/sda. Target should have been /dev/sdb, /dev/sda was the backup drive I was trying to recover from.

15
0
Silver badge

The only good cock-up is one you are able to recover from.

5
0

300km?

People don't drive 300 km in Britain.

They drive 186.411 miles.

Most of it over potholes.

19
3

Re: 300km?

> They drive 186.411 miles. Most of it over potholes.

That would have been in 'creator mode' - the word you wanted was "through".

7
0
Silver badge
Headmaster

Re: 300km?

Or for some of them around here, "into"...

7
0
TVU
Bronze badge

Re: 300km?

"Most of it over potholes"

...because too much money is being blown on HS2 instead.

4
3

SFTIII

I remember having this in my client's site in the mid 90's. It worked too well, when one server went down it switched seamlessly over to the secondary... so seamlessly no-one noticed the primary had failed, until two months later when the secondary also failed.

And then we had to work out which one had failed first so that when we bought them back up again, they synched the right way. It was a 50/50 chance, so of course we got it wrong.

12
0
Anonymous Coward

Re: SFTIII

Had a couple of instances at a council. First one had one physical server die (hardware failure) and it ran better with one server than it ever did with two.

Second one was just being installed, using that new-fangled Netware 4 SFT, and it nicely decided to get involved in a reboot-crash cycle, with both server alternately ABENDing, restarting, joining the SFT III cluster-thingy just before the second server crashed and repeated the cycle. Users didn't see much of a problem, but it was a sod to fix.

5
0

Customer sites

Had to fly to Jersey to do a customer upgrade

Solaris upgrade and application upgrade.

Specified to customer that they must be a full backup before I got there on Saturday morning

Got there Saturday morning, asked about the backup, finally got hold of their Unix admin, nope, didn't do it.

Okay!

On site customer said do the upgrade anyway

Okay!

Disk 1 of 2 - whirr whirr whir -OK

Disk 2 of 2 - whirr whirr, disk read error

Oh shit!

Retry, whirr whirr, disk read error

<swear swear swear, no backup, swear swear>

remove CD, look at scratches, customer doesn't have another copy

Oh shit.

Bit of saliva, bit of cleaning, bit more saliva, bit more buffing

Put disk inm whirr, whirr whirr (20 minutes or so later) - done.

Seat of the pants upgrade with customer standing there too!

Was also their Solaris disks!

Another customer (in Jersey oddly)

IT Manager "you broke our system"

me "it was OK on the weekend it was upgraded" - This was wednesday

IT Managaer, "you have to come over fix it now, rant rant rant rant"

Me - immediatte flight to Jersey

IT managers manager, flies over from Guernsey

All in computer room at console

dig through events, log in on wednesday, 5 minutes later, errors

All look at IT Manager.

Sheepishly admits he logged in and "done something"

Irate Guernsey Manager

2 return flights bought at no notice and one emegerncy consultancy day shows IT manager in bad light.

5
0
Anonymous Coward

Another interesting one, upgrading an IBM Power server with new firmware, and it failed, borking the machine. Turned out that the check processor that solely existed to check that things were working OK, wasn't working so the machine failed because it couldn't check that it was running OK !

There must be a way to start without the check processor so we could restart the firmware upgrade process. IBM tech discovered that process at 4:30 am (started at 11 pm) on page 400 or something like that in Appendix XII of the manual. After that it took a mere 2 minutes to boot, reload the firmware, and restart and all was go.

But a tension filled 5 1/2 wait in between.

6
0
Silver badge

You got to know when to run

Just out of high school I worked briefly at a steel mill in the electrical shop. They had these "little" 1 to 10MW aux generators that ran on gas to handle peak loads. Part of my job was to synch these to mains bus before closing breakers. I failed, and one of the synchronous machines ripped off it's mounts and left the shop. For the most part it went dark. I quit real fast and did a GTFO before the metal workers - paid by the piece - arrived to perhaps literally tear me from limb to limb

5
0
Silver badge

Thought for a moment it was the SFT II failure I encountered

(which I mentioned previously) where the customer *didn't* have a backup, and an engineer had to sit on site for a week recreating the system from printouts..

There was also the customer who really knew their stuff, and had asked their supplier for a particular HP tape drive. The tape drive arrived, and Did Not Work. Looking at it the customer noted that this wasn't an HP tape drive 'that's ok' said the supplier 'it's actually exactly the same drive, just not HP branded'. 'No', said the customer-with-clue, 'the HP drive has a special chip which enables it to work in this server, and your compatible version does not. Please ship the genuine part'

4
0
FAIL

Self harmed

What I did in the 80s, was very similar. Had spent a good few man months working on a very detailed fully illustrated User Guide for what became our well reviewed well loved but commercially failed because it was TBEX (Too Bloody Expensive) during a crashing late 1980s economy Classic BiT BOPPER Audio Reactive Video Entertainment System.

One day, I fired up the PC or whatever it was I was using to write the user guide on to do a backup. This involved inserting the 3 or 3.5" floppy disk into the machine's drive and copying the Word file ontop of the backup file. Simples!

Errr...

I dragged the user guide file onto the backup floppy. Grrdd grrdd grrdd grrddd etc. Done. (I recall that's the noise they made back then.)

Later, when I went to open up the working user guide to wrap it up prior to launch, the Word file was empty, just a few lines of text.

As one does during such crisis, I tried to understand what was going on. Flashes of my life and hard work went past as I tried to mentally deny I had lost months of effort. A corrupt file, virus etc?

I then got the backup out, it was the same as the working file!

On my PC, the windows I opened up were identical, as was (of course) the name of the file inside.

I had dragged the WRONG file from the WRONG window, and overwritten my working file with a backup from months ago, that was in fact blank except for a few seconds of typing to open up the new Word file. I had for whatever idiotic reason, probably to do with time passing quickly and being busy, procrastinated on making full on periodical backups. I had lost everything, including the diagrams I had painstakingly drawn.

I had to re do the whole user guide by copying from hard copy I had made a few weeks earlier of an earlier version, although I stopped about 75% way through when it became apparent the product itself was doomed due to the economic crash of the late 80s/1990.

The upside is we went on to develop a way cheaper compact version based on the Atari Falcon, and it's user guide was published both in print - AND online, one of the first user guides ever published online. Here you go, it's still there in lovely 1918 style html...

http://www.tecterran.com/svbb/userguide/

(For the record, people are still using this machine today for the retro visuals.)

I learned a few lessons, that are less relevant today in a world of online publishing and dynamic cloud backups/syncing etc: a) Don't procrastinate in backing up stuff. b) Name your folders/windows differently!

(I have never made such an error since.)

In a way, what happened was probably a diving entity telling me I was wasting my time anyway, being the product was doomed.

5
0
gb2

SFT III was a bit difficult to get right.

I remember finding driver issues with the first round of the MSL cards, random disconnections under heavy load.

Having fixed that I remember Arcserve continuing to be a PITA. Nobody wants a mirror disconnect during the backup.

If memory serves, the abend message was:

Richard Keil Memorial Abend #27

0
0
Silver badge
Facepalm

One from this weeked

Migrating a customer from Windows XP/Outlook Express to Linux/Thunderbird. I'm still a bit hazy on the details, but it seems to be the difference between POP3 & IMAP.

Imported the Outlook files into Thunderbird just fine and got Thunderbird talking to the ISPs mail server. Then the Good Idea Fairy suggested moving the migrated Outlook Inbox contents into the "real" Thunderbird Inbox. Great idea G.I.F., the customer will be happy everything is organised the same! But it seems that with IMAP, those emails are no longer on the server, so it duly synched the Thunderbird Inbox to remove all the "deleted" emails.

Luckily the "customer" is my Mum, so I didn't get sacked. I just have to redo the import again.

5
0
Silver badge
IT Angle

water, water, everywhere!

decades ago I had an "in between" job as a building maintenance guy, working graveyard shift. One morning I went to turn on the A/C unit, but when I started the cooling tower pump, someone had shut the valve on the inlet of the A/C unit [only one guy ever did that]. It broke several 12" diameter PVC pipes running around the building, dumping tons of water into the parking garage. Yeah, the switch for the pumps was on the opposite side of the building from the A/C unit with the valve. I didn't get fired, though. But the A/C was down for several days [fortunately weather wasn't bad, and regular ventilation was sufficient] while the pipes were all re-done. With double-thickness pipe so it wouldn't happen again.

ok NOT related to I.T. hence icon.

3
0
Silver badge
Meh

Break a mirror? 7 y bad luck?

I remember one of my first IT jobs where I supported POS systems. (in both senses of the acronym) Discovering early into the job that there were tape drives but only a few, filthy, incomplete backup tape sets, and no meaningful backups being done, I pushed the CIO to let me buy backup tape sets for all locations. Over $6K worth of tapes were bought. Subsequently I had some system failures and discovered that even with new tapes and perfectly working drives, I could only recover data about 1 in 5 times. (and this was when the closing store manager remembered to type "backup" at the login prompt before leaving)

Add to that the daily backup process also mirrored the primary drive to the inert secondary (that was only used when the primary failed) in the wee hours of the morning. It would cheerfully mirror a failing, corrupted drive, and wipe out the last night's perfectly good backup on the secondary. To change the drive involved swapping drives and setting jumpers, then relicensing the software, which was tied to the drive's serial number. I drove 100 miles through a blizzard at least once to perform this asinine, 15-minute process.

2
0
Silver badge
Pint

Mirror C: to X: first

HDDs have never been that expensive that a server couldn't have several.

0
4
Silver badge

Re: Mirror C: to X: first

Oh, I dunno. In 1986 my Sun server had a bottomless pit of a drive for user space. It was a 300 megabyte CDC Wren IV SCSI drive. It cost US$14,000 ... in 1986 dollars. (The "user" was a database archiving network statistics, if anybody's wondering.)

7
2
Silver badge

Re: Mirror C: to X: first

That assumes you can sensibly attach extra drives to a server, which isn't a given. Modern servers all have fast USB, historically you'd better hope there's external SCSI, spare drive bays, or similar.

5
0

Re: Mirror C: to X: first

"historically you'd better hope there's external SCSI, spare drive bays, or similar."

My first server was a 100MHz P1 bought from university surplus for $10, in 2000. A Compaq with all kinds of weird custom parts inside (memory and processor on a riser card!), and the boot disk was a SCSI hard drive too small for practical use. So we added a second drive to an available IDE port. No extra drive bays - no problem, just zip-tied it to the existing drive bays. Ran like that for 5+ years, never had any problems with it. (Eventually replaced with a P3, which I am currently in the process of replacing. It's amazing how low-powered a small web/mail/file server can be.)

3
0
Silver badge

Re: Mirror C: to X: first

Sometimes you're lucky, but if the port is on a SAS/SCSI backplane IDE/SATA isn't an option.

I recently copied VMs from one ESXi server to another, and initially thought I'd re-use the sole SATA connection intended for optical drives to directly copy between disks. It turned out that whilst there is an SATA connection, there is no SATA power feed or way to extract power from the custom redundant PSUs.

Resorted to an external USB to SATA dock, which despite the fact previous versions of ESXi had fought tooth and nail when using USB storage, ESXi 6.5 was almost straightforward (once the USB service had been stopped, and various obscure commands run to identify the long GUID/lun identifier for the external disk and mount it)

1
0

I was on the other side of planet earth all on my own, no-one to call when things got ugly. I was doing some remote work on a server at the country I was in. For a change, I thought I would have a go at using Windows 3.1 Winfile instead of DOS Norton Commander, but over the 14.4k modem the screen updates were really-really slow. "Oops" I clicked delete and then OK for it to delete all files on the OS/2 server. It was like a slow motion train wreck, nothing I could do, it slowly deleted everything in front of my eyes. The server I had deleted all the files from was core to the data communication system; all other servers of the data communication system started up from it. Complicated stuff with many configuration files that were being regularly updated whist debugging the new unreliable system, no regular tape backups; just the odd occasional one. The system kept running...phew! All I had to do now was book a flight to the other end of the country the following day. I had to get there before the next system crash, as crashes happened most days being a new system. Attempt to restore from backup, if complete enough, if good enough, if current enough to restore the folders, configuration files, then a very brave attempt of a complete system restart. Sphincter was like a goldfish mouth for 24 hours... got it going! If I hadn't, I don't know what I would have done.

5
0

SFT wasnt so bad

It was clever stuff - Fault Tolerance all done in SW on old (generic) kit with Intel single CPU speeds measured in the range of maybe 25-70MHz. So it was essential for the 2 machines to be identical if they were to operate in lock-step with each other - there simply weren't enough spare CPU cycles to deal with Hardware Abstractions and translations.

Fault Tolerance (more or less) went away for a loooong time on generic intel kit until vmWare came along and virtualised everything but with the benefit of CPUs running at 50x the clock speed and multiple cores.

0
0
Anonymous Coward

I remember SFT at a Novell marketing session. The company I worked at only had a single server. It had failed hard drive twice in my time there, both on a Friday. The first time our vendor met their SLA with a replacement drive, but not with a restore by a Netware expert. I being a fresh minted CS grad with only 4 months Netware experience had to pull a weekend miracle to have it reinstalled and restored for Monday morning. The second time I had a bit more experience and was restoring from my redesigned backup system. I had fired the vendor for their dropping of the ball multiple times. I had ordered a new drive with overnight shipping via UPS. It was coming through California which had a major earthquake earlier that week. It arrived Sunday morning instead of Saturday morning. I was still able to have a working fully restored system by Monday morning. I also had UPS regional manager in my office over the lack of cooperation on their part to get the shipment there in spite of it being at the local UPS warehouse by Saturday afternoon. In both instances, I tested restores regularly prior to the event. My redesign was due to "lessons learned" from the first event. That redesign and my greater experience allowed me to to the same task in half the time (actually less). "Lessons learned" from the second event allowed me to redesign yet again with the goal to achieve an overnight complete system restoration given working hardware to which to restore. Never had to actually do one, but I simulated it multiple times to test the design, earlier tests failed, fix and redo.

0
0

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Forums

Biting the hand that feeds IT © 1998–2018