back to article Sysadmins: Everything they told you about backup WAS A LIE

So, you're a sysadmin, slaving away to maintain the impossible 100 per cent uptime demanded by The Powers That Be. How many common myths about storage do you really believe? More to the point, how many of these common myths do your bosses believe? Of course, it really doesn’t matter which backup vendor you use - the myths are …

COMMENTS

This topic is closed for new posts.

Page:

Silver badge
Pint

Some extra points

* Ask yourself : How will I cope with vendor lock-in.

* What is backed up is the reponsability of the data owner not the IT Department. Are the other departments aware of the costs/time involved. Ask them to reconsider what they need compared to what they want.

* Add the time it take to bring the Offsite backup tapes "Onsite" into the restoration SLA.

* VERIFY YOUR LOGS......EVERYDAY

* Inform people when backups fail - it is after all their data..

* ALWAYS have more than one backup media ( hard disks + Tapes ) example - Hard disks Onsite + Tapes Offsite.

* Ensure that new Network Shares are added to the backup selection....

* Have you tested restoring your back up tapes on another Site / hardware / Server. If your building and your hardware are completely destroyed you will be glad that you can recover your data elsewhere.

4
0
Silver badge

But who gets it in the neck?

> What is backed up is the reponsability of the data owner not the IT Department.

That's all very well. The problem is that the universal impression of pretty much everyone in business is that if it runs on a computer, it's the IT dept's fault when it fails. No matter how much you'd like to argue about charters, SLAs, job descriptions or anything else; these will all be perceived as excuses trying to weasel out of an IT failure and blame the problem on someone else.

In fact if you are successful in getting this point across you could easily find you've just talked yourself out of a job. [ MD thought process: Well, if IT aren't responsible for this, what are they doing ... maybe it's time to "do more with less" ]

3
0
Silver badge

Re: But who gets it in the neck?

Pete, I understand completely what you are saying, if a counterarguement is required it all boils down to "finance".

The accounting departments has requested that 1 Tb of data "must" be backed up every day ( theri choice no yours). Ok, no problem the price of Disks/Tapes/Hardware will be X per year.

My role as the IT Guy is to also say to them that maybe the alternative is to back up only a subset of that data. ( Everything else can be archived) In such a case only 20Gb of data will be backed up daily, ther cost will be Y per year. Will be be lower if you are carefull. You gain not only in in cost but also in backup time ( this is important when problems arise), in recovery time ( vital for the DRP scenarios- get them up and running quicker). BUT it "must" be the Accounts department which make the decision as to which data is vital and which it not.

The bean counters understand figures far better than technology.

2
0
Silver badge

Re: Some extra points

You need to backup the working environment not just the data.

Our data is totally backed up and we can restore it instantly to our server farm which is currently;

a, underwater after we were flooded

b, in the middle of an area of the city that is cordoned off because of the bomb threat

c, working perfectly but inaccesible to all of our staff because of the above

0
0
Anonymous Coward

I've been in Storage, Backup & Recovery, data protection and DR for about 16 years or so, in industry (financial services) and now as a software and solution designer and I totally with everything you said.

Which is a shame, because I'm feeling rather contrary today, but I can't pick a hole in your article. People need to design, build and fund recovery solutions. The sooner we stop talking about backup and start talking about recovery, the better.

10
0
Silver badge
Trollface

If you want to nitpick ... the numbering is off.

Numbering should be machine-generated.

0
0
Coat

Numbering

The repeated point *was* about replication....

1
0
Anonymous Coward

@AC 12th July 2013 10:02 GMT

Precisely!

Actually we had a massive database corruption yesterday, thanks wholly to Windows Cluster & SQL Server not playing together. A cluster fail over event for no apparent reason resulted in an SQL Server database corruption. The event occurred DURING the daily incremental backup. The suspect database went into recovery on cluster restart, and we had no idea how long that would take (In the end it took 13 hours).

Sadly, plan B, restoring the database, meant recovering 1 full backup+1 incremental backup+23 hours and 53 minutes worth of log files. Unluckily, yesterday included a massive database grooming exercise removing a few hundred million rows, so the log files were larger than normal :(. After the restore we would have to post everything that didn't get posted after the failure.

This, I submit, was Murphy at work. There was no worse moment for it all to go pear shaped.

It took 10 hours to recover the database and another 4 hours to bring it up to date before I could start delivering data to customers.

We now know our worst case scenario :(

1
0
Anonymous Coward

Re: @AC 12th July 2013 10:02 GMT

You were, intentionally albeit perhaps reluctantly, doing [non-routine thing A] and [non-routine thing B] and maybe [non-routine thing C].

Yet you seem surprised that [non-routine thing D] happened.

Is it possible that [non-routine thing D] occured specifically *because* of all the other non-routine stuff going on?

Sometimes people are unlucky. Sometimes they're just waiting for the inevitable.

0
1
Anonymous Coward

Re: @AC 12th July 2013 10:02 GMT

Daily Incremental - routine (coincident with cluster event)

Weekly purge - routine (completed long before cluster event)

Daily Ops - routine (nothing unusual or extra was running)

Cluster failover due to LOCK exhaustion (64 bit 2008) - definitely non-routine and unpredictable

Thus, you are in fact incorrect.

Predicting when SQLServer will go tits up because of lock table exhaustion is a tricky business, impossible I would suggest. Causing it to happen coincident with a regular differential backup would be impossible to do on purpose, and statistically in the fabulously unlikely bucket by accident.

The obvious answer is more incremental backups. I already generate 1/2 TB of backup files/day - there are limits to everything, including disk space.

0
0
Bronze badge

scary but true

Makes one break out in a sweat thinking about it.

0
0
Anonymous Coward

Point 3 is wrong

You can either deploy a new server, or you can restore the O/S. Neither option is definately right or definately wrong, but to say, "you’ll never use it to restore a system" is just plain wrong.

2
0

This post has been deleted by its author

Re: Point 3 is wrong

It must be twenty years since I restored an OS. When it comes to metal then you need an OS to be able to restore anything, so why restore the OS on top of a working OS. Disk images for virtuals are another kettle of fish I guess.

2
0
Anonymous Coward

Re: Point 3 is wrong

" you need an OS to restore anything, so why restore the OS on top of a working OS"

Because the lightweight single function OS you use to do the restore isn't the same as the OS you use to run the application, perhaps?

Back in the days when I used to care about these things, you booted DOS or a minimal Linux to do the restore, even if the eventual target environment was a Window box.

Does it not work like that any more?

Or are you seriously suggesting installing Windows to do the restore and then using the same Windows to run the application(s)? Anyone see any problems with that?

2
0
Silver badge
Meh

Re: Point 3 is wrong

My view is that it depends entirely on ho much has changed in the OS since it was installed, and that is probably determined by the function of the system being backed up.

I've worked in an environment where every server in the server farm is a basic install with scripted customisations, with all the data contained in silos that can be moved from one server to another (the bank I used to work for had been doing this on a proprierty UNIX since the turn of the century, before Cloud was fashionable). These systems can be re-installed rather than restored.

I've also worked in environments where each individual system has a unique history that is difficult to replicate or isolate. These systems need to be restored.

One example of this latter category is the infrastructure necessary to reinstall systems in the former category!

There just is not one fixed way of doing things. Each environment is different.

2
0
Bronze badge

Re: Point 3 is wrong

Rebuilding a server only works if you know EXACTLY how the current one was built - not just the OS config but the application stack and EVERYTHING! And I know too many companies where that simply isn't the case.

2
0
Anonymous Coward

Strongly Agree: Backups don't really matter.

Restores matter a lot more. Or recovery, if you'd prefer that word.

Rarely has so much wisdom been concentrated in so few words in The Register.

More of this kind of thing, please.

[I'm not AC 10:02, I've never been in the City, but we're apparently thinking very similarly on this subject]

4
0
Gold badge

Re: Strongly Agree: Backups don't really matter.

"Rarely has so much wisdom been concentrated in so few words in The Register."

We can probably precis it down even further to just point (2), though. Really the other 9 follow logically from there.

There is no such thing as backup; there is only restore.

1
0

Re: Strongly Agree: Backups don't really matter.

I can condense it down even further.

Nobody ever gets fired for backup failures. They get fired for failing to restore.

.

0
0

Well, yeah.

Apparently, health care has recently been revolutionized by use of check lists. These are sets of must-do items that, considered individually, are obvious, until you put them on a check list so that they don't keep getting missed.

So, yes, backups are for restoring, when you have to. And when you have to, you have to. If you aren't ready to restore then you aren't ready. And you don't know for sure that you have a backup ready to restore, until you restore it.

Backing up is a tedious inconvenience - until you need to restore.

I suppose we probably aren't talking about desktop PCs here, but there is that factor too - as far as I know, Microsoft Windows still puts itself and its users' data all on one disk partition. If you want to back up the whole system (which seems like a -good- idea to me for a fast restore), you have to back up the -whole- system - unless you do partitioning yourself, which -is- easier nowadays.

Another factor, though, is that if what I once read about Microsoft's way with GPT applies, then even a fairly straightforward design of disk with separate partitions for useful things is liable to be littered with tiny extra partitions for Microsoft's own amusement.

But then, restoring the Windows partitions probably won't get your PC running again in any case.

I think I may be saying that some things you just can't back up.

6
2
Silver badge

Re: Well, yeah.

Checklists - pilots with trends of thousands of flying hours still use Checklists. It is a very simple efficient system even though a job gets habitual, you still check and check your assumptions.

2
0
Holmes

Re: Well, yeah.

Aviation was the original home of the checklist for complex but routine operations. The use of them came out of a crash that nearly bankrupted Boeing.

More at http://www.atchistory.org/History/checklst2.htm

0
0
Silver badge
Trollface

Re: Well, yeah.

Checklists gave Hitler the key to Europe!

0
0
Anonymous Coward

Re: Well, yeah.

And when people ignore the checklist or trust the tick in the box rather than actually looking, Bad Things eventually happen.

Design engineer: Have we got a simple visual indicator to tell if the bonnet is latched closed? No problem, costs too much anyway, people will always check it the hard way.

Service technician: Is the bonnet properly latched closed? Y/N

Post-service inspector: Has the technician *actually* ensured the bonnet is properly latched closed (not just ticked the box)?

Operational crew: Have the technician and the inspector *actually* ensured the bonnet is properly latched closed (not just ticked the box)?

Not talking about car bonnets, but aircraft engine cowlings, and what happens when these four people (plus others) all fail in sequence, and the problem is known about and largely ignored for two decades:

http://www.flightglobal.com/news/articles/dual-cowl-mystery-at-centre-of-ba-a319-probe-386495/

0
0

Inspector?

I think I understand correctly that the pilot always personally walks around the outside of the plane looking for anything wrong. Wouldn't you, too?

The incident that your link is about seems to be a relatively rare case of this failing, because - as hinted - it can be difficult to see whether these engine door things were properly closed.

But I bet they're still checking extra-carefully now.

0
0

This post has been deleted by its author

Anonymous Coward

Re: Inspector?

"I think I understand correctly that the pilot always personally walks around the outside of the plane looking for anything wrong. Wouldn't you, too?"

Indeed, and that was emphasised particularly after the first few incidents of this nature. But apparently the latch in question is on the underside of the engine, and viewing it would apparently involve bending down. Apparently nobody's yet thought of using a mirror-on-a-stick as used to be used for under-car bomb inspections. A microswitch cabled to a light in the cockpit (a long way away) and/or to an input on the engine control unit (a few feet away) is out of the question too apparently, given all the other failsafes which have to go wrong for one of these to escape.

I'm not that sure it deserves to be called relatively rare - check the history of airworthiness directives etc.

How does this relate to backup?

Well, it adds a supplementary to Ken Hagan's terse but entirely appropriate summary posted at 17:31

The long version of the supplementary is "What can go wrong, will go wrong. However many failsafe checks you build in, someone/something will one day defeat each one. Eventually, if you repeat the sequence enough times, there will very likely be an occasion where someone/something will defeat all of the checks in a way that was probably entirely foreseeable but considered infinitely improbable. Sometimes it won't matter. Occasionally it will. Are you feeling lucky?"

I'll leave Ken to summarise again, if he'd like to.

0
0
Bronze badge

'as far as I know, Microsoft Windows'

criptes. And it's already got upvotes too!

Look, if you what you want to say about Windows can be tagged "if what I once read about Microsofts's way', then it it isn't worth writing, no matter how many equally ignorant people there are to agree with you..

Anyway, responding to your central point, 'Microsoft Windows' doesn't put itself and user data all on one disk partition.

Users do.

Like on all of the *nix based home and office systems I have ever seen, including those down on the factory floor right now. In contrast, ALL of the enterprise level workstations I have worked with for the last 15 years, from around 1998 to now, have put the OS and the user's data on seperate disk partitions. So shoot me. I've never worked at enterprise level with linux workstations. At least I don't make dumb comments like "as far as I know, OSX still puts itself and it's users data all on one disk partition."

And you NEVER restore the OS partition. If something goes wrong with that, you just re-install.

1
0
Anonymous Coward

Re: 'as far as I know, Microsoft Windows'

I'm having trouble understanding your point here.

I've been using and sysadminning NT since the MSDN pre-release (1993?), and still do with its successors.

It is hard (to the point of insanity) to keep a consistent system backup of an individual system, unless user data and OS data are backed up at the same time. Didn't say impossible (you want to move My Documents etc, feel free, but there's other stuff too). You don't see the inconsistencies till the restore, of course.

I've been using and sysadminning *NIX (and VMS) rather longer. And Linux almost as long (mostly Suse, occasionally others) since, well, whenever Suse 8 was (2000ish). Yes I'm a dinosaur.

Maybe it's just me, but my experience has been that with those OSes, getting a consistent restore (eg by keeping "/" separate from "/home" on *ix) is trivial in comparison with doing the equivalent on Windows.

YMMV.

2
0
Anonymous Coward

Re: 'as far as I know, Microsoft Windows'

That's because Unix, and later Linux, was written slowly and carefully to do the job properly. By geeks for geeks.

Windows has always had the marketing and bean counters overriding some extremely clever engineers because they so desperately yearned to see that Windows logo on every computer in the world.

1
1

And maybe the most important one; backup/restore is not intended to address or solve archiving or data retnetion requirements... Maarten

4
0

Utilities

IT is a utility. That gives it certain weird and unpleasant characteristics, and we need more thought on this topic. I'd write a book but I have contracts to run.

1. A utility is an asymmetrical service. In a utility, when you spend a fortune, innovate brilliantly, bust your gut to make things run perfectly, then save the business from a problem it didn't even know it had, you get this well-known result:

Nothing.

2. When you take your eye off the ball for one minute of the half million minutes in a year, or when something breaks that's within your remit but beyond your control, or when you make a dumb mistake, you get this well known result:

Shit.

Ever wondered why your job is so thankless? a. You work in a utility, and these are the only two possible results of your work. b. The five nines of Nothing you have produced in no way shields you from the amount of Shit that will rain on you when something goes wrong.

3. Utilities are easy to shave costs from. Why spend all this time and money on Nothing? If you spend less, you still get Nothing, at least for a while. This means that in return for the Nothing you produce as a utility provider, what you can expect from the organization for the production of Nothing is, therefore, Less.

IT is not special - this is true at water companies, chicken farms, and banks, and all the other things humans have got operationally good at.

IT has responded by trying to enable things and innovate. That's nice, and probably necessary, but today's innovation becomes tomorrow's baseline. Now you have to work harder to produce Nothing. At best you might be able to argue for more resources. But not for long.

DR and security are the most utility-ish part of IT because most of the effort manifestly produces Nothing, by design. That's why DR and security have to resort to a bit of hyperbole once in a while to get proper funding.

It's systemic.

36
0
Gold badge
Thumb Up

So backup is backup and everything else (which people might use as backup) is not

I know, when it's put that way it's obvious except when it's not and admins (or perhaps their PHB's) think they can get away with using something like backup.

Excellent article (and some excellent comments) the part about "utilities" would explain a lot.

Thumbs up for a useful reminder of some things some people may have forgotten.

1
0

Completely agree...

Except for this:

" If you feel the need to back up an operating system several thousand times… feel free, I guess, but you’ll never use it to restore a system."

Well that depends. I have an old OS on a machine, which won't be upgraded because that will be a lot of work for somebody - likely none of the apps that are on it will work. So in addition to backing this machine up using *some piece of backup software*, I also back it up using ufsdump. If I ever want to restore this machine on bare metal, I will use that dump to install the OS. I know it will restore exactly as the box is now. I know this because I have tested the restores in VirtualBox.

There is practically no chance of finding the original OS install disks, and the long-forgotten patches and tweaks, that have been applied to this box. So dumping and restoring the whole thing is the best way.

Of course the box should be upgraded/updated. But it won't be, unless it breaks.

5
0
Silver badge
FAIL

Re: Completely agree...

You cannot be serious! This application/system is a major point of failure, and by your own admission, if this box fails you have no hardware or software which can run this application!

Even though you are religiously backing up both the OS and the application, a single hardware failure could leave this system broken until you can urgently migrate it to a different host. If you can't restore a back up, it isn't a back up.

1
3

Re: Completely agree...

I am serious.

We can restore the backup - I test this from time to time in VirtualBox. If the worst came to the worst that is exactlty what I would do as an interim measure.

As and when it fails, the powers that be will understand its importance and maybe then provide the time and money to update it.

3
0
Silver badge
Boffin

Missed a key common fail

A backup is NEVER an archive. (regular readers of my comments will know I bang on about this)

A backup is there to allow you to recover (as you state).

An archive is a primary copy in the information lifecyle.

Two entirely different concepts.

3
0
Pint

Re: Missed a key common fail

A backup is NEVER an archive.

EXCEPT: When you need one, the other is much better than nothing.

0
0
Silver badge

A tale of restoring..

I backup my whole computer nightly on a headless server.

The other week it died. Never mind why, I just got fed up with waiting for it to do something mindless and yanked the power cord. Couldn't discover which bit of it had got trashed. Took a view.. Quicker to reinstall upgraded OS and then recover all the parts from the backup.

And it worked.

Bit by bit everything came back with the old machines image mounted on a spare directory, and crucial bits copied back across.

Nothing was lost, and whilst I echo the point that you don't need to back up everything, since in this case the bulk of the data was on the server anyway, backing up the whole OS was not in fact a huge hardship.

Let's say that disk is cheap, working out what to put on it is not.

1
1
KPz

And when designing your storage...

...make sure you allow for how you're going to back it up.

I've seen 12Tb CIFS volumes configured, which make backups interesting.

1
0

Nice to see a decent sysadmin article on this site which isn't written by a petulant prick and isn't a glorified advertisement for some shitty product which only runs on Windows. Good stuff. Can we have more, please?

4
0
Silver badge
Boffin

Full and incremental backups

Usually you do a Full backup and then incremental ones during the week. That's why you don't have to backup Terabytes upon Terabytes of data. Of course, you should also have another team restoring said backups on the DR platform, which serves as both DR readiness and testing that the backup media is actually working.

Ah, the woes of a certain company that found out their backups were worthless the day their Server went down...

0
0

So what would be the most efficient way of doing a system backup with full system restore? One of our customers is getting paranoid about server failure at the moment. The current server is running Win2008R2 for some main programs and data storage w/Win2008R2 virtual server handling their Exchange. They are using around 45gb on their C Drive and about 100gb on their Data drive.

We have two QNAP 2tb drives set up. One which does a daily incremental backup and stays on site and another that does Full weekly backups of the whole system and is removed from site the following morning. We're using Acronis Backup and Recovery 11.5.

The customer is wanting the best solution if the server goes down how to get back up asap with minimal data loss. They are possibly thinking of having another server build which will mirror the existing one so they can swap to it if need be. I don't really know what would be involved in that for carrying out either a daily or weekly snapshot.

Also, bear in mind their internet connection is only a 2mb line so cloud backup would be useless.

0
0

This post has been deleted by its author

Boffin

Re: >AHEM<

Nope. He's implying that they *can* do that (due to the access they have to all the data), so you better choose people who *won't* do that.

For SMEs this probably isn't an issue - anyone in IT probably has admin access to every system anyway. But for big companies where different systems have different administrators, if you have a common backup team then they might be the only people with access to everything.

0
0

AntyEM

Folks in the application community know the best backup is one taken by the application immediately prior to the point of failure of course it is always better to immediately restore the backup to validate the recovery. Unfortunately precognitive backup software does not exist just like service availability=1. If storage software could tell you when your storage array was about to go TU or your datacenter EPO is about to be tripped we really wouldn't need to worry about accidental data loss. I have experienced both. Heres the fix. Get rid of the red button and store the data in more than one place. If the place burns down tag the site as a fire hazard and move instead of replacing the wires.

0
0

It's very simple

you have user data backups

you have OS backups

OS backups require

1) Mirror OS to prevent issues with h/w failures

2) Alternate boot disks - copy of OS disk that is not online except when you make the backups

3) Bootable tapes/DVD if possible - off site

No, you cannot restore the OS from kickstart/PXE/Puppet etc unless you are constantly refreshing

that data.

0
0

Backups are tickbox crap for auditors

Recovery is what matters. Only after having done several offsite restores was this driven home to me. The old ways of doing stuff (file based / tape / only 'important' databases, the rest can be done manually / manually reconfiguring networking n event of DR / anything involving Backup Exec / ignoring fast recovery of client infrastructure). All crap.

0
0

Hey Author, you didn't say "cloud" or "bare metal". Will you marry me?

4
0

Page:

This topic is closed for new posts.

Forums