Feeds

back to article Back up all you like - but can you resuscitate your data after a flood?

When it comes to backups two sayings are worth keeping in mind: "if your data doesn't exist in at least two places, it doesn't exist" and "a backup whose restore process has not been tested is no backup at all”. There is nothing like a natural disaster affecting one of your live locations to test your procedures. I have just …

COMMENTS

This topic is closed for new posts.

Page:

Been there

We're on a much smaller setup but been there with the VMs. Now I don't both dumping the database, website or anything - we just dump the entire VM every few hours and back that up. The procedure is much simpler and uniform across all our systems, and when we've had hardware fail firing up a VM on a spare machine is only a minute ot two away.

2
0

DR - never important until you need it

Part of your DR plan is always about having a location to put it!

also no DR plan works the first time so an actual disaster shouldn't be the first time you implement it. It should be tested until the most junior tech can make it work

If they had a alternate data centre with 'gobs of storage' why wouldn't you have copies of the VM already stored and tested ready to bring up and have the latest backup data restored to it.

Also if you're having to build from scratch why would you install a different OS version? wouldn't you put in place the exact same version?

I was bitten by the php short tags depreciation a couple of months ago but it was for a planned migration so it wasn't panic inducing.

1
0
Gold badge

Re: DR - never important until you need it

We had tested the DR plan by restoring to a copy of the VM. We thought we had a copy of the VM when we didn't. (The copy on the backup server was corrupt.) Our records indicated the VMs in use were the latest version - and they are - but they had gotten there by upgrading from previous versions; meaning the php configuration file was from an older version despite the binaries being up-to-date.

3
0

Re: DR - never important until you need it

That makes it tricky then. I would be highly concerned at corrupted backup files.

Might need to update your asset registry with application version levels as well as OS

0
0
Silver badge

Re: DR - never important until you need it

> We had tested the DR plan by restoring to a copy of the VM.

Well, you tested the backup plan. Doesn't sound like you had a full DR plan, one that identified services, risks, impact, allowed downtime numbers, etc. Put all that together and the missing bits tend to stand out.

I suspect you designed this from the bottom up, starting with the backups, and working out how to transfer them, how to restore them. That tends to focus your attention on what is in the backup, it's easy to miss what isn't in it.

Full tests are a pain, but are essential. I have a customer that swaps their production and DR centres once a month, back and forth. They take a short outage, in the small hours of a Sunday morning, at what they know to be off-peak hours, but they do know for certain that their DR site will be there for them when they need it. Given the size of their travel/hospitality business, they need that assurance.

4
0
Gold badge

Re: DR - never important until you need it

We worked from the bandwidth on up. We have a fixed amount of bandwidth every night that we can use for backups. X number of bits can be transferred every night. Backups is all about fitting everything inside that window.

0
0
Anonymous Coward

Ouch...

....as soon as I read the line

"This information is made highly available to deal with hardware failure but is not replicated offsite."

I just cringed noing where this was heading.

You mention that bigger companies have more budgets, but it's often the sheer scale that makes it impossible. For example, you mention a lab were you test you back up (by the sounds of it daily). For us to do that would cost 100's of thousands of pound and need a dedcated team, we're talking terabytes of data each night (let's not even think about wekly and monthly backups).

Combine the with the server to sys admin ration.

A small place may have one or two admin looking after say 10 servers, and no doubt some network kit and end users as well.

Somewhere with a 1000 servers won't have 100x more staff, you be lucky if it has 10x more staff and this is the bit the bean counters don't appreciate. if the shit hits the fan, there simply won't be enough to envoke a full DR in a resonable time.

So even if you are a big boy, getting everything running quckly is in the hands of automation. If that fails, then it's much more of a headache!

3
0
Gold badge

Re: Ouch...

The theory behind the stuff that isn't replicated off-site is that it is relevant only to that site. That means that if it all goes splork and we can't recover it something horrible has happened to that site and we're in to "contacting the insurance company to replace a site" anyways. By the time the site is back online the data in question will no longer be relevant.

2
0

Out-of-date OS?

It seems that most of your problems were due to running your live systems on an out-of-date version of Centos, then reinstalling the backups on a more recent version. The solution is either to keep the live systems updated, or to document exactly which OS flavour and version works.

Not that I've ever done anything similar of course, oh dear me no...

1
1
Gold badge

Re: Out-of-date OS?

The live version is fully up-to-date (at least as far as Yum is concerned, and both the 5 and 6 series are under active support) but the original image was an older version. That means that the php config file was from the past - and allowed the short tags - but the binaries got updated over time.

1
0
Anonymous Coward

So, you didn't test your DR plan...

...what did you expect. Just like un-test-restored backups could turn out to be useless, an untested DR plan is a DR plan that in all probability will not work. Sloppy service management - see it all too frequently.

0
5
Bronze badge
FAIL

Re: So, you didn't test your DR plan...

If you'd have read Trevors' replies, restarting the setup should have gone pretty smoothly, if it weren't for a corrputed backup of the VM, which happened to be the only thing that wasn't tested regularly.

The devil is always in the details, but "sloppy"...? Anon Troll is Anon, of course..

5
0
Gold badge

Re: So, you didn't test your DR plan...

The bitch of it is the corrupt VM ended up being caused by a flaky RAID SAS cable combined with some flaky disks on the backup server. Not outright dead, but dead *enough* that things acted wonky. It has since been replaced.

0
0
Bronze badge

Re: So, you didn't test your DR plan...

"sparking contact syndrome". Inviting Murphy for a 6-course dinner since the dawn of electric era.....

0
0
Anonymous Coward

Re: So, you didn't test your DR plan...

"Not outright dead, but dead *enough* that things acted wonky. It has since been replaced ..."

... and procedures introduced to prevent this kind of undetected data corruption being repeated, including removal of all potentially affected hardware in critical roles.

Please?

0
1
Gold badge

Re: So, you didn't test your DR plan...

You're funny. That would only occur in a world where the people in question have the kind of money to throw away whole servers because they act up. Try fighting like a caged rat for two years to get a storage replacement for 6 year old drives and then having to spend the better part of two months grinding every vendor on earth against eachother to slide in at budget.

Different worlds.

1
0
Bronze badge

Re: So, you didn't test your DR plan...

That would only occur in a world where the people in question have the kind of money to throw away whole servers because they act up.

Bean counters, the bane of IT.

0
0

This is where things like puppet come into their own. Stateful config rather than a complete backup of a particular machine means you can build a new one from scratch to the exact same spec as the old one really, really fast. Of course, don't forget to back up your puppet manifests...

4
0
Silver badge
Go

See Chapter 21 (New York Board of Trade) of "Blueprints for High Availability" (Evan Marcus and Hal Stern) which itself is a reprint of a VERITAS Software book "The Resilient Enterprise" (Richard Barker and Paul Massiglia). (I have no association with authors or publishers)

Well worth a read - made me focus more on what the rest of the business was going to do once I'd tested out the recovery of the core IT. Because your jobs not finished even when you've got the core IT nailed!

2
0
Facepalm

You're not using MySQL's built-in replication???

You say "None of the databases for our public websites can be set up for live replication because that would require rewriting code to accommodate it." and in the next paragraph, you state that you're using MySQL.

MySQL has had built-in near-real-time replication to a remote mirror server for over a decade. I was using it as a data-protection solution back in 2002! It doesn't require any changes to client code, since it happens within the database server.

And there are robust open-source solutions for backing up a running server, such as Percona's XtraBackup.

0
0
Gold badge

Re: You're not using MySQL's built-in replication???

Certainly does if you are doing multi-point writes! Your app needs to not blow up horribly on read-only DB instances and/or be somewhat aware of the underlying architecture to ensure write coherence. The built in replication doesn't work for all situations, sadly...

2
0

Re: You're not using MySQL's built-in replication???

Are you talking about multi-master replication? <Shudder>

There are open-source solutions that allow you to avoid that kind of thing, such as Galera Cluster, as well as commercial products.

Thankfully, I don't have to support multiple primary sites, so I've avoided having to implement such setups.

0
0
Gold badge

Re: You're not using MySQL's built-in replication???

If you're running apps that either aren't master-slave aware and require write capability to do even basic things then you usually end up in a multi-master scenario. MySQL built-in replication just doesn't work unless you have a properly designed application. By "properly designed" I mean something that's aware of DB replication scenarios and which uses the database architecture for relational key tracking.

As soon as any part of your app is manually creating or updating indexes, or is storing data in a table somewhere along with an index reference but isn't using that index reference in a relational manner (common when the application is designed by a developer and not a DBA) then you're deep into a world where replication cause all sorts of horrible, horrible things.

How many times in our industry has some horrible kludge designed to solve a temporary problem been pressed into mainline production, build upon dozens of times over the years and ended up as some patchwork bandaid application that is layers of plaster over the same kludgy, unscalable core? How many applications both in house and off the shelf suffer this? Too many, in my experience.

MySQL replication assumes a spherical cow. That's great if you're designing from scratch, but not so helpful if your cow is in fact a 12th dimensional meatcube extruded through a hole in space-time.

4
0
Bronze badge

Re: You're not using MySQL's built-in replication???

You don't have to rewrite application code to take advantage of MySQL replication, only if you want to use replication for load balancing/distribution. I have several DB servers with passive slaves, just ticking along. If nothing else it's certainly better than not doing it!

Even if there's slight replication lag (marginal in our experience) the chances are that the data on the slave is still more recent than your last backup.

We've recently deployed a Unitrends virtual appliance and very good it is too!

0
0
WTF?

Re: You're not using MySQL's built-in replication???

"MySQL built-in replication just doesn't work unless you have a properly designed application. By "properly designed" I mean something that's aware of DB replication scenarios and which uses the database architecture for relational key tracking."

I'm sorry, Trevor, but either this is a troll or you really don't understand MySQL replication and you're just repeating what some bloke down the pub told you once.

Applications do NOT need to be aware of replication. I should know -- I've written large-scale applications which talk to MySQL databases that had replication slaves. Not once did I have to alter my code because of that. And these days I get paid to manage hundreds of MySQL servers, ALL of which have replication slaves, a fact that most of the developers are blissfully unaware of. And that's how it should be, of course.

1
0
Silver badge
Thumb Up

Thanks Trevor!

Lines like these are why I read El Reg:

...not so helpful if your cow is in fact a 12th dimensional meatcube extruded through a hole in space-time.

2
0
Gold badge

Re: You're not using MySQL's built-in replication???

I'm sorry, David Harper 1, it looks like you're either a troll or you don't actually read other people's comments, interjecting instead your personal experiences as though they were valid for all circumstances. I could cheerfully write an application such that it would work just fine with MySQL replication. I could also write one such that it didn't.

MySQL Master-Master replication would work for the application at hand, but it would also be a monumental bitch to set up and maintain. Master-Slave doesn't work and causes muchos big time problems in failover.

I can believe that your personal coding practices - and those of developers you work with - are subconsciously such that they "just work" with master-slave replication. Bully for you. That said, your experiences, tics, mannerisms, and stylistic choices are not present in all members of our species. Different people do different things. This results in configurations that even you, with your vast and phallus-enhancing experience haven't worked with. The job of the sysadmin is to beat the infrastructure into shapes that cope with such things. We don't always get to have things recoded to meet our desires.

Applications need to be aware of replication insomuch as the developers of those applications need avoid doing things that break replication. (Which I call a replication-aware application. It is designed with the idea that you need to do things "properly" from the beginning.)

One scenario in which things go sideways is when your production facing servers can't see the "master" DB at all. They can only see their local copy. (The DBs can talk to one another.) In the failover scenario where the DR site is now the "active" one then the DR site's system will start writing to the slave. Bringing up the primary site won't cause the slave to replicate back it's new data, but the automatics would switch the front-facing servers back. (Politics dictate that if real-time replication were occurring then automated failover and re-transfer would be gun-to-the-head forced.)

The application simply blows up if it cannot write to the DB (every single script writes something, even if it's only tracking data) and thus can't work with a read-only database copy. Worse, if I had a fully active setup on the DR site linked to a slave system I could measure the time before a pointy-haired-boss demanded that we switch our setup to pulling reports off the DR site's copy in minutes. As I said, every page performs writes and your databases suddenly start diverging.

For added fun and games, the web servers running the PHP on the DR site will never be allowed to "see" the master DB. (Routing rules.) The database servers could be set up to tunnel to one another for replication, but items in one site's DMZ would not be allowed to talk to backend systems in another site's DMZ.

These are scenarios that break replication. They are dealing with "real world stuff" that includes politics, bad design choices by developers and more. MySQL master-slave replication does not solve all ills.

1
0

Re: You're not using MySQL's built-in replication???

"Your app needs to not blow up horribly on read-only DB instances..."

It shouldn't be a burden to remove the read-only denotation from your my.ini on your slave DB (since you're in there changing the slave bit anyway) in the event of a DR scenario to bring it up as a master. The replication was suggested to keep a nearly-live sync of your DB on a second server. Also, who said your app needs to know how to run on a read-only DB? The replication, in your case, would be solely for DR, not for active use.

0
0
Gold badge

Re: You're not using MySQL's built-in replication???

That's where the political issues come in. If a "live synced" copy existed then the powers that be would take a matter of days before they demanded that production workloads started operating off of it.

TPTB would also demand that switchover be automated. That would mean that any minor outage in the primary site (say because the ISP is having problems with the fibre card in their routers again) immediately trigger a switch to the copy stored on the DR site. They would not be capable of viewing the synched copy as "for emergency, disaster-only use".

This would result in either things going horribly wrong as databases diverged or massive amounts of resources needing to be invested in retooling the application in question (and a large chunk of the rest of the infrastructure) to go from "DR" to "multi-site HA."

Solutions that are "technically possible, if you can control for various factors" don't work when politics do not let you control the requisite factors.

0
0
Stop

Re: You're not using MySQL's built-in replication???

"...would take a matter of days before they demanded that production workloads started operating off of it."

You bill it as a "backup." They wouldn't, rightly, demand to run your backup copies of the network shares as a production datastore, so they should not demand a backup DB to be a production workload. It is the network admin's job to teach that.

For TPTB for automated switchover: your example of why auto failover is a Bad Thing in your case should be the exact argument against doing so. As an admin, there's a fine line to walk between "I can make it do that" and "that simply can't [shouldn't] be done." IT is as much an advisory source as it is an enabler. Just because I can set up a group of FreeNAS boxes as iSCSI targets so I can scale up my environment to 60TB doesn't mean I should, simply because TPTB demand more space, but won't pay for a SAN. Likewise, caving to each want and whim of TPTB that don't allocate proper funding to do it right (or at least "better"), is not correct. Of course, with their software, there's not much of an "ideal" way to do it. Manual failover, manual corrections in the event of DR, etc. It's just how it is, and TPTB need to understand that.

0
0
Gold badge

Re: You're not using MySQL's built-in replication???

@Ammaross Danan: you seem to believe that everyone will listen to their sysadmins and/or be swayed by logic. Even if you pull out bullshit ideas like "it is the network admin's job to teach that" you are still simply wrong. Computers are easy, politics are hard...and you cannot simply reprogram people until they obey you.

Armchair quarterbacking on the internet is so much easier when you can simply demand that other people change the rules around them though, isn't it? Makes me ask all sorts of questions about how well you manage to interact with human beings in the real world. Or if you do much of that at all. Compromises suck, but they are the way of the world.

1
0

Re: You're not using MySQL's built-in replication???

@Trevor_Pott

I have to practice politics every day too. You've had to deal with a wider range, due to the nature of contract work. I, like yourself, tend to end up implementing compromised solutions IRL, because that is exactly how the world works. With office politics, as with armchair quarterbacking on the internet, you recommend the more-ideal solution first, then let it get whittled and compromised down into the end result. But yes, it is the sysadmin's (or more accurately, the CIO/CTO's job) to emphasize disadvantages or shortcomings of implementations. As a consultant, it remits to the consultant to point out those things too.

0
0
Gold badge

Re: You're not using MySQL's built-in replication???

I consider it a matter of statistics. Talking about how things "should be" in IT is like physicists talking about a spherical cow. Everyone talks about the whitepapered version of reality in which everything has infinite budgets, change controls and completely pliant users that do whatever IT says.

Why would anyone read about that? Why should they? Such imagined fantasies have less to do with the real world than the spherical cow. In my view it's far better to begin discussions by asking "what are the constraints of operation and budget?" Skip 14 layers of dancing around the problem and argument and get right down to "where are the walls and what can do within them?"

I also think it's interesting to discuss real world implementations - both successful and failed - because they have to work inside these walls. The reason we get paid isn't to implement spherical cows but to make judgments about where compromises could or should be made.

Discussions that revolve around "no compromise" scenarios help noone; the discussions that need to be happening are "what are the constraints in existence, what compromises were made and were those rational compromises given the circumstances?" If the compromises aren't rational, then where should the compromises have been made? It is in discussing the making of the sausage of IT - when and where we can and should be making compromises to turn our spherical cow into a real one - that we evolve the discussion of our craft.

I posit that there is far more to be learned from failure - and from successful compromise - than there ever will be from "by the book."

0
0
Silver badge

Not too shabby

The CentOS version problem and not storing the VM definitions in both sites should not have happened, but I would not bash yourself over the head wrt the sendmail config.

Sometimes it is not enough to do a restoration test. For some services, it's necessary to actually run for a period of time in your alternate location. I suspect that any number of 99% tested DR plans may hold something like your sendmail problem.

This is normally because of the high cost of a full DR test. As a result, 5 minutes after the last DR test has been concluded sucessfully, an apparently minor change somewhere in the depths of the environment may invalidate it!

Of course, if you do run from your alternate location for enough time to make sure that you've got most of the bugs, it introduces another problem, that of fail-back. This is something that many, many administrators just do not think about. If you run from your alternate location for any length of time (to rattle any connectivity problems out), you have to have a procedure to revert back to your primary site. And it's not always a reverse of the DR plans, because these are often asymmetric.

The background to this is that most businesses don't think beyond restoring the service. One bank I worked for acknowledged (or at least their DR architect did) that it would be almost impossible to revert back to the primary site if they invoked their full site disaster plan for their main data centre. The services would be back up, but vulnerable to another failure.

4
0
Gold badge

Re: Not too shabby

It isn't enough to just test the DR plans; frequency of tests is an issue. A copy of the VM existed on the target site...but that copy was corrupted. Couldn't get it to boot. (Most likely an incomplete backup run at some point.)

So the DR plans were good, they were tested to inject new information and files into a known-good VM...but the known good VM turned out to be not so good. At that point, down the rabbit whole you go...

3
0

Re: Not too shabby

"It isn't enough to just test the DR plans; frequency of tests is an issue. A copy of the VM existed on the target site...but that copy was corrupted. Couldn't get it to boot. (Most likely an incomplete backup run at some point.)

So the DR plans were good, they were tested to inject new information and files into a known-good VM...but the known good VM turned out to be not so good. At that point, down the rabbit whole you go..."

Unless you just snag the VM copy from a previous version. But if you don't keep previous backups of your VMs, but instead overwrite each VM each night, then you're just asking for trouble. This could have been avoided if you simply had "the night before" the corrupted VM. Software that can backup using incremental rather than full also help. I'm willing to bet, though, that DFSR was the sole means of remote-site copies (which does have remote differential transfers, if you're not politically stuck on Win2003....)

0
0
Bronze badge

Re: Not too shabby

But you can't keep every single nightly backup of your VM, even as incrementals, at some point you'll have to delete old copies to save space. The problem is if your DR testing is less frequent than the length of time you keep old snapshots around.

And given that a full, 100% DR test has effectively the same effect on your business as a real disaster, I'm going to guess that you don't have backups going back that far.

In my old job, we assumed that if a disaster was big enough to take out ALL the physical hardware, it would also take out most of the rest of the office as well, and pretty much destroy the company. So, we kept long ter offsite backups of data, but in the event we'd have to use them, we'd pretty much be building a new infrastructure from scratch, so OS level backups were not much use.

1
0

Re: Not too shabby

"...I'm going to guess that you don't have backups going back that far."

Actually, we keep about 2 weeks worth of daily VM backups offsite with a week lag on cycling, so actually, YES, we do keep a fair amount of backups for which at least one image per VM would be restorable even in the event "last night's" backup failed for some reason. It's not hard to do, but certainly requires a decent storage device (ours has a good 20TB in it, but easy enough for a no-budget shop like Trevor's to set up a FreeNAS to do the same thing...)

0
0
Bronze badge

As I read somewhere, "It's not a backup until it's been restored."!

1
2
Silver badge

DR

ie. You had a backup regime but no DR or bare metal restore plans. First part of the DR plan is usually rebuilding the metal or virtual infrastructure. Second is restoring data. DR agreement is expensive but allows you to practice the whole rebuild at 3rd party DR site. Installing the right operating system version... er, should be a no-brainer ? (Sorry about that!)

1
2
Gold badge

Re: DR

It isn't the "right operating system version." It's about the config file version. The version of the OS on the production system is up to date (binary-wise,) however, the originally installed version was older than than the newest installed version. This means that a brand new install to the same version as is currently running in production will install different default config files.

The real lesson here is "add the php config files to the nightly backup set." It obviously isn't enough to rely on operating system version to keep those straight.

I would have thought that for someone who read the article that lesson was a no brainer.

2
0
Bronze badge

Re: DR

"The version of the OS on the production system is up to date (binary-wise,) however, the originally installed version was older than than the newest installed version. This means that a brand new install to the same version as is currently running in production will install different default config files."

Is that why apt-get on Debian based systems sometimes gives you a warning when upgrading about changed config files? You get three alternatives and I always seem to pick the wrong one (as I'm an end user, no major damage results).

CentOS/RHEL world just uses the older config and assumes you know what the consequences are I suppose.

We need a light bulb icon.

0
0

Re: DR

</quote>The real lesson here is "add the php config files to the nightly backup set."</quote>

except I don't think that would have worked here.

from memory the version of PHP that came with earlier versions of Centos didn't require a config entry to allow for short tags, its only in newer versions.

Even if you had the original config files it would still fail. At least that was my experience going from centos 4 to 6 where I had the original config.

0
0
Pint

Re: DR

Hi Trevor, I feel your pain regarding the config files - although you should slap your developers for using short tags. The most important rule is *always* back up /etc - even if you don't want to use it directly, then you can restore it (/root/oldetc perhaps) and manually check/compare things at the very least... been there, done that.

2
0
Silver badge

Re: DR

I can't pretend to understand the whole event, but a backup regime normally includes all data on the system, including the OS and all config files. Unless there is good reason for exclusion, eg cache files.

0
2
Gold badge

Re: DR

A local backup regime? Sure. To deal with equipment failure. A disaster recovery scheme? Rarely, if ever. The cost of bandwidth is prohibitive and there isn't always access to offsite vaulting companies willing to work for the prices you can afford.

0
0
Silver badge

Re: DR

"Never underestimate the bandwidth of a truck full of tapes".

Over the net is fine as far as it goes, but it does not have to be the only mechanism used. That's why most large datacentres use tape with offsite storage pools for their DR plan.

2
0
Silver badge

Re: DR

DR is expensive, usually reserved for a company's most critical stuff. Companies tend to categorize web servers as less critical, rightly or wrongly. I have worked with several clients on their DR tests, often it is covering a production data warehouse or similar. But some don't have it. DR tends to be a subject managers don't like to think about.

0
0
PM.

Good read. Thanks !

0
0
Silver badge

This is the dog+flood picture you are actually looking for:

http://mubi.com/lists/my-favorite-films-of-all-time-always-under-construction

0
0

Page:

This topic is closed for new posts.