back to article Back up all you like - but can you resuscitate your data after a flood?

When it comes to backups two sayings are worth keeping in mind: "if your data doesn't exist in at least two places, it doesn't exist" and "a backup whose restore process has not been tested is no backup at all”. There is nothing like a natural disaster affecting one of your live locations to test your procedures. I have just …

COMMENTS

This topic is closed for new posts.
  1. Androgynous Cupboard Silver badge

    Been there

    We're on a much smaller setup but been there with the VMs. Now I don't both dumping the database, website or anything - we just dump the entire VM every few hours and back that up. The procedure is much simpler and uniform across all our systems, and when we've had hardware fail firing up a VM on a spare machine is only a minute ot two away.

  2. NinjasFTW

    DR - never important until you need it

    Part of your DR plan is always about having a location to put it!

    also no DR plan works the first time so an actual disaster shouldn't be the first time you implement it. It should be tested until the most junior tech can make it work

    If they had a alternate data centre with 'gobs of storage' why wouldn't you have copies of the VM already stored and tested ready to bring up and have the latest backup data restored to it.

    Also if you're having to build from scratch why would you install a different OS version? wouldn't you put in place the exact same version?

    I was bitten by the php short tags depreciation a couple of months ago but it was for a planned migration so it wasn't panic inducing.

    1. Trevor_Pott Gold badge

      Re: DR - never important until you need it

      We had tested the DR plan by restoring to a copy of the VM. We thought we had a copy of the VM when we didn't. (The copy on the backup server was corrupt.) Our records indicated the VMs in use were the latest version - and they are - but they had gotten there by upgrading from previous versions; meaning the php configuration file was from an older version despite the binaries being up-to-date.

      1. NinjasFTW

        Re: DR - never important until you need it

        That makes it tricky then. I would be highly concerned at corrupted backup files.

        Might need to update your asset registry with application version levels as well as OS

      2. Phil O'Sophical Silver badge

        Re: DR - never important until you need it

        > We had tested the DR plan by restoring to a copy of the VM.

        Well, you tested the backup plan. Doesn't sound like you had a full DR plan, one that identified services, risks, impact, allowed downtime numbers, etc. Put all that together and the missing bits tend to stand out.

        I suspect you designed this from the bottom up, starting with the backups, and working out how to transfer them, how to restore them. That tends to focus your attention on what is in the backup, it's easy to miss what isn't in it.

        Full tests are a pain, but are essential. I have a customer that swaps their production and DR centres once a month, back and forth. They take a short outage, in the small hours of a Sunday morning, at what they know to be off-peak hours, but they do know for certain that their DR site will be there for them when they need it. Given the size of their travel/hospitality business, they need that assurance.

        1. Trevor_Pott Gold badge

          Re: DR - never important until you need it

          We worked from the bandwidth on up. We have a fixed amount of bandwidth every night that we can use for backups. X number of bits can be transferred every night. Backups is all about fitting everything inside that window.

  3. Anonymous Coward
    Anonymous Coward

    Ouch...

    ....as soon as I read the line

    "This information is made highly available to deal with hardware failure but is not replicated offsite."

    I just cringed noing where this was heading.

    You mention that bigger companies have more budgets, but it's often the sheer scale that makes it impossible. For example, you mention a lab were you test you back up (by the sounds of it daily). For us to do that would cost 100's of thousands of pound and need a dedcated team, we're talking terabytes of data each night (let's not even think about wekly and monthly backups).

    Combine the with the server to sys admin ration.

    A small place may have one or two admin looking after say 10 servers, and no doubt some network kit and end users as well.

    Somewhere with a 1000 servers won't have 100x more staff, you be lucky if it has 10x more staff and this is the bit the bean counters don't appreciate. if the shit hits the fan, there simply won't be enough to envoke a full DR in a resonable time.

    So even if you are a big boy, getting everything running quckly is in the hands of automation. If that fails, then it's much more of a headache!

    1. Trevor_Pott Gold badge

      Re: Ouch...

      The theory behind the stuff that isn't replicated off-site is that it is relevant only to that site. That means that if it all goes splork and we can't recover it something horrible has happened to that site and we're in to "contacting the insurance company to replace a site" anyways. By the time the site is back online the data in question will no longer be relevant.

  4. Dave Pickles

    Out-of-date OS?

    It seems that most of your problems were due to running your live systems on an out-of-date version of Centos, then reinstalling the backups on a more recent version. The solution is either to keep the live systems updated, or to document exactly which OS flavour and version works.

    Not that I've ever done anything similar of course, oh dear me no...

    1. Trevor_Pott Gold badge

      Re: Out-of-date OS?

      The live version is fully up-to-date (at least as far as Yum is concerned, and both the 5 and 6 series are under active support) but the original image was an older version. That means that the php config file was from the past - and allowed the short tags - but the binaries got updated over time.

  5. Anonymous Coward
    Anonymous Coward

    So, you didn't test your DR plan...

    ...what did you expect. Just like un-test-restored backups could turn out to be useless, an untested DR plan is a DR plan that in all probability will not work. Sloppy service management - see it all too frequently.

    1. Grikath
      FAIL

      Re: So, you didn't test your DR plan...

      If you'd have read Trevors' replies, restarting the setup should have gone pretty smoothly, if it weren't for a corrputed backup of the VM, which happened to be the only thing that wasn't tested regularly.

      The devil is always in the details, but "sloppy"...? Anon Troll is Anon, of course..

      1. Trevor_Pott Gold badge

        Re: So, you didn't test your DR plan...

        The bitch of it is the corrupt VM ended up being caused by a flaky RAID SAS cable combined with some flaky disks on the backup server. Not outright dead, but dead *enough* that things acted wonky. It has since been replaced.

        1. Grikath

          Re: So, you didn't test your DR plan...

          "sparking contact syndrome". Inviting Murphy for a 6-course dinner since the dawn of electric era.....

        2. Anonymous Coward
          Anonymous Coward

          Re: So, you didn't test your DR plan...

          "Not outright dead, but dead *enough* that things acted wonky. It has since been replaced ..."

          ... and procedures introduced to prevent this kind of undetected data corruption being repeated, including removal of all potentially affected hardware in critical roles.

          Please?

          1. Trevor_Pott Gold badge

            Re: So, you didn't test your DR plan...

            You're funny. That would only occur in a world where the people in question have the kind of money to throw away whole servers because they act up. Try fighting like a caged rat for two years to get a storage replacement for 6 year old drives and then having to spend the better part of two months grinding every vendor on earth against eachother to slide in at budget.

            Different worlds.

            1. Fatman

              Re: So, you didn't test your DR plan...

              That would only occur in a world where the people in question have the kind of money to throw away whole servers because they act up.

              Bean counters, the bane of IT.

  6. theOtherJT Silver badge

    This is where things like puppet come into their own. Stateful config rather than a complete backup of a particular machine means you can build a new one from scratch to the exact same spec as the old one really, really fast. Of course, don't forget to back up your puppet manifests...

  7. Velv
    Go

    See Chapter 21 (New York Board of Trade) of "Blueprints for High Availability" (Evan Marcus and Hal Stern) which itself is a reprint of a VERITAS Software book "The Resilient Enterprise" (Richard Barker and Paul Massiglia). (I have no association with authors or publishers)

    Well worth a read - made me focus more on what the rest of the business was going to do once I'd tested out the recovery of the core IT. Because your jobs not finished even when you've got the core IT nailed!

  8. David Harper 1
    Facepalm

    You're not using MySQL's built-in replication???

    You say "None of the databases for our public websites can be set up for live replication because that would require rewriting code to accommodate it." and in the next paragraph, you state that you're using MySQL.

    MySQL has had built-in near-real-time replication to a remote mirror server for over a decade. I was using it as a data-protection solution back in 2002! It doesn't require any changes to client code, since it happens within the database server.

    And there are robust open-source solutions for backing up a running server, such as Percona's XtraBackup.

    1. Trevor_Pott Gold badge

      Re: You're not using MySQL's built-in replication???

      Certainly does if you are doing multi-point writes! Your app needs to not blow up horribly on read-only DB instances and/or be somewhat aware of the underlying architecture to ensure write coherence. The built in replication doesn't work for all situations, sadly...

      1. David Harper 1

        Re: You're not using MySQL's built-in replication???

        Are you talking about multi-master replication? <Shudder>

        There are open-source solutions that allow you to avoid that kind of thing, such as Galera Cluster, as well as commercial products.

        Thankfully, I don't have to support multiple primary sites, so I've avoided having to implement such setups.

        1. Trevor_Pott Gold badge

          Re: You're not using MySQL's built-in replication???

          If you're running apps that either aren't master-slave aware and require write capability to do even basic things then you usually end up in a multi-master scenario. MySQL built-in replication just doesn't work unless you have a properly designed application. By "properly designed" I mean something that's aware of DB replication scenarios and which uses the database architecture for relational key tracking.

          As soon as any part of your app is manually creating or updating indexes, or is storing data in a table somewhere along with an index reference but isn't using that index reference in a relational manner (common when the application is designed by a developer and not a DBA) then you're deep into a world where replication cause all sorts of horrible, horrible things.

          How many times in our industry has some horrible kludge designed to solve a temporary problem been pressed into mainline production, build upon dozens of times over the years and ended up as some patchwork bandaid application that is layers of plaster over the same kludgy, unscalable core? How many applications both in house and off the shelf suffer this? Too many, in my experience.

          MySQL replication assumes a spherical cow. That's great if you're designing from scratch, but not so helpful if your cow is in fact a 12th dimensional meatcube extruded through a hole in space-time.

          1. batfastad

            Re: You're not using MySQL's built-in replication???

            You don't have to rewrite application code to take advantage of MySQL replication, only if you want to use replication for load balancing/distribution. I have several DB servers with passive slaves, just ticking along. If nothing else it's certainly better than not doing it!

            Even if there's slight replication lag (marginal in our experience) the chances are that the data on the slave is still more recent than your last backup.

            We've recently deployed a Unitrends virtual appliance and very good it is too!

          2. David Harper 1
            WTF?

            Re: You're not using MySQL's built-in replication???

            "MySQL built-in replication just doesn't work unless you have a properly designed application. By "properly designed" I mean something that's aware of DB replication scenarios and which uses the database architecture for relational key tracking."

            I'm sorry, Trevor, but either this is a troll or you really don't understand MySQL replication and you're just repeating what some bloke down the pub told you once.

            Applications do NOT need to be aware of replication. I should know -- I've written large-scale applications which talk to MySQL databases that had replication slaves. Not once did I have to alter my code because of that. And these days I get paid to manage hundreds of MySQL servers, ALL of which have replication slaves, a fact that most of the developers are blissfully unaware of. And that's how it should be, of course.

            1. Trevor_Pott Gold badge

              Re: You're not using MySQL's built-in replication???

              I'm sorry, David Harper 1, it looks like you're either a troll or you don't actually read other people's comments, interjecting instead your personal experiences as though they were valid for all circumstances. I could cheerfully write an application such that it would work just fine with MySQL replication. I could also write one such that it didn't.

              MySQL Master-Master replication would work for the application at hand, but it would also be a monumental bitch to set up and maintain. Master-Slave doesn't work and causes muchos big time problems in failover.

              I can believe that your personal coding practices - and those of developers you work with - are subconsciously such that they "just work" with master-slave replication. Bully for you. That said, your experiences, tics, mannerisms, and stylistic choices are not present in all members of our species. Different people do different things. This results in configurations that even you, with your vast and phallus-enhancing experience haven't worked with. The job of the sysadmin is to beat the infrastructure into shapes that cope with such things. We don't always get to have things recoded to meet our desires.

              Applications need to be aware of replication insomuch as the developers of those applications need avoid doing things that break replication. (Which I call a replication-aware application. It is designed with the idea that you need to do things "properly" from the beginning.)

              One scenario in which things go sideways is when your production facing servers can't see the "master" DB at all. They can only see their local copy. (The DBs can talk to one another.) In the failover scenario where the DR site is now the "active" one then the DR site's system will start writing to the slave. Bringing up the primary site won't cause the slave to replicate back it's new data, but the automatics would switch the front-facing servers back. (Politics dictate that if real-time replication were occurring then automated failover and re-transfer would be gun-to-the-head forced.)

              The application simply blows up if it cannot write to the DB (every single script writes something, even if it's only tracking data) and thus can't work with a read-only database copy. Worse, if I had a fully active setup on the DR site linked to a slave system I could measure the time before a pointy-haired-boss demanded that we switch our setup to pulling reports off the DR site's copy in minutes. As I said, every page performs writes and your databases suddenly start diverging.

              For added fun and games, the web servers running the PHP on the DR site will never be allowed to "see" the master DB. (Routing rules.) The database servers could be set up to tunnel to one another for replication, but items in one site's DMZ would not be allowed to talk to backend systems in another site's DMZ.

              These are scenarios that break replication. They are dealing with "real world stuff" that includes politics, bad design choices by developers and more. MySQL master-slave replication does not solve all ills.

          3. Tom 13
            Thumb Up

            Thanks Trevor!

            Lines like these are why I read El Reg:

            ...not so helpful if your cow is in fact a 12th dimensional meatcube extruded through a hole in space-time.

      2. Ammaross Danan

        Re: You're not using MySQL's built-in replication???

        "Your app needs to not blow up horribly on read-only DB instances..."

        It shouldn't be a burden to remove the read-only denotation from your my.ini on your slave DB (since you're in there changing the slave bit anyway) in the event of a DR scenario to bring it up as a master. The replication was suggested to keep a nearly-live sync of your DB on a second server. Also, who said your app needs to know how to run on a read-only DB? The replication, in your case, would be solely for DR, not for active use.

        1. Trevor_Pott Gold badge

          Re: You're not using MySQL's built-in replication???

          That's where the political issues come in. If a "live synced" copy existed then the powers that be would take a matter of days before they demanded that production workloads started operating off of it.

          TPTB would also demand that switchover be automated. That would mean that any minor outage in the primary site (say because the ISP is having problems with the fibre card in their routers again) immediately trigger a switch to the copy stored on the DR site. They would not be capable of viewing the synched copy as "for emergency, disaster-only use".

          This would result in either things going horribly wrong as databases diverged or massive amounts of resources needing to be invested in retooling the application in question (and a large chunk of the rest of the infrastructure) to go from "DR" to "multi-site HA."

          Solutions that are "technically possible, if you can control for various factors" don't work when politics do not let you control the requisite factors.

          1. Ammaross Danan
            Stop

            Re: You're not using MySQL's built-in replication???

            "...would take a matter of days before they demanded that production workloads started operating off of it."

            You bill it as a "backup." They wouldn't, rightly, demand to run your backup copies of the network shares as a production datastore, so they should not demand a backup DB to be a production workload. It is the network admin's job to teach that.

            For TPTB for automated switchover: your example of why auto failover is a Bad Thing in your case should be the exact argument against doing so. As an admin, there's a fine line to walk between "I can make it do that" and "that simply can't [shouldn't] be done." IT is as much an advisory source as it is an enabler. Just because I can set up a group of FreeNAS boxes as iSCSI targets so I can scale up my environment to 60TB doesn't mean I should, simply because TPTB demand more space, but won't pay for a SAN. Likewise, caving to each want and whim of TPTB that don't allocate proper funding to do it right (or at least "better"), is not correct. Of course, with their software, there's not much of an "ideal" way to do it. Manual failover, manual corrections in the event of DR, etc. It's just how it is, and TPTB need to understand that.

            1. Trevor_Pott Gold badge

              Re: You're not using MySQL's built-in replication???

              @Ammaross Danan: you seem to believe that everyone will listen to their sysadmins and/or be swayed by logic. Even if you pull out bullshit ideas like "it is the network admin's job to teach that" you are still simply wrong. Computers are easy, politics are hard...and you cannot simply reprogram people until they obey you.

              Armchair quarterbacking on the internet is so much easier when you can simply demand that other people change the rules around them though, isn't it? Makes me ask all sorts of questions about how well you manage to interact with human beings in the real world. Or if you do much of that at all. Compromises suck, but they are the way of the world.

              1. Ammaross Danan

                Re: You're not using MySQL's built-in replication???

                @Trevor_Pott

                I have to practice politics every day too. You've had to deal with a wider range, due to the nature of contract work. I, like yourself, tend to end up implementing compromised solutions IRL, because that is exactly how the world works. With office politics, as with armchair quarterbacking on the internet, you recommend the more-ideal solution first, then let it get whittled and compromised down into the end result. But yes, it is the sysadmin's (or more accurately, the CIO/CTO's job) to emphasize disadvantages or shortcomings of implementations. As a consultant, it remits to the consultant to point out those things too.

                1. Trevor_Pott Gold badge

                  Re: You're not using MySQL's built-in replication???

                  I consider it a matter of statistics. Talking about how things "should be" in IT is like physicists talking about a spherical cow. Everyone talks about the whitepapered version of reality in which everything has infinite budgets, change controls and completely pliant users that do whatever IT says.

                  Why would anyone read about that? Why should they? Such imagined fantasies have less to do with the real world than the spherical cow. In my view it's far better to begin discussions by asking "what are the constraints of operation and budget?" Skip 14 layers of dancing around the problem and argument and get right down to "where are the walls and what can do within them?"

                  I also think it's interesting to discuss real world implementations - both successful and failed - because they have to work inside these walls. The reason we get paid isn't to implement spherical cows but to make judgments about where compromises could or should be made.

                  Discussions that revolve around "no compromise" scenarios help noone; the discussions that need to be happening are "what are the constraints in existence, what compromises were made and were those rational compromises given the circumstances?" If the compromises aren't rational, then where should the compromises have been made? It is in discussing the making of the sausage of IT - when and where we can and should be making compromises to turn our spherical cow into a real one - that we evolve the discussion of our craft.

                  I posit that there is far more to be learned from failure - and from successful compromise - than there ever will be from "by the book."

  9. Peter Gathercole Silver badge

    Not too shabby

    The CentOS version problem and not storing the VM definitions in both sites should not have happened, but I would not bash yourself over the head wrt the sendmail config.

    Sometimes it is not enough to do a restoration test. For some services, it's necessary to actually run for a period of time in your alternate location. I suspect that any number of 99% tested DR plans may hold something like your sendmail problem.

    This is normally because of the high cost of a full DR test. As a result, 5 minutes after the last DR test has been concluded sucessfully, an apparently minor change somewhere in the depths of the environment may invalidate it!

    Of course, if you do run from your alternate location for enough time to make sure that you've got most of the bugs, it introduces another problem, that of fail-back. This is something that many, many administrators just do not think about. If you run from your alternate location for any length of time (to rattle any connectivity problems out), you have to have a procedure to revert back to your primary site. And it's not always a reverse of the DR plans, because these are often asymmetric.

    The background to this is that most businesses don't think beyond restoring the service. One bank I worked for acknowledged (or at least their DR architect did) that it would be almost impossible to revert back to the primary site if they invoked their full site disaster plan for their main data centre. The services would be back up, but vulnerable to another failure.

    1. Trevor_Pott Gold badge

      Re: Not too shabby

      It isn't enough to just test the DR plans; frequency of tests is an issue. A copy of the VM existed on the target site...but that copy was corrupted. Couldn't get it to boot. (Most likely an incomplete backup run at some point.)

      So the DR plans were good, they were tested to inject new information and files into a known-good VM...but the known good VM turned out to be not so good. At that point, down the rabbit whole you go...

      1. Ammaross Danan

        Re: Not too shabby

        "It isn't enough to just test the DR plans; frequency of tests is an issue. A copy of the VM existed on the target site...but that copy was corrupted. Couldn't get it to boot. (Most likely an incomplete backup run at some point.)

        So the DR plans were good, they were tested to inject new information and files into a known-good VM...but the known good VM turned out to be not so good. At that point, down the rabbit whole you go..."

        Unless you just snag the VM copy from a previous version. But if you don't keep previous backups of your VMs, but instead overwrite each VM each night, then you're just asking for trouble. This could have been avoided if you simply had "the night before" the corrupted VM. Software that can backup using incremental rather than full also help. I'm willing to bet, though, that DFSR was the sole means of remote-site copies (which does have remote differential transfers, if you're not politically stuck on Win2003....)

        1. phuzz Silver badge

          Re: Not too shabby

          But you can't keep every single nightly backup of your VM, even as incrementals, at some point you'll have to delete old copies to save space. The problem is if your DR testing is less frequent than the length of time you keep old snapshots around.

          And given that a full, 100% DR test has effectively the same effect on your business as a real disaster, I'm going to guess that you don't have backups going back that far.

          In my old job, we assumed that if a disaster was big enough to take out ALL the physical hardware, it would also take out most of the rest of the office as well, and pretty much destroy the company. So, we kept long ter offsite backups of data, but in the event we'd have to use them, we'd pretty much be building a new infrastructure from scratch, so OS level backups were not much use.

          1. Ammaross Danan

            Re: Not too shabby

            "...I'm going to guess that you don't have backups going back that far."

            Actually, we keep about 2 weeks worth of daily VM backups offsite with a week lag on cycling, so actually, YES, we do keep a fair amount of backups for which at least one image per VM would be restorable even in the event "last night's" backup failed for some reason. It's not hard to do, but certainly requires a decent storage device (ours has a good 20TB in it, but easy enough for a no-budget shop like Trevor's to set up a FreeNAS to do the same thing...)

  10. Amorous Cowherder

    As I read somewhere, "It's not a backup until it's been restored."!

  11. Jim 59

    DR

    ie. You had a backup regime but no DR or bare metal restore plans. First part of the DR plan is usually rebuilding the metal or virtual infrastructure. Second is restoring data. DR agreement is expensive but allows you to practice the whole rebuild at 3rd party DR site. Installing the right operating system version... er, should be a no-brainer ? (Sorry about that!)

    1. Trevor_Pott Gold badge

      Re: DR

      It isn't the "right operating system version." It's about the config file version. The version of the OS on the production system is up to date (binary-wise,) however, the originally installed version was older than than the newest installed version. This means that a brand new install to the same version as is currently running in production will install different default config files.

      The real lesson here is "add the php config files to the nightly backup set." It obviously isn't enough to rely on operating system version to keep those straight.

      I would have thought that for someone who read the article that lesson was a no brainer.

      1. keithpeter Silver badge

        Re: DR

        "The version of the OS on the production system is up to date (binary-wise,) however, the originally installed version was older than than the newest installed version. This means that a brand new install to the same version as is currently running in production will install different default config files."

        Is that why apt-get on Debian based systems sometimes gives you a warning when upgrading about changed config files? You get three alternatives and I always seem to pick the wrong one (as I'm an end user, no major damage results).

        CentOS/RHEL world just uses the older config and assumes you know what the consequences are I suppose.

        We need a light bulb icon.

      2. NinjasFTW

        Re: DR

        </quote>The real lesson here is "add the php config files to the nightly backup set."</quote>

        except I don't think that would have worked here.

        from memory the version of PHP that came with earlier versions of Centos didn't require a config entry to allow for short tags, its only in newer versions.

        Even if you had the original config files it would still fail. At least that was my experience going from centos 4 to 6 where I had the original config.

      3. Blane Bramble
        Pint

        Re: DR

        Hi Trevor, I feel your pain regarding the config files - although you should slap your developers for using short tags. The most important rule is *always* back up /etc - even if you don't want to use it directly, then you can restore it (/root/oldetc perhaps) and manually check/compare things at the very least... been there, done that.

      4. Jim 59

        Re: DR

        I can't pretend to understand the whole event, but a backup regime normally includes all data on the system, including the OS and all config files. Unless there is good reason for exclusion, eg cache files.

        1. Trevor_Pott Gold badge

          Re: DR

          A local backup regime? Sure. To deal with equipment failure. A disaster recovery scheme? Rarely, if ever. The cost of bandwidth is prohibitive and there isn't always access to offsite vaulting companies willing to work for the prices you can afford.

          1. Peter Gathercole Silver badge

            Re: DR

            "Never underestimate the bandwidth of a truck full of tapes".

            Over the net is fine as far as it goes, but it does not have to be the only mechanism used. That's why most large datacentres use tape with offsite storage pools for their DR plan.

          2. Jim 59

            Re: DR

            DR is expensive, usually reserved for a company's most critical stuff. Companies tend to categorize web servers as less critical, rightly or wrongly. I have worked with several clients on their DR tests, often it is covering a production data warehouse or similar. But some don't have it. DR tends to be a subject managers don't like to think about.

  12. PM.

    Good read. Thanks !

  13. Destroy All Monsters Silver badge

    This is the dog+flood picture you are actually looking for:

    http://mubi.com/lists/my-favorite-films-of-all-time-always-under-construction

  14. Daniel B.
    Boffin

    I feel your pain

    Ever-changing defaults on config files have been a headache precisely because they hit me when I migrate stuff to new boxes. Incidentally, the first time I got hit with something like this was with PHP, so I see they have marched on with the neverending changing of default settings.

    'Tis been 3 years since I last experienced a test switchover to the DR system, and that was at a former employer. At least the systems I managed worked fine, though the DR site was heavily underpowered. Hopefully they'll never need to use it, as everything does run but much, much slower.

    By the way, I wouldn't quite spend the budget on cloudy backups; what that particular employer did was to have the DR stuff in a DR-specialized facility. They even had an Ops Center that could be used by the operational team for both testing and actual work if the DR plan had to be executed. So while the company didn't own the DR facilities, they were there for the using. Much better than relying on 'the cloud'...

  15. John Smith 19 Gold badge
    Thumb Up

    Unfortunate. Something to add to the new customer checklist.

    But otherwise the system got back up. Lesson learned.

    Which is not to say that as the checklist gets longer it can be tough to keep track of it all.

    But I agree regular backups without regular restore tests is just voodoo IT. Common sense when you think about it but surprisingly uncommon IRL.

  16. Trevor_Pott Gold badge

    DR plans

    Seriously, it isn't just testing the DR plans...it's testing them with some regularity. One bloke up thataway made mention that even a minor change can invalidate a DR plan.

    Like "yum update", perhaps?

    Security says update every month, at a minimum. Do you have time/money/etc to test your DR plans for every single change every month? If so...I want to work where you work.

    1. Anonymous Coward
      Anonymous Coward

      Re: DR plans

      "Do you have time/money/etc to test your DR plans for every single change every month? "

      Do your clients have the financial reserves to survive the immediate and ongoing impact of a DR plan that ends up not quite working as quickly as was hoped, because it hasn't been tested frequently enough?

      If they don't have the financial reserves, they're almost certainly toast once Bad Things happen.

      If they don't test frequently enough, they may well be toast if Bad Things happen.

      Shit happens. Sometimes it's important to understand which bits matter, and require investment upfront and on an ongoing basis.

      1. Trevor_Pott Gold badge

        Re: DR plans

        You are absolutely correct. In fact, I think I've written the exact same thing in about a dozen different ways on this very site. Unfortunately, nerds don't control the business.

        Or fortunately? It depends on your outlook. Nerds would spend a virtually unlimited amount of money on things, restrict changes to rigid procedures that had long time horizons and generally play things incredibly paranoid and "safe." This would result in an unbeatable network, but a massive money sink and virtually zero agility. At large enough scale you could provide agility - sort of - but certainly not in the SME space. So the owners of the business make choices and they take risks. "Continue operating today" versus "prevent a risk that may not happen." There isn't always money for both.

        What really gets me is the armchair quarterbacks that seem to think that any systems administrator or contractor on the planet has the ability to force their clients/employers/etc to spend money and make the choices that the armchair quaterback would make.

        Of course, when the Anonymous Coward knows only 10% of the story, that isn't a problem, because it's obvious that everyone should do everything according to the most paranoid possible design costing the maximum amount of money using the best possible equipment and all of the relevant whitepapers. The part where doing that would bankrupt most SMEs is irrelevant. Nerds believe in IT over all things.

        Forget the people, forget cashflow; the money is always (magically) there, it is just that business owners are withholding it to fund their massage chair. Salaries of staff don't need to be paid; you need to hire more IT guys. The ability of sales, marketing etc to generate revenue is irrelevant, all that matters is that they cannot possibly affect the system stability and that the data (generated by what? Why?) is secure.

        So yeah; shit happens, and in a perfect world you'd get an up front investment from them to prevent issues and solve potential issues. In the real world, however, things get messy. Oftentimes they simply don't have the money, can't obtain it and/or aren't willing to do things like mortgage their own house to cover a remote possibility event.

        Other times, they are unwilling to make the investment and there's nothing you can do. It's your job as a sysadmin to do the best you can with what you have. You make your recommendations, you accept the choices the client makes and you help them as best you can.

  17. Pete 2 Silver badge

    Never forget the personnel angle

    > two sayings are worth keeping in mind:

    There's a third. Restores are useless unless you have the staff available to apply them

    All the talk about backups, restores, DR, high-availability focuses on the technical aspect and never seems to address the issue regarding people. There's little point having a full tested recovery plan, or backups that you *know* you [ well: someone ] will restore if needed, if that someone is either unavailable, indisposed, sacked or chooses not to do it (Yeah, go ahead: fire me. How will that get your system back up and running?)

    Whether it's something as mundane as the staff canteen serving up a dodgy lunch that lays the whole IT staff low, a particularly good party that does the same but more pleasantly, a scheduling "hundred year wave" where all the players are simultaneously on holiday, off sick, on strike, on maternity leave and freshly redundant or any other unforeseen circumstance that means nobody answers the phone when the call goes out.

    Possibly the worst of all is when no-one can remember the key to all the encrypted personal data that was backed up and can be successfully restored, but for one tiny detail.

    So yes: make sure your tech is all fired up and ready to rock. But don't take for granted the person who has to make it all happen.

  18. Anonymous Coward
    Anonymous Coward

    Relying on capped data links

    The problem I see with this is that while the ISP can promise you that uncapping the link to 100 Mb will work any time you care to do it, if the disaster that takes out your primary site affects enough other customers with similar agreements they may not be able to give everything the promised uncapped speed.

    That's something I'd look at VERY carefully before making that part of a DR plan. Perhaps that was done in this case, but seeing as how the ISP would probably consider any penalties for failure to provide the full bandwidth as a "cost of doing business", I'd want some pretty some pretty stiff penalties before this arrangement let me sleep at night.

    1. Trevor_Pott Gold badge

      Re: Relying on capped data links

      If wishes were horses we'd all ride.

      What I would like from an ISP arrangement, or amounts of available bandwidth, or budget, time, storage, development cycles, applications, operating systems, coffee vendors, dispensaries of bagels and whatever else it is that runs my life has very little to do with what I get. You get what's available. Your job is to make things work as well as possible within those boundaries.

      As it is, the cost of bandwidth is mind-numbingly prohibitive. Canada: lots of cheap, shitty quality downstream bandwidth, but you'll have to toss virgins into a very rare Ebrus-class stratovolcano to get upstream that isn't utter pants.

  19. OzBob

    Kudos to Trevor

    for sharing his experience and exposing himself to the self-righteous and indignant of the world.

  20. Anonymous Coward
    Anonymous Coward

    Tiger Team

    Maybe what was missing was a "Tiger Team"; one or preferably more people, chosen for their cynical and destructive temperament, whose entire job would be to find points of failure. Certainly an expensive investment, but maybe not a luxury.

    One of the oldest and soundest rules of testing is that the people who made something are never the best people to find mistakes in it. Mother love?

    1. Trevor_Pott Gold badge

      Re: Tiger Team

      Agree entirely. And it's a fantastic argument for external audits, too. :)

  21. SirDigalot

    100mps how quaint...

    We vrep all our VMS to the DR site, Databases are log shipped

    Email uses DAG and is basically instant

    the vrep machines need a little manual jiggery pokery and are live

    we can bring our entire saas product(s) with over a TB of data between 4 separate systems online and open for business from a total obliteration of our main datacenter in less than 30 mins if there is anyone alive to do it.. (probably me since everyone else is in the office next to the datacenter... I have often considered some radical promotion prospects over the years...)

    Under 2 hours and we have the entire company working from our DR location

    unfortunately our dev dept has not yet got to grips with true geographically diverse databases so we have to use the old fashioned way until they work out how to keep all databases online and updated in more than one place.

    It could be done on a 100mps line we only had 200 when I started here however we opt towards a little more bandwidth for safety somewhere near the 1000mps, we do the same for our florida office too and that has basically nothing in the way of server infrastructure, though it could.

  22. Long John Brass

    spherical cows in a vacuum

    Some cliché's

    No battle plan survives contact with the enemy.

    Cheap, fast, reliable; Pick any two

    My favourite solution to the DR problem is to use the DR systems as part of the QA or integration testing cycle.

    That way you know that the DR rig actually works; DR then involved switching out the QA/Int DB's for the Prod copy & running a few scripts that swap any configs that need to be changed

    Actually managed to convince one client to implement the above & they even agreed to live DR test once a year. Once a year for a week or two Prod would migrate over to DR & run there for a week or so then migrate back. First time was hairy & scary ... got easier after that :)

    Old prod kit would move to DR(Qa/Int Test) then to the dev/test racks. I believe the dev systems should be slow (Keeps the Dev's honest)

This topic is closed for new posts.