back to article Sysadmins: Everything they told you about backup WAS A LIE

So, you're a sysadmin, slaving away to maintain the impossible 100 per cent uptime demanded by The Powers That Be. How many common myths about storage do you really believe? More to the point, how many of these common myths do your bosses believe? Of course, it really doesn’t matter which backup vendor you use - the myths are …

COMMENTS

This topic is closed for new posts.
  1. Khaptain Silver badge
    Pint

    Some extra points

    * Ask yourself : How will I cope with vendor lock-in.

    * What is backed up is the reponsability of the data owner not the IT Department. Are the other departments aware of the costs/time involved. Ask them to reconsider what they need compared to what they want.

    * Add the time it take to bring the Offsite backup tapes "Onsite" into the restoration SLA.

    * VERIFY YOUR LOGS......EVERYDAY

    * Inform people when backups fail - it is after all their data..

    * ALWAYS have more than one backup media ( hard disks + Tapes ) example - Hard disks Onsite + Tapes Offsite.

    * Ensure that new Network Shares are added to the backup selection....

    * Have you tested restoring your back up tapes on another Site / hardware / Server. If your building and your hardware are completely destroyed you will be glad that you can recover your data elsewhere.

    1. Pete 2 Silver badge

      But who gets it in the neck?

      > What is backed up is the reponsability of the data owner not the IT Department.

      That's all very well. The problem is that the universal impression of pretty much everyone in business is that if it runs on a computer, it's the IT dept's fault when it fails. No matter how much you'd like to argue about charters, SLAs, job descriptions or anything else; these will all be perceived as excuses trying to weasel out of an IT failure and blame the problem on someone else.

      In fact if you are successful in getting this point across you could easily find you've just talked yourself out of a job. [ MD thought process: Well, if IT aren't responsible for this, what are they doing ... maybe it's time to "do more with less" ]

      1. Khaptain Silver badge

        Re: But who gets it in the neck?

        Pete, I understand completely what you are saying, if a counterarguement is required it all boils down to "finance".

        The accounting departments has requested that 1 Tb of data "must" be backed up every day ( theri choice no yours). Ok, no problem the price of Disks/Tapes/Hardware will be X per year.

        My role as the IT Guy is to also say to them that maybe the alternative is to back up only a subset of that data. ( Everything else can be archived) In such a case only 20Gb of data will be backed up daily, ther cost will be Y per year. Will be be lower if you are carefull. You gain not only in in cost but also in backup time ( this is important when problems arise), in recovery time ( vital for the DRP scenarios- get them up and running quicker). BUT it "must" be the Accounts department which make the decision as to which data is vital and which it not.

        The bean counters understand figures far better than technology.

    2. Yet Another Anonymous coward Silver badge

      Re: Some extra points

      You need to backup the working environment not just the data.

      Our data is totally backed up and we can restore it instantly to our server farm which is currently;

      a, underwater after we were flooded

      b, in the middle of an area of the city that is cordoned off because of the bomb threat

      c, working perfectly but inaccesible to all of our staff because of the above

  2. Anonymous Coward
    Anonymous Coward

    I've been in Storage, Backup & Recovery, data protection and DR for about 16 years or so, in industry (financial services) and now as a software and solution designer and I totally with everything you said.

    Which is a shame, because I'm feeling rather contrary today, but I can't pick a hole in your article. People need to design, build and fund recovery solutions. The sooner we stop talking about backup and start talking about recovery, the better.

    1. Destroy All Monsters Silver badge
      Trollface

      If you want to nitpick ... the numbering is off.

      Numbering should be machine-generated.

      1. Lozzer292
        Coat

        Numbering

        The repeated point *was* about replication....

    2. Anonymous Coward
      Anonymous Coward

      @AC 12th July 2013 10:02 GMT

      Precisely!

      Actually we had a massive database corruption yesterday, thanks wholly to Windows Cluster & SQL Server not playing together. A cluster fail over event for no apparent reason resulted in an SQL Server database corruption. The event occurred DURING the daily incremental backup. The suspect database went into recovery on cluster restart, and we had no idea how long that would take (In the end it took 13 hours).

      Sadly, plan B, restoring the database, meant recovering 1 full backup+1 incremental backup+23 hours and 53 minutes worth of log files. Unluckily, yesterday included a massive database grooming exercise removing a few hundred million rows, so the log files were larger than normal :(. After the restore we would have to post everything that didn't get posted after the failure.

      This, I submit, was Murphy at work. There was no worse moment for it all to go pear shaped.

      It took 10 hours to recover the database and another 4 hours to bring it up to date before I could start delivering data to customers.

      We now know our worst case scenario :(

      1. Anonymous Coward
        Anonymous Coward

        Re: @AC 12th July 2013 10:02 GMT

        You were, intentionally albeit perhaps reluctantly, doing [non-routine thing A] and [non-routine thing B] and maybe [non-routine thing C].

        Yet you seem surprised that [non-routine thing D] happened.

        Is it possible that [non-routine thing D] occured specifically *because* of all the other non-routine stuff going on?

        Sometimes people are unlucky. Sometimes they're just waiting for the inevitable.

        1. Anonymous Coward
          Anonymous Coward

          Re: @AC 12th July 2013 10:02 GMT

          Daily Incremental - routine (coincident with cluster event)

          Weekly purge - routine (completed long before cluster event)

          Daily Ops - routine (nothing unusual or extra was running)

          Cluster failover due to LOCK exhaustion (64 bit 2008) - definitely non-routine and unpredictable

          Thus, you are in fact incorrect.

          Predicting when SQLServer will go tits up because of lock table exhaustion is a tricky business, impossible I would suggest. Causing it to happen coincident with a regular differential backup would be impossible to do on purpose, and statistically in the fabulously unlikely bucket by accident.

          The obvious answer is more incremental backups. I already generate 1/2 TB of backup files/day - there are limits to everything, including disk space.

  3. graeme leggett Silver badge

    scary but true

    Makes one break out in a sweat thinking about it.

  4. Anonymous Coward
    Anonymous Coward

    Point 3 is wrong

    You can either deploy a new server, or you can restore the O/S. Neither option is definately right or definately wrong, but to say, "you’ll never use it to restore a system" is just plain wrong.

    1. This post has been deleted by its author

    2. JimC

      Re: Point 3 is wrong

      It must be twenty years since I restored an OS. When it comes to metal then you need an OS to be able to restore anything, so why restore the OS on top of a working OS. Disk images for virtuals are another kettle of fish I guess.

      1. Anonymous Coward
        Anonymous Coward

        Re: Point 3 is wrong

        " you need an OS to restore anything, so why restore the OS on top of a working OS"

        Because the lightweight single function OS you use to do the restore isn't the same as the OS you use to run the application, perhaps?

        Back in the days when I used to care about these things, you booted DOS or a minimal Linux to do the restore, even if the eventual target environment was a Window box.

        Does it not work like that any more?

        Or are you seriously suggesting installing Windows to do the restore and then using the same Windows to run the application(s)? Anyone see any problems with that?

        1. Peter Gathercole Silver badge
          Meh

          Re: Point 3 is wrong

          My view is that it depends entirely on ho much has changed in the OS since it was installed, and that is probably determined by the function of the system being backed up.

          I've worked in an environment where every server in the server farm is a basic install with scripted customisations, with all the data contained in silos that can be moved from one server to another (the bank I used to work for had been doing this on a proprierty UNIX since the turn of the century, before Cloud was fashionable). These systems can be re-installed rather than restored.

          I've also worked in environments where each individual system has a unique history that is difficult to replicate or isolate. These systems need to be restored.

          One example of this latter category is the infrastructure necessary to reinstall systems in the former category!

          There just is not one fixed way of doing things. Each environment is different.

    3. NogginTheNog

      Re: Point 3 is wrong

      Rebuilding a server only works if you know EXACTLY how the current one was built - not just the OS config but the application stack and EVERYTHING! And I know too many companies where that simply isn't the case.

  5. Anonymous Coward
    Anonymous Coward

    Strongly Agree: Backups don't really matter.

    Restores matter a lot more. Or recovery, if you'd prefer that word.

    Rarely has so much wisdom been concentrated in so few words in The Register.

    More of this kind of thing, please.

    [I'm not AC 10:02, I've never been in the City, but we're apparently thinking very similarly on this subject]

    1. Ken Hagan Gold badge

      Re: Strongly Agree: Backups don't really matter.

      "Rarely has so much wisdom been concentrated in so few words in The Register."

      We can probably precis it down even further to just point (2), though. Really the other 9 follow logically from there.

      There is no such thing as backup; there is only restore.

      1. perlcat

        Re: Strongly Agree: Backups don't really matter.

        I can condense it down even further.

        Nobody ever gets fired for backup failures. They get fired for failing to restore.

        .

  6. Robert Carnegie Silver badge

    Well, yeah.

    Apparently, health care has recently been revolutionized by use of check lists. These are sets of must-do items that, considered individually, are obvious, until you put them on a check list so that they don't keep getting missed.

    So, yes, backups are for restoring, when you have to. And when you have to, you have to. If you aren't ready to restore then you aren't ready. And you don't know for sure that you have a backup ready to restore, until you restore it.

    Backing up is a tedious inconvenience - until you need to restore.

    I suppose we probably aren't talking about desktop PCs here, but there is that factor too - as far as I know, Microsoft Windows still puts itself and its users' data all on one disk partition. If you want to back up the whole system (which seems like a -good- idea to me for a fast restore), you have to back up the -whole- system - unless you do partitioning yourself, which -is- easier nowadays.

    Another factor, though, is that if what I once read about Microsoft's way with GPT applies, then even a fairly straightforward design of disk with separate partitions for useful things is liable to be littered with tiny extra partitions for Microsoft's own amusement.

    But then, restoring the Windows partitions probably won't get your PC running again in any case.

    I think I may be saying that some things you just can't back up.

    1. Cliff

      Re: Well, yeah.

      Checklists - pilots with trends of thousands of flying hours still use Checklists. It is a very simple efficient system even though a job gets habitual, you still check and check your assumptions.

      1. Neil Woolford
        Holmes

        Re: Well, yeah.

        Aviation was the original home of the checklist for complex but routine operations. The use of them came out of a crash that nearly bankrupted Boeing.

        More at http://www.atchistory.org/History/checklst2.htm

        1. Destroy All Monsters Silver badge
          Trollface

          Re: Well, yeah.

          Checklists gave Hitler the key to Europe!

        2. Anonymous Coward
          Anonymous Coward

          Re: Well, yeah.

          And when people ignore the checklist or trust the tick in the box rather than actually looking, Bad Things eventually happen.

          Design engineer: Have we got a simple visual indicator to tell if the bonnet is latched closed? No problem, costs too much anyway, people will always check it the hard way.

          Service technician: Is the bonnet properly latched closed? Y/N

          Post-service inspector: Has the technician *actually* ensured the bonnet is properly latched closed (not just ticked the box)?

          Operational crew: Have the technician and the inspector *actually* ensured the bonnet is properly latched closed (not just ticked the box)?

          Not talking about car bonnets, but aircraft engine cowlings, and what happens when these four people (plus others) all fail in sequence, and the problem is known about and largely ignored for two decades:

          http://www.flightglobal.com/news/articles/dual-cowl-mystery-at-centre-of-ba-a319-probe-386495/

          1. Robert Carnegie Silver badge

            Inspector?

            I think I understand correctly that the pilot always personally walks around the outside of the plane looking for anything wrong. Wouldn't you, too?

            The incident that your link is about seems to be a relatively rare case of this failing, because - as hinted - it can be difficult to see whether these engine door things were properly closed.

            But I bet they're still checking extra-carefully now.

            1. This post has been deleted by its author

            2. Anonymous Coward
              Anonymous Coward

              Re: Inspector?

              "I think I understand correctly that the pilot always personally walks around the outside of the plane looking for anything wrong. Wouldn't you, too?"

              Indeed, and that was emphasised particularly after the first few incidents of this nature. But apparently the latch in question is on the underside of the engine, and viewing it would apparently involve bending down. Apparently nobody's yet thought of using a mirror-on-a-stick as used to be used for under-car bomb inspections. A microswitch cabled to a light in the cockpit (a long way away) and/or to an input on the engine control unit (a few feet away) is out of the question too apparently, given all the other failsafes which have to go wrong for one of these to escape.

              I'm not that sure it deserves to be called relatively rare - check the history of airworthiness directives etc.

              How does this relate to backup?

              Well, it adds a supplementary to Ken Hagan's terse but entirely appropriate summary posted at 17:31

              The long version of the supplementary is "What can go wrong, will go wrong. However many failsafe checks you build in, someone/something will one day defeat each one. Eventually, if you repeat the sequence enough times, there will very likely be an occasion where someone/something will defeat all of the checks in a way that was probably entirely foreseeable but considered infinitely improbable. Sometimes it won't matter. Occasionally it will. Are you feeling lucky?"

              I'll leave Ken to summarise again, if he'd like to.

    2. david 12 Silver badge

      'as far as I know, Microsoft Windows'

      criptes. And it's already got upvotes too!

      Look, if you what you want to say about Windows can be tagged "if what I once read about Microsofts's way', then it it isn't worth writing, no matter how many equally ignorant people there are to agree with you..

      Anyway, responding to your central point, 'Microsoft Windows' doesn't put itself and user data all on one disk partition.

      Users do.

      Like on all of the *nix based home and office systems I have ever seen, including those down on the factory floor right now. In contrast, ALL of the enterprise level workstations I have worked with for the last 15 years, from around 1998 to now, have put the OS and the user's data on seperate disk partitions. So shoot me. I've never worked at enterprise level with linux workstations. At least I don't make dumb comments like "as far as I know, OSX still puts itself and it's users data all on one disk partition."

      And you NEVER restore the OS partition. If something goes wrong with that, you just re-install.

      1. Anonymous Coward
        Anonymous Coward

        Re: 'as far as I know, Microsoft Windows'

        I'm having trouble understanding your point here.

        I've been using and sysadminning NT since the MSDN pre-release (1993?), and still do with its successors.

        It is hard (to the point of insanity) to keep a consistent system backup of an individual system, unless user data and OS data are backed up at the same time. Didn't say impossible (you want to move My Documents etc, feel free, but there's other stuff too). You don't see the inconsistencies till the restore, of course.

        I've been using and sysadminning *NIX (and VMS) rather longer. And Linux almost as long (mostly Suse, occasionally others) since, well, whenever Suse 8 was (2000ish). Yes I'm a dinosaur.

        Maybe it's just me, but my experience has been that with those OSes, getting a consistent restore (eg by keeping "/" separate from "/home" on *ix) is trivial in comparison with doing the equivalent on Windows.

        YMMV.

        1. Anonymous Coward
          Anonymous Coward

          Re: 'as far as I know, Microsoft Windows'

          That's because Unix, and later Linux, was written slowly and carefully to do the job properly. By geeks for geeks.

          Windows has always had the marketing and bean counters overriding some extremely clever engineers because they so desperately yearned to see that Windows logo on every computer in the world.

  7. McDoes

    And maybe the most important one; backup/restore is not intended to address or solve archiving or data retnetion requirements... Maarten

  8. Ant Evans

    Utilities

    IT is a utility. That gives it certain weird and unpleasant characteristics, and we need more thought on this topic. I'd write a book but I have contracts to run.

    1. A utility is an asymmetrical service. In a utility, when you spend a fortune, innovate brilliantly, bust your gut to make things run perfectly, then save the business from a problem it didn't even know it had, you get this well-known result:

    Nothing.

    2. When you take your eye off the ball for one minute of the half million minutes in a year, or when something breaks that's within your remit but beyond your control, or when you make a dumb mistake, you get this well known result:

    Shit.

    Ever wondered why your job is so thankless? a. You work in a utility, and these are the only two possible results of your work. b. The five nines of Nothing you have produced in no way shields you from the amount of Shit that will rain on you when something goes wrong.

    3. Utilities are easy to shave costs from. Why spend all this time and money on Nothing? If you spend less, you still get Nothing, at least for a while. This means that in return for the Nothing you produce as a utility provider, what you can expect from the organization for the production of Nothing is, therefore, Less.

    IT is not special - this is true at water companies, chicken farms, and banks, and all the other things humans have got operationally good at.

    IT has responded by trying to enable things and innovate. That's nice, and probably necessary, but today's innovation becomes tomorrow's baseline. Now you have to work harder to produce Nothing. At best you might be able to argue for more resources. But not for long.

    DR and security are the most utility-ish part of IT because most of the effort manifestly produces Nothing, by design. That's why DR and security have to resort to a bit of hyperbole once in a while to get proper funding.

    It's systemic.

  9. John Smith 19 Gold badge
    Thumb Up

    So backup is backup and everything else (which people might use as backup) is not

    I know, when it's put that way it's obvious except when it's not and admins (or perhaps their PHB's) think they can get away with using something like backup.

    Excellent article (and some excellent comments) the part about "utilities" would explain a lot.

    Thumbs up for a useful reminder of some things some people may have forgotten.

  10. HereWeGoAgain

    Completely agree...

    Except for this:

    " If you feel the need to back up an operating system several thousand times… feel free, I guess, but you’ll never use it to restore a system."

    Well that depends. I have an old OS on a machine, which won't be upgraded because that will be a lot of work for somebody - likely none of the apps that are on it will work. So in addition to backing this machine up using *some piece of backup software*, I also back it up using ufsdump. If I ever want to restore this machine on bare metal, I will use that dump to install the OS. I know it will restore exactly as the box is now. I know this because I have tested the restores in VirtualBox.

    There is practically no chance of finding the original OS install disks, and the long-forgotten patches and tweaks, that have been applied to this box. So dumping and restoring the whole thing is the best way.

    Of course the box should be upgraded/updated. But it won't be, unless it breaks.

    1. Tom 38
      FAIL

      Re: Completely agree...

      You cannot be serious! This application/system is a major point of failure, and by your own admission, if this box fails you have no hardware or software which can run this application!

      Even though you are religiously backing up both the OS and the application, a single hardware failure could leave this system broken until you can urgently migrate it to a different host. If you can't restore a back up, it isn't a back up.

      1. HereWeGoAgain

        Re: Completely agree...

        I am serious.

        We can restore the backup - I test this from time to time in VirtualBox. If the worst came to the worst that is exactlty what I would do as an interim measure.

        As and when it fails, the powers that be will understand its importance and maybe then provide the time and money to update it.

  11. Velv
    Boffin

    Missed a key common fail

    A backup is NEVER an archive. (regular readers of my comments will know I bang on about this)

    A backup is there to allow you to recover (as you state).

    An archive is a primary copy in the information lifecyle.

    Two entirely different concepts.

    1. Tony-A
      Pint

      Re: Missed a key common fail

      A backup is NEVER an archive.

      EXCEPT: When you need one, the other is much better than nothing.

  12. itzman

    A tale of restoring..

    I backup my whole computer nightly on a headless server.

    The other week it died. Never mind why, I just got fed up with waiting for it to do something mindless and yanked the power cord. Couldn't discover which bit of it had got trashed. Took a view.. Quicker to reinstall upgraded OS and then recover all the parts from the backup.

    And it worked.

    Bit by bit everything came back with the old machines image mounted on a spare directory, and crucial bits copied back across.

    Nothing was lost, and whilst I echo the point that you don't need to back up everything, since in this case the bulk of the data was on the server anyway, backing up the whole OS was not in fact a huge hardship.

    Let's say that disk is cheap, working out what to put on it is not.

  13. KPz

    And when designing your storage...

    ...make sure you allow for how you're going to back it up.

    I've seen 12Tb CIFS volumes configured, which make backups interesting.

  14. Anonymous Coward
    Anonymous Coward

    Nice to see a decent sysadmin article on this site which isn't written by a petulant prick and isn't a glorified advertisement for some shitty product which only runs on Windows. Good stuff. Can we have more, please?

  15. Daniel B.
    Boffin

    Full and incremental backups

    Usually you do a Full backup and then incremental ones during the week. That's why you don't have to backup Terabytes upon Terabytes of data. Of course, you should also have another team restoring said backups on the DR platform, which serves as both DR readiness and testing that the backup media is actually working.

    Ah, the woes of a certain company that found out their backups were worthless the day their Server went down...

  16. CellThree

    So what would be the most efficient way of doing a system backup with full system restore? One of our customers is getting paranoid about server failure at the moment. The current server is running Win2008R2 for some main programs and data storage w/Win2008R2 virtual server handling their Exchange. They are using around 45gb on their C Drive and about 100gb on their Data drive.

    We have two QNAP 2tb drives set up. One which does a daily incremental backup and stays on site and another that does Full weekly backups of the whole system and is removed from site the following morning. We're using Acronis Backup and Recovery 11.5.

    The customer is wanting the best solution if the server goes down how to get back up asap with minimal data loss. They are possibly thinking of having another server build which will mirror the existing one so they can swap to it if need be. I don't really know what would be involved in that for carrying out either a daily or weekly snapshot.

    Also, bear in mind their internet connection is only a 2mb line so cloud backup would be useless.

  17. This post has been deleted by its author

    1. Jon 37
      Boffin

      Re: >AHEM<

      Nope. He's implying that they *can* do that (due to the access they have to all the data), so you better choose people who *won't* do that.

      For SMEs this probably isn't an issue - anyone in IT probably has admin access to every system anyway. But for big companies where different systems have different administrators, if you have a common backup team then they might be the only people with access to everything.

  18. MrScott

    AntyEM

    Folks in the application community know the best backup is one taken by the application immediately prior to the point of failure of course it is always better to immediately restore the backup to validate the recovery. Unfortunately precognitive backup software does not exist just like service availability=1. If storage software could tell you when your storage array was about to go TU or your datacenter EPO is about to be tripped we really wouldn't need to worry about accidental data loss. I have experienced both. Heres the fix. Get rid of the red button and store the data in more than one place. If the place burns down tag the site as a fire hazard and move instead of replacing the wires.

  19. Stewart McKenna

    It's very simple

    you have user data backups

    you have OS backups

    OS backups require

    1) Mirror OS to prevent issues with h/w failures

    2) Alternate boot disks - copy of OS disk that is not online except when you make the backups

    3) Bootable tapes/DVD if possible - off site

    No, you cannot restore the OS from kickstart/PXE/Puppet etc unless you are constantly refreshing

    that data.

  20. RonWheeler

    Backups are tickbox crap for auditors

    Recovery is what matters. Only after having done several offsite restores was this driven home to me. The old ways of doing stuff (file based / tape / only 'important' databases, the rest can be done manually / manually reconfiguring networking n event of DR / anything involving Backup Exec / ignoring fast recovery of client infrastructure). All crap.

  21. J__M__M

    Hey Author, you didn't say "cloud" or "bare metal". Will you marry me?

  22. Anonymous Coward
    Anonymous Coward

    OS Backup

    Point 3 (i) reads like someone who hasn't tried to recover a Windows server. Given the dependency on the registry and all the tweaks various apps require, such as having to disable UAC, custom permissions on folders, use of HKCU, or where apps are installed the same drive as the OS, or even if another partition has been selected the app will still go write something to system32 anyway...sadly yes you need to backup the OS. You *nix sysadmins don't know how good you have it. ;-)

  23. N2

    Hmm Backup

    My bosses believed anything they saw fit - everything is backup up because we have an IT manager...

    IT manager to bosses - ooh you need a NAS server with a few disks to store an independent backup that is secure & an off site backup to backblaze (or some other off site service) in case it all burns down.

    Bosses - Err, thats too expensive

    //Bollox, duck you then.

  24. Peter Gathercole Silver badge

    Full tests are good

    I did most of the technical design for the backup/recovery and DRM of UNIX systems at a UK Regional Electricity System back in the late '90s.

    The design revolved around having a structured backup system based around an incremental forever server and a tape library.

    One of the requirements of getting the operating license for the 1998 deregulated electricity market in the UK was passing a real disaster recovery test. A representative of the regulator turned up on a known day, and said "Restore enough of your environment to perform a transaction of type X". The exact transaction was not known in advance.

    We had to get the required replacement hardware from the recovery company, put it on the floor, and then follow the complete process to recover all the systems from bare metal up. This included all of the required infrastructure necessary to perform the restore.

    First, rebuild your backup server from an offsite OS backup and tape storage pool, and reconstruct the network (if necessary). Then rebuild your network install server using an OS backup and data stored in the backup server. Then rebuild the OS on all the required servers from the network install server and data from the backup server. All restores on the servers had to be consisntent for a known point-in-time to be usable. Then run tests, and the requested transaction.

    And where possible, do this using people other than the people who designed the backup process, from only the documentation that was stored offsite with backups, using hardware that was very different from the original systems (same system family, but that was all).

    Apart from one (almost catastrophic) error in rebuilding the backup server (the install admin account for the storage server solution had been disabled after the initial install), for which the inspector was informed, but allowed us to fix and continue because we demonstrated that we could make a permanent change that permanently overcame the problem while he was there, the process worked from beginning to end. Much running around with tapes (the kit from the DR company did not have a tape library large enough!), and a frantic 2 days (the time limit to restore the systems), but was good fun and quite gratifying to see the hard work pay off. I would recommend that every system administratror gos through a similar operation at least once in their career.

    We were informed afterwards that we were the only REC in the country to pass the test first time, even with my little faux pas!

    When supply and distribution businesses split, we used the DR plan to split the systems, so having such good plans is not always only used in disasters, and I've since done similar tests at other companies.

    1. Anonymous Coward
      Anonymous Coward

      Re: Full tests are good

      Sir, respect is due.

      Does that kind of thing still go on, or has "light touch regulation" (see also: ISO9000, CQC tick-the-box paper-based audits, and many others across many fields), together with mostly-brainless Window-centric monoculture, largely rendered doing things RIGHT redundant?

  25. OzBob

    some slightly enhanced additions.,...

    Determine before a crisis what the recovery strategy will be for the various scenarios. You can guarantee a DBA will pipe up and go "I reckon I can fix that" then stuff around for 4+ hours before admitting defeat and reducing the actual time you have to work with.

    By all means do a FS restore of the Application but make sure you have an OS image of all the libraries / dependencies / config files - selectively trying to pick packages / files to reinstall in a crisis is a painful hit-and-miss exercise.

    Be sure you know how to switch off all the processes that are non-essential to the system while it is being recovered. There is nothing worse that having am automated process stomp all over what you are trying to do.

  26. naw

    Great article

    Don't often find myself nodding in agreement - very insightful and a good read

  27. OkRay

    How True!

    I may not post as often as I should but your irreverent brightens my day and keeps me coming for more!

    Great job!

This topic is closed for new posts.