back to article Tech is the biggest problem facing archiving

Technology is the biggest problem facing archiving. Archives grow bigger and bigger. The amount of data to be kept grows ever bigger and threatens to overflow an archive installation. So, let's use LTO-6 tapes instead of LTO-5 ones because they hold twice as much data in the same physical space. That's logical but there is an …

COMMENTS

This topic is closed for new posts.
  1. Just a geek
    Flame

    There are actually places that have older style drives to be able to recover data from those tapes sitting out at places like iron mountain at the moment. Of course, one issue here is that to recover data from 10 years ago - or more but it goes a lot further than just the tape drives.

    so, once we've found the tape drive that we need we'd then need to find the tape(s) and hope they work.. hang on, how do we restore it? Ok, we need to build a server so we need the correct software - maybe even the patch version. Where do we get that? The supplier doesn't exist any more (or has been bought out). We've got the server backed up on tape but are now in a chicken and egg situation and even if we get the server back we may well need the domain as some backup products (netbackup for example) will use the domain SID to generate an encryption key.

    You did keep the passwords for the domain somewhere safe didn't you and not just on the tape????

    Long term archive presents a whole series of problems. Tape format and machines to read it are just one of those problems.

    1. ukaudiophile

      Proprietary backup format = No backup

      You make a good point, storing ageing data in a proprietary format is fraught with problems.One site I worked at had a badly organised backup & storage system which eventually failed. Recovering the data involved building a Netbackup server from scratch to recover the data. If those files had been stored as a regular tar ball or a filesystem dump they could have been recovered far more rapidly.

      Don't confuse the backup with the archive. When I backup archive data, I always use the original format and ensure an O/S install disk is kept with it. I tend to favour optical media now, I find it durable and readers are easy to come by (I can still read CD-ROM's from 20 years ago). This way, once the O/S is up and running, the data can be accessed easily.

      Avoiding proprietary formats means there is one less problem between you and your data.

      1. Charles 9

        Re: Proprietary backup format = No backup

        The trouble is that ANY format can eventually become either proprietary or otherwise obsolete. Take your optical discs. More and more computers are showing up with no optical drives to speak of. So what happens when the time comes to actually retrieve that archived data, your drive is discovered to be knackered, and they don't make optical drives anymore (because they've moved on to things like 3-D crystal storage or something)?

        I can see what's going on. It's the competing pressures of paying up to keep all this archived data and the risk that the archived data may not be there when it actually become necessary. How do you know you actually need the data...before you actually need the data? And if the threat is existential, then price no longer becomes an object.

        1. Christian Berger

          Re: Proprietary backup format = No backup

          Well... That's why Unix people decreed that you should use text for everything. Every medium we have can store text, from stone plates over Microfilm to flash memory.

          If course thinking that LTO or anything has a shelf-life is idiotic, if you are lucky you can get a drive which can read it for the next 10 years. If you want something like that you should use a popular format like CD-Roms or Diskettes. Those might be available for 20 or 30 years, if you are lucky.

          The lesson the Unix people learned back then was simple. Binary formats are unportable. Back then you might have had an 18 bit machine, storing 3 6 bit bytes in an 18 bit word. And of course all of the tapes used there had 18 bit words. You cannot easily transfer that to an 8 bit machine, it just won't work. Text however can be punched out in Baudot or ASCII onto a paper stripe and read in without any trouble.

  2. Lusty

    I think someone has been drinking a little too deeply from the marketing cup. Does Shell really need to archive much from it's history? They strike me as the sort of company who has enormous quantities of current data but not a whole lot of historical data which needs looking after long term. They may well have loads of crap lying around, but I doubt very much of that is needed.

    I've worked with numerous companies in my role as a consultant and very few of them need to archive anything, and the vast majority of "archives" are simply old backups and not archives at all. There are the odd exceptions, such as museums and companies specifically associated with collecting data.

    The real main problem in the archive industry is a human one - training. People need to keep fewer files and companies need to clearly define what data needs to be archived and for how long. Even the law firm I worked in when asked what would happen if the paper archive they are required by law to keep was burned to the ground said "not much, we're allowed to not have an archive if it was destroyed".

    1. Gordon Pryra

      @Lusty

      Then you also know the impossibility of getting a company to perform any form of data cleansing on any of their data.

      Even when you show them 3 tb of files ending in the .tmp they still refuse to delete it.

      Even when you show them the costs of hosting that data and the costs of backing it up. Managers are too scared to delete anything, rightly believing that the moment something goes wrong at some point in the future they will be held accountable personally.

      The story mentions that drives are not such a problem, pfft, Ive seen a box of old MFM drives that are being kept by a big company "in case", when you ask how they hell they will ever find a controller to actually spin that drive up they tend to go glassy eyed and forget you asked the question.

      1. Lusty

        Re: @Lusty

        I never have issues with convincing people to remove data, the trick is not to pose it as a question. Do a thorough investigation into the data and then make a recommendation. It's probably your own ass covering tactics which are causing your managers to rely on theirs. If you are confident then they will be confident. Why would you expect a manager to even know what a tmp file is, let alone why it's on YOUR server. The IT staff are in charge of data and should therefore either know what it is or remove it. Removing doesn't need to be permanent, there are ways of safely removing access for a given time before permanently removing it. You could even take a backup before final removal and stash the tape for a year, just label the tape "destroy data after dd/mm/yyyy" to ensure that the person who finds it knows it's useless.

        1. Yet Another Anonymous coward Silver badge

          Re: @Lusty

          What if you were interested in looking for oil under Borat-istan. Before bribing the local rulers or persuading the US to invade - it might be interesting to look at those surveys you did when in the 1920s when it was British Eastern Turkey.

          1. Brewster's Angle Grinder Silver badge

            Re: @Yet Another Anonymous coward

            "What if you were interested in looking for oil under Borat-istan..."

            Then it shouldn't be a in a .tmp file.

            1. Anonymous Coward
              Devil

              Re: @Yet Another Anonymous coward

              Everything in the universe is*.tmp.

          2. JEDIDIAH
            Linux

            Re: @Lusty

            More than just storage technology has improved since 1920.

            Any information that's 100 years old is suspect. It doesn't matter what kind of condition the media is in. The more sophisticated the information, the more likely it is to become obsolete before the computer storage format it resides on becomes unusable.

            1. Anonymous Coward
              Anonymous Coward

              Re: @Lusty

              Well, yes, but I'm sure they could still have very usable location data for old wells.

              1. Lusty

                Re: @Lusty

                And you don't think an oil company can afford to keep that info in online storage?

    2. Johan Bastiaansen

      No archive

      Being allowed to not have an archive if it is destroyed, is not the same as "don't need an archive".

      1. hplasm
        Headmaster

        Re: No archive

        "Is required to have..." is not the same as "need to have..."

  3. Luke McCarthy
    Linux

    Word 2113

    Is a non-problem. Stop using proprietary, undocumented data formats and the problems solves itself.

  4. SirDigalot

    Throw it in the cloud...

    After all that's what everyone is pushing these days? if it is in the cloud it will always be there, I know because there are websites on the internet that are still around after 10 years so it must be true...

    If you have enough money to keep 100 years worth of data in pretty much any format there is enough room in the storage vault to keep a server and tape library to read it. We keep older EOL stuff around for the oldest data, then when we no longer need it (after paltry 7 years) we shred the lot and recycle the drive/server.

  5. Drummer Boy
    FAIL

    It's the same with paper

    Financial records get shredded after 10 years, or so, so why not all other records?

    Where I work we store offsite backups in destruction date order, as they are far more likely to be destroyed than accessed!

    Also, the chances are, if you don;t need a backup in the first couple of years, you'll never need to access it.

    Also, as time goes on, all your historical 'tapes' (I use tape as a unit of currency rather than a physical media) will fit onto a single, super capacity future tape format, so the on going transfer will be a single tape at a time.

    Are we buying into the keep everything for ever trap, and what happens with 'Big Data' which doesn't even get backed up?

    1. Alan Brown Silver badge

      Re: It's the same with paper

      "Financial records get shredded after 10 years, or so, so why not all other records?"

      You're confusing "minimum retention period" with "mandatory destruction" - yes you may shred 'em after 10 years or you may keep them longer, but you may not shred them before that period - because that's how long the tax department is allowed to go back when they audit you.

    2. Radbruch1929

      Re: It's the same with paper

      @Drummer Boy:

      > "Financial records get shredded after 10 years, or so, so why not all other records?"

      Because you keep them for different purposes. Financial records are a question of tax law. Now consider product liability: It may be necessary to recall your product because you find a non-random defect very much later. Then you need to know what was affected and what to recall. Depending on your legislation, claims may go back 30 years for example.

      Or patent infringement: A patent usually lasts up to 20 years, you may need to have your patent documentation (application etc.) longer than that in order to claim. On the defensive side, claims may last 10 years so you may want to know what you did in those 10 years and also prior to the priority date since you may have a right to pursue your own prior knowledge.

      To me, the question of how and what to archive should be answered along the lines of the purpose you retained the information for in the first place. And that may be a hard question to answer that few companies are willing to consider.

  6. Ian 62
    Joke

    No-ones said it yet?

    The CLOUD!

    With our "Cloudiest Cloud"™ all your stufff is safe forever!

    With a guarantee to keep all your data available forever with 100% uptime. (NB. Not a guarantee)

    As a limited time extra, we promise the FBI or the MPAA wont switch it off. (Until they really want us to).

    1. Ian 62

      Re: No-ones said it yet?

      Damn..someone did.

  7. bag o' spanners
    Devil

    Place a pizza box and yesterday's Metro on top of anything that looks like an archive. Hey presto! Problem solved.

    The digital pizza box has been with us for years. Love those proprietary legacy enterprise applications, which not only stored junk in a proprietary format, but did it using technology that was obsolete when it was rolled out. The client, the budget.

  8. Crisp

    Virtually no data is needed for ever

    But I don't want to throw it away! You never know when something might come in useful!

    Guaranteed as soon as I throw away that credit card receipt from 1985 I'll need it for something!

  9. Crisp

    Another problem with archiving.

    Those tapes have a limited shelf life.

    What I'd like to see is the digital equivalent of a stone tablet. A storage medium that can be viable for millennia.

    1. John Brown (no body) Silver badge
      Joke

      Re: Another problem with archiving.

      zip or rar the files to punched tape?

  10. Neil Barnes Silver badge
    Holmes

    Paging Hal Draper!

    http://folk.uio.no/knuthe/msfndinalbry.html <-- he knows whereof he speaketh. Notched quanta FTW.

  11. Velv
    Boffin

    Lifecycle

    You need to understand the CONCEPTS behind the data life cycle, and it will only ever work if you classify your data correctly.

    An archive is an active part of the life cycle of data. An archive is the PRIMARY copy of an item of data. You move it to cheaper storage as you need to maintain the data for a defined period of time. Ergo, since it remains the primary copy, you will always need to keep the data on your active technology. You change tape drive types, you move the data to the new tape types. It is therefore vital that you fully understand the life cycle of your data, and delete it at the earliest opportunity once it has served its (regulatory, legal, business, analytical) purpose.

    A backup is a secondary copy to allow you to recover in the event of loss of the primary copy - a completely different scenario. Backups are NEVER an archive. (Tape is just a media, a tool, like disk, paper, CD, stone, etc).

    To many people and businesses confuse the two when they are fundamentally different use cases.

  12. jabuzz
    FAIL

    LTO1 tape drives are still available on the used/surplus market. If you are using a decent software package to handle your archives and are sensibly keeping all your tapes in the library, and having multiple libraries in separate locations, then the backup/archiving software will take care of migration from one tape technology to the other, with the tape robot in the library doing the shuffling of all the tapes between the drives and the slots with no human interaction once you set it away. Might take some time a.k.a. a couple of months, to complete but if there is no human interaction required in the process that is not a problem.

  13. Pen-y-gors

    Won't somebody think of the history?

    Can I put in a plea for keeping at least some of the 'no longer needed' data?

    Historians learn a lot from the odd bits of paper that happen (by luck/coincidence) to have been preserved beyond their useful life. I did my MA thesis on an archive of company documents from the 1850s that just happened to have survived - it included all the invoices and receipts (even cheque stubs), daily work sheets and payments, letters, every detail of the life of a small, failed Welsh mining company - each one individually pretty trivial and unnecessary, but together the amount of information that could be extracted from them was incredible - comparison of wage rates (men/women/children), security of employment, even the weather and how goods were transported from A to B (before the railways - it involved coach, canal, horse-and cart and a lot of faffing!)

    And I'm very worried that historians of the 23rd century won't have access to similar source material for any dates after about 1990 - everything is electronic which, even if preserved, will probably be unreadable, but which (most likely) will have been erased for data protection or other reasons. Just because there is no legal requirement to keep financial records for more than 10 years doesn't mean you must destroy them. Please, print them all out on acid-free paper, lock them in an airtight, nitrogen filled steel box and present them to your local county record office!

    (That was an appeal from the Society of European Historians, first broadcast on Radio 4 in April 2237)

    1. Locky

      Re: Won't somebody think of the history?

      Historians in the 23rd C will know our time as the period when people took photos of everything with very poor cameras. Which is odd, as 10 years before all the analogue pictures were pin sharp

      1. /dev/null
        Facepalm

        Re: Won't somebody think of the history?

        ...and then lost all those photos because they only existed in the digital domain, and nobody really does backing up (never mind archiving) properly...

      2. Neil Barnes Silver badge

        Re: Won't somebody think of the history?

        I'm doing my best... http://stereo.nailed-barnacle.co.uk/#0.69

      3. Dave 32
        Coat

        Re: Won't somebody think of the history?

        They'll also believe that we worshiped cats, based on the number of "cute" cat photos that have been posted to Facebook, Pintrest, etc.

        Dave

        P. S. I'll get my coat; it's the one with the kitty litter in the pocket.

      4. JEDIDIAH
        Linux

        Re: Won't somebody think of the history?

        > all the analogue pictures were pin sharp

        Spoken like someone that's never taken or looked at an analog photograph.

        Chemical film is no protection from incompetent photography. What you are attempting to describe sounds more like the pre-snapshot era that existed before Kodak made (analog) photography accessible to every incompetent and tasteless amateur.

        Snapshot photography really hasn't changed much. If anything, these days you can cheaply take 100 pictures and discard 99 of them. Even with "cheap" film, any analog print was far more dear.

        1. Neil Barnes Silver badge

          Re: Won't somebody think of the history?

          There's plenty of evidence that the quality of the image is proportional to the cost of the film: two good shots off a roll of film, whether it has 36, 24, 15, 12, or 8 pictures on it (in order of increasing size). It's when you get to 4x5 or larger that you start to take real care with composition and exposure... but then, I'm a silver lover, and I *hate* what Kodak did to the art.

          I wonder if it's a 'law' of nature: the cheaper it is to archive digits, the more crap digits will be archived...

  14. Duncan Macdonald

    Old data can be VERY important

    Seismology records from early oil exploration when reprocessed with modern computers can lead to the discovery of additional oil reserves or hidden earthquake fault lines.

    With current LTO 6 tape prices of about $93/tape and a 2.5TB capacity, the cost per TB per year would amount to around $50. (Assumptions - tape replacement cycle 5 years, cost of automated tape library less than $125/tape, tape library replaced with the tapes (no trade in value), cost of automated stager to newer tape less than $15 per tape copied to new media.)

    For a large database of 100PB, this amounts to $500,000 per year which is probably lost in the noise of any organisation large enough to have a 100PB archive.

    The cost of separating the wanted old data from the dross is probably so much higher than the cost of keeping it all that the best approach will often be to keep all the data.

    On a much smaller scale - where I used to work, the fact that I was a confirmed cynic and digital pack rat saved the company a lot of money when a private backup copy that I had kept of an old project turned out to be the only one left when a modification was needed. Getting rid of old data can be an expensive mistake.

    1. Anonymous Coward
      Anonymous Coward

      Re: Old data can be VERY important

      Hmmmmm that teensy little key hole back into history....

      Ahhhhhhh......

      So good, so useful.

  15. Alan Brown Silver badge

    Old tapes

    Can't just be left sitting around. They have to be periodically exercised and checked to see if the data is still intact. (You'd be surprised how many old "archives" are completely unreadable)

    When you're doing that, it's just as easy to migrate to new media.

    As for holding on to old drives: If your recovery policy relies on equipment which can only be sourced or repaired via ebay, then what you have is a liability, not a backup.

    Old drives (disk and tape) left on shelves have a remarkable ability to simply stop working the next time they're switched on. Apart from the obvious stuff like bearings breaking down, simple corrosion and ion migration inside components or on PCBs can easily result in a device being unusable after several years on the shelf without major servicing. If you don't test the devices regularly then you're living in la-la land.

    Again, whilst you're doing that, it's a good opportunity to move the data to new media.

    NB: Bare tar files aren't good enough - no error correction, no redundancy, no checksums, etc. If you must use 'em, make 'em small and make sure you include a list of checksums. I'm pretty pissed off with trying to read tape-spanning tarballs which have a glitch at tape 2 of 20, rendering the whole bloody thing useless.

  16. Johan Bastiaansen
    Devil

    Interesting article

    But the archiving problem doesn't have to be so dramatic with the biggest companies in the world sitting on data that spans over a century.

    I'm in sales, selling software to engineering and construction companies. They are required to give a 10 year guarantee on their regular projects, so they would like to keep the information available. This worked very well when letters were typed on paper, when drawings were made on blueprints. Now we send emails and use software to make a 3D-model of the project.

    For the Chunnel, or a nuclear power plant, the information would have to be available for the entire lifespan of the project. We're talking 40-50 years here. And now, with BIM promising facility management for hospitals, office buildings and such, we have the same problem for quite ordinary projects.

    This is a big challenge. Imagine that you would want to use information that was first issued 30 years ago.

    Imagine someone designed such a project, using AutoCad 1.3 and handed over the drawings on 5.25" floppies. That would have been a lucky choice, he could have used software that is now completely obsolete. Suppose every 3 years, the drawings would be converted to the latest version and media. You would convert to V2.5, rel 10, rel 12, 13, 14, 2002, 2005, 2009 & 2013. The information would be on the harddrive and on tape back-up, but to keep it up to date would be moved back and forth on 5.25" then 3.5" floppy drives, later on some exotic drive, ultimately USB-sticks.

    Can you guarantee that the dotted line that is supposed to be there, and not over here is still there and is still the same dotted line, on the same layer in the same color? There's no chance of going back to the original information. You don't have a PC that runs AC 1.3 and you can't read the 5.25" floppies anymore.

    Why would the situation be different for a 3D model we make now and expect people to use 30 years down the road?

    1. JEDIDIAH
      Linux

      Re: Interesting article

      One thing that is different now is that virtual machine technology is commonplace. You can encapsulate your entire decoding envrionment into a nicely encapuslated format. It's been done for DOS games from the age of AutoCAD 1.3, so it can probably be done for any other format you care to mention.

      Output to some common standardized format like PS and you've solved a big bulk of the problem right there.

      Having documents in a print-ready format that's not vulnerable to client sabotage is a pretty good first step actually. Those formats are a bit more stable and much more standardized.

  17. chris lively
    Mushroom

    Simply put, in 500 years time we will be entirely forgotten. Digital records simply wont survive that long. Historians in that time period are going to know more about what happened in 500BCE than they will 2010CE.

    It's happened before and it will happen again.

  18. Christian Berger

    That's why banks use Microfiche

    They put all their historical data as text on Microform. Done correctly you can simply OCR it and get back your data. If everything fails you can even read it with simple optical devices. The beauty of this is that it can easily survive decades or centuries of neglect. You can put it into a box put it into a cave and if someone finds it in 100 years, it's likely to still contain most of the data. Plus it's much more compact than paper.

    If neglect is no problem (i.e. in a company archive which is going to be irrelevant once the company is gone) the solution is to continuously copy all your data. Whatever the medium is, you need to be prepared to be able to copy all of them within a reasonable amount of time. Data stored on a format you cannot efficiently read and copy is already lost when the medium requires special technology.

    And of course, don't use office documents for long term storage. There is no office software meant to actually store data so it can be used on another computer. (that's why fonts are hardly ever embedded) If you want to store such documents. Store the text as ASCII or UTF-8 text and the image as a archival grade PDF or TIFF. (or uncompresed bitmaps, etc)

    If you cannot write software from scratch to read your data, your data is lost.

  19. Primus Secundus Tertius

    Many File Types

    Companies are busy enough as it is without trying to reformat documents to a standard for long term archiving. So files exist in every format from MS Word 1 to MS Word 2013. Then there are the Excel, Power Point, and Access files, plus whole databases in Oracle etc., at various version levels. Also there are the other formats mentioned above: Autocad, scientific/engineering "unformatted" Fortran files, raw data logging formats... (The format of an unformatted Fortran file is more formidable than the format of a formatted Fortran file.)

    The Word documents can be turned to microfiche, but not the others. There are gibberish ascii-like formats for spreadsheets (sylk) which in my experience are not compatible between MS Office and Open Office; but the pressure of everyday work means they are saved as xls or xlsx; or as sxc, odc. Where XML traffic exists, for contractual enforcement of data hygiene between subcontractors, it is worth saving.

    It might seem like a good idea to create a virtual machine every few years with well known office and CAD packages. But then the format of the virtual machines changes ...

  20. h3

    The original UNIX answer still stands use text, use text, no really use text, do something else if you absolutely have to. (And understand you almost certainly don't or else you are doing it wrong and take responsibility for it).

  21. xpusostomos

    Yeah but...

    Surely the frantic evolution in technology will slow down and stop sooner rather than later. As long as it is evolving, those upgrades bring benefits of much more data per tape, which presumably is worth the upgrade. But this can't go on forever. Will there really be 20 more generations of tape? Will there even be tape in 20 years? Who knows.

  22. Jessie James

    Random thoughts of a digital archivist

    It may not be necessary to constantly migrate document formats to keep material readable - a better approach might be to use emulation which was first suggested by Jeff Rothenberg in the 20th century and allows old formats to be read on modern machines.

    A decent tape library will do all the tape housekeeping you would want - checksums, identification of dodgy tapes and migration of formats. I don't think that it is necessary to move to each new generation of LTO tape; we tend to go for every second generation 2,4,6. What is worrying is that at some date in the future, physics will catch up with LTO. The progress of the tape has been remarkable - generation 1 could hold 100GB and LTO 6 can hold 2.5TB, the manufacturers intend that LTO 8 will hold 12.8TB (assuming a 2.5:1 compression ratio), but one day, presumably they won't be able to cram any more bits onto their polyester.

    One other random thought - LTO claims an archival life of 15 to 30 years - er, how do they test this; can they get a large enough sample size to determine how long the life is? Also it doesn't matter what the archival life of the tape is if we don't have any machines to read it.

  23. Christian Berger

    Don't forget one thing: Indexing and organisation

    I used to work in an archive, one day we had a call for early CT images from 1980. We looked into our regular archives and there was nothing. Months later while cataloging some basement we found a box with old CT images from that time period.

    An archive which is not properly indexed or cataloged is useless.

  24. Tyou

    Missing the point

    Again another article that misses the point. Large scale archives have 4 functions for data. Ingest, presentation, validation and migration. Tape, disk it doesn't matter. how many data migrations have been doen to get from 40mb disks to the ubiquitous 3TB disks I'm now using at The Library. Data migration is part of the business, you plan for it.

This topic is closed for new posts.