back to article British Library tracks rise and fall of file formats

File formats and the software capable of reading them are living longer than previously thought, according to a British Library and UK Web Archive study. Formats over Time: Exploring UK Web History (PDF, slides as PDF) considers 2.5 billion files author Andrew N Jackson retrieved with the help of the Internet Archive and the …

COMMENTS

This topic is closed for new posts.
  1. Anonymous Coward
    Thumb Up

    Another reason to thank open sourcery

    Is the heroic efforts sometimes invested into compatibility of various sorts, from MAME for venerable video game systems to providing alternative viewers for legacy formats - I almost cried with joy when I realised I could play an .rm file with VLC rather than polluting my system with that RealPlayer shite.

    Of course it does depend upon people finding the topic of legacy formats sufficiently interesting to invest that heroism but this seems to happen pleasantly often - perhaps because it's basically just puzzle-solving.

    1. Anonymous Coward
      Thumb Up

      Re: Another reason to thank open sourcery

      > perhaps because it's basically just puzzle-solving.

      I think that a lot of the time it is nose-thumbing at the proprietary companies that create them, not that that's a bad thing mind you :D

    2. Mark 65

      Re: Another reason to thank open sourcery

      Yep, open source and open formats/standards work wonders for longevity of formats. Open formats/standards are by far the best thing and should be, err, standard. After all your product should be more about the value-add than the storage format.

  2. Tom 35

    Not just the file format

    Not long ago I make my self some pocket change by converting some files for a university professor.

    He had 60 discs of documents that he could not access.

    They were Apple II 5 1/4 discs mostly in Apple Writer and some AppleWorks files.

    I happen to have a IIgs with a 5 1/4 and a 3.5 "super disk" that can write MSDOS 1.44 MB discs. All but two of the discs were readable.

    He was going to buy an Apple II on ebay and print them all, and have someone retype them before he found me through a friend of a friend.

    Then there was NASA when they found they didn't have a tape drive to read all their old mission data.

    1. ceebee

      Re: Not just the file format

      It is not just the file formats but also the physical media beit disk or tape or whatever.

      I regularly retrieve material from 1980s Mac formats like MacWrite, MacDraw, MS Word for Mac etc. from 400kb and 800kb Mac disks.

      I luckily still have a working 1990 MacPlus which enables me to read the disks and bring the data forward to via a usb floppy disk drive to a more recent version of Word or whatever.

      Archivists rightly fear the "Digital Dark Ages" as material written only 10 or 20 years ago is increasingly difficult to recover easily.

      Even MS Word will not (easily) read files created before Word 97 (for security reasons!).

  3. Richard 12 Silver badge

    Ask someone working in any field

    They'll all be able to tell you of formats that are obsolete, and in some cases impossible to open.

    I deal with several completely obsolete formats in my day job - we have several special tools to (mostly) convert them into text-based formats that the modern systems will open with varying degrees of success.

    Assuming you have a working floppy disk drive, serial port and Windows XP emulation that can use them.

    However, while they can extract the really important info, none of them get 100% of the data - usually 95-99% or so.

    Unfortunately there are also some formats with no tools at all, and they are less than 20 years old - with the hardware still in use.

    Because of this, all our current systems have ASCII export built-in from day one - oddly, most of our competitors do not.

    1. anjackson

      Re: Ask someone working in any field

      Can you provide any links or other information about these recent, difficult formats? As part of the research I want to collect genuine examples of problematic formats and document what work-arounds are available, if any. Thanks.

      1. Richard 12 Silver badge

        Re: Ask someone working in any field

        The difficult formats are the proprietary binaries other than word processing. Word processing formats usually have the raw text inside in ASCII or similar well-known representation, so reverse-engineered conversion utils are usually available - even if they only get the text, it's still worthwhile.

        However, once you look at other fields you will find many obsolete and difficult formats.

        For example: Strand SSF - have to convert to Ascii, using an unsupported and difficult to find Win 95 application called Showport. As far as I know the source code for that program is long gone.

        If Strand hadn't written that before going bankrupt then a lot of people would have lost a lot of data.

  4. Ole Juul

    vintage buffs

    I'm glad to see that there are posters here who also haven't thrown out all the old kit. As someone involved in vintage computing and being part of a very large world wide community with a similar interest, it is clear that most old formats are very accessible still. Of particular relevance in that regard, I note that in 2012 one can still buy a brand new punched tape reader.

    1. Anonymous Coward
      Anonymous Coward

      Re: vintage buffs

      "Of particular relevance in that regard, I note that in 2012 one can still buy a brand new punched tape reader."

      Even in 1980 it proved impossible to source generic papertape punches and readers with the high speed capability of the 1960s ones. All that the market provided were relatively slow devices for intermittent, very small volume, operation. Data input direct to magnetic tape or disk had removed the major market need for such highly-tuned, and temperamental, mechanical paper handling.

  5. Anonymous Coward
    Anonymous Coward

    bits or holes

    My prediction is that the .txt (i.e. plain ascii) file format will be around way longer than any other text file format, before or after.

    It's not the file format that's the issue, it's the storage medium. As technologically advanced as we like to think we are, we don't seem to have an electronic storage medium that can (probably) outlive good old paper. Invented ~2000 years ago, we still have legible artifacts from that time. I doubt any of your good ole 5.25 inch floppy disks will be able to boast the same.

    Therefore, convert your pdfs and wot not into ascii txt files and store them on punched tape or cards.

    1. Anonymous Coward
      Anonymous Coward

      Re: bits or holes

      "[...] we don't seem to have an electronic storage medium that can (probably) outlive good old paper."

      Several years ago the British Library pointed out that most of the19th and 20th century paper is wood-pulp based, Depending on its residual acidic content It turns brown and then starts to disintegrate. Some of their archives were deteriorating faster than they could be digitised. I have a 1940s Penguin paperback which after only 30 years was already in that state.

      On the other hand their 18th century archives' papers were made from rags and still going strong.

      In my experience the most transient formats were on network protocol analysers or logging devices. The number of undocumented proprietary formats, both data and media, was large. Even the manufacturers' backward-compatibility programs often made imperfect renderings of things like time and status details.

    2. P. Lee

      Re: bits or holes

      +1

      However, all those logos and visio diagrammes make hieroglyphics relatively easy to decipher.

      Bring back text!

      On the flip side, we store incredible amounts of complete rubbish.

      1. philbo

        Re: bits or holes

        >On the flip side, we store incredible amounts of complete rubbish.

        Never a truer word spoken..

    3. Anonymous Coward
      Anonymous Coward

      Re: bits or holes

      "Therefore, convert your pdfs and wot not into ascii txt files and store them on punched tape or cards."

      The physical volume of such media is very large and heavy for significant amounts of data. Cards have the additional headache of needing embedded numbers to allow them to be resequenced after an accident.

      To handle significant volumes of data needs high speed readers. Even with contactless hole sensing the media has to be moved mechanically at high speed. That media was fairly fragile when it was brand new. Storage conditions for humidty and temperature could be quite crtical. Resplicing snapped paper tapes or trying to clone crumpled, torn cards manually was a daily task at which operators became very skilled.

      1. keith_w
        Facepalm

        Re: bits or holes

        I remember the Mother-F*ing Card Mangler!

  6. Neil Barnes Silver badge
    Holmes

    ms fnd in a lbry

    http://home.comcast.net/~bcleere/texts/draper.html

    It doesn't matter about the format if you can't locate the information in the first place...

    The point about Word not easily reading earlier formats is well made, but the corollary is rarely noted: what do the new formats offer other than a one-way forced upgrade? There's a difference between a document's contents and its metadata which is most obvious in text documents, but the presence of some of that metadata (fonts, size, colour, position on the page etc) can be critical to the understanding of the document. Text only representations do not always provide unambiguous reading of the material.

    A point has been made many times: if you rely on the software supplier to provide the means to read material created by you, you don't own it; you rely on the goodwill of the software author to continue supporting it. There's an awful lot to be said for publicly available file formats...

  7. Mike 137 Silver badge

    file formats?

    HTML is not a file format - it's a page description language.

    1. Anonymous Coward
      Stop

      Re: file formats?

      > HTML is not a file format - it's a page description language.

      Maybe not in theory, but it is in practice. A lot of file-based documentation these days is in HTML format.

  8. Jim 59

    Data size

    When you manage to read an old format, it is a shock how small the data is. An entire mid 90s PC can fit into half a gig. Run an emulator and the window is tiny, because of screen resolution increases. Hundreds of floppy disk images add up to so little you hardly notice the space usage.

    I made a DAT tape in 1994 and didn't re-read it until 2007. Because we used text files so much then, even for documents, the data is incredibly small.

    1. Chemist

      Re: Data size

      I remember getting my first 470MB drive. It was like opening the fridge and finding Wembley stadium there.

      Mind now I come back from one holiday with 5GB of video.....

  9. Anonymous Coward
    Anonymous Coward

    Apple Quicktake

    Now there's an obsolete format. From Apple's abortive foray into the very early days of digital photography. I have a stack of them from a trip to Australia many years ago. The filetype is .pict but [typically Apple] a proprietory version of .pict that nothing else can read and [also typically Apple] support for the format has been dropped from Quicktime since the pre-OSX days.

    Not much hope for file formats surviving into the future when even their own progenitors abandon them, after a few years!

    1. Oddb0d

      Re: Apple Quicktake

      There is apparently enough info on the web to deal with that format, try here

      http://blog.richardsprague.com/2005/01/quicktake-camera-photo-conversion.html

      or try Image Converter by Bitten Apps, the current changelog specifically states "Fixed a bug where the application would refuse to open QuickTake PICT files."

      http://itunes.apple.com/us/app/image-converter/id437491380?mt=12

  10. Scott Wheeler

    Definition of HTML version?

    Successive HTML versions are almost entirely supersets of the earlier versions, and if you are writing a simple page, you may have no reason to use the more advanced facilities of the later versions. So are the server test pages I put up last week obsolete HTML 2.0 because I use nothing more advanced than posting to a CGI script? Surely not: this has nothing to tell us about information loss and is unrelated to any discussion of old word processor file formats.

This topic is closed for new posts.

Other stories you might like