back to article Internet pioneer Vint Cerf predicts the future, fears Word-DOCALYPSE

Big data may turn out to be a big mystery to future generations, godfather of the internet Vint Cerf has warned. The pioneering computer scientist, who helped design the TCP/IP protocol (along with Robert Kahn) before going on to work as chief internet evangelist for Google, has claimed that spreadsheets, documents and various …

COMMENTS

This topic is closed for new posts.

Page:

  1. Tanuki
    Thumb Up

    Nothing new...

    Truth is, the number of competing formats has if anything decreased over the years - which should make things easier as time goes on.

    [I was once given the task of recovering data from a set of magtapes, with no information at all as to what system they came from, how they had been written, or how they had 'happened' to come into the particular organisation's possession. Even the brand-name of tape manufacturer had been expunged from the spools. After doing a physical block-dump to disk and a week or so of frobbing-about I determined the format to be an Eastern-European-ICL-1900-series-clone native binary. There was much ensuing happiness. Tapes apparently from the same source would arrive intermittently and unpredictably. If I told you anything more I'd probably still have to kill you].

    1. Anonymous Coward
      Coat

      Re: Nothing new...

      "or how they had 'happened' to come into the particular organisation's possession."

      I'm guessing that trench coats, silver briefcases, and rainy afternoons in public squares were involved...

  2. Andy The Hat Silver badge

    As has already been said, backwards compatibility and data death has been a perceived problem for years. Perhaps the Doomsday Project was the first 'popular news' item with the death of the hardware (laser disc) making the media unreadable.

    Not having the software to decode the data is a second problem.

    "Use ODF", "Use XML"? You're missing the entire point folks. However open the format is, just because the data structures exist with software to decode them today doesn't mean it will exist in 50 ... or even 5 years time. Pick some of the 'totally compatible' formats from history and try to read them - for instance wordstar format on a 360kb disk used all over the place ... until word .doc replaced it. I might be able to at the moment but that's only a 30 years old data set ...

    So if you can a) find a drive that will take the disc and b) find an os that will read the disk format and c) find software that will decode the file you're fine ...

    Perhaps there should be a relatively small, open library of data formats established somewhere - disk structures, storage formats, software encoding techniques et al? These can then be applied to data as required rather than having to preserve the tools themselves...

    I'm now off to read a book - which I just have to open to decode it ...

    1. Steve Knox
      Meh

      "Use ODF", "Use XML"? You're missing the entire point folks. However open the format is, just because the data structures exist with software to decode them today doesn't mean it will exist in 50 ... or even 5 years time.

      No, you missed the point. The open formats have public documents describing the data structures, and XML is designed to be semantically self-consistent. In order for XML and ODF to be completely unreadable in 50 years time, we'd have to destroy everything that uses our current binary model of computing and burn down a few hundred warehouses full of books and paper documents as well.

      Your only significant point is the 360kb disk your Wordstar doc is stored on -- but in a competent modern IT department, that document would have been transferred to current media when the old media were retired.

      Perhaps there should be a relatively small, open library of data formats established somewhere - disk structures, storage formats, software encoding techniques et al?

      IEEE, ECMA, ANSI, W3C, et al. They don't exactly meet your "small" requirement, though.

    2. Michael Wojcik Silver badge

      "Use ODF", "Use XML"? You're missing the entire point folks. However open the format is, just because the data structures exist with software to decode them today doesn't mean it will exist in 50 ... or even 5 years time.

      This is a wild misapprehension of the problem. Try reading some of the actual research in file-format preservation and recovery. There are orders of magnitude differences in the work factor between documented and undocumented formats; between formats that have multiple, open implementations and those that have a single, closed one; between formats that use a handful of straightforward, widely-applied technologies (e.g. XML and zip, in the case of ODF) and those that use one-off encodings; between formats that have high redundancy and human-readable elements and those that are binary gobbledygook. Claiming that all formats are equivalently vulnerable to this problem is like claiming that since all people die, the length of life and manner of death are irrelevant. It's a fallacy of composition of the first water, a sophomoric generalization to the point of absurdity.

  3. Anonymous Coward
    Anonymous Coward

    Total non-issue

    Scientists don't collect data shrouded in proprietary formats like PowerPoint files. If you can't access the raw figures, then you're not doing science!

    1. Geologist

      Re: Total non-issue

      Would that this were so. Most big lab instrumentation (eg SEMs, TEMs, XRD etc) uses proprietary software, and proprietary formats for their analysis, storage and presentation of data. This is annoying at least, and some times a big problem (you may not know exactly how data have been processed). The distinction between "raw" and "processed" data from a complex piece of equipment is almost always fuzzy.

      Obviously, most data are then exportable (if you're lucky), or can be hand-retyped (yes - still sometimes necessary - aargh!) into csv, spreadsheet, or other standard or open(ish) formats, but these normally don't have the functionality of the proprietary formats and tools.

      I'm afraid Powerpoint is often used as a cheap n cheerful way of assembling screen dumps, and copied data from proprietary software on instruments. Not good, but sometimes the best option...

  4. Seanmon
    Mushroom

    Bring it on.

    If this means getting rid of some of the millions of bytes of "manager"-generated excel shite that fills my inbox daily, the sooner the better, says I.

  5. Anonymous Coward
    Anonymous Coward

    rasters

    PNG and TIFF (caveat: use a standardized codec) are decent for long-term storage. Most of the public records I've seen are black-and-white TIFFs with CCITT G4 compression (like faxes). These are things like property deeds and engineering drawings, stamped and signed by a bunch of people then scanned. Everything else gets thrown away after a few years, especially the Word docs and CAD drawings. Pretty decent system.

    1. Anonymous Coward
      Anonymous Coward

      Re: rasters

      I keep everything in PCX.

  6. Smallbrainfield
    Paris Hilton

    Ha ha, this remnds me. I have

    a drawing database that started life in Paradox, moved on to MS Access and currently resides in MS Excel. Twenty two years that database has been alive in one form or another. It's older than my kids.

    Paris, cos her most famous work has probably been converted into dozens of different formats, ensuring her place in cinematic history.

  7. Ole Juul

    Text

    Yes ASCII is a "format", but it's a simple one which is likely to survive a long time. I'm still trying to understand what all this other stuff is really good for. Why is everybody so bone headed about needing all that word processor stuff? I think the answer is that there is much less interest in what a document says than what it looks like.

    1. Anonymous Coward
      Anonymous Coward

      Re: Text

      Biggest problem with text is that it doesn't do graphics So if you want to see charts, diagrams, or relevant photos it's a constant "refer to ven1a7.gif" for breakdown showing current trends" or "as detailed in pic049.png" littering the text. Really breaks the flow of reading.

      Bummer if the refered files aren't included/missing.

  8. Nigel 11

    Encryption is the real danger

    If anyone wanted to read a non-encrypted document in some old format or other, I doubt that they'd need to be qualified to work at GCHQ in order to "break" the (non-)code. It's just reverse-engineering something that actually isn't designed to conceal.

    Strong encryption, on the other hand, means that once the decryption key is gone, so is the document.

    1. Michael Wojcik Silver badge

      Re: Encryption is the real danger

      It's just reverse-engineering something that actually isn't designed to conceal.

      Yes, good luck with that, when you have a single file in some unknown binary format and no record of the software used to process it.

  9. Lee D Silver badge

    Relevance

    If there's a document worth preserving, I'm pretty sure it'll get preserved in a way that future generations can read it.

    The problem is that 99.999% of the stuff we generate today is worthless. Hell, even the design notes of big-hitting games like Prince of Persia on the Atari (I believe) are pretty useless and boring, even to those interested. Same for the Mac paint app that had it's original source released recently. There probably exists more useful information in a reverse-engineering of the program itself than anything contained in the design documents that could ever be found and preserved. And this is from an industry where the whole product is digital, not just a document or two about it.

    The fact is that I have email in front of me now going back to 1999. I have archives for a few years before that. I have code I wrote when I was a teenager. I have huge essays and articles and documents I've written over all that time. The number of times I have to refer to anything older than a year? So small as to be worthless, and usually just because of poor organisation or convenience rather than being a vital requirement. I can't imagine that most of what goes through a computer database needs to stay around for that long, and the stuff that does ends up being on paper in archives for a few decades at most. There's just no need for it.

    As Terry Pratchett says: Digital archeologists of the future? Get a real job. DELETE.

    1. Anonymous Coward
      Anonymous Coward

      Re: Relevance

      I went looking through some files from 2000 for some marketing materials relating to our business to see if they could help our current efforts. MS Word 97 docs. Opened easily. Weren't worth bothering with - unless to show how not to do it. The market and situation had moved on apace and they were just so terribly dated.

      The only stuff from that time that was useful was printed materials - little dog-eared but readable.

      I should think there's a lot more files in our system from a decade or so ago that aren't much use. Getting rid of them wouldn't save much room but would make it quicker to look through the stuff that was left.

      It must be a human instinct to hoard, on a personal front, I have 9999+ emails in my webmail. I'm certain I don't need past acknowledgements of internet orders, or a conversation with a friend in '04 organising a get together.

      A good clean out is useful every once in a while.

    2. Michael Wojcik Silver badge

      Re: Relevance

      If there's a document worth preserving, I'm pretty sure it'll get preserved in a way that future generations can read it.

      I must have imagined the data-recovery industry, then.

  10. silent_count

    Not for nuthin'

    Amongst the reference material I gathered while learning x86 assembly is a file called "386intel.txt", circa 1986, which is just as readable now as it was then. And, if it for some reason weren't, it would be trivial to write software to make it readable.

    Not that I much care about future generations knowing the inner workings of ancient CPUs, but I guess the moral of the story is: create files in propriety formats at your own risk, if you care about them being readable in the future, that is.

  11. phuzz Silver badge
    Go

    Either update your documents to newer standards as you upgrade (and given the increase in storage space/decrease in costs you can keep the original as well), or keep a VM with (for example), winXP and office 2003, or whatever you need to read it.

  12. weevil

    So Mr Cerf hasn't heard of virtualisation?

    1. Destroy All Monsters Silver badge
      Trollface

      But will your VM format be recognized and be runnable?

  13. reno79

    What about .org files?

  14. gnufrontier

    Distant "fear"

    Relax V. Most species that have ever existed are extinct. The amount of historical data we have prior to the printing press is paltry compared to what was written but we still seem to make up pretty good stories about the past and presume that what we have is more important than what we've lost (which can't be known).

    The premise of the philosophy of progress is that the past is well, outdated and of little value. We want to know the future not what has already happened. Knowing the past doesn't prepare us for a future that is always new, always changing. So if you are a progressive type, it's a non-issue.

    This is really a function of human development. As humans get older their attention always turns more to the past than the future primarily because they have little future left. Older people (and I am one) always think that the way it was was better than the way it is or is going to be. Not true. It will suck for people in the future just as it did for us in the past. It'll just suck in different ways.

    Holding on to all this big data just means one ends up with a bigger haystack in which one has to search for the needle.

    As for Doris Goodwin's book on Lincoln, we needed another book on Lincoln like a whole in the head. It's like getting another book on Jesus or Plato. Studying the past (I was a history major) is a great way to escape the present but it doesn't mean beans for the future anymore. When things changed very little for a thousand years, the past may have had some actual use but now, not really.

    1. Anonymous Coward
      Headmaster

      Re: Distant "fear"

      "we needed another book on Lincoln like a whole in the head. "

      Actually, I've only got a half in the head, so I could use a whole one.

      Wait, what?

    2. strum

      Re: Distant "fear"

      >As for Doris Goodwin's book on Lincoln, we needed another book on Lincoln like a whole in the head.

      Actually, Obama used Goodwin's book as a guide to forming his own administration - so Doris's work was a damn sight more valuable than anything appearing on this site (including your message and mine).

      1. gazthejourno (Written by Reg staff)

        Re: Re: Distant "fear"

        Except Reg the Vulture, up in the masthead. He's priceless.

  15. Bronek Kozicki

    it's about science

    He actually gave good example: "So years from now, when you have a new theory, you won't be able to go back and look at the older data"

    There is stupendous mountain of scientific data collected every day (e.g. CERN alone collects 15PB per annum) and yet there is little warranty that this information will be of any use to future generations. Since referring to old experiments to verify new theory is established, and very useful practice, this is actually important.

  16. Captain DaFt

    Reply to: Citizen 140924376

    Due to a lack of decipherable information (Files in state sanctioned format Ab19-0-7 are unavilable prior to 2025), we are unable to validate your date of birth, nationality, and any work history prior to 2025.

    Therefore your retirement/disabilty claim is rejected.

    Failure to produce the relevant files (In state-sanctioned format Ab19-0-7) within 30 days will result in your reclassification as Non-person, and any files on you dated from 2025 to present will be deleted.

    Have a nice day, and remember, The Council Loves And Cares For You.

  17. jake Silver badge
    Pint

    Well, Vint ...

    ... I have the email you & I swapped, discussing bits of TCP/IP (and the code involved) from back in 1975. It's still perfectly legible. The code still compiles, too ;-)

    Insert something about "two cans & a string" here ... IOW, KISS!

    Beer, for the memories ... gawd/ess but I was young & dumb back then!

  18. Anonymous Coward
    Anonymous Coward

    It's a harder problem that you might think...

    There are really two problems here, one is the problem of retrieval over a time span of about one human, say 80 years, and the other is about designing systems that permit information retrieval over really long spans, like 10,000 years.

    We are barely getting to grips with the first one, but we have at least learnt some lessons. For example, binary blobs are really bad, whereas text formats are pretty good. Not perfect (how many people can still read EBCDIC?) but fairly trivially decodable.

    The other problem fairly rapidly descends into a spiral argument even if you assume the existence of a long-lived storage medium. (Aside: clay tablets seem to do remarkably well!!) Let's say you write something in French or English. Who's to say in 10K years anyone will understand that language? And if you write a decoder, what do you write it in? There is a project called Rosetta (see the WP page) that tries to tackle this issue, but it's not easy.

    Finally, and then I'll shut up: the quality *today* of content or items has no bearing of the importance in the *far future* of the same. Think how much we have learnt through archaeologists digging through middens (i.e. sh**-heaps) and turning up bits of junk that tell us so much. Whose to say that a future digital archaeologist won't unearth "Charlie Bit Me" and explain how late 20th century families worked?

    1. jake Silver badge

      Re: It's a harder problem that you might think...

      You don't "read" EBCDIC, per sey. Nor ASCII, for that matter. Rather, you read the words that the output device translates for you.

      With that said, I can still read text/code on cards and tape, and sometimes "think" in octal and hex. A partially sighted friend of mine can read punched paper with her fingers, similar to braille. She's one of the best "big iron" debuggers I've ever known ...

      1. jake Silver badge

        Re: It's a harder problem that you might think...

        As a side-note, when my daughter was learning to count (age 4ish), I taught her to count to 15 on four fingers. She added the thumb, and then the other hand, on her own. In highschool, she "invented" three extra digits on each extremity, for full 32-bit compatibility ... with her right eye as a carry-bit ;-)

        She's a programmer today ... and Sr. Member of the Technical Staff for a Fortune 250.

        Teach your kids alternates to decimal numbers early and often ...

    2. Christian Berger

      Text is essential

      But the format must be as trivial as possible, that's why XML isn't a particularly good solution. It's still somewhat better than binary blobs, but if you have something that is just a table, and you store it in XML it's bad.

      As for different character encodings, that's usually not a problem in long term storage. Just dump it out to microfiche as text and OCR it with the next system. That's what banks are currently doing.

    3. strum

      Re: It's a harder problem that you might think...

      >clay tablets seem to do remarkably well

      Indeed. I've just been reading a piece on the deciphering of Linear B, from clay tablets. Mind you, these tablets had been baked in a major fire. Unfired tablets would have crumbled to dust, long ago.

      But the point is, these were little better than laundry lists - eminently disposable - but the ability to read them now, gave us insight into life in Knossos, that we wouldn't otherwise have had.

    4. DanceMan
      Thumb Up

      Re: It's a harder problem that you might think...

      "the quality *today* of content or items has no bearing of the importance in the *far future* of the same"

      On a personal level, the throwaway photo I shot out the kitchen window of the family home 40 years ago just to make sure the bulk-loaded film I'd just loaded into the camera was on clean, unfogged film became valuable decades later. It was what you saw standing at the kitchen sink, but who would think to take a picture of it? Time changes values.

  19. Vision Aforethought
    Flame

    Indeed...

    I have priceless Hypercard stacks on my Mac, but cannot view or edit them! No emulators exist at all.

    1. jake Silver badge

      Re: Indeed...

      If they are "priceless" why not look for options? SuperCard comes to mind ... LiveCode would probably be a better option in today's world. There are others.

  20. Christian Berger

    Office formats were always more like memory dumps than archival formats

    Office file formats, no matter what office suite or version, were never meant to be archival formats. They were more like save games, little "memory dumps" allowing you to continue the game where you left off, no more no less. In fact some early systems even just dumped the memory onto diskette. (i.e. the Canon Cat) That's why such formats have non-portable options like OLE objects which are nearly impossible to open on another computer. If such a file ever moves from one computer to another you are screwed.

    If you want to have something you want to be able to read in a few years or send to someone else, you must use archival formats. Those formats must be as trivially simple as possible. Possible candidates for archiving "printed" documents are TIFF (bitmap format, supports multiple pages) and archival grade PDF (special PDF without all of those useless features). Be sure to include a dump of the text in a separate text file so it's trivial to search. You don't need to change things in your archive. If you want a newer version re-create it again.

    Never ever ever store data in file formats you cannot read yourself. Complex (binary) file formats are acceptable only as long as they don't have to be backed up. That's why SQL-Servers tend to store their dumps as simple text files.

    1. ijustwantaneasylife
      Stop

      Re: Office formats were always more like memory dumps than archival formats

      I'm quite surprised that nobody mentions HTML? I'm pretty sure I can open every one of these files/pages since the creation of the web. OK the formatting might not be that pretty, but the content will be there and will be structured in some manner that makes sense (P, H1-H5, etc.).

  21. ecofeco Silver badge
    Facepalm

    One day?!

    Did I read that right? "One day?" Tried opening any of these lately?

    Apple Pie Editor and Formatter Hayden Book Company

    Apple Writer III Apple Computer, Inc.

    Comprehensive Electronic Office Data General Corporation

    DisplayWrite 2 IBM Corporation

    Easywriter Professional & II Information Unlimited Software

    Executive Secretary Sofsys

    FinalWord Mark of the Unicorn

    Lazywriter ABC Sales

    Leading Edge Leading Edge Products, Inc.

    Microsoft Word Microsoft Corporation

    MultiMate MultMate International

    NBI NBI

    Omniword Northern Telecom

    Palantir Tier I & Tier 2 Designer Software

    Para Text Para Research

    Peachtext (formerly Magic Wand) Peachtree Software

    Perfect Writer Perfect Software

    Samna Word II & III Samna Corporation

    SCRIPSIT 2.0 Radio Shack

    Select Word-Processing Select Information Systems

    Spellbinder Lexisoft

    Text Wizard Datasoft

    VisiWord Plus VisiCorp

    Volkswriter Lifetime Software, Inc.

    Word-11 Data Processing Design, Inc.

    WordPerfect Satellite Software Intl.

    WordStar MicroPro International

    WordVision Bruce & James Program Pubs.

    XyWrite XyQuest

    1. jake Silver badge

      Re: One day?!

      I can open and read all of those.

      Formatting might suffer, depending on the version, but the gist of the subject matter will be immediately obvious. Not all of us throw away all hardware older than two years, and all code[1] older than 9 months. Or maybe I'm just a packrat.

      [1] There is no such thing as "software", so-called "software" is merely the current state of the hardware.

Page:

This topic is closed for new posts.

Other stories you might like