back to article Want a MEEELLION-year data storage? Use DNA of course

Swiss boffins have used their weird wizardry to devised a way to store data for a million years using DNA, or so they claim. Perhaps DNA should be changed and adopt a new acronymic meaning: Digital Nucleotide Archive. The researchers were led by boss boffin Robert Grass (Die Haupt Boffiner – think that’s the translation, Ed …

  1. Neil Barnes Silver badge

    And once it's in a suitable carrier

    You could graft it onto, say, a human genome?

    Could be a bit embarrassing if in a few years, someone did a genome scan and found the copyright notice.

    1. Buzzword

      Re: And once it's in a suitable carrier

      That is (in part) the plot of Orphan Black.

      1. Graham Marsden
        Coat

        Re: And once it's in a suitable carrier

        But what if someone uses it to write a virus...?

  2. JimmyPage Silver badge
    Thumb Up

    Back in the 80s

    this got discussed when I was studying at Uni - as a theoretical possibility. We were just waiting for the technology to catch up.

    Of course it does raise the possibility that this has already been done elsewhere in the Universe, and *we* are the result ?

    1. Dr. G. Freeman

      Re: Back in the 80s

      "Of course it does raise the possibility that this has already been done elsewhere in the Universe, and *we* are the result ?"

      Maybe thick people are Cat pictures, that's why there are so many of them.

    2. Zog_but_not_the_first
      Unhappy

      Re: Back in the 80s

      So... I'm just somebody's backup.

      1. Anonymous Coward
        Anonymous Coward

        Re: So... I'm just somebody's backup.

        Absolutely. But that hairy homonid who you're a backup of only had a wetware printer to work with, and those printers had minds of their own, as did the printed copies. The net result is that the current backup is disappointingly hairless, too tall, and has a weird looking face, and would probably struggle to survive a week out in the original's "real world" - i.e. is somewhat corrupted and of dubious value. So I'm not sure I'd hold out too much hope for your backups, down the line...

      2. GitMeMyShootinIrons

        Re: Back in the 80s

        Brings a whole new meaning to corrupt humans.

        I was obviously written using NT backup as I'm useless at remembering anything of value.

  3. frank ly

    Yes, but

    "... DNA storage density is one million gigabits per cubic millimetre, ..."

    How do you know which tiny group of tiny glass spheres contains the funny cat video?

  4. John H Woods Silver badge

    You could duplicate it ...

    .With appropriate checksumming / redundancy, you could always duplicate the DNA before sequencing it. The truly great thing about DNA though, is that we'll always know how to read it!

    1. Warm Braw

      Re: You could duplicate it ...

      It could duplicate itself - that's kind of its raison d'etre...

      1. John H Woods Silver badge

        Re: You could duplicate it ...

        "It could duplicate itself - that's kind of its raison d'etre..." -- Warm Braw

        No, it really can't :-) and it really isn't.

    2. Anonymous Coward
      Anonymous Coward

      Re: You could duplicate it ...

      We might always know how to read DNA but we won't always know what encoding scheme was used a few hundred years ago. You can read the bits from a CD but unless you know about the CD-DA format you're not gonna get any music (not to mention that here 4 bits are mapped to 2 so that mapping itself is also important).

      1. Grikath
        FAIL

        Re: You could duplicate it ...

        "we might always know how to read DNA" .....

        Ummmm not so much... It will take only a couple decades' worth of technological setback to render this type of archive useless. Reading DNA pair-for-pair quickly is pretty advanced stuff y'know..

        Even then, the Read/Write rate is, let us say, somewhat awful, and cannot be improved much upon, because physics...

        This kind of thing has always been a nice thought experiment since the 70's , and there's been quite a bit of oddball work around ( at least when I was at uni in the 90's) on how you could "organically compute" stuff. But reality dictates that the best speed you could manage would be in the order of magnitude of the old tube-driven beasts of the digital stone age.

        Practicality? Naught.

        1. John H Woods Silver badge

          Re: You could duplicate it ...

          "Reading DNA pair-for-pair quickly is pretty advanced stuff y'know.."

          I do know (was a biologist before reincarnation as an IT guy) -- but surely any sufficiently advanced civilization will be able to read DNA. A couple of decades technology setback would stop you reading it, I agree --- but a couple more decades of technology advance will let you read it again.

          The real challenge is self a self describing code. It's easy to store stuff in binary form that proves intelligence and knowledge - binary encodings of pi, e, the Fibonacci sequence etc would be recognised by any sufficiently advanced intelligence, however alien. But how do you (or even can you) embed some kind of Rosetta Stone that bridges the gap between this material and the advanced content?

          1. Martin Budden Silver badge
            Go

            Re: You could duplicate it ...

            The real challenge is self a self describing code.....

            Forget the DNA stuff, this would be a valuable thought exercise in its own right, something Randall Munroe would probably enjoy.

          2. Michael Wojcik Silver badge

            Re: You could duplicate it ...

            But how do you (or even can you) embed some kind of Rosetta Stone that bridges the gap between this material and the advanced content?

            That's not what the Rosetta Stone provides - the analogy isn't apt. The Rosetta Stone is useful because it presented the same message in multiple encodings (languages), some of which were known to the readers. Obviously it's possible to do that with DNA storage.

            The larger problem - making the message recognizable to readers who aren't familiar with the encoding - is underspecified. More-complete versions are unsolvable, moderately difficult, or relatively easy, depending on the constraints.

            The unsolvable version: Make a message such that arbitrary readers, with just the assumption that they possess technology sufficient to sequence DNA, can with high probability recover a large portion of the message, and interpret it to a meaning that's close (within some epsilon) to the meaning intended by the author. That's simply not achievable. There's no guarantee that such a reader will recognize the DNA structure; that it will have an understanding of communication such that it perceives a message; that it will be capable of comprehending human psychology; etc. It's even conceivable that such a reader wouldn't recognize Reed-Soloman. Any sufficently alien intelligence is ineffable.

            The simple version: Make a message that can be decoded by a reader with a good understanding of human communication and the culture of the present moment, and access to a broad corpus of cultural products from the present. Reduce the interpretation requirement to recovering at least a substantial portion of the message with high probability and good accuracy. This is simple: write a message using a straightforward encoding such as ASCII,1 using one of the dominant natural languages (eg English). Write it using straightforward prose with a lot of redundancy. You can also employ a Rosetta-Stone mechanism by translating it into other natural languages. Here we can safely assume Reed-Soloman2 will be recognized. Even if your language(s) of choice is no longer in general use, the corpus of cultural materials should suffice to make the message intelligible.

            Between those two you have the moderately difficult versions. You can assume an alien intelligence with relatively little access to human cultural materials, etc. So add information and redundancy to the message, by including more data and by making better use of the channel. For example, if we assume the reader has, or can recognize, some form of visual two-dimensional communication, we can DNA-encode monochrome line drawings with a procedure like this:

            - Pick two "flag" sequences for delimiting drawings and rows. Make them clearly artificial and low-information-entropy, such as a thousand adenines (let's call this "AK") and a thousand cytosines ("CK"). Those should attract the attention of anyone sequencing the string.

            - Within a picture, use the other two bases (T and G) for your two "colors". (We can assume this is prior to ECC without loss of generality, since we've already assumed the reader recognized and decoded the ECC layer.)

            - Start and end the image with AK.

            - Sequence a row of (one-bit) "pixels" at a time, delimited by CK. All rows are the same length, so that the reader can recognize the rectangular shape.

            Now, this image may be interpreted under various affine transformations. The reader won't know which base is "black" and which is "white" (which is irrelevant), or what the orientation of the "rows" are. So the picture may be reflected in either or both dimension, and so on.3

            There are all sorts of ways of increasing the total information entropy and decreasing the entropy density in the message, both of which make it easier to derive an interpretation. Some encodings (pure-binary numbering systems, for example) are more "natural" than others because they don't depend on arbitrary language or cultural features and can be derived from basic mathematical forms, so you use those where feasible. And so on.

            1Even if the requirements for this scenario didn't more or less ensure that ASCII is recognizable, it's easy for a reader with access to print text to decode ASCII by symbol frequency.

            2Or any other purely mathematical ECC, such as group codes. We wouldn't want to use an ECC mechanism that depends on features of the input, such as a predictive matcher primed for the language in question (which is how humans correct for typographical errors and the like).

            3A perverse reader might start with the assumption that it's written bostrophedon. But I think such a reader would also try a consistent row ordering, and pick the correct one based on preferential edge appearance.

  5. Anonymous Coward
    Anonymous Coward

    What if people start trying to decode our DNA....

    ...and they find a sequence that just makes sense...?

    Is our DNA someone's encrypted backup?

    The mind boggles...

  6. Filippo Silver badge

    WORO?

    Just stick it into bacteria and let them reproduce. When you want to read it, pick up a few of them. Sure, the reproduction step will cause error rates to skyrocket, especially over a million years, but the information density is so high that you can just go overkill on the ECC algorithms.

  7. Kubla Cant

    I'm afraid we've lost your data. Yes, we had a backup, but it evolved legs and ran away.

    1. DropBear

      Are you saying you can't remember it for me wholesale...?

  8. Anonymous Coward
    Anonymous Coward

    What's the bandwidth? How would the glass capsule tolerate 1 million years? Could you clone the DNA strand and that way achieve massive parallelism in reading/writing? Interesting article, and I hope a follow up will come in time.

  9. phil dude
    Boffin

    interesting....

    If the nucleotide side chains had a quantum dot attached, the data would be readable by a laser and can remain encased.... Not sure I want my data in solution...!

    DNA looks like food to microbes...

    P.

  10. WelshAl

    Encryption too...

    http://www.theregister.co.uk/2005/05/27/bofh_2005_episode_17/

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like