back to article Public genome databases can leak identity

Public genome data is a significant risk to individuals, according to research led out by Yaniv Elrich, a geneticist at the Whitehead Institute for Biomedical Research. The team that Elrich led was able to de-anonymise genome data using only public information and careful Internet searches. A little chillingly, individuals …

COMMENTS

This topic is closed for new posts.
  1. Anonymous Coward
    Anonymous Coward

    There you have it

    I bet the Government and all 'interested' parties are rubbing their hands with glee.

  2. Cubical Drone

    So DNA information can be used to identify individuals...shocker!!

    1. Bumpy Cat

      Read more closely

      Anonymised DNA data can be used to identify individuals.

  3. Martin Budden Silver badge
    WTF?

    idiots!

    Which numpty thought it would be ok to publish people's fucking NAMES????!!!!!!!!!!

    1. John Smith 19 Gold badge
      Meh

      Re: idiots!

      The phone company?

    2. MondoMan
      Boffin

      Re: idiots!

      The names were NOT published -- the interesting part of the paper is how they were able to INFER the names from the DNA sequence (see below).

  4. Scotty
    WTF?

    Suppose its easier to trace the inbreeds down than a general sweep of "normal" city

  5. graeme leggett Silver badge

    not anonymised - by definition

    An NHS trust defines anonymised data as "data concerning an individual from which the identity of the individual cannot be determined"

    "In practice, anonymised data should exclude the name [list of stuff], and any other information which when combined with other information....available to the recipent could allow the individual to be identified. "

    Now if the public version of the data replaced every "Smith" and "Jones" with a number, there would be less of a problem. Up until individuals start making their DNA sequence and surnames available ( Genebook?) and inadvertently providing a new key.

    As I haven't read the published article, is it the case with Dr Ventner , that there were only a few individuals in the database from Utah with his age?

  6. John Smith 19 Gold badge
    WTF?

    It's an open acced database of "useless"* DNA

    Why?

    *Or rather that DNA that cellular biologists have not been smart enough to figure out what it does yet

  7. moonface
    Joke

    I downloaded just Dr Craig Venter's genome data. After a bit of anaylsis I predict he will be baldy bloke. I now just need his email address so I can spam him with my toupee services.

  8. MondoMan
    Big Brother

    To clarify...

    The genome data was NOT published along with the person's surname, although it DID include the person's age and state of residence. Of course, if you're given a person's surname, age and state of residence, it's plausible that you might be able to track down the person pretty easily.

    The interesting aspect of the paper is that the authors were often able to figure out the person's surname FROM the genomic DNA sequence. Once they had that and the (misguidedly) database-listed age and state, Bob was their uncle, too.

    How did they find the surname? It turns out that the legions of genealogy enthusiasts worldwide have uploaded enough information to the publicly-available genealogy databases to make this possible in many cases:

    1) Genealogists know that in most English-speaking countries, surnames are inherited from father to son without change.

    2) We know from human genetics that the Y chromosome is inherited from father to son, essentially without change (whereas an X chromosome can come from either parent).

    3) Given the above, if two men share the same Y chromosome version, they are likely to also share the same surname.

    4) You don't have to compare the whole Y chromosomes of two men; it's enough to check just a small number of "fingerprint" locations on the Y to check if they've got the same version. Short tandem repeats (STRs) are a type of genetic marker or fingerprint component scattered throughout the genome, including the Y chromosome. Thus, to compare two Y chromosomes, just compare a few STRs from each Y.

    5) Enthusiasts have uploaded thousands of Y chromosome STR fingerprints to the public genealogy databases, along with the surname matching each. It's very cheap and easy these days to send in a cheek swab and have a company report back with your Y chromosome STR fingerprint.

    6) Since STRs are just DNA sequences, if you have the whole genomic DNA sequence for a man, you can just read out their Y chromosome STR fingerprint from the DNA sequence. Match that with a fingerprint in the Y chromosome/surname database, and you've likely identified the person's surname. Voila!

    Clearly, this technique doesn't directly work for genomes without Y chromosomes (i.e. women) and in cultures where surnames are not inherited from father to son. However, as computing power increases to the point where processing thousands to millions of 10GB whole-genome data chunks is cheap and fast, other techniques will become practical. Even now, one could cheaply and easily convert a whole-genome DNA sequence into a synthetic whole-genome DNA fingerprint suitable for matching with the data in law-enforcement DNA fingerprint databases, and match any individuals therein.

    Since the publication of this result, at least some whole-genome databases have stopped releasing the individual's age (or perhaps state) along with the DNA sequence. Other whole genome databases require the names of the individuals to be published along with their DNA sequences, so there's no new privacy risk in those (pre-consented) cases.

This topic is closed for new posts.

Other stories you might like