back to article Data-mining technique outs authors of anonymous email

Engineers and computer scientists say they have devised a novel method for identifying authors of anonymous emails that's reliable enough to be used in courts of law. In a series of papers published over the past few years, the researchers from Concordia University in Montreal have described what they say is the first ever data …

COMMENTS

This topic is closed for new posts.
  1. Anonymous Coward
    Anonymous Coward

    Sounds like a field waiting for opening up

    Specifically, it would be easy enough to get a program to add randomness to a lot of these features in order to properly anonymise the writing style of the person sending the email. Loose some spelling acurracy, drop in some, broken sentences. even just dropping the odd capital would help reduce the confidence of the system until it was inadmissable in court.

    Prose style would be far more awkward, but maybe the system could then highlight distinctive phrases and advise that you may very well wish to rephrase them?

  2. Dave 125
    FAIL

    1 in 5 failure

    And a lawyer that can't turn a 20% failure rate into "reasonable doubt" isn't worth his money. Heck, I could probably do it.

  3. Anonymous Coward
    Anonymous Coward

    So, let's see...

    "When finely tuned, the technique identified the author about 80 percent of the time."

    In other words, they think a 20% failure rate is "reliable enough to be used in courts of law"?

    Well, in combination with other evidence it might be, I suppose. But given the "believe anything the computer says" attitude of some people I doubt it.

  4. Adam, Kent
    FAIL

    Correct 80% when finely tuned.

    So, wrong 20% when finely tuned and even more wrong when not in perfect lab conditions.

    So, hanging at least 1 in 5 innocent men is OK then....... FAIL as this should *never* be accepted as evidence in court!

  5. Anonymous Coward
    Unhappy

    Hmmmm

    Would the police in the UK do anything with it ?

    It took me over a week to get a respomse after reporting a crime under the misuse of computers act and the original response was that "if i had lost nothing then it was not a crime" ? I dont know why i bothered supplying them with with all the information they required to trace the offender.

    It makes me wonder if this is a nice safe way to make money as the police dont seem to give a crap.

    The majority of coppers are clueless when it comes to anything slightly technical so i cant see this being used to stop all the "Viagra, penis enlargement and SEO" emails i find in my inbox. If they are outside the UK what exactly can the police do anyway ?

    1. Marvin the Martian
      Unhappy

      Not that much trickery needed

      Given that 80% of spam is written at the Lagos School for e-Business, and probably all cut, pasted, and mutated from one original "Hi I've inherited a squazillion of dollars, do you want for a fee to have a fraction of that?" writer. So strictly speaking, correctly identifying the original hand behind all that is not that hard.

      Note that the algorithm is not such that it determines the person, it is such that it determines whether 2 mails probably come from the same person; so it relies on the author kindly identifying him/herself in one of them.

    2. Anonymous John
      Happy

      It works!

      I've just googled on "respomse" and found about 88,800 examples of your work.

    3. It wasnt me
      Thumb Up

      No really......

      ...... you underestimate its accuracy. Apparently they tested it on the message boards of the Daily Mail, and it correctly identified that 87.4% of the postings had been written by the Twat-O-Tron.

      1. Anonymous Coward
        Anonymous Coward

        Thinking about it further

        What is really needed is a system to clone the writing style of one email in another. If you can modify an email to pass this test and appear to have come from one of your rivals for promotion... sorted!

    4. Nigel 11
      Unhappy

      Libel?

      Well, the 20% of the time they get it wrong should create plenty of work for libel lawyers!

      I expect the "re-write in the style of ..." computer program is not far off. Just feed it a novel by (say) Charkes Dickens to train it, and off you go.

    5. BristolBachelor Gold badge

      not a crime ?

      "if i had lost nothing then it was not a crime"

      Can you quote that when you are up in front of the beak who has a picture of you from a Gatso?

    6. This post has been deleted by a moderator

    7. Graham Marsden
      Thumb Up

      "this should *never* be accepted as evidence in court!"

      Hear hear!

      An "about 80%" success rate does not equate to "beyond reasonable doubt"!

  6. James Micallef Silver badge
    Thumb Down

    Title

    The 80% (after fine tuning) figure is completely at odds with the phrase "reliable enough to be used in courts of law".

  7. Anonymous Coward
    Anonymous Coward

    "identified the author about 80 percent of the time"

    And that's in cases where the authors were not trying to hide their identity or impersonate someone else. No way could that be "used in courts of law", particularly since that reference to "finely tuned" presumably means that a corrupt police force or expert witness could finely tune it to identify the person they want to incriminate.

  8. Anonymous Coward
    Anonymous Coward

    Reliable?

    20% failure rate is not much better than heresay. Which lawcourts are going to accept that as evidence? Just wanting to know so I can avoid those territories in the future...

    I bet the hit rate drops significantly if you have always known this analysis will be applied to your writings one day and are subsequently paranoid about consistent phrasing and sentence structure.

  9. John Smith 19 Gold badge
    WTF?

    80% good enough for a court of law.

    Icon says it all.

  10. Joe 1
    FAIL

    Too easy to circumvent

    How hard can it be to pass a human written paragraph through an algorithm and replace like synonyms, reorder sentences, and modify clause order to obfuscate the origin.

  11. Vic

    How's that reliable?

    In testing, they only had 158 possible authors. they had to fine-tune the algorithm, and they still only got an 80% success rate.

    The standard of proof in a criminal case is supposed to be "beyond reasonable doubt". I cannot see how this qualifies...

    Vic.

    1. Anonymous Coward
      Anonymous Coward

      Easy Google Anonomiser (TM)

      Just Google translate it to French and back. Correct the really bad mistakes that mean you are no longer asking for £1M and saying you know where they live.

      I'm not really anonymous if you already know who I am and can find 200k of my emails :)

  12. Herby Silver badge

    Now if they only could...

    ...run this on various SPAM goodies.

    I'm sure that tracing these back to their authors would be beneficial to all. We can only hope.

    p.s. Does El Reg do this with "anonymous coward" emails?

  13. Paul Frankheimer
    Thumb Down

    Finely tuned gives 80%

    So this is not the standard of proof needed for a court of law. Especially when your set of potential senders is not nice and small and you don't have thousands of messages per sender. Why are they overselling this so much?

  14. jake Silver badge

    Yeah, whatever.

    ID this ...

  15. Anonymous Coward
    Anonymous Coward

    Yawn

    been doing this for years on usenet. by eye, at any rate. note the careful lack of capital letters in this post. its not how i normally write.

    1. Anonymous Coward
      Anonymous Coward

      "not how i normally write"

      And yet I think I d0xxed you. The post was rejected though, and I'm not going to try again for fear of angering The Mighty Moderatrix, so we might never know for sure!

  16. Jacob Lipman
    FAIL

    Analysis fail.

    80% is good enough for a court of law? Out of 200k emails, they misidentified 40,000... With a sample set of fewer than 200 people. I don't think that's enough for probable cause to have a warrant issued in the US, nor sufficient for presentation as evidence in a criminal case. Presumably with a larger pool of people, the accuracy goes down further - I can think of very few situations in which the software in its present state would be useful.

  17. Anonymous Coward
    Anonymous Coward

    Words are the new sticks and stones, boffins say.

    And there I was, thinking that "malicious email" was about the payload that would take over your computer when clicked upon.

    So tell me, what is the minimum amount the algorithm needs? Will this be enough:?

    "Hi! How are you?

    I send you this file in order to have your advice!

    See you later. Thanks"

    On a more serious note, this algorith has, given a sufficiently large corpus and after a sufficient amount of fine-tuning (with a-priori knowledge? how else would they tune?), a 80% chance to pick the right author from a list of suspects. That's not quite lifting the veil of general anonymity. Wonder how well reference humans would do, sufficiently tuned through being familiar with the writing style of (all of) the suspects. Did they perchance do a comparative study?

    Also, what is it that makes this /suitable for court use/? The 80%? Is that it? Is the enron corpus the only one that they used to test this with? Then I wouldn't be so sure they'd get 80% elsewhere too. In theory maybe, but that doesn't automatically scale to the courts. Go find a few other bodies of emails to try this on, and we'll see what happens then.

  18. Anonymous Coward
    Anonymous Coward

    reliable enough

    "When finely tuned, the technique identified the author about 80 percent of the time."

    And that is "reliable enough to be used in courts of law" is it?

    Beyond reasonable doubt? I think not.

    And what does "finely tuned" mean exactly? Might it have something to do with knowing the answer in advance? What is its accuracy when not finely tuned?

  19. Paul Crawford Silver badge
    Big Brother

    Me thinks bullshit

    "...reliable enough to be used in courts of law"

    "When finely tuned, the technique identified the author about 80 percent of the time"

    That is really going to impress a court of law, or has the situation changed where the MD5 sum was considered not reliable enough for electronic evidence in spite of being much, much better than this:

    http://www.schneier.com/blog/archives/2005/08/the_md5_defense.html

    Unless of course you are accused of a crime that attracts witch-hunt like emotions?

  20. Anonymous Coward
    Thumb Down

    Hmm... Not sure I like it...

    Only automating it is actually original. The different components of write print as a method in criminology have been around for a considerable amount of time. A lot of criminals have been caught by their writing style. Some by luck when someone recognizes it (Unabomber). Some through a concerted effort and analysis.

    However, prior to this these were the domain of proper police and used only for very serious cases where paying someone qualified to do a lexical analysis was worth it. Even then, it was classed as circumstantial evidence and used only along with other data.

    The difference here is that this development puts this technique into the hands of people some of which are neither qualified to use it nor can understand the consequences of 80% maximum confidence (on a rather large statistical sample). I may be overly pessimistic here, but the wya this is going "tools" to for use this technique with an automated feed of the company Exchange mail server are at most 2-3 years away.

    So the next "anonymous" personnel survey you will be filling may actually be way less anonymous than you think...

  21. WonkoTheSane Silver badge
    Big Brother

    Easily bypassed

    Just pass your text through Google Translate into a random language, then back.

    If you're really paranoid, make a couple of passes through different languages before English.

  22. Dragonfighter
    WTF?

    This should have been mentioned earlier

    http://www.theregister.co.uk/Design/graphics/icons/comment/wtf_32.png 80% accuracy good enough for court

  23. Anonymous Coward
    Anonymous Coward

    Cor Blimey, strike a light

    o'course, that's gonna be 'ard 2 phake an-anon email, innit.

  24. SuperTim

    80% of a known base?

    And that's good enough for a court of law to prove that a "suspect" is guilty is it?

  25. Anonymous Coward
    Thumb Down

    reliable enough to be used in court?

    If it is only correct 80% of the time, I'd say that it doesn't go as far as "beyond reasonable doubt".

    Besides, I suspect it wouldn't be difficult to imitate somebody else's style, given a few examples of their emails, which leads to the question : how do you identify somebody if you only have one anonymous email?

    You'd have to have a good idea of where to start looking in the first place : no samples = no match.

  26. Anonymous Coward
    Thumb Up

    "reliable enough to be used in courts of law."

    "...emails written by 158 employees...the technique identified the author about 80 percent of the time."

    So of the 158 people, only 32 of them would be wrongly convicted?

  27. Anonymous Coward
    Joke

    Yeah, not that hard

    H8RS POST EVERYTHING IN UPPER CASE WITH NO PUNCTUATION EXCEPT EXCLAMATION MARKS! OMG! IT MUST BE ME THEY ARE AFTER!

  28. Phil 54

    Easy to fool

    I'd imagine that it would be pretty easy to protect yourself from this technique by using a web translation service to translate your email to another language, and then back to the original. That would probably give enough of a difference in writing style to screw up the results while keeping a relatively legible email that stays true to the original.

    I imagine it would be fairly easy to protect yourself against this technique using a translation service for websites to translate your e-mail to another language, then back to the original. That would probably give enough of a difference in writing style screw results while maintaining a relatively readable e-mail that remains faithful to the original.

    Google translate English to French and then back again

  29. David Hicks

    DOH! Reading fail...

    ... and there I was assuming that these guys had claimed to have tracked down the people releasing "Anonymous" email, not just "anonymous" email.

    This is going to get rapidly more confusing.

  30. mr-tom
    Troll

    Why use this in a court of law?

    It would be far more effective to employ the SAS, and for once we would see some benefits of doing so, plus it would serve as a disincentive to others.

  31. Anonymous Coward
    FAIL

    80% accurate reliable enough for court?!

    With a 200,000 sample and a limited, known user populationof 158 the best they got was 20% of them wrong, yet based on this high error rate under optimal conditions, its good enough for court?! I don't think so!

    Of course peoples personal style and content in a corporate environment is going to use a common, person specific format, but an 80% hit rate is nothing to rave about. Come back when you are in the high nineties for a non-MEH response

  32. Buzzword
    FAIL

    Burden of proof

    "Reliable enough to be used in courts of law.... the technique identified the author about 80 percent of the time."

    80% is reliable enough for a court of law? *despair*

  33. The Alpha Klutz

    So it works on 158 authors

    But what about millions?

    Some people are so dumb that they surely don't have a unique writing style (in as much as they can write at all). Even their mistakes are one big exercise in groupthink.

    Though I suppose those people are easy to catch anyway.

  34. Anonymous Coward
    FAIL

    Beyond reasonable doubt

    and not balance of Probabilities.

  35. Ralphe Neill
    Big Brother

    Hmmmm ...

    What's to stop somebody deliberately changing style? We could have a digital copy-and-paste return to those old blackmail notes that used cut-out newspaper headlines.

    The anonymous writer could just copy a "suitable" phrase from one document and one from another and so on. Would that work?

  36. Tom 7 Silver badge

    identified the author about 80 percent of the time.

    That'll be for use in courts run by Ausie hopping marsupials then?

  37. Anonymous Coward
    Big Brother

    Reliable enough?

    "Engineers and computer scientists say they have devised a novel method for identifying authors of anonymous emails that's reliable enough to be used in courts of law."

    Which court of law are we talking about here? IANAL, but have some experience of legal processes. In the UK, to secure a criminal conviction the burden of proof is "beyond reasonable doubt". I do not believe that this technique meets that requirement: "When finely tuned, the technique identified the author about 80 percent of the time" 80% correct = one in five incorrect. I doubt, reasonably enough.

    I also doubt whether it would have much bearing on a civil case where the balance of probabilities is the test, unless backed up by other evidence.

  38. Anonymous Coward
    Thumb Down

    hurr durr

    becaus a clever trole dosn't know how to right in a diferint tone.

    becaus it work on unsuspect non-tech savvies dosn't meen it can work on ppls who r smart enuf to disguis there writin

  39. Neoc

    Clarification please..

    When they stated "identified the author 80 percent of the time" did they mean:

    (1) Given 100 emails it identified authors for 80 of them and they were all correct - the remaining 20 where marked as "unsure"; or

    (2) Given 100 emails it identified authors for 100 of them, but upon examination only 80 of them were correct.

    Option 1 is a decent piece of software detective work. Option 2 is a lousy negative match rate.

    1. John Robson Silver badge
      Boffin

      Thank you

      For finally asking the question that had been floating round my head since the article...

      1. Anonymous Coward
        Anonymous Coward

        Oops ...

        Downvote button too close to the next page link. Sowwy.

    2. A J Stiles
      Pint

      Indeed

      That is a very important distinction.

      Given that they haven't said anything, I'm going to bet it's #2; because a good spin-doctor could turn #1 into a 100% success rate.

  40. David Hicks

    So 80% of the time it can pick between a known list of 158 people?

    And this is supposed to be good enough for use in court? Holy hell...

    With a false positive rate of 20% on such a small sample it's next to useless for picking people out of the general population, surely? All you could hope to get is "this guy we already suspect writes in a similar style to the release", which has got to qualify for pretty weak circumstantial evidence at best.

  41. Daniel 20
    Pint

    in courts of law ... ?

    and it gets it right 80% of the time ... ?

    Sounds great!

  42. Anonymous Coward
    Anonymous Coward

    Whoa there!

    80 percent is reliable enough for a court of law? 20 percent constitutes reasonable doubt in my mind.

    A thesaurus and a list of common spelling and grammar mistakes tied into a random character / word replacement / transposition script ... I'm fairly sure someone has thought of this already.

    Your move.

  43. Matt 21

    about 80 percent of the time

    Good enough for a court of law? I hope not.

  44. Anonymous Coward
    Anonymous Coward

    80% ?

    Excellent news for spam fighters. Not really feasible for court, I'd say.

  45. Notas Badoff
    Boffin

    Between beginning and end - no meat for the court

    We start with "... that's reliable enough to be used in courts of law."

    And end with "When finely tuned, the technique identified the author about 80 percent of the time." Out of a very limited set of 158 suspects, but with at least 200,000 emails to chew on.

    This has no teeth.

  46. Anonymous Coward
    Anonymous Coward

    Linguist involvement?

    I hope the 'engineers and computer scientists' behind this research involved linguists at an early stage. There are enough debates among experts in that field as to the authorship of various literary works to call this whole idea into question. For example, how much of the work attributed to Shakespeare was really penned by him? I suspect it is far to easy mimic the writing style of another person to make this stand up in court.

  47. Tim 54
    Thumb Up

    Not so bad

    Whilst 80% for a single document wouldn't be good enough, suppose you have 4? All identified to the same author. That gives you a pretty low chance of the suspect not being the author. Building a case to a "beyond reasonable doubt" level involves multiple levels of evidence. In a harrassment case, your likely to have several emails (otherwise it wouldn't be harassment). That's the point of building a case - if they were all easy, police wouldn't have much to do. It's the same process we use in the brain to decide if something is true or not - we build evidence until we hit above our "truth threshold" and decide it's true.

  48. Ian Stephenson Silver badge
    Boffin

    I don't see it specifying "criminal court"

    Under criminal law the onus is for the prosecution to prove "beyond reasonable doubt" - I agree that is not met by this 80% accuracy.

    However a civil case it is only "on the balance of probabilities" so anything over 50% is technically acceptable.

    So good enough for a libel or copyright case .....

    Oh shit.

  49. Anonymous Coward
    Anonymous Coward

    80%???

    80% probability is enough for American courts?

    That's pathetic!

  50. Anonymous Coward
    Grenade

    Rorschach Test alert!

    You know that hocus-pocus personallity test of symmentrical ink-blots that all look like female genitalia, for which you need to answer what this most reminds you of so that you appear as 'normal' as possible to the testers, while trying not to snigger / not get a boner / not spoil the test-paper?

    Well, the Rorschach Test is still being widely used in the US & Canada. Small wonder then that this email data-mining technique is also treated as credible evidence.

    Based on my previous (named) postings, I wonder if this process would be able to divine my identity from this anonymous post? Ha!

  51. Jess--

    not good enough for court

    I would see this being used as a tool to reduce the number of possibilities for an author

    in the test they did where it got 80% right out of 200,000 it doesnt sound too useful but if you apply that figure to trying to identify the author of 5 emails amongst that 200,000 and the system comes back with the same name 4 times and a "no match" or different name for the fifth I know where I would concentrate other (human) resources

    1. Anonymous Coward
      Anonymous Coward

      I would say Oliver 7

      But I'm not sure.

  52. Anonymous Coward
    Anonymous Coward

    Known answer

    So they only get 80% correct when they know the answer, what's the hit rate like before they have manipulated the answers?!

    I have a horse tipping algorithm with a 100% success rate, only problem is that the race has to have already been run in order for it to work.

  53. Anonymous Coward
    Anonymous Coward

    Even better

    They claim they can identify the person talking in an ENCRYPTED VOIP connection!

    http://ncfta.ca/papers/voip.pdf

  54. JaitcH
    Unhappy

    There was a web site somewhere that would take inputted text ...

    process it and then toss it out in whatever language style you wanted.

    Some were really, really different. It's a bit like a word processor with a selectable grammar option - business, casual, etc.

    Bet that would bugger up even that great university, Concordia. Great for untouched e-mails but undoubtedly beatable if need be.

  55. tfewster Silver badge
    FAIL

    Concerns/doubtful/bollocks

    Businesslike:

    I'm concerned that this study doesn't seem to address the fact that people tend to change style depending on the target audience.

    vs casual:

    Doubtful. We speak differently to different people.

    vs. Anonymous:

    Bollocks. Do these retards think I'd post words like this if they who I was? *

    * I know, not too hard to discover, but Anon posters often don't think that far ahead.

  56. Anonymous Coward
    Thumb Down

    Useless...

    It can only tell you which of the suspects has a style most like the author of the email. Or put another way, if the author is not known, this method won't unmask them.

    Essentially it could be argued that there is no way to prove that there isn't another person with a similar educational background and a similar writing style.

This topic is closed for new posts.

Biting the hand that feeds IT © 1998–2019