Feeds

back to article Data-mining technique outs authors of anonymous email

Engineers and computer scientists say they have devised a novel method for identifying authors of anonymous emails that's reliable enough to be used in courts of law. In a series of papers published over the past few years, the researchers from Concordia University in Montreal have described what they say is the first ever data …

COMMENTS

This topic is closed for new posts.

Page:

Bronze badge

Sounds like a field waiting for opening up

Specifically, it would be easy enough to get a program to add randomness to a lot of these features in order to properly anonymise the writing style of the person sending the email. Loose some spelling acurracy, drop in some, broken sentences. even just dropping the odd capital would help reduce the confidence of the system until it was inadmissable in court.

Prose style would be far more awkward, but maybe the system could then highlight distinctive phrases and advise that you may very well wish to rephrase them?

5
0
Thumb Up

No really......

...... you underestimate its accuracy. Apparently they tested it on the message boards of the Daily Mail, and it correctly identified that 87.4% of the postings had been written by the Twat-O-Tron.

14
0
Bronze badge

Thinking about it further

What is really needed is a system to clone the writing style of one email in another. If you can modify an email to pass this test and appear to have come from one of your rivals for promotion... sorted!

1
0
FAIL

Correct 80% when finely tuned.

So, wrong 20% when finely tuned and even more wrong when not in perfect lab conditions.

So, hanging at least 1 in 5 innocent men is OK then....... FAIL as this should *never* be accepted as evidence in court!

15
0
Bronze badge
Unhappy

Not that much trickery needed

Given that 80% of spam is written at the Lagos School for e-Business, and probably all cut, pasted, and mutated from one original "Hi I've inherited a squazillion of dollars, do you want for a fee to have a fraction of that?" writer. So strictly speaking, correctly identifying the original hand behind all that is not that hard.

Note that the algorithm is not such that it determines the person, it is such that it determines whether 2 mails probably come from the same person; so it relies on the author kindly identifying him/herself in one of them.

0
0
Silver badge
Thumb Up

"this should *never* be accepted as evidence in court!"

Hear hear!

An "about 80%" success rate does not equate to "beyond reasonable doubt"!

0
0
Anonymous Coward

So, let's see...

"When finely tuned, the technique identified the author about 80 percent of the time."

In other words, they think a 20% failure rate is "reliable enough to be used in courts of law"?

Well, in combination with other evidence it might be, I suppose. But given the "believe anything the computer says" attitude of some people I doubt it.

9
0
Silver badge
Unhappy

Libel?

Well, the 20% of the time they get it wrong should create plenty of work for libel lawyers!

I expect the "re-write in the style of ..." computer program is not far off. Just feed it a novel by (say) Charkes Dickens to train it, and off you go.

0
0

This post has been deleted by a moderator

Unhappy

Hmmmm

Would the police in the UK do anything with it ?

It took me over a week to get a respomse after reporting a crime under the misuse of computers act and the original response was that "if i had lost nothing then it was not a crime" ? I dont know why i bothered supplying them with with all the information they required to trace the offender.

It makes me wonder if this is a nice safe way to make money as the police dont seem to give a crap.

The majority of coppers are clueless when it comes to anything slightly technical so i cant see this being used to stop all the "Viagra, penis enlargement and SEO" emails i find in my inbox. If they are outside the UK what exactly can the police do anyway ?

0
0
Happy

It works!

I've just googled on "respomse" and found about 88,800 examples of your work.

4
0
Gold badge

not a crime ?

"if i had lost nothing then it was not a crime"

Can you quote that when you are up in front of the beak who has a picture of you from a Gatso?

3
0
FAIL

1 in 5 failure

And a lawyer that can't turn a 20% failure rate into "reasonable doubt" isn't worth his money. Heck, I could probably do it.

4
0
Anonymous Coward

"identified the author about 80 percent of the time"

And that's in cases where the authors were not trying to hide their identity or impersonate someone else. No way could that be "used in courts of law", particularly since that reference to "finely tuned" presumably means that a corrupt police force or expert witness could finely tune it to identify the person they want to incriminate.

3
0
FAIL

Too easy to circumvent

How hard can it be to pass a human written paragraph through an algorithm and replace like synonyms, reorder sentences, and modify clause order to obfuscate the origin.

0
0
Anonymous Coward

Easy Google Anonomiser (TM)

Just Google translate it to French and back. Correct the really bad mistakes that mean you are no longer asking for £1M and saying you know where they live.

I'm not really anonymous if you already know who I am and can find 200k of my emails :)

1
0
Silver badge
Thumb Down

Title

The 80% (after fine tuning) figure is completely at odds with the phrase "reliable enough to be used in courts of law".

2
0
Anonymous Coward

Reliable?

20% failure rate is not much better than heresay. Which lawcourts are going to accept that as evidence? Just wanting to know so I can avoid those territories in the future...

I bet the hit rate drops significantly if you have always known this analysis will be applied to your writings one day and are subsequently paranoid about consistent phrasing and sentence structure.

0
0
Gold badge
WTF?

80% good enough for a court of law.

Icon says it all.

2
0
Vic
Silver badge

How's that reliable?

In testing, they only had 158 possible authors. they had to fine-tune the algorithm, and they still only got an 80% success rate.

The standard of proof in a criminal case is supposed to be "beyond reasonable doubt". I cannot see how this qualifies...

Vic.

3
0
Bronze badge

Now if they only could...

...run this on various SPAM goodies.

I'm sure that tracing these back to their authors would be beneficial to all. We can only hope.

p.s. Does El Reg do this with "anonymous coward" emails?

0
0
Silver badge

Yeah, whatever.

ID this ...

0
0
Thumb Down

Finely tuned gives 80%

So this is not the standard of proof needed for a court of law. Especially when your set of potential senders is not nice and small and you don't have thousands of messages per sender. Why are they overselling this so much?

0
0
Anonymous Coward

Yawn

been doing this for years on usenet. by eye, at any rate. note the careful lack of capital letters in this post. its not how i normally write.

0
0
Anonymous Coward

"not how i normally write"

And yet I think I d0xxed you. The post was rejected though, and I'm not going to try again for fear of angering The Mighty Moderatrix, so we might never know for sure!

0
0
Anonymous Coward

Words are the new sticks and stones, boffins say.

And there I was, thinking that "malicious email" was about the payload that would take over your computer when clicked upon.

So tell me, what is the minimum amount the algorithm needs? Will this be enough:?

"Hi! How are you?

I send you this file in order to have your advice!

See you later. Thanks"

On a more serious note, this algorith has, given a sufficiently large corpus and after a sufficient amount of fine-tuning (with a-priori knowledge? how else would they tune?), a 80% chance to pick the right author from a list of suspects. That's not quite lifting the veil of general anonymity. Wonder how well reference humans would do, sufficiently tuned through being familiar with the writing style of (all of) the suspects. Did they perchance do a comparative study?

Also, what is it that makes this /suitable for court use/? The 80%? Is that it? Is the enron corpus the only one that they used to test this with? Then I wouldn't be so sure they'd get 80% elsewhere too. In theory maybe, but that doesn't automatically scale to the courts. Go find a few other bodies of emails to try this on, and we'll see what happens then.

0
0
FAIL

Analysis fail.

80% is good enough for a court of law? Out of 200k emails, they misidentified 40,000... With a sample set of fewer than 200 people. I don't think that's enough for probable cause to have a warrant issued in the US, nor sufficient for presentation as evidence in a criminal case. Presumably with a larger pool of people, the accuracy goes down further - I can think of very few situations in which the software in its present state would be useful.

0
0
Anonymous Coward

reliable enough

"When finely tuned, the technique identified the author about 80 percent of the time."

And that is "reliable enough to be used in courts of law" is it?

Beyond reasonable doubt? I think not.

And what does "finely tuned" mean exactly? Might it have something to do with knowing the answer in advance? What is its accuracy when not finely tuned?

0
0
Silver badge
Big Brother

Me thinks bullshit

"...reliable enough to be used in courts of law"

"When finely tuned, the technique identified the author about 80 percent of the time"

That is really going to impress a court of law, or has the situation changed where the MD5 sum was considered not reliable enough for electronic evidence in spite of being much, much better than this:

http://www.schneier.com/blog/archives/2005/08/the_md5_defense.html

Unless of course you are accused of a crime that attracts witch-hunt like emotions?

0
0
Thumb Down

Hmm... Not sure I like it...

Only automating it is actually original. The different components of write print as a method in criminology have been around for a considerable amount of time. A lot of criminals have been caught by their writing style. Some by luck when someone recognizes it (Unabomber). Some through a concerted effort and analysis.

However, prior to this these were the domain of proper police and used only for very serious cases where paying someone qualified to do a lexical analysis was worth it. Even then, it was classed as circumstantial evidence and used only along with other data.

The difference here is that this development puts this technique into the hands of people some of which are neither qualified to use it nor can understand the consequences of 80% maximum confidence (on a rather large statistical sample). I may be overly pessimistic here, but the wya this is going "tools" to for use this technique with an automated feed of the company Exchange mail server are at most 2-3 years away.

So the next "anonymous" personnel survey you will be filling may actually be way less anonymous than you think...

0
0
Thumb Up

"reliable enough to be used in courts of law."

"...emails written by 158 employees...the technique identified the author about 80 percent of the time."

So of the 158 people, only 32 of them would be wrongly convicted?

0
0
Joke

Yeah, not that hard

H8RS POST EVERYTHING IN UPPER CASE WITH NO PUNCTUATION EXCEPT EXCLAMATION MARKS! OMG! IT MUST BE ME THEY ARE AFTER!

0
0
Bronze badge
Big Brother

Easily bypassed

Just pass your text through Google Translate into a random language, then back.

If you're really paranoid, make a couple of passes through different languages before English.

1
0
Anonymous Coward

Cor Blimey, strike a light

o'course, that's gonna be 'ard 2 phake an-anon email, innit.

0
0

Easy to fool

I'd imagine that it would be pretty easy to protect yourself from this technique by using a web translation service to translate your email to another language, and then back to the original. That would probably give enough of a difference in writing style to screw up the results while keeping a relatively legible email that stays true to the original.

I imagine it would be fairly easy to protect yourself against this technique using a translation service for websites to translate your e-mail to another language, then back to the original. That would probably give enough of a difference in writing style screw results while maintaining a relatively readable e-mail that remains faithful to the original.

Google translate English to French and then back again

1
0

80% of a known base?

And that's good enough for a court of law to prove that a "suspect" is guilty is it?

0
0
Thumb Down

reliable enough to be used in court?

If it is only correct 80% of the time, I'd say that it doesn't go as far as "beyond reasonable doubt".

Besides, I suspect it wouldn't be difficult to imitate somebody else's style, given a few examples of their emails, which leads to the question : how do you identify somebody if you only have one anonymous email?

You'd have to have a good idea of where to start looking in the first place : no samples = no match.

0
0
WTF?

This should have been mentioned earlier

http://www.theregister.co.uk/Design/graphics/icons/comment/wtf_32.png 80% accuracy good enough for court

0
0
Thumb Down

hurr durr

becaus a clever trole dosn't know how to right in a diferint tone.

becaus it work on unsuspect non-tech savvies dosn't meen it can work on ppls who r smart enuf to disguis there writin

0
0
Troll

Why use this in a court of law?

It would be far more effective to employ the SAS, and for once we would see some benefits of doing so, plus it would serve as a disincentive to others.

0
0
Silver badge

identified the author about 80 percent of the time.

That'll be for use in courts run by Ausie hopping marsupials then?

0
0
FAIL

80% accurate reliable enough for court?!

With a 200,000 sample and a limited, known user populationof 158 the best they got was 20% of them wrong, yet based on this high error rate under optimal conditions, its good enough for court?! I don't think so!

Of course peoples personal style and content in a corporate environment is going to use a common, person specific format, but an 80% hit rate is nothing to rave about. Come back when you are in the high nineties for a non-MEH response

0
0
Bronze badge
FAIL

Burden of proof

"Reliable enough to be used in courts of law.... the technique identified the author about 80 percent of the time."

80% is reliable enough for a court of law? *despair*

0
0
Big Brother

Hmmmm ...

What's to stop somebody deliberately changing style? We could have a digital copy-and-paste return to those old blackmail notes that used cut-out newspaper headlines.

The anonymous writer could just copy a "suitable" phrase from one document and one from another and so on. Would that work?

0
0

So it works on 158 authors

But what about millions?

Some people are so dumb that they surely don't have a unique writing style (in as much as they can write at all). Even their mistakes are one big exercise in groupthink.

Though I suppose those people are easy to catch anyway.

0
0
Silver badge

DOH! Reading fail...

... and there I was assuming that these guys had claimed to have tracked down the people releasing "Anonymous" email, not just "anonymous" email.

This is going to get rapidly more confusing.

0
0
Big Brother

Reliable enough?

"Engineers and computer scientists say they have devised a novel method for identifying authors of anonymous emails that's reliable enough to be used in courts of law."

Which court of law are we talking about here? IANAL, but have some experience of legal processes. In the UK, to secure a criminal conviction the burden of proof is "beyond reasonable doubt". I do not believe that this technique meets that requirement: "When finely tuned, the technique identified the author about 80 percent of the time" 80% correct = one in five incorrect. I doubt, reasonably enough.

I also doubt whether it would have much bearing on a civil case where the balance of probabilities is the test, unless backed up by other evidence.

0
0
FAIL

Beyond reasonable doubt

and not balance of Probabilities.

0
0
Bronze badge

Clarification please..

When they stated "identified the author 80 percent of the time" did they mean:

(1) Given 100 emails it identified authors for 80 of them and they were all correct - the remaining 20 where marked as "unsure"; or

(2) Given 100 emails it identified authors for 100 of them, but upon examination only 80 of them were correct.

Option 1 is a decent piece of software detective work. Option 2 is a lousy negative match rate.

4
0
Bronze badge
Boffin

Thank you

For finally asking the question that had been floating round my head since the article...

2
0

Page:

This topic is closed for new posts.