A group of Adelaide researchers has released an open-source tool that helps identify document authorship by comparing texts. While their own test cases – and therefore the headlines – concentrated on identifying the authors of historical documents, it seems to The Register that any number of modern uses of such a tool might …
How is this different than Turnitin? Other than it is free...
Isn't the point that Turnitin identifies when texts are the same (it identifies when text A is pretty much the same as text B, probably because it's been ripped off) but that this software identifies the author of two different texts? In other words, it'll tell you if the same person (probably) wrote Twelfth Night and As You Like It, but not whether As You Like It by T Mangrove is the same as As You Like It by W Shakespeare.
Does anyone have any experience of working with Turnitin etc? Does it work?
In my experience, Turnitin is a steaming pile of horse manure that spews false positives.
Apart from describing my work as plagiarism for using the same page numbering template in the header of a word document as a student from Manchester University it enjoys highlighting my use of three common words together as a form of intellectual theft.
Anyway, I believe they are different technologies. Turnitin checks for roughly the same content, structure, sentences etc. to establish originality whereas this project assesses the style of a text to see if it matches samples of an author's known texts to establish authorship.
I have yet to be really convinced of the use of Turnitin. For most of the work it is asked to do in universities, there is always going to be a high level of correlation, because the answer you want, and the references to support it, are going to be the same as thousands of other answers on the same topic. I have found it useful three times for spotting essays that are (literally) nothing but stitched-together collections of copy-and-pasted sections from available web-sources (but I'm well-enough acquainted with my subject to have recognised many of them anyway), and for giving evidence to students that they *must* reference what they have used from elsewhere (though Turnitin doesn't recognise when something has been referenced, so both a properly and improperly reference piece of work will get the same similarity score).
Where it might be useful is for more freehand topics, such as dissertations, but I haven't done that yet.
Can it read txt?
Re: o rly?
It'll only detect crap writers ...
really great writers are masters at generating their own style ... especially in English, where someone who bothered to study it (rare, I admit) can choose from *at least* 3 sources to achieve an effect (celtic, saxon, latin, french). It can be great fun to rewrite prose, changing Latin words to Saxon, or Saxon to French.
Re: It'll only detect crap writers ...
Gonna b using; grammar, and punctuation also... not just vocabulary. innit?
Even so, I agree that it's almost certainly utterly trivial to defeat, should one feel inclined. That doesn't necessarily make it completely useless. The kiddies would probably find it easier to do their own homework than carefully transcribe a friend's. I can also envisage plenty of filtering applications where the author wouldn't be be attempting to conceal their identity.
Re: It'll only detect crap writers ...
"Generating their own style" is precisely what improves the accuracy of the classification models this sort of approach uses - or would be, if that characterization wasn't naive to the point of fallacy.
The point of this sort of work, which is by no means new, is to build a classifier for traits thought to be relatively invariant for a given writer. This gives you a feature vector which, to the extent the model is sensitive and accurate, uniquely identifies an author with a given probability.
It's basically what I Write Like does.
 Prose styles are epiphenomena heavily determined by culture, education, and personal experience; even writers who have expressly set out to create "new" styles (more common in poetry than in prose) can and have been shown to be substantially influenced by identifiable sources and full of intertextual references. This is true even of the most strikingly novel styles, as of the high modernists. Take a look at a scholarly edition of Joyce's Ulysses, for example. Really, I understand the widespread resistance to poststructuralist theory among the middlebrow, but has even structuralism failed to penetrate? I suppose it has.
 The heuristic identification of authorship by textual evidence is one of the oldest areas of textual studies. It was widely researched and practiced back in the days of the philologists around the turn of the twentieth century, and variations on it go back at least as far as the scholastics in the European tradition. Applying IT to the problem is also a well-established field.
 More precisely, the ability of the model to identify authorship with probability P is a function of the model's "recall" (essentially the reciprocal of the false-negative rate) and "precision" (the reciprocal of the false-positive rate). These are typically condensed into a single measure such as f_0, which weighs recall and precision equally.
I'm going to need someone to write some kind of tool to help me make all my anon writings different to my less anon stuff.
Wrong. You only need to make your threatening notes unique
I fix yoo good
Letter to the Hebrews
To compare authorship you surely need to do this in the original language - so how would a translator be involved?
Re: Letter to the Hebrews
Paul can write a letter and, because he's a popular guy, someone may translate it for another audience. Fast forward a couple thousand years and the original may no longer be around, only the translated copy. That doesn't mean he's not responsible for the translated version.
can it tell us who really wrote Bitter Sweet Symphony?
Gee, it's only been 50 years and a bit since the statistician Frederick Mosteller went to work on the authorship of the disputed Federalist Papers: http://www.thecrimson.com/article/1962/3/30/mosteller-joins-federalist-query-ptwelve-of/
This was in use at least 40 years ago for analysing Biblical texts. Other than it being open source, I don't see anything new.
Does it detect the Quran as being the word of god. ? (Does it think the bible is by the same hand ?)
The Art of War
Itd be prety easy to spoof by incdluding intentional characrterisics of others.
Re: The Art of War
Perhaps not spoof but certainly foil.
Reverse tool in 3..2..1..
I suspect that this would indeed pose a threat to anonymity on the Internet - it depends a bit how much data the tool needs to arrive at a sufficiently acceptable probability (having said that, if the TSA's use of probability is any measure, 1% is probably enough).
However, the real fun comes from defeating such analysis. I have no idea how that would be done, but it strikes me as an interesting exercise. Not for any nefarious reason (although it's easy to dream up some), just because :).
Re: Reverse tool in 3..2..1..
"Reverse tool in 3..2..1.."
Hmmm... AC is using erroneous two-dot ellipses ".." (should be three dots, "..."). Also using posh words "indeed" and "nefarious".
Hey Fred! How are you doing? How's your sister doing?
No words have been elided, (nor does the pause imply the absence of words) so that's not an ellipses, even when spelled with three dots. It's just a pause longer than a full stop, indicated by two full stops.
Also using posh words "indeed" and "nefarious".
Are those posh words? At best they will not appear in the Daily Mail or the Sun on account of being of 2 or more syllables..
and while I'm at it
The Federalist Papers are not part of the U.S. Constitution, though they were written in support of it, and by men much involved in writing it.
the technique which will preserve your anonymity and allow you to preserve all your sock puppets (at least for the time being) is to create your draft in your native language, mince it through one or more translators and then back into your native language. Correct the errors. Post. That's how I did the other posts on this page without anyone spotting me. Oops.
On a more semi serious note, has anyone got around to running Shakespeare's texts through this software to see if Christopher Marlow (or any other contenders) show up as suspects?
Re: Preserving Anonymity
has anyone got around to running Shakespeare's texts through this software to see if Christopher Marlow (or any other contenders) show up as suspects?
Your local research library should have shelves full of books of textual scholarship on Shakespeare. Using this new software - which may not even employ any novel approaches; I haven't read the original article - would be a drop in the bucket.
In any case, contributing to the debate on the authorship of Shakespearean apocrypha like Arden of Faversham would probably be more interesting, though with most of those works (as with many attributed to Shakespeare) the canonical versions are probably the work of multiple people.
- Xmas Round-up Ten top tech toys to interface with a techie’s Christmas stocking
- Xmas Round-up Ghosts of Christmas Past: Ten tech treats from yesteryear
- Review Hey Linux newbie: If you've never had a taste, try perfect Petra ... mmm, smells like Mint 16
- Analysis Microsoft's licence riddles give Linux and pals a free ride to virtual domination
- NSFW Oz couple get jiggy in pharmacy in 'banned' condom ad