"ROMANES EUNT DOMUS"
comprender? ==> comprendéis?
If you want to learn another language, you need to spend time in the country, talk to people, get drunk and attempt to order complex drinks, and eventually read that country's great works of literature – unless you're Google, that is. In a recent paper, three Googlers outlined a new approach to machine-based translation that …
is the requirement for an extremely large data set from which to derive the original statistical relationships, and the necessary hardware to crunch through it all. While this might work using a distributed system, I can't see it as a standalone quite yet...
In some ways its similar to the statistical n-gram spelling correction Google researchers proposed a few years back; effective, but requiring a huge database to work.
Nonetheless, interesting stuff.
You will be surprised on the size of the dataset. It is much smaller than you would expect.
Right, where should we start - this method is not original. This is the Schliman Method. Schliman used this method to be a successful translator/trade rep for varuous german traders and manufacturer associations. This allowed him to collect the capital needed to go and play with amateur archeology and discover the lost civilzations of Troy and Mikena.
The reason why the size of the dataset needed is much smaller than you would expect is that most languages belong to a handful of language groups. Example - if you know one language from each indo-european subgroup you start understanding the whole group (even if you cannot speak all of the languages properly). A neural network can pick up the similarities very easily so nothing surprising here.
Didn't god punish humanity by making us all speak different languages?
You don't suppose this project is going to be like the Nine Billion Names Of God when Google make that last entry in their database and everyone can speak to everyone else...
Long live the Norman conquest as well as the infiltration of Latin/Romance terms into English!
(Actually, knowing some English really helped me learn Latin and French, so thanks for that!)
(Title: yes, well, had to stay in line with Romanes eunt domus, right?)
(Icon: no vinum?... well beer it is then)
Usually with this sort of thing it's vectors of features, as detected by, well, various feature detectors that they came up with and found were suitable to describe words.
Pointless to try to describe all of them, probably, but you're right, they could have given us a couple of examples.
Pretty obscure reference, to be fair. I got what they mean by the example "king - man + woman = queen", whereas the silly "graphs" were more confusing than anything.
A vector, by the definition, is simply a move through n-dimensional space.
The mind-twister here is that the "dimensionality" of a word is kind of arbitrary, because the component parts of the meaning change from word to word.
The example used of "king" (or "queen") tells us not only gender, but also the importance of the person, the nature of the constitution of the place.
The weirdest thing about vectors in a lot of AI applications is that they've mostly abandoned the idea of axes -- notice that the vector has to subtract "man" as well as adding "woman", because the system doesn't recognise the existence of a gender "axis".
Instead, we have a selection of "features" that are measurable only in terms of presence or absence.
Then again, to make a really good translation a very highly evolved AI would still be needed first (chick-egg thing?). Proper language is intricately part of the whole consciousness and self-awareness thing or whatever it is (I as language construct). Meaning as context (per Donaldson). Without proper context evaluation language has little meaning beyond boring scientific manuals which are already written in English in many cases.
It's ambitious as a project and no doubt useful things might be developed but quality translations won't be one of them. Then again, many settle for dumb translations so the solution might also be to create more dumb users with dumb, modest needs and the translation machines will start working better already!
I don't think it will be that hard for Google to develop a way to analyse a context of the text. It not that hard to tell a poetry from legal document from a love letter and then apply the correct translation or the translation most likely to be use to deal with that document. I did read once that Google were asking users to describe document they want translating, whether it was legal,poetry, formal, informal but I never seen this option on Google translate.
Handling real time voice to voice translation may be trickier to do.
"It's ambitious as a project and no doubt useful things might be developed but quality translations won't be one of them. Then again, many settle for dumb translations so the solution might also be to create more dumb users with dumb, modest needs and the translation machines will start working better already!"
More quality is better than less. Before Google Translate, machine translations were often literally unusable. As someone who has to communicate with speakers of foreign languages frequently (via e-mail) I don't know how I would get along without Google Translate. You apparently don't have the same business need.
"I wonder if this would help interpret written material in extinct languages where we have a few known words? Though there might not be a big enough data set."
This stuff, along with existing Google Translate technology, relies on a massive monolingual dataset as well as a smaller bilingual one. We don't have enough data.
If you start with an urban polyglot description of last nights Coronation Street (or an interchange between football pundits on Match of the Day) rather than an encyclopedia article on the history of the steam engine does it produce reasonable copy?
ie at what point do idiom and dialect defeat google's best efforts to reduce us down to the lowest common denominator advertising receptacle?
"...at what point do idiom and dialect defeat Google's best efforts...?"
Idioms and dialects can defeat ANY translator, not only Google Translate. Even native speakers have trouble with that. As for the 'advertising receptacle' part, I partly agree, but just translating ads is not enough, as most ads need to be tailored for their audience, and that includes taking in account lots of cultural differences, not only language.
"Idioms and dialects can defeat ANY translator..."
I was once at a conference in Germany where a particularly long-winded was speaker expressing himself.
The simultaneous English translation went silent about half way through, and just as I thought it had broken, the translator came back on sounding somewhat exasperated.
"For god's sake man - get to the verb!"
"ie at what point do idiom and dialect defeat google's best efforts to reduce us down to the lowest common denominator advertising receptacle?"
Be as cynical as you want, I appreciate that Google Translate lets me communicate (not perfectly, but usually effectively) with speakers of various languages that I don't speak.
What about the lack of similarities? Take Scottish Gaelic.
No indefinite article, no words for "yes" and "no" as such, no present tenses apart from the two verbs for "to be", no verb "to have", a separate set of numbers from two to ten used only for counting people, etc.
And I guess you think Spanish grammar is the same as English? Heh... gender inflected articles and nouns, verb conjugation beyond adding an -s for third person singular, three different kinds of past tense that are all used in different circumstances than our two, a separate set of conjugations for subjunctive and conditional cases, etc.
No languages are going to be translatable just by looking up words verbatim in a dictionary. But it doesn't hurt to be able to do so.
"No indefinite article, no words for "yes" and "no" as such, no present tenses apart from the two verbs for "to be", no verb "to have","
The technique is for guessing at translations of unknown vocabulary. Normally when natural language processing guys talk about vocabulary, they're talking about words with an independent and relatively unambiguous meaning -- so-called "lexical words", eg "cat", "hamburger", "galactic". The other class of words is called "function words", and these are the grammatical glue that has next-to-no meaning outside of its context -- eg "me", "now", "would" etc. Within natural language processing, these are often not even considered "words" because they follow directly from grammatical rules, and there is very little choice when using them.
These "function words" also form a closed set -- consider the number of pronouns in any given language with the number of common nouns. It is therefore efficient to deal with these more explicitly than lexical words, and even if you're doing pure statistical translation, all of the function words in a language are likely to turn up in your training data (and if not, you've not got enough data) -- and therefore these things are therefore not going to be "unknown vocabulary", so not applicable to this technique anyhow.
To use an example of how vectors would work to translate between very different structures, consider disease.
Say the software knows how to translate "I am hungry" to Gaelic, but doesn't know how to translate the word "thirsty" from English to Gaelic.
I am hungry -- tha an t-acras oirm (lit. is the hunger on_me)
However, the system does know that the only difference between "hungry" and "thirsty" is that "hungry" is about food and "thirsty" is about drink, so the software can generate a vector (-food, +drink) that given "hungry" as its input/starting point will give "thirsty" as its output/endpoint.
Now that same vector will of course also go from "hunger" to "thirst", so it doesn't matter that the Gaelic equivalent of the phrase uses a noun instead of an adjective.
Very clever stuff.
We tried using google translate in a commercial context for user reviews - a lot of them in fact. The problem was using human translation was too expensive.
At the time google was about 85% "good". The problem was the 15% "terrible". You can't put content in front of users that has a 1/7 chance of making them laugh out loud for the wrong reasons.
Second point is speculative - will google apply a discount or penalty for sites that use google translate to pretend they have original content in more languages, like user reviews?
So you have a list of words or phrases that mean the same thing. That list represents a concept. Somewhere out in database land that list has an identifier. What happens if you start using the list identifier to communicate? Maybe Google should work up a human-usable symbology for these. Lingua Google...