"ROMANES EUNT DOMUS"
comprender? ==> comprendéis?
If you want to learn another language, you need to spend time in the country, talk to people, get drunk and attempt to order complex drinks, and eventually read that country's great works of literature – unless you're Google, that is. In a recent paper, three Googlers outlined a new approach to machine-based translation that …
comprender? ==> comprendéis?
si, yo comprendo
Shirley: "Romani domum ite" (as a declarative), based on Latin sentence structure.
is the requirement for an extremely large data set from which to derive the original statistical relationships, and the necessary hardware to crunch through it all. While this might work using a distributed system, I can't see it as a standalone quite yet...
In some ways its similar to the statistical n-gram spelling correction Google researchers proposed a few years back; effective, but requiring a huge database to work.
Nonetheless, interesting stuff.
Good thing they have a huge distributed system, then, I guess?
You will be surprised on the size of the dataset. It is much smaller than you would expect.
Right, where should we start - this method is not original. This is the Schliman Method. Schliman used this method to be a successful translator/trade rep for varuous german traders and manufacturer associations. This allowed him to collect the capital needed to go and play with amateur archeology and discover the lost civilzations of Troy and Mikena.
The reason why the size of the dataset needed is much smaller than you would expect is that most languages belong to a handful of language groups. Example - if you know one language from each indo-european subgroup you start understanding the whole group (even if you cannot speak all of the languages properly). A neural network can pick up the similarities very easily so nothing surprising here.
... they are trying to become a .gov, in the Nineteen Eighty-Four sense.
Evil, evil stuff, no matter how you look at it.
Yeah, translating languages - what utter evilness. They should be shot.
Didn't god punish humanity by making us all speak different languages?
You don't suppose this project is going to be like the Nine Billion Names Of God when Google make that last entry in their database and everyone can speak to everyone else...
They have a solution. Google Stars. Somehow it makes sense.
When I'm stuck for a spanish word, I usually take the English word and add an -o or -a (or -ado, -ando or -amente depending on context) to the end of it. Most of the time it works out
Long live the Norman conquest as well as the infiltration of Latin/Romance terms into English!
(Actually, knowing some English really helped me learn Latin and French, so thanks for that!)
(Title: yes, well, had to stay in line with Romanes eunt domus, right?)
(Icon: no vinum?... well beer it is then)
"Romano goando homeamente"
hmm, no, that doesn't really work does it?
... and this is why Google Translate works so well.
> this is why Google Translate works so well
It's not that bad. For some languages, e.g. Spanish and German, you get a decent translation. Some other languages, like Chinese, still have quite a bit to go, tho.
Understandable, yes. Decent, I beg to differ. There's still a fair way to go yet.
Have you guys read The Culture books by Iain Banks?
Every time I read about Google's projects, I'm reminded of The Culture.
Is it me or does the article just start babbling about vectors without explaining what vectors are meant? Probably in the articles the Googlers published, but who RTFA when you can complain?
Usually with this sort of thing it's vectors of features, as detected by, well, various feature detectors that they came up with and found were suitable to describe words.
Pointless to try to describe all of them, probably, but you're right, they could have given us a couple of examples.
Pretty obscure reference, to be fair. I got what they mean by the example "king - man + woman = queen", whereas the silly "graphs" were more confusing than anything.
A vector, by the definition, is simply a move through n-dimensional space.
The mind-twister here is that the "dimensionality" of a word is kind of arbitrary, because the component parts of the meaning change from word to word.
The example used of "king" (or "queen") tells us not only gender, but also the importance of the person, the nature of the constitution of the place.
The weirdest thing about vectors in a lot of AI applications is that they've mostly abandoned the idea of axes -- notice that the vector has to subtract "man" as well as adding "woman", because the system doesn't recognise the existence of a gender "axis".
Instead, we have a selection of "features" that are measurable only in terms of presence or absence.
Ah the Gender axis. I remember a sketch a very long time ago (Smith & Jones era?) where someone asked if they had any kids, said "Yes, three - one of each; boy, girl & hairdresser".
However your point about axes is spot on.
Then again, to make a really good translation a very highly evolved AI would still be needed first (chick-egg thing?). Proper language is intricately part of the whole consciousness and self-awareness thing or whatever it is (I as language construct). Meaning as context (per Donaldson). Without proper context evaluation language has little meaning beyond boring scientific manuals which are already written in English in many cases.
It's ambitious as a project and no doubt useful things might be developed but quality translations won't be one of them. Then again, many settle for dumb translations so the solution might also be to create more dumb users with dumb, modest needs and the translation machines will start working better already!
I don't think it will be that hard for Google to develop a way to analyse a context of the text. It not that hard to tell a poetry from legal document from a love letter and then apply the correct translation or the translation most likely to be use to deal with that document. I did read once that Google were asking users to describe document they want translating, whether it was legal,poetry, formal, informal but I never seen this option on Google translate.
Handling real time voice to voice translation may be trickier to do.
"It's ambitious as a project and no doubt useful things might be developed but quality translations won't be one of them. Then again, many settle for dumb translations so the solution might also be to create more dumb users with dumb, modest needs and the translation machines will start working better already!"
More quality is better than less. Before Google Translate, machine translations were often literally unusable. As someone who has to communicate with speakers of foreign languages frequently (via e-mail) I don't know how I would get along without Google Translate. You apparently don't have the same business need.
Computers can be really good at determining context. Remember that IBM's 'Watson' won Jeopardy, which is pretty much all about context.
I wonder if this would help interpret written material in extinct languages where we have a few known words? Though there might not be a big enough data set.
I can imagine that they could run extinct languages against many different modern languages to get some useful information about etymology that humans wouldn't otherwise be able to work out because nobody speaks that many languages.
"I wonder if this would help interpret written material in extinct languages where we have a few known words? Though there might not be a big enough data set."
This stuff, along with existing Google Translate technology, relies on a massive monolingual dataset as well as a smaller bilingual one. We don't have enough data.
This seems to be putting Semantic back into the Statistical
of Statistical Machine Translation
my hovercraft is full of eels.
Hearing that, my nipples explode with delight!
If you start with an urban polyglot description of last nights Coronation Street (or an interchange between football pundits on Match of the Day) rather than an encyclopedia article on the history of the steam engine does it produce reasonable copy?
ie at what point do idiom and dialect defeat google's best efforts to reduce us down to the lowest common denominator advertising receptacle?
"...at what point do idiom and dialect defeat Google's best efforts...?"
Idioms and dialects can defeat ANY translator, not only Google Translate. Even native speakers have trouble with that. As for the 'advertising receptacle' part, I partly agree, but just translating ads is not enough, as most ads need to be tailored for their audience, and that includes taking in account lots of cultural differences, not only language.
"ie at what point do idiom and dialect defeat google's best efforts to reduce us down to the lowest common denominator advertising receptacle?"
Be as cynical as you want, I appreciate that Google Translate lets me communicate (not perfectly, but usually effectively) with speakers of various languages that I don't speak.
"Idioms and dialects can defeat ANY translator..."
I was once at a conference in Germany where a particularly long-winded was speaker expressing himself.
The simultaneous English translation went silent about half way through, and just as I thought it had broken, the translator came back on sounding somewhat exasperated.
"For god's sake man - get to the verb!"
"The Belgian minister has just made a joke - it would be polite to laugh."
Anything that works for vocabulary can work for idioms, which are often as not thing more than multiword "words".
What about the lack of similarities? Take Scottish Gaelic.
No indefinite article, no words for "yes" and "no" as such, no present tenses apart from the two verbs for "to be", no verb "to have", a separate set of numbers from two to ten used only for counting people, etc.
And I guess you think Spanish grammar is the same as English? Heh... gender inflected articles and nouns, verb conjugation beyond adding an -s for third person singular, three different kinds of past tense that are all used in different circumstances than our two, a separate set of conjugations for subjunctive and conditional cases, etc.
No languages are going to be translatable just by looking up words verbatim in a dictionary. But it doesn't hurt to be able to do so.
"No indefinite article, no words for "yes" and "no" as such, no present tenses apart from the two verbs for "to be", no verb "to have","
The technique is for guessing at translations of unknown vocabulary. Normally when natural language processing guys talk about vocabulary, they're talking about words with an independent and relatively unambiguous meaning -- so-called "lexical words", eg "cat", "hamburger", "galactic". The other class of words is called "function words", and these are the grammatical glue that has next-to-no meaning outside of its context -- eg "me", "now", "would" etc. Within natural language processing, these are often not even considered "words" because they follow directly from grammatical rules, and there is very little choice when using them.
These "function words" also form a closed set -- consider the number of pronouns in any given language with the number of common nouns. It is therefore efficient to deal with these more explicitly than lexical words, and even if you're doing pure statistical translation, all of the function words in a language are likely to turn up in your training data (and if not, you've not got enough data) -- and therefore these things are therefore not going to be "unknown vocabulary", so not applicable to this technique anyhow.
To use an example of how vectors would work to translate between very different structures, consider disease.
Say the software knows how to translate "I am hungry" to Gaelic, but doesn't know how to translate the word "thirsty" from English to Gaelic.
I am hungry -- tha an t-acras oirm (lit. is the hunger on_me)
However, the system does know that the only difference between "hungry" and "thirsty" is that "hungry" is about food and "thirsty" is about drink, so the software can generate a vector (-food, +drink) that given "hungry" as its input/starting point will give "thirsty" as its output/endpoint.
Now that same vector will of course also go from "hunger" to "thirst", so it doesn't matter that the Gaelic equivalent of the phrase uses a noun instead of an adjective.
Very clever stuff.
At least I'll still have FaceBing's hilarious "translations" (in the way McDonalds is "food") for a good giggle.
Sigh, why did Microsoft have to invest in Facebook.
I'm tired of copying and pasting stuff from Facebook into Google Translate in order to understand it.
But will it be effective when our future alien overlords are descending from orbit?
We tried using google translate in a commercial context for user reviews - a lot of them in fact. The problem was using human translation was too expensive.
At the time google was about 85% "good". The problem was the 15% "terrible". You can't put content in front of users that has a 1/7 chance of making them laugh out loud for the wrong reasons.
Second point is speculative - will google apply a discount or penalty for sites that use google translate to pretend they have original content in more languages, like user reviews?
So you have a list of words or phrases that mean the same thing. That list represents a concept. Somewhere out in database land that list has an identifier. What happens if you start using the list identifier to communicate? Maybe Google should work up a human-usable symbology for these. Lingua Google...
can it translate YouTube comments into English?
sharpening his bat'leth..