Wednesday, December 06, 2006

Me Translate Pretty One Day

Their appears to be a breakthrough on machine language translation. Anyone who has used Babelfish knows its problems. A company with a massive dictionary and a smarter understanding is making programs that are doing better. Recent improvements had been to use a statistical analysis of good translators and learn their rules as opposed to the rules of grammar for each language.
In the most promising method to emerge from the work, called statistical-based MT, algorithms analyze large collections of previous translations, or what are technically called parallel corpora Â? sessions of the European Union, say, or newswire copy Â? to divine the statistical probabilities of words and phrases in one language ending up as particular words or phrases in another. A model is then built on those probabilities and used to evaluate new text. A slew of researchers took up IBM's insights, and by the turn of the 21st century the quality of statistical MT research systems had drawn even with five decades of rule-based work.

Since then, researchers have tweaked their algorithms and the Web has spawned an explosion of available parallel text, turning the competition into a rout. The lopsidedness is best seen in the results from the annual MT evaluation put on by the National Institute of Standards and Technology (NIST), which uses a measurement called the BiLingual Evaluation Understudy (BLEU) scale to assess a system's performance in Chinese and Arabic against human translation. A high-quality human translator will likely score between 0.7 and 0.85 out of a possible 1 on the BLEU scale. In 2005, Google's stat-based system topped the NIST evaluation in both Arabic (at 0.51) and Chinese (at 0.35). Systran, the most prominent rule-based system still in operation, languished at 0.11 for Arabic and 0.15 for Chinese.
But the new company has come up with an even better machine translation. They applied the largest bilingual dictionaries in the world and a new simple algorithm with huge databases of languages as actually used. It is a fairly simple method of getting a translation but it requires large databases and fast computers.


Argentum said...

I wonder why the developpers of the statistical-based machine translator have chosen such exotic languages as Chinese or Arabic. There are more parallel texts in English and German, or in English and French, and these directions of translation are also very popular. In fact, there are many MT systems for European languages, and the translation they provide, although not quite accurate, is generally acceptable. It's unlikely that the statistical approach could improve the quality of translation, since the rules-based online translators (, Babelfish, Altavista and so on) are able to provide more than 50 % of correct translations.

Gary said...

I think Chinese and Arabic were chosen because they are more difficult being unrelated to Western languages.