Kevin Knight and Daniel Marcu—research associate professors of computer science in the USC Viterbi School of Engineering and senior research scientists in the USC Information Sciences Institute—are breaking new ground in the field of natural language translation.

Statistical approach to natural language translation

Kevin Knight and Daniel Marcu—research associate professors of computer science in the USC Viterbi School of Engineering and senior research scientists in the USC Information Sciences Institute—are breaking new ground in the field of natural language translation.

Knight and Marcu develop mathematical models and learning algorithms for natural language applications, with a focus on machine translation. They start with vast amounts of previously translated documents and automatically learn knowledge that can be used to translate arbitrary texts.

The algorithms they use evaluate billions of possible translations to find the ones that make most sense when translating, and evaluate billions of possible configurations of translation parameters to uncover hidden structures specific to the translation processes they model. The HPCC supercomputer provides them with the necessary computing power to conduct their research.

Whereas most existing translation software uses hand-coded rules for transposing words and phrases, new software developed by Knight and Marcu takes a statistical approach, building probabilistic rules about words, phrases and syntactic structures.

The pair founded a company called Language Weaver in Los Angeles to sell the software as an automated translation tool. They already offer technology that can translate Arabic, Chinese, French or Spanish into English and vice versa.

The keys to their statistical machine translation software are the translation dictionaries, patterns and rules that the program develops. It does this by creating many different possible translation parameters based on previously translated documents and then ranking them probabilistically.

This new approach to machine learning eventually may give computers the ability to produce genuine insights into the structures of various languages, leading to discoveries about linguistics that only a machine could produce-by crunching through billions of words.

The translated documents used to teach the translation algorithms can be electronic, on paper, or even on audio files. The system produces higher accuracy translations than other methods and is better suited to tackling less common languages and the unusual vocabularies found in specialized or technical texts.


  ITS Policies       Contact HPCC