2.2 Contemporary Translation Modeling Overview
Machine translation is a hard problem in part because there is a trade-off between the methods with which we would like to translate and those that we can readily compute. There are several main strategies for attacking the translation problem, but most of them are still out of reach.
Warren Weaver viewed languages as:
“[…] tall closed towers, all erected over a common foundation. Thus it may be true that the way to translate from Chinese to Arabic […] is not to attempt the direct route […]. Perhaps the way is to descend, from each language, down to the common base of human communication – the real
but as yet undiscovered universal language […].” (Weaver 1949/1955)
Weaver’s notion of decoding linguistic utterances using the fundamental knowledge representation formalism of an interlingua, while ideal, is utterly impractical with the current state of the fields of Linguistics and Natural Language Processing. Other ambitious methods include semantic transfer approaches, where the meaning of the source sentence is derived via semantic parsing and the translation is then generated from it. There are cases where the translation is too literal to be clear, but the main obstacle is the lack of semantically annotated parallel corpora.
Next are syntactic transfer approaches, such as Yamada and Knight (2001), where the source sentence is initially parsed and then transformed into a syntactic tree in the target language. The translated sentence is then generated from this tree. Syntax-based approaches produce correctly ordered translations, but may miss the semantics of the sentence. However, the main obstacle is again the requirement of parsing tools, or more precisely the money to fund their research, a requirement that is currently not yet met for many languages.
Word-level translation models adopt a simple approach. They translate the source sentence word for word into the target language and then reorder the words until they make the most sense. This is the essence of the IBM translation models. It has the disadvantage of failing to capture semantic or syntactic information from the source sentence, thus degrading the translation quality. The great advantage of word-level translation models, however, is that they are readily trained from available data and do not require further linguistic tools that may not be available. Because of this, wordlevel translation models form the basis of statistical machine translation.