2.3 word-based translation modeling

Most research into improving statistical machine translation has focused on improving the translation model component of the Fundamental Equation (2.2). We now examine this translation model probability, P(F |E ), in more detail.

The translation model probability cannot be reliably calculated based on the sentences as a unit, due to sparseness. Instead, the sentences are decomposed into a sequence of words. In the IBM models, words in the target sentence are aligned with the word in the source sentence that generated them. The translation in Figure 2.2 contains an example of a reordered alignment (black cat to chat noir) and of a many-to-one alignment (fish to le poisson).

The black cat likes fish

Le chat noir aime le poisson

Figure 2.2: Example of a word-aligned translation from English to French.

2.3.1 Definitions

Suppose we have E, an English sentence with I words, and F, a French sentence with J words.

According to the noisy channel model, a word alignment aligns a French word f j with the English word ei that generated it (Brown et al. 1993). We callA the set of word alignments that account for all words in F.

Then the probability of the alignment is the product of individual word alignment probabilities.

P(F, A |E ) is called the statistical alignment model. However, there are multiple possible sets A of word alignments for a sentence pair. Therefore, to obtain the probability of a particular translation, we sum over the set AA of all such A.

2.3.2 Calculating the Translation Model Probability

P(F |E ) would be simple to compute if word-aligned bilingual corpora were available. Because they are generally unavailable, word-level alignments are estimated from the sentence-aligned corpora that are accessible for many language pairs.

Perform this alignment extraction by using an Expectation Maximization (EM) algorithm (Dempster et al. 1977). EM methods produce estimates for hidden

parameter values by maximizing the likelihood probability of a training set. In this case, it estimates word translation probabilities (and other hidden variables of the translation probability) by selecting those that maximize the sentence alignment probability of the sentence-aligned parallel corpus. The more closely the training set represents the set the machine translation system will be used on, the more accurate the EM parameter estimates will be. However, EM is not guaranteed to produce the globally optimal translation probabilities.
For a training corpus of SS aligned sentence pairs, the EM algorithm estimates the translation model parameters.

Given the word alignment probability estimates?, the IBM translation models then collectively compute the translation model P(F |E ) for a word-based statistical machine translation system.

Word-based statistical machine translation models necessitated some elaborate model parameters (e. g. fertility, the likelihood of a word translating to more than one word) to account for how some real-world translations could be produced. A limitation of word-based models was their handling of reordering, null words, and non-compositional phrases (which cannot be translated as a function of their constituent words). The word-based system of translation models has been improved upon by recent phrase – based approaches to statistical machine translation, which use larger chunks of language as their basis.

(No Ratings Yet)

Похожие топики по английскому:

The effects of nonsymmetric matrix permutations and scalings in semiconductor device and circuit simulation Abstract – The solution of large sparse unsymmetric linear systems is a critical and challenging component of semiconductor device and circuit simulations. The time for...
Английские слова 100 (по частоте встречаемости) part 3/25 201 window 202 life 203 maybe 204 fall 205 own 206 far 207 under 208 boy 209 end 210 those 211 reach 212 while 213...
Usmle – tests 1. A 69-year-old male with a 45 pack-year smoking history presents with hemoptysis, 20 lb. weight loss, and proximal muscle weakness that improves throughout the...
Doctor who chat part 6 Name: Doctor Who Chat Part: 6 Writer: Andrey Lysenkov – vkontakte. ru/id105176267 Warning: there is a couple mistakes – – Chat restoring – – –...
New source for generating ‘green’ electricity University of Minnesota engineering researchers discover new source for generating ‘green’ electricity Contacts: Rhonda Zurn, College of Science and Engineering, rzurn@umn. edu, (612) 626-7959 Preston...
Software testing terms and definitions Precision and Accuracy As a software tester, it’s important to know the difference between precision and accuracy. Suppose that you’re testing a calculator. Should you...
Sidney sheldon, master of the game SIDNEY SHELDON MASTER OF THE GAME 1982 PROLOGUE Kate 1982 The large ballroom was crowded with familiar ghosts come to help celebrate her birthday. Kate...
Tolkien – the lord of the rings (book 4) Book IV Chapter 1 The Taming of Sméagol ‘Well, master, we’re in a fix and no mistake,’ said Sam Gamgee. He stood despondently with hunched...
How device drivers work Device drivers consist of software code that allows your PC’s operating system to interact with a hardware device. Every device driver performs a different function...
Parable of love A man and woman had been married for more than 60 years. They had shared everything and talked about everything. They had no secrets from...
New audi a7 sportback revealed On Monday night I was in a modern art museum in Munich – the Pinakothek der Moderne Munchen, if you must know – to see...
A teacher was working with a group of underprivileged children… – анекдот A teacher was working with a group of underprivileged children, trying to broaden their horizons through sensory exploration. With their eyes closed, they would feel...
The ugly story of the moscow cat theatre Witnesses silenced, mysterious lawsuits, information and evidence disappearing, charlatan doctors and scientists, and sadistic clowns – sounds like a horror movie? Maybe, but that is...
Jane fonda – the she decade The SHE DECADE Adapted from Jane Fonda: The Private Life of a Public Woman, by Patricia Bosworth For a decade, starting in 1963, Jane Fonda...
Philip kindred dick – service call Service Call It would be wise to explain what Courtland was doing just before the doorbell rang. In his swank apartment on Leavenworth Street where...
Crayon crisis Crayon Crisis The telephone rang. It was my sister. She said, “Just thought I’d let you know I used your crayon story again.” My sister...
The sun also rises (fiesta) by ernest hemingway / chapter 3 BOOK ONE / CHAPTER 3 It was a warm spring night and I sat at a table on the terrace of the Napolitain after Robert...
Voa test your word knowledge with a quiz about farm terms This is the VOA Special English Agriculture Report. Today we have a test for you – a vocabulary quiz. The words are all related to...
Police arrest more than 700 protesters on brooklyn bridge In a tense showdown above the East River, the police arrested more than 700 demonstrators from the Occupy Wall Street protests who took to the...
How to become a disciplined forex trader How to Become a Disciplined Forex Trader Discipline is an integral part of becoming a consistently profitable trader. However, most aspiring Forex traders find themselves...