Statistical language modeling

This chapter is an introduction to statistical language models. A statistical language model is a probabilistic way to capture regularities of a particular language, in the form of word-order constraints. In other words, a statistical language model expresses the likelihood that a sequence of words is a fluent sequence in a particular language. Plausible sequences of words are given high probabilities whereas nonsensical ones are given low probabilities.

We briefly mentioned statistical language models in Chapter 2; here we explain them in more detail, with the goal of motivating the use of factored language models for statistical machine translation. Section 3.1 explains the n-gram model and associated smoothing techniques that are the basis for statistical language modeling. A linguistically-motivated approach to language modeling, probabilistic context-free grammars, is briefly described in section 3.2. Section 3.3 details the language model and smoothing scheme used in this thesis: Factored Language Models with Generalized Parallel Backoff. Section 3.4 mentions a useful method for generating factored language models. Finally, Section 3.5 describes the metrics for evaluating statistical language models.

3.1 The N-gram Language Model

The most widespread statistical language model, the n-gram model, was proposed by Jelinek and Mercer (Bahl et al. 1983) and has proved to be simple and robust. Much like phrase-based statistical machine translation, the n-gram language model has dominated the field since its introduction despite disregarding any inherent linguistic properties of the language being modeled. Language is reduced to a sequence of arbitrary symbols with no deep structure or meaning – yet this simplification works.

The n-gram model makes the simplifying assumption that the nth word w depends only on the history h, which consists of the n – 1 preceeding words. By neglecting the leading terms, it models langage as a Markov

chain of order n – 1.

The value of n trades off the stability of the estimate (i. e. its variance) against its appropriateness (i. e. bias) (Rosenfeld 2000). A high n provides a more accurate model, but a low n provides more reliable estimates. Despite the simplicity of the n – gram language model, obtaining accurate n-gram probabilities can be difficult because of data sparseness. Given infinite amounts of relevant data, the next word following a given history can be reasonably predicted with just the maximum likelihood estimate (MLE).

The Count function simply measures the number of times something was observed in the training corpus. However, for any sized corpus there will be a value for n beyond which n-grams occur very infrequently and thus cannot be estimated reliably. Because of this, trigrams are a common choice for n-gram language models based on multi-million-word corpora. Rosenfeld (1996) mentions that a language model trained on 38 million words marked one third of test trigrams, from the same domain, as previously unseen. By comparison, this thesis uses the Europarl corpus, which is a standard parallel text, yet contains only 14 million words in each language. Rosenfeld (1996) furthermore showed that the majority of observed trigrams only appeared once in the training corpus.

Because the value of n tends to be chosen to improve model accuracy, direct MLE computation of n-gram probabilities from counts is likely to be highly inaccurate.

(No Ratings Yet)

Похожие топики по английскому:

The 20 greatest historical myths It is said that those who don’t know history are condemned to repeat it and as any history buff can tell you, much of history...
Английские слова 100 (по частоте встречаемости) part 3/25 201 window 202 life 203 maybe 204 fall 205 own 206 far 207 under 208 boy 209 end 210 those 211 reach 212 while 213...
Doctor who chat part 6 Name: Doctor Who Chat Part: 6 Writer: Andrey Lysenkov – vkontakte. ru/id105176267 Warning: there is a couple mistakes – – Chat restoring – – –...
Pride and prejudice chapter 7 Mr. Bennet’s property consisted almost entirely in an estate of two thousand a year, which, unfortunately for his daughters, was entailed, in default of heirs...
Software testing terms and definitions Precision and Accuracy As a software tester, it’s important to know the difference between precision and accuracy. Suppose that you’re testing a calculator. Should you...
Parable of love A man and woman had been married for more than 60 years. They had shared everything and talked about everything. They had no secrets from...
Jane fonda – the she decade The SHE DECADE Adapted from Jane Fonda: The Private Life of a Public Woman, by Patricia Bosworth For a decade, starting in 1963, Jane Fonda...
How to become a disciplined forex trader How to Become a Disciplined Forex Trader Discipline is an integral part of becoming a consistently profitable trader. However, most aspiring Forex traders find themselves...
Правильный выбор / Right choice Every day each of us make a choice and each time this choice changes our lives, does it better or worse? What kind of person...
50 facts about russians 1: Russians distrust anything cheap. 2: The English word “bargain” can not be adequately translated into Russian. 3: Although Russians distrust anything with a cheap...
To seduce a man Seducing a man doesn’t necessarily mean stealing him from someone else, or convincing him to do something he shouldn’t be doing. It can simply mean...
Harp seal Harp Seal A baby harp seal rests on the Arctic ice. Its mother can distinguish it from hundreds of others by scent alone. Harp seals...
Education and role of foreign languages Education is very important in our life. An educated person is one who knows a lot about many things. He always tries to learn, find...
In the thick of it Well, here we are in the headlong world of touring. After a frenetic start, rushing up and down Britain, we’re now in the Netherlands and...
Present Continuous Tense Для того чтобы понять, как образуется время Present Continuous и когда его нужно употреблять, изучить ниже следующие таблицы. Таблица 1. Образование Present Continuous 1. Утвердительная...
Dark tower i – the gunslinger, 2003 The Gunslinger Stephen King, 2003 (Revised and expanded throughout) To ED FERMAN Who took a chance on these stories, one by one INTRODUCTION On Being...
Краткое содержание Патриотизм Мисима Патриотизм 28 февраля 1936 г., на третий день после военного путча, устроенного группой молодых националистически настроенных офицеров, недовольных слишком либеральным правительством, гвардейский поручик Синдзи Такэяма,...
Studying English Generation of 90-ies is really lucky because it has new oportunities and great future. “Iron Curtain” fell down, so now young people can travel all...
The girl with the dragon tattoo – stieg larsson CONTENTS PROLOGUE A FRIDAY IN NOVEMBER PART 1 Incentive CHAPTER 1 Friday, December 20 CHAPTER 2 Friday, December 20 CHAPTER 3 Friday, December 20-Saturday, December...
Goldilocks and the three bears One day Goldilocks went for a walk in the forest. She came to a little house. She knocked on the door but the house was...