N-gram Language Models

Models that assign probabilities to sequences of words are called language models. An ngram is a sequence of n words: a 2-gram (or bigram) is a two-word sequence of words like “please turn”, “turn your”, or “your homework”, and a 3-gram (or trigram) is a three-word sequence of words like “please turn your”, or “turn your homework”.


To represent the probability of a particular random variable X_i taking on the value “the”, or P(X_i = \text{"the"}), use the simplification P(\text{the}).

A sequence of N words is represented either as w_1, \cdots , w_n or w_1^n.

Probability of a Sequence

For the joint probability of each word in a sequence having a particular value P(X = w_1, Y = w_2, Z = w_3, \cdots , W = w_n) use P(w_1,w_2, \cdots , w_n).

    \begin{equation*} 	P(w_1^n) = P(w_1)P(w_2|w_1)P(w_3|w_1^2) \cdots P(w_n|w_1^{n-1}) = \prod_{k=1}^{n}P(w_k|w_1^{k-1}) \end{equation*}

The intuition of the n-gram model is that instead of computing the probability of a word given its entire history, approximate the history by just the last few words.


The bigram model, for example, approximates the probability of a word given all the previous words P(w_n|w_1^{n-1}) by using only the conditional probability of the preceding word P(w_n|w_{n-1}).

    \begin{equation*} 	P(w_n|w_1^{n-1}) \approx P(w_n|w_{n-1}) \end{equation*}

This assumption that the probability of a word depends only on the previous word is also called a Markov assumption.

Thus, for a bigram

    \begin{equation*} 	P(w_1^n) \approx \prod_{k=1}^{n}P(w_k|w_{k-1}) \end{equation*}


The N-gram (which looks N - 1 words into the past) approximates the probability as follows

    \begin{equation*} 	P(w_n|w_1^{n-1}) \approx P(w_n|w_{n-(N-1)}^{n-1}) \end{equation*}

Parameter Estimation

Maximum Likelihood estimation is used to compute the parameters of an n-gram model. This process involves getting counts from a corpus, and normalizing the counts so that they lie between 0 and 1.

To compute a particular bigram probability of a word y given a previous word x, compute the count of the bigram C(xy) and normalize by the sum of all the bigrams that share the same first word x

    \begin{equation*} 	P(w_n|w_{n-1}) = \frac{C(w_{n-1}w_n)}{\sum_wC(w_{n-1}w)} \end{equation*}

The sum of all bigram counts that start with a given word w_{n-1} must be equal to the unigram count for that word w_{n-1}.

    \begin{equation*} 	P(w_n|w_{n-1}) = \frac{C(w_{n-1}w_n)}{C(w_{n-1})} \end{equation*}

For the general case of MLE N-gram parameter estimation:

    \begin{equation*} 	P(w_n|w_{n-(N-1)}^{n-1}) = \frac{C(w_{n-(N-1)}^{n-1}w_n)}{C(w_{n-(N-1)}^{n-1})} \end{equation*}


The perplexity (sometimes called PP for short) of a language model on a test set is the inverse probability of the test set, normalized by the number of words. The higher the conditional probability of the word sequence, the lower the perplexity. Consider the test set W = w_1w_2 \cdots w_N.

    \begin{equation*} 	PP(W) = P(w_1w_2 \cdots w_N)^{-\frac{1}{N}} = \sqrt[N]{\frac{1}{P(w_1w_2 \cdots w_N)}} = \sqrt[N]{\prod_{i=1}^{N}\frac{1}{P(w_i|w_1 \cdots w_{i-1})}}  \end{equation*}

For a bigram language model

    \begin{equation*} 	PP(W) = \sqrt[N]{\prod_{i=1}^{N}\frac{1}{P(w_i|w_{i-1})}} \end{equation*}

The more the information the n-gram gives about a word sequence, the lower the perplexity.


Smoothing is a technique to prevent a language models from assigning zero probability to unseen events.

The unsmoothed Maximum Likelihood Estimate of the unigram probability of the word w_i is its count c_i normalized by the total number of word tokens N.

    \begin{equation*} 	P(w_i) = \frac{c_i}{N} \end{equation*}

Laplace (add-one) Smoothing technique adds a one to all the bigram counts. Since there are V words in the vocabulary and each one was incremented, the denominator needs to be adjusted to take into account the extra V observations.

    \begin{equation*} 	P_{\text{Laplace}}(w_i) = \frac{c_{i}+1}{N+V} \end{equation*}

Comparing the two equations above

    \begin{equation*} 	c_i^* = (c_i + 1)\frac{N}{N + V} \end{equation*}

The adjusted count c^* describes how the smoothing affects the numerator.

The bigram probabilities follow the same technique

    \begin{align*} 	P_{\text{Laplace}}(w_n|w_{n-1}) &= \frac{C(w_{n-1}w_n) + 1}{\sum_w(C(w_{n-1}w) + 1)} = \frac{C(w_{n-1}w_n) + 1}{C(w_{n-1}) + V} \\ 	c^*(w_n|w_{n-1}) &= \frac{[C(w_{n-1}w_n) + 1] \cdot C(w_{n-1})}{C(w_{n-1}) + V} \end{align*}

An extension of the above technique is to add k instead of 1.

    \begin{equation*} 	P_{\text{Add-k}}(w_n|w_{n-1}) = \frac{C(w_{n-1}w_n) + k}{C(w_{n-1}) + kV} \end{equation*}


Interpolation is an approach to mix the probability estimates from all the n-gram estimators. The trigram, bigram, and unigram counts are weighed and combined. An example of simple linear interpolation is given below

    \begin{equation*} 	\hat P(w_n|w_{n-2}w_{n-1}) = \lambda_1P(w_n|w_{n-2}w_{n-1}) + \lambda_2P(w_n|w_{n-1}) + \lambda_3P(w_n) \end{equation*}

such that

    \begin{equation*} 	\sum_{i}\lambda_i = 1 \end{equation*}

Leave a Reply

Your email address will not be published. Required fields are marked *