N-gram Language Models - The Beard Sage

Models that assign probabilities to sequences of words are called language models. An $n$ –gram is a sequence of $n$ words: a $2$ -gram (or bigram) is a two-word sequence of words like “please turn”, “turn your”, or “your homework”, and a $3$ -gram (or trigram) is a three-word sequence of words like “please turn your”, or “turn your homework”.

Notation

To represent the probability of a particular random variable $X_i$ taking on the value “the”, or $P(X_i = \text{"the"})$ , use the simplification $P(\text{the})$ .

A sequence of $N$ words is represented either as $w_1, \cdots , w_n$ or $w_1^n$ .

Probability of a Sequence

For the joint probability of each word in a sequence having a particular value $P(X = w_1, Y = w_2, Z = w_3, \cdots , W = w_n)$ use $P(w_1,w_2, \cdots , w_n)$ .

$\begin{equation*} P(w_1^n) = P(w_1)P(w_2|w_1)P(w_3|w_1^2) \cdots P(w_n|w_1^{n-1}) = \prod_{k=1}^{n}P(w_k|w_1^{k-1}) \end{equation*}$

The intuition of the $n$ -gram model is that instead of computing the probability of a word given its entire history, approximate the history by just the last few words.

Bigram

The bigram model, for example, approximates the probability of a word given all the previous words $P(w_n|w_1^{n-1})$ by using only the conditional probability of the preceding word $P(w_n|w_{n-1})$ .

$\begin{equation*} P(w_n|w_1^{n-1}) \approx P(w_n|w_{n-1}) \end{equation*}$

This assumption that the probability of a word depends only on the previous word is also called a Markov assumption.

Thus, for a bigram

$\begin{equation*} P(w_1^n) \approx \prod_{k=1}^{n}P(w_k|w_{k-1}) \end{equation*}$

N-gram

The $N$ -gram (which looks $N - 1$ words into the past) approximates the probability as follows

$\begin{equation*} P(w_n|w_1^{n-1}) \approx P(w_n|w_{n-(N-1)}^{n-1}) \end{equation*}$

Parameter Estimation

Maximum Likelihood estimation is used to compute the parameters of an $n$ -gram model. This process involves getting counts from a corpus, and normalizing the counts so that they lie between $0$ and $1$ .

To compute a particular bigram probability of a word $y$ given a previous word $x$ , compute the count of the bigram $C(xy)$ and normalize by the sum of all the bigrams that share the same first word $x$

$\begin{equation*} P(w_n|w_{n-1}) = \frac{C(w_{n-1}w_n)}{\sum_wC(w_{n-1}w)} \end{equation*}$

The sum of all bigram counts that start with a given word $w_{n-1}$ must be equal to the unigram count for that word $w_{n-1}$ .

$\begin{equation*} P(w_n|w_{n-1}) = \frac{C(w_{n-1}w_n)}{C(w_{n-1})} \end{equation*}$

For the general case of MLE $N$ -gram parameter estimation:

$\begin{equation*} P(w_n|w_{n-(N-1)}^{n-1}) = \frac{C(w_{n-(N-1)}^{n-1}w_n)}{C(w_{n-(N-1)}^{n-1})} \end{equation*}$

Perplexity

The perplexity (sometimes called $PP$ for short) of a language model on a test set is the inverse probability of the test set, normalized by the number of words. The higher the conditional probability of the word sequence, the lower the perplexity. Consider the test set $W = w_1w_2 \cdots w_N$ .

$\begin{equation*} PP(W) = P(w_1w_2 \cdots w_N)^{-\frac{1}{N}} = \sqrt[N]{\frac{1}{P(w_1w_2 \cdots w_N)}} = \sqrt[N]{\prod_{i=1}^{N}\frac{1}{P(w_i|w_1 \cdots w_{i-1})}} \end{equation*}$

For a bigram language model

$\begin{equation*} PP(W) = \sqrt[N]{\prod_{i=1}^{N}\frac{1}{P(w_i|w_{i-1})}} \end{equation*}$

The more the information the $n$ -gram gives about a word sequence, the lower the perplexity.

Smoothing

Smoothing is a technique to prevent a language models from assigning zero probability to unseen events.

The unsmoothed Maximum Likelihood Estimate of the unigram probability of the word $w_i$ is its count $c_i$ normalized by the total number of word tokens $N$ .

$\begin{equation*} P(w_i) = \frac{c_i}{N} \end{equation*}$

Laplace (add-one) Smoothing technique adds a one to all the bigram counts. Since there are $V$ words in the vocabulary and each one was incremented, the denominator needs to be adjusted to take into account the extra $V$ observations.

$\begin{equation*} P_{\text{Laplace}}(w_i) = \frac{c_{i}+1}{N+V} \end{equation*}$

Comparing the two equations above

$\begin{equation*} c_i^* = (c_i + 1)\frac{N}{N + V} \end{equation*}$

The adjusted count $c^*$ describes how the smoothing affects the numerator.

The bigram probabilities follow the same technique

$\begin{align*} P_{\text{Laplace}}(w_n|w_{n-1}) &= \frac{C(w_{n-1}w_n) + 1}{\sum_w(C(w_{n-1}w) + 1)} = \frac{C(w_{n-1}w_n) + 1}{C(w_{n-1}) + V} \\ c^*(w_n|w_{n-1}) &= \frac{[C(w_{n-1}w_n) + 1] \cdot C(w_{n-1})}{C(w_{n-1}) + V} \end{align*}$

An extension of the above technique is to add $k$ instead of $1$ .

$\begin{equation*} P_{\text{Add-k}}(w_n|w_{n-1}) = \frac{C(w_{n-1}w_n) + k}{C(w_{n-1}) + kV} \end{equation*}$

Interpolation

Interpolation is an approach to mix the probability estimates from all the $n$ -gram estimators. The trigram, bigram, and unigram counts are weighed and combined. An example of simple linear interpolation is given below

$\begin{equation*} \hat P(w_n|w_{n-2}w_{n-1}) = \lambda_1P(w_n|w_{n-2}w_{n-1}) + \lambda_2P(w_n|w_{n-1}) + \lambda_3P(w_n) \end{equation*}$

such that

$\begin{equation*} \sum_{i}\lambda_i = 1 \end{equation*}$