Hidden Markov Models Training – The Forward-Backward Algorithm

Learning

Given an observation sequence $O$ and the set of possible states in the HMM, learn the HMM parameters $A$ and $B$ .

The standard algorithm for HMM training is the Forward-Backward, or Baum-Welch algorithm. It is an iterative algorithm, computing an initial estimate for the probabilities, then using those estimates to computing a better estimate, and so on, iteratively improving the probabilities that it learns.

Backward Probability

The backward probability $\beta$ is the probability of seeing the observations from time $t + 1$ to the end, given that we are in state $i$ at time $t$ (and given the HMM $\lambda$ ):

$\begin{equation*} \beta_t(i) = P(o_{t+1},o_{t+2}, \cdots, o_T|q_t = i,\lambda) \end{equation*}$

The state at $q_{t+1}=j$ can be any of N states.

$\begin{align*} \beta_t(i) & = \sum_{j=1}^{N} P(o_{t+1},o_{t+2}, \cdots, o_T,q_{t+1}=j|q_t = i,\lambda) \\ & = \sum_{j=1}^{N} P(o_{t+2}, \cdots, o_T|o_{t+1},q_{t+1}=j,q_t = i,\lambda)P(o_{t+1},q_{t+1}=j|q_t = i,\lambda) \end{align*}$

Observations $o_{t+2}, \cdots , o_{T}$ are independent of $o_{t+1}$ and $q_t$ .

$\begin{align*} \beta_t(i) & = \sum_{j=1}^{N} P(o_{t+2}, \cdots, o_T|q_{t+1}=j,\lambda)P(o_{t+1},q_{t+1}=j|q_t = i,\lambda) \\ & = \sum_{j=1}^{N} \beta_{t+1}(j)P(o_{t+1},q_{t+1}=j|q_t = i,\lambda) \\ & = \sum_{j=1}^{N} \beta_{t+1}(j)P(o_{t+1}|q_{t+1}=j,q_t = i,\lambda)P(q_{t+1}=j|q_t = i,\lambda) \\ & = \sum_{j=1}^{N} \beta_{t+1}(j)P(o_{t+1}|q_{t+1}=j,q_t = i,\lambda)P(q_{t+1}=j|q_t = i,\lambda) \end{align*}$

Observation $o_{t+1}$ is independent of $q_t$ .

$\begin{align*} \beta_t(i) & = \sum_{j=1}^{N} \beta_{t+1}(j)P(o_{t+1}|q_{t+1}=j,\lambda)P(q_{t+1}=j|q_t = i,\lambda) \\ & = \sum_{j=1}^{N} \beta_{t+1}(j)b_j(o_{t+1})a_{ij} \end{align*}$

Visualization

Rendered by QuickLaTeX.com

The computation of $\beta_t(i)$ by summing all the successive values $\beta_{t+1}(j)$ weighted by their transition probabilities $a_{ij}$ and their observation probabilities $b_j(o_{t +1})$ . Hidden states are in circles, observations in squares. Shaded nodes are included in the probability computation for $\beta_t(i)$ .

Initialization

$\begin{equation*} \beta_T(i) = a_{iF} \quad \quad 1 \leq i \leq N \end{equation*}$

Recursion

$\begin{equation*} \beta_t(i) = \sum_{j=1}^{N} a_{ij}b_j(o_{t+1})\beta_{t+1}(j) \quad \quad 1 \leq i \leq N; 1 < t \leq T \end{equation*}$

Termination

$\begin{equation*} P(O|\lambda) = \beta_1(q_0) = \sum_{j=1}^{N} a_{0j}b_j(o_{1})\beta_{1}(j) \end{equation*}$

Probability $\bm{\xi_t}$

Define the probability $\bm{\xi_t}$ as the probability of being in state $i$ at time $t$ and state $j$ at time $t + 1$ , given the observation sequence and the model.

$\begin{align*} \xi_t(i, j) & = P(q_t=i,q_{t+1}=j|O,\lambda) \\ & = \frac{P(q_t=i,q_{t+1}=j,O|\lambda)}{P(O|\lambda)} \\ \end{align*}$

The observation $O$ is defined as $o_1, \cdots ,o_t,o_{t+1}, \cdots ,o_T$

$\begin{align*} \xi_t(i, j) & = \frac{P(q_t=s_i, q_{t+1}=s_j,o_1,....o_t,o_{t+1},....o_T|\lambda)}{P(O|\lambda)} \\ & = \frac{P(o_1,....o_t,q_t=s_i,o_{t+2}....o_T,q_{t+1}=s_j,o_{t+1}|\lambda)}{P(O|\lambda)} \\ & = \frac{P(o_{t+1}|o_1,....o_t,q_t=s_i,o_{t+2}....o_T,q_{t+1}=s_j,\lambda)P(o_1,....o_t,q_t=s_i,o_{t+2}....o_T,q_{t+1}=s_j|\lambda)}{P(O|\lambda)} \\ \end{align*}$

The observation $o_{t+1}$ , emitted at state $q_{t+1} = s_j$ , will only depend on the state $q_{t+1} = s_j$ . Simplifying

$\begin{align*} \xi_t(i, j) & = \frac{P(o_{t+1}|q_{t+1}=s_j,\lambda)P(o_1,....o_t,q_t=s_i,o_{t+2}....o_T,q_{t+1}=s_j|\lambda)}{P(O|\lambda)} \\ & = \frac{b_j(o_{t+1})P(o_1,....o_t,q_t=s_i,o_{t+2}....o_T,q_{t+1}=s_j|\lambda)}{P(O|\lambda)} \\ & = \frac{b_j(o_{t+1})P(o_{t+2}....o_T,q_{t+1}=s_j|o_1,....o_t,q_t=s_i,\lambda)P(o_1,....o_t,q_t=s_i|\lambda)}{P(O|\lambda)} \\ & = \frac{b_j(o_{t+1})\alpha_t(i)P(o_{t+2}....o_T,q_{t+1}=s_j|o_1,....o_t,q_t=s_i,\lambda)}{P(O|\lambda)}\nonumber \end{align*}$

The observations $o_{t+2}$ to $o_{T}$ and the state $q_{t+1} = s_j$ do not depend on any observation from $o_{1}$ to $o_{t}$ . Simplifying

$\begin{align*} \xi_t(i, j) & = \frac{b_j(o_{t+1})\alpha_t(i)P(o_{t+2}....o_T,q_{t+1}=s_j|q_t=s_i,\lambda)}{P(O|\lambda)} \\ & = \frac{b_j(o_{t+1})\alpha_t(i)P(o_{t+2}....o_T|q_{t+1}=s_j,q_t=s_i,\lambda)P(q_{t+1}=s_j|q_t=s_i,\lambda)}{P(O|\lambda)} \\ & = \frac{b_j(o_{t+1})\alpha_t(i)a_{ij}P(o_{t+2}....o_T|q_{t+1}=s_j,q_t=s_i,\lambda)}{P(O|\lambda)} \\ \end{align*}$

The observations $o_{t+2}$ to $o_{T}$ do not depend on the state $q_t=s_i$ . Simplifying

$\begin{align*} \xi_t(i, j) & = \frac{b_j(o_{t+1})\alpha_t(i)a_{ij}P(o_{t+2}....o_T|q_{t+1}=s_j,\lambda)}{P(O|\lambda)} \\ & = \frac{b_j(o_{t+1})\alpha_t(i)a_{ij}\beta_{t+1}(j)}{P(O|\lambda)} \\ \end{align*}$

Visualization

\clearpage

Rendered by QuickLaTeX.com

Computation of the joint probability of being in state $i$ at time $t$ and state $j$ at time $t + 1$ . The figure shows the various probabilities that need to be combined to produce $P(q_t=i,q_{t+1}=j,O|\lambda)$ : the $\alpha$ and $\beta$ probabilities, the transition probability $a_ij$ and the observation probability $b_j(o_{t+1})$ . Hidden states are in circles, observations in squares. Shaded nodes are included in the probability computation for $\xi_t(i, j)$ .

The probability of the observation given the model is simply the forward probability of the whole utterance (or alternatively, the backward probability of the whole utterance):

$\begin{equation*} P(O|\lambda) = P(o_1,o_2,\cdots,o_t,o_{t+1},\cdots,o_T|\lambda) \end{equation*}$

The state $q_t = j$ can be any of $N$ possible states.

$\begin{align*} P(O|\lambda) & = \sum_{j=1}^{N}P(o_1,o_2,\cdots,o_t,q_t=j,o_{t+1}….o_T|\lambda) \\ & = \sum_{j=1}^{N}P(o_{t+1},\cdots,o_T|o_1,o_2,\cdots,o_t,q_t=j,\lambda)P(o_1,o_2,\cdots,o_t,q_t=j|\lambda) \end{align*}$

Observations $o_{t+1},\cdots,o_T$ do not depend on observations $o_1,o_2,\cdots,o_t$

$\begin{align*} P(O|\lambda) & = \sum_{j=1}^{N}P(o_{t+1},\cdots,o_T|q_t=j,\lambda)P(o_1,o_2,\cdots,o_t,q_t=j|\lambda) \\ & = \sum_{j=1}^{N}\beta_t(j)\alpha_t(j) \\ \end{align*}$

Define the probability $\gamma_t(j)$ as the probability of being in state $j$ at time $t$ .

$\begin{align*} \gamma_t(j) & = P(q_t=j|O, \lambda) \\ & = \frac{P(q_t=j,O|\lambda)}{P(O|\lambda)} \\ & = \frac{P(q_t=j,o_1,o_2,\cdots,o_t,o_{t+1},\cdots,o_T|\lambda)}{P(O|\lambda)} \\ & = \frac{P(o_{t+1},\cdots,o_T|q_t=j,o_1,o_2,\cdots,o_t,\lambda)P(q_t=j,o_1,o_2,\cdots,o_t|\lambda)}{P(O|\lambda)} \\ \end{align*}$

Observations $o_{t+1},\cdots,o_T$ do not depend on observations $o_1,o_2,\cdots,o_t$

$\begin{align*} \gamma_t(j) & = \frac{P(o_{t+1}....o_T|q_t=j,\lambda)P(q_t=j,o_1,o_2....o_t|\lambda)}{P(O|\lambda)} \\ & = \frac{\beta_t(j)\alpha_t(j)}{P(O|\lambda)} \\ \end{align*}$

Estimate $\bm{\hat a_{ij}}$

$\begin{align*} \hat a_{ij} & = \frac{\text{expected number of transitions from state $i$ to state $j$}}{\text{expected number of transitions from state $i$}} \\ & = \frac{C(q_t=s_i,q_{t+1}=s_j)}{C(q_t=s_i)} \\ & = \frac{\text{Sum over all $t$ of $\xi$ }}{\text{Sum over all transitions out of state $i$}} \\ & = \frac{\sum_{t=1}^{T-1}\xi_t(i,j)}{\sum_{t=1}^{T-1}\sum_{k=1}^{N}\xi_t(i,k)} \\ \end{align*}$

Estimate $\bm{\hat b_{j}(v_k)}$

$\begin{align*} \hat b_{j}(v_k) & = \frac{\text{expected number of times in state $j$ and observing symbol $v_k$}}{\text{expected number of times in state $j$}} \\ & = \frac{C(q_t=s_j,o_t=v_k)}{C(q_t=s_j)} \\ & = \frac{\text{Sum $\gamma_t(j)$ for all time steps $t$ in which the observation $o_t$ is the symbol $v_k$}}{\text{Sum $\gamma_t(j)$ over all time steps $t$}} \\ & = = \frac{\sum_{t=1 s.t.O_t=v_k}^{T-1}\gamma_t(j)}{\sum_{t=1}^{T-1}\gamma_t(j)} \\ \end{align*}$

The forward-backward algorithm has two steps: the expectation step, or E-step, and the maximization step, or M-step.

In the E-step, the expected state occupancy count $\gamma$ and the expected state transition count $\xi$ from the earlier $A$ and $B$ probabilities are computed.

In the M-step, $\gamma$ and $\xi$ are used to recompute new $A$ and $B$ probabilities.

The Beard Sage

Hidden Markov Models Training – The Forward-Backward Algorithm

Learning

Backward Probability

Visualization

Initialization

Recursion

Termination

Probability $\bm{\xi_t}$

Visualization

Estimate $\bm{\hat a_{ij}}$

Estimate $\bm{\hat b_{j}(v_k)}$

Leave a Reply Cancel reply