Hidden Markov Models Training – The Forward-Backward Algorithm

Learning

Given an observation sequence O and the set of possible states in the HMM, learn the HMM parameters A and B.

The standard algorithm for HMM training is the Forward-Backward, or Baum-Welch algorithm. It is an iterative algorithm, computing an initial estimate for the probabilities, then using those estimates to computing a better estimate, and so on, iteratively improving the probabilities that it learns.

Backward Probability

The backward probability \beta is the probability of seeing the observations from time t + 1 to the end, given that we are in state i at time t (and given the HMM \lambda):

    \begin{equation*}      \beta_t(i) = P(o_{t+1},o_{t+2}, \cdots, o_T|q_t = i,\lambda)  \end{equation*}

The state at q_{t+1}=j can be any of N states.

    \begin{align*}      \beta_t(i) & = \sum_{j=1}^{N} P(o_{t+1},o_{t+2}, \cdots, o_T,q_{t+1}=j|q_t = i,\lambda) \\      & = \sum_{j=1}^{N} P(o_{t+2}, \cdots, o_T|o_{t+1},q_{t+1}=j,q_t = i,\lambda)P(o_{t+1},q_{t+1}=j|q_t = i,\lambda)  \end{align*}

Observations o_{t+2}, \cdots , o_{T} are independent of o_{t+1} and q_t.

    \begin{align*} \beta_t(i) & = \sum_{j=1}^{N} P(o_{t+2}, \cdots, o_T|q_{t+1}=j,\lambda)P(o_{t+1},q_{t+1}=j|q_t = i,\lambda) \\ & = \sum_{j=1}^{N} \beta_{t+1}(j)P(o_{t+1},q_{t+1}=j|q_t = i,\lambda) \\ & = \sum_{j=1}^{N} \beta_{t+1}(j)P(o_{t+1}|q_{t+1}=j,q_t = i,\lambda)P(q_{t+1}=j|q_t = i,\lambda) \\ & = \sum_{j=1}^{N} \beta_{t+1}(j)P(o_{t+1}|q_{t+1}=j,q_t = i,\lambda)P(q_{t+1}=j|q_t = i,\lambda)  \end{align*}

Observation o_{t+1} is independent of q_t.

    \begin{align*} \beta_t(i) & = \sum_{j=1}^{N} \beta_{t+1}(j)P(o_{t+1}|q_{t+1}=j,\lambda)P(q_{t+1}=j|q_t = i,\lambda) \\ & = \sum_{j=1}^{N} \beta_{t+1}(j)b_j(o_{t+1})a_{ij}  \end{align*}

Visualization

Rendered by QuickLaTeX.com

The computation of \beta_t(i) by summing all the successive values \beta_{t+1}(j) weighted by their transition probabilities a_{ij} and their observation probabilities b_j(o_{t +1}). Hidden states are in circles, observations in squares. Shaded nodes are included in the probability computation for \beta_t(i).

Initialization

    \begin{equation*} \beta_T(i) = a_{iF} \quad \quad  1 \leq i \leq N \end{equation*}

Recursion

    \begin{equation*} \beta_t(i) = \sum_{j=1}^{N} a_{ij}b_j(o_{t+1})\beta_{t+1}(j) \quad \quad  1 \leq i \leq N; 1 < t \leq T \end{equation*}

Termination

    \begin{equation*} P(O|\lambda) = \beta_1(q_0) = \sum_{j=1}^{N} a_{0j}b_j(o_{1})\beta_{1}(j) \end{equation*}

Probability \bm{\xi_t}

Define the probability \bm{\xi_t} as the probability of being in state i at time t and state j at time t + 1, given the observation sequence and the model.

    \begin{align*}  \xi_t(i, j) & = P(q_t=i,q_{t+1}=j|O,\lambda) \\ & = \frac{P(q_t=i,q_{t+1}=j,O|\lambda)}{P(O|\lambda)} \\ \end{align*}

The observation O is defined as o_1, \cdots ,o_t,o_{t+1}, \cdots ,o_T

    \begin{align*}  \xi_t(i, j) & = \frac{P(q_t=s_i, q_{t+1}=s_j,o_1,....o_t,o_{t+1},....o_T|\lambda)}{P(O|\lambda)} \\ & = \frac{P(o_1,....o_t,q_t=s_i,o_{t+2}....o_T,q_{t+1}=s_j,o_{t+1}|\lambda)}{P(O|\lambda)} \\ & = \frac{P(o_{t+1}|o_1,....o_t,q_t=s_i,o_{t+2}....o_T,q_{t+1}=s_j,\lambda)P(o_1,....o_t,q_t=s_i,o_{t+2}....o_T,q_{t+1}=s_j|\lambda)}{P(O|\lambda)} \\ \end{align*}

The observation o_{t+1}, emitted at state q_{t+1} = s_j, will only depend on the state q_{t+1} = s_j. Simplifying

    \begin{align*} 	\xi_t(i, j) & = \frac{P(o_{t+1}|q_{t+1}=s_j,\lambda)P(o_1,....o_t,q_t=s_i,o_{t+2}....o_T,q_{t+1}=s_j|\lambda)}{P(O|\lambda)} \\ & = \frac{b_j(o_{t+1})P(o_1,....o_t,q_t=s_i,o_{t+2}....o_T,q_{t+1}=s_j|\lambda)}{P(O|\lambda)} \\ & = \frac{b_j(o_{t+1})P(o_{t+2}....o_T,q_{t+1}=s_j|o_1,....o_t,q_t=s_i,\lambda)P(o_1,....o_t,q_t=s_i|\lambda)}{P(O|\lambda)} \\ & = \frac{b_j(o_{t+1})\alpha_t(i)P(o_{t+2}....o_T,q_{t+1}=s_j|o_1,....o_t,q_t=s_i,\lambda)}{P(O|\lambda)}\nonumber \end{align*}

The observations o_{t+2} to o_{T} and the state q_{t+1} = s_j do not depend on any observation from o_{1} to o_{t}. Simplifying

    \begin{align*} 	\xi_t(i, j) & = \frac{b_j(o_{t+1})\alpha_t(i)P(o_{t+2}....o_T,q_{t+1}=s_j|q_t=s_i,\lambda)}{P(O|\lambda)} \\ & = \frac{b_j(o_{t+1})\alpha_t(i)P(o_{t+2}....o_T|q_{t+1}=s_j,q_t=s_i,\lambda)P(q_{t+1}=s_j|q_t=s_i,\lambda)}{P(O|\lambda)} \\ & = \frac{b_j(o_{t+1})\alpha_t(i)a_{ij}P(o_{t+2}....o_T|q_{t+1}=s_j,q_t=s_i,\lambda)}{P(O|\lambda)} \\ \end{align*}

The observations o_{t+2} to o_{T} do not depend on the state q_t=s_i. Simplifying

    \begin{align*}  \xi_t(i, j) & = \frac{b_j(o_{t+1})\alpha_t(i)a_{ij}P(o_{t+2}....o_T|q_{t+1}=s_j,\lambda)}{P(O|\lambda)} \\ & =  \frac{b_j(o_{t+1})\alpha_t(i)a_{ij}\beta_{t+1}(j)}{P(O|\lambda)} \\ \end{align*}

Visualization

\clearpage

Rendered by QuickLaTeX.com

Computation of the joint probability of being in state i at time t and state j at time t + 1. The figure shows the various probabilities that need to be combined to produce P(q_t=i,q_{t+1}=j,O|\lambda): the \alpha and \beta probabilities, the transition probability a_ij and the observation probability b_j(o_{t+1}). Hidden states are in circles, observations in squares. Shaded nodes are included in the probability computation for \xi_t(i, j).

The probability of the observation given the model is simply the forward probability of the whole utterance (or alternatively, the backward probability of the whole utterance):

    \begin{equation*}      P(O|\lambda) = P(o_1,o_2,\cdots,o_t,o_{t+1},\cdots,o_T|\lambda) \end{equation*}

The state q_t = j can be any of N possible states.

    \begin{align*}      P(O|\lambda) & = \sum_{j=1}^{N}P(o_1,o_2,\cdots,o_t,q_t=j,o_{t+1}….o_T|\lambda) \\ & = \sum_{j=1}^{N}P(o_{t+1},\cdots,o_T|o_1,o_2,\cdots,o_t,q_t=j,\lambda)P(o_1,o_2,\cdots,o_t,q_t=j|\lambda)   \end{align*}

Observations o_{t+1},\cdots,o_T do not depend on observations o_1,o_2,\cdots,o_t

    \begin{align*}  P(O|\lambda) & = \sum_{j=1}^{N}P(o_{t+1},\cdots,o_T|q_t=j,\lambda)P(o_1,o_2,\cdots,o_t,q_t=j|\lambda) \\ & = \sum_{j=1}^{N}\beta_t(j)\alpha_t(j) \\ \end{align*}

Define the probability \gamma_t(j) as the probability of being in state j at time t.

    \begin{align*}  \gamma_t(j) & = P(q_t=j|O, \lambda) \\ & = \frac{P(q_t=j,O|\lambda)}{P(O|\lambda)} \\ & = \frac{P(q_t=j,o_1,o_2,\cdots,o_t,o_{t+1},\cdots,o_T|\lambda)}{P(O|\lambda)} \\ & = \frac{P(o_{t+1},\cdots,o_T|q_t=j,o_1,o_2,\cdots,o_t,\lambda)P(q_t=j,o_1,o_2,\cdots,o_t|\lambda)}{P(O|\lambda)} \\ \end{align*}

Observations o_{t+1},\cdots,o_T do not depend on observations o_1,o_2,\cdots,o_t

    \begin{align*}  \gamma_t(j) & = \frac{P(o_{t+1}....o_T|q_t=j,\lambda)P(q_t=j,o_1,o_2....o_t|\lambda)}{P(O|\lambda)} \\  & = \frac{\beta_t(j)\alpha_t(j)}{P(O|\lambda)} \\ \end{align*}

Estimate \bm{\hat a_{ij}}

    \begin{align*}  \hat a_{ij} & = \frac{\text{expected number of transitions from state $i$ to state $j$}}{\text{expected number of transitions from state $i$}} \\ & = \frac{C(q_t=s_i,q_{t+1}=s_j)}{C(q_t=s_i)} \\ & = \frac{\text{Sum over all $t$ of $\xi$ }}{\text{Sum over all transitions out of state $i$}} \\ & = \frac{\sum_{t=1}^{T-1}\xi_t(i,j)}{\sum_{t=1}^{T-1}\sum_{k=1}^{N}\xi_t(i,k)} \\ \end{align*}

Estimate \bm{\hat b_{j}(v_k)}

    \begin{align*}  \hat b_{j}(v_k) & = \frac{\text{expected number of times in state $j$ and observing symbol $v_k$}}{\text{expected number of times in state $j$}} \\ & = \frac{C(q_t=s_j,o_t=v_k)}{C(q_t=s_j)} \\ & = \frac{\text{Sum $\gamma_t(j)$ for all time steps $t$ in which the observation $o_t$ is the symbol $v_k$}}{\text{Sum $\gamma_t(j)$ over all time steps $t$}} \\ & = = \frac{\sum_{t=1 s.t.O_t=v_k}^{T-1}\gamma_t(j)}{\sum_{t=1}^{T-1}\gamma_t(j)} \\ \end{align*}

The forward-backward algorithm has two steps: the expectation step, or E-step, and the maximization step, or M-step.

In the E-step, the expected state occupancy count \gamma and the expected state transition count \xi from the earlier A and B probabilities are computed.

In the M-step, \gamma and \xi are used to recompute new A and B probabilities.

Leave a Reply

Your email address will not be published. Required fields are marked *