Hidden Markov Models(HMM) and Bayesian Networks

Updated on Feb 28, 2017 • by - pranav

categories - the_mathetics_behind

tags - hmm, bayesian, probabity,

I was reading about handwriting recognition when I came across the term Hidden Markov Models(HMM). Although I had some idea about Markov Models from my High school days, but never heard of "hidden" markov models. But while reading papers on handwriting recognition, and also speech recognition I came across this term, so I decided to dive into this.

Motivation

I am a lazy guy, and hence I need motivation. So I would list what made me go towards hidden markov models, as maybe some of you would wonder too. I would be mainly focussing on the use cases in Machine Learning. HMMs are used to model sequencial data i.e. the data which depends on previous data(eh?). What I mean to say is that machine learning techniques use data to train and to represent that data in mathematical models we use various mathematical tools(Poisson, Gaussian), and when the data which is generated by a system which is not completely independent of previous data generated, we can use hidden markov models to model that into our system. Some of the sequencial data categories are -

Time series data.
Language data banks.
Speech
The famous Human Gait detection problem.

Some Preliminary Mathematics

In the theory of markov models, we have good and efficient algorithms to calculate joint, marginal probabilities. Let $X$ be a set of random variable, then
$P(\textbf{X}) = P(X_{1}, X_{2}, X_{3}, ... X_{n})$
Breaking it into conditional probabilities with chain rule gives.
$P(\textbf{X}) = P(X_{1}).P(X_{2}|X_{1}).P(X_{3}|X_{2},X_{1}). .... P(X_{n}|X_{n-1},X_{n-2}...X_{1})$ $P(\textbf{X}) = \prod^{n}_{i = 1}{P(X_{i}|X_{i-1},X_{i-2}, ... X_{1})}$
To model these long conditional probabilities is a really difficult task as in machine learning the collected data(i.e. the value of n) can vary $~ O(10^{15})$. Hence we use Markov's assumption which is as follows for $m^{th}$ order -
$P(\textbf{X}) = \prod^{n}_{i = 1}{P(X_{i}| X_{i-1}, X_{i-2}, ... X_{i-m})})$
Which means we are assuming that the current state is dependent only on the previous m states.

Introduction

HMM represent probability distribution over sequences of observations. HMM help model sequential data with few parameters, and are not enforced by strong markov assumptions. There are certain assumptions that we take when we model a problem using this method. First lets understand the assumptions then go towards the model and how it makes our lives easy.

Assumptions

The observations are taken in discrete time intervals.
Observation at time $t$ was generated by some state $S_{t}$, which is hidden from the observer.
The states follow markov property i.e. State at time(any other spacio temporal unit) $t$ depends only on state $t-1$ and no previous ones.
The hidden state variables are discrete and take values from the set $\{1, 2, 3, ... k\}$

Hence because of these assumptions, we can model the joint distribution using the following equation -
$P(S_{1:T}, Y_{1:T}) = P(S_{1}).P(Y_{1}|S_{1}) \prod{_{t=2}^{T}P(S_{t}|S_{t-1}).P(Y_{t}|S{t})}$
Notice that we are modelling on $1^{st}$ order markov assumption. In this equation t = 1 is the initial state and hence it has no previous state to depend upon. From the next state(2), the current state depends only on the previous one and the output depends on the state as before. To represent this model, we $K x K$ matrix which will hold the transition probabilities between states, and if the output probabilities take discrete L values, then total of KxL matrix, for real case it might be a gaussian model or something else too. Although the HMM is a nice trick to model sequence of states, it is not complete with Viterbi algorithm(discussed shortly). There are following parameters associated to a model -

States - These define the system, the conditions in which agents can be.
Observations (Events) - These are responsible for changes in state of agents.
Start Probabilities - The probability of agent(s) residing in either of states at beginning.
Transition Probabilities - The probabilities of events occuring in the system.
Emission Probabilities - The probabilities of next state when an event occurs in current state.

Example Time

Lets take an example to understand this. You will find a hell lot of examples on the internet, but I particularly chose the one form the CMU class because it would help use link this section and the next, which discusses the problem formulation and solutions using HMM. So the example goes like this -

A casino has 2 dice. 1 is fair and the other is loaded. The fair dice has equal probability associated with each outcome, i.e. $\frac{1}{6}$ The probabilities associated with the loaded one is
$P(X = 1,2,3,4,5) = \frac{1}{10}, P(X = 6) = \frac{1}{2}$
And the casino player can switch between the dices with probability of $0.05$.

Now we got a finite sequence like - 434231231654366653546565532.. and we need some insights regarding this sequence which are -

Evaluation Problem - What is the probability of this sequence occuring in the casino with the given probabilities?
Decoding Problem - What portion was generated in fair dice, and what in the loaded one?
Learning Problem - How loaded/fair is the loaded/fair dice and often is the casino player switching in them?

In this problem, we have our observation as - $\{O_{t}\}^{T}_{t=1}$, which results in 43423123165436665...., and the hidden sequence here is of the sequence of dice used to generate the outputs. THe start, transition and emmission probabilities are given also, hence we an model this problem as a hidden markov model.

So thats all for this post. I will discuss the algorithms to solve the above posed 3 problems in the next post. :)