Consider last week's mixture model for the height of an adult, H, in which we specified H|M=0 as N(155,152) and H|M=1 as N(175,172), where M=1 indicates male and M=0 indicates female, with P(M=1)=P(M=0)=1/2. Suppose we measure that some adult's height is 180cm. How likely is it that they are male?
We can answer this by finding the conditional probability that M=1 given the measured height, using Bayes' Rule. But we have a problem. P(M=1|H=180) is a conditional probability in which the event that we condition on, H=180, has zero probability. We had previously said that such a conditional probability is undefined, since its definition involves a division by zero.
However, our measurement doesn't actually tell us that H=180. It has some finite precision, and so tells us something like that H is in the interval (179.9,180.1). So what we actually need to find is
P(M=1 | H in (179.9,180.1)) = P(M=1) P(H in (179.9,180.1)) | M=1) / P(H in (179.9,180.1))which is well defined. And we could actually compute it, using integrals over the normal probability density function.
However, when the precision of a measurement is high compared to the standard deviation of the quantity measured, we will find that these integrals are over intervals where the probability density is almost constant. So the integral is approximately equal to the probability density at the centre of the interval times the width of the integral. For example,
P(H in (179.9,180.1) | M=1) is approximately 0.2 f1,H(180)where f1,H is the probability density function for the N(175,172) distribution, which is the distribution for H when M=1.
When we substitute this into Bayes' Rule, we find that the width of the interval (0.2 in this example) cancels out. We then have something that looks just like Bayes' Rule, but with probability densities for H instead of probabilities.
We can often get away with this trick, treating probability densities almost like probabilities. Note, however, that there certainly are differences - for example, probabilities can't be greater than one, but probability densities can be greater than one.
X0 ---> X1 ---> X2 ---> X3 ---> ...In this model, Xi is conditionally independent of Xk given Xj whenever i < j < k. This is called the "Markov property", and the model above is called a "Markov model" or a "Markov chain". The variables X0, X1, X1, ... are often seen as being ordered by "time", measured by integers, so we may use t as the index, writing Xt for one of these variables, although in some applications the variables may be ordered in some other way, such as in space, or by position in a file. We sometimes call the value of Xt the "state" at time t.
Example applications: Markov models arise in many application areas. Some examples:
Let's suppose that the random variables making up a Markov chain have some finite range, such as { 1, 2, ..., K }. To specify the joint distribution of all the Xt, we will need to specify
The initial probabilities for the state - in other words, P(X0=x) for all x in the range of X0.If the transition probabilities are the same for all t, we say the Markov chain is "homogeneous".The transition probabilities for moving from a state at time t to a state at time t+1 - in other words, P(Xt+1=x' | Xt+1=x) for all x and x'.
For a homogeneous Markov chain, we will write
P0(x) for P(X0=x).Note that specifying a Markov chain with with K possible states requires K-1 numbers for the initial probabilities (the last probability is determined from the others by the requirement that they sum to one) and K(K-1) numbers for the transition probabilities.P(1)(x --> x') for P(Xt+1=x' | Xt+1=x)
Suppose we want to find Pn(x) = P(Xn = x).
We know P0(x). We can find P1(x) as follows:
P1(x1) = P(X1 = x1)
= SUM(over x0) P(X1 = x1, X0 = x0)
= SUM(over x0) P(X1 = x1 | X0 = x0) P(X0 = x0)
= SUM(over x0) P(1)(x0 --> x1) P0(x0)
Similarly, we could find P4(x) as
P4(x4) = P(X4 = x4)But the summation here is over K4 terms, and in general using this method to compute Pn(x) would take time that grows exponentially with n.
= SUM(over x0, x1, x2, x3) P0(x0) P(1)(x0 --> x1) P(1)(x1 --> x2) P(1)(x2 --> x3) P(1)(x3 --> x4)
Fortunately, we can instead proceed sequentially, computing P1, P2, P3, ... in turn (of course, we know P0 before we start). At each stage, we build a table of values for Pn, which we can use when computing the next table. To compute Pn when we already have a table of values for Pn-1, we just need to write it as follows:
Pn(xn) = SUM(over x0, ..., xn-1) P(Xn = xn | X0 = x0, ..., Xn-1 = xn-1) P(X0 = x0, ..., Xn-1 = xn-1)Note how the Markov property is crucial in simplifying P(Xn = xn | X0 = x0, ..., Xn-1 = xn-1) to P(Xn = xn | Xn-1 = xn-1). The end result is that to compute Pn when we already know Pn-1 requires a sum of only K terms, so computing Pn takes time proportional to Kn rather than Kn.
= SUM(over x0, ..., xn-1) P(Xn = xn | Xn-1 = xn-1) P(X0 = x0, ..., Xn-1 = xn-1)
= SUM(over xn-1) P(Xn = xn | Xn-1 = xn-1) SUM(over x0, ..., xn-2) P(X0 = x0, ..., Xn-1 = xn-1)
= SUM(over xn-1) P(Xn = xn | Xn-1 = xn-1) P(Xn-1 = xn-1)
= SUM(over xn-1) P(1)(xn-1 --> xn)) Pn-1(xn-1)