STA 247 - Week 11 lecture summary

Bayes' Rule for continous random variables

Consider last week's mixture model for the height of an adult, H, in which we specified H|M=0 as N(155,15²) and H|M=1 as N(175,17²), where M=1 indicates male and M=0 indicates female, with P(M=1)=P(M=0)=1/2. Suppose we measure that some adult's height is 180cm. How likely is it that they are male?

We can answer this by finding the conditional probability that M=1 given the measured height, using Bayes' Rule. But we have a problem. P(M=1|H=180) is a conditional probability in which the event that we condition on, H=180, has zero probability. We had previously said that such a conditional probability is undefined, since its definition involves a division by zero.

However, our measurement doesn't actually tell us that H=180. It has some finite precision, and so tells us something like that H is in the interval (179.9,180.1). So what we actually need to find is

P(M=1 | H in (179.9,180.1)) = P(M=1) P(H in (179.9,180.1)) | M=1) / P(H in (179.9,180.1))

which is well defined. And we could actually compute it, using integrals over the normal probability density function.

However, when the precision of a measurement is high compared to the standard deviation of the quantity measured, we will find that these integrals are over intervals where the probability density is almost constant. So the integral is approximately equal to the probability density at the centre of the interval times the width of the integral. For example,

P(H in (179.9,180.1) | M=1) is approximately 0.2 f_1,H(180)

where f_1,H is the probability density function for the N(175,17²) distribution, which is the distribution for H when M=1.

When we substitute this into Bayes' Rule, we find that the width of the interval (0.2 in this example) cancels out. We then have something that looks just like Bayes' Rule, but with probability densities for H instead of probabilities.

We can often get away with this trick, treating probability densities almost like probabilities. Note, however, that there certainly are differences - for example, probabilities can't be greater than one, but probability densities can be greater than one.

Markov chains

One simple way of modeling dependence between random variables is to use the following directed graphical model:

X₀ ---> X₁ ---> X₂ ---> X₃ ---> ...

In this model, X_i is conditionally independent of X_k given X_j whenever i < j < k. This is called the "Markov property", and the model above is called a "Markov model" or a "Markov chain". The variables X₀, X₁, X₁, ... are often seen as being ordered by "time", measured by integers, so we may use t as the index, writing X_t for one of these variables, although in some applications the variables may be ordered in some other way, such as in space, or by position in a file. We sometimes call the value of X_t the "state" at time t.

Example applications: Markov models arise in many application areas. Some examples:

X_t could be the length of a queue (eg, of processes waiting to run on a processor) at time t.
X_t could be the language (0=English, 1=French) of the tth word in some document, or document collection.
X_t could be 1 or 0 according to whether or not the tth bit in a transmission was corrupted by noise.
X_t could be the type of the tth web site visited by a user (eg, 1=news, 2=shopping, 3=photo, etc.).

In all these applications, a Markov chain might or might not be a good model - ie, the Markov property might or might not hold. But even if a Markov chain isn't a perfect model, it might be a much better model than assuming that the X_t are independent. We might prefer a Markov model to a more complex model because, as we'll see, it is relatively easy to compute probabilities relating to a Markov model (as long as the number of possible values for X_t isn't too large).

Specifying a Markov chain

Let's suppose that the random variables making up a Markov chain have some finite range, such as { 1, 2, ..., K }. To specify the joint distribution of all the X_t, we will need to specify

The initial probabilities for the state - in other words, P(X₀=x) for all x in the range of X₀.
The transition probabilities for moving from a state at time t to a state at time t+1 - in other words, P(X_t+1=x' | X_t+1=x) for all x and x'.

If the transition probabilities are the same for all t, we say the Markov chain is "homogeneous".

For a homogeneous Markov chain, we will write

P₀(x) for P(X₀=x).
P⁽¹⁾(x --> x') for P(X_t+1=x' | X_t+1=x)

Note that specifying a Markov chain with with K possible states requires K-1 numbers for the initial probabilities (the last probability is determined from the others by the requirement that they sum to one) and K(K-1) numbers for the transition probabilities.

Finding the marginal distribution for the state at time t

Suppose we want to find P_n(x) = P(X_n = x).

We know P₀(x). We can find P₁(x) as follows:

P₁(x₁) = P(X₁ = x₁)
     = SUM(over x₀) P(X₁ = x₁, X₀ = x₀)
     = SUM(over x₀) P(X₁ = x₁ | X₀ = x₀) P(X₀ = x₀)
     = SUM(over x₀) P⁽¹⁾(x₀ --> x₁) P₀(x₀)

Similarly, we could find P₄(x) as

P₄(x₄) = P(X₄ = x₄)
= SUM(over x₀, x₁, x₂, x₃) P₀(x₀) P⁽¹⁾(x₀ --> x₁) P⁽¹⁾(x₁ --> x₂) P⁽¹⁾(x₂ --> x₃) P⁽¹⁾(x₃ --> x₄)

But the summation here is over K⁴ terms, and in general using this method to compute P_n(x) would take time that grows exponentially with n.

Fortunately, we can instead proceed sequentially, computing P₁, P₂, P₃, ... in turn (of course, we know P₀ before we start). At each stage, we build a table of values for P_n, which we can use when computing the next table. To compute P_n when we already have a table of values for P_n-1, we just need to write it as follows:

P_n(x_n) = SUM(over x₀, ..., x_n-1) P(X_n = x_n | X₀ = x₀, ..., X_n-1 = x_n-1) P(X₀ = x₀, ..., X_n-1 = x_n-1)
     = SUM(over x₀, ..., x_n-1) P(X_n = x_n | X_n-1 = x_n-1) P(X₀ = x₀, ..., X_n-1 = x_n-1)
     = SUM(over x_n-1) P(X_n = x_n | X_n-1 = x_n-1) SUM(over x₀, ..., x_n-2) P(X₀ = x₀, ..., X_n-1 = x_n-1)
     = SUM(over x_n-1) P(X_n = x_n | X_n-1 = x_n-1) P(X_n-1 = x_n-1)
     = SUM(over x_n-1) P⁽¹⁾(x_n-1 --> x_n)) P_n-1(x_n-1)

Note how the Markov property is crucial in simplifying P(X_n = x_n | X₀ = x₀, ..., X_n-1 = x_n-1) to P(X_n = x_n | X_n-1 = x_n-1). The end result is that to compute P_n when we already know P_n-1 requires a sum of only K terms, so computing P_n takes time proportional to Kn rather than Kⁿ.