STA 414/2104, Spring 2014, Assignment 2 discussion.
With K=1, which is equivalent to a naive Bayes model, the
classification error rate on test cases was 0.190.
With K=5, 80 iterations of EM seemed sufficient for all ten random
initializations. The resulting models had the following error rates
on the test cases:
0.157 0.151 0.158 0.156 0.166 0.162 0.163 0.159 0.158 0.153
These are all better than the naive Bayes result, showing that using
more than one mixture component for each digit is beneficial.
I used the "show_digit" function to display the theta parameters of
the 50 mixture components as pictures (for the run started with the
last random seed). It is clear that the five components for each
digit have generally captured reasonable variations in writing style,
except perhaps for a few with small mixing proportion (given as the
number above the plot), such as the second "1" from the top.
Using the ensemble predictions (averaging probabilities of digits over
the ten runs above), the classification error rate on test cases was
0.139. This is substantially better than the error rate from every
one of the individual runs, showing the benefits of using an ensemble
when there is substantial random variation in the results.
Note that the individual run with highest log likelihood (and also
highest log likelihood + penalty) was the sixth run, whose error rate
of 0.162 was actually the third worst. So at least in this example,
picking a single run based on log likelihood would certainly not do
better than using the ensemble.