STA 414/2014, Spring 2014, Assignment #3 discussion
After some experimentation, a learning rate of 0.00025 was found to
give a stable decrease in minus log likelihood (plus penalty) for all
the multilayer perceptron models fit, and 30000 iterations was found
to be enough to reach the best iteration (based on log probability on
the validation cases). Some larger learning rates were tried, which
worked well initially, but produced unstable behaviour in later
iterations. For the models without penalty, fewer iterations would
have been sufficient.
I chose the magnitude of the penalty used (for models using a penalty)
so that the average log probability of validation cases became almost
constant in later iterations. This means that overfitting is being
avoided entirely by the presence of the penalty, not by stopping
training early. (In contrast, with no penalty, the validation log
probability got substantially worse for later iterations.) This is
not necessarily the best strategy - it's possible that a combination
of a smaller penalty with stopping training earlier might work better.
Did only one run for each model, using a single random initialization.
It is possible that the results could vary substantially with other
random seeds.
The training behaviour are shown in plots for each model, with blue
being minus the average log probability for estimation cases, and
red being minus the average log probability for validation cases.
I also made pairwise scatterplots of projections of the estimation
cases on PC1 & PC2, PC3 & PC4, and PC39 & PC40. It is clear that some
of these projections contain information about class (which is
distinguished by colour in the plots).
The results are summarized below:
MODEL #PCs VALIDATION TEST
err rate -log pr err rate -log pr
Logistic Regression 10 0.073 0.191 0.064 0.188
Logistic Regression 20 0.067 0.166 0.061 0.176
Logistic Regression 40 0.060 0.166 0.053 0.180
MLP, no penalty 10 0.060 0.171 0.050 0.159
MLP, no penalty 20 0.040 0.146 0.044 0.150
MLP, no penalty 40 0.040 0.144 0.042 0.186
MLP, with penalty 10 0.057 0.166 0.047 0.158
MLP, with penalty 20 0.053 0.158 0.048 0.153
MLP, with penalty 40 0.040 0.128 0.041 0.180
Based on average log probability for validation cases, the model
selected would be the last above - the MLP with penalty using 40 PCs.
The error rate on test cases with this model is 0.041, and minus the
average log probability for test cases is 0.180.
Looking at the results on test cases for all the models, we see that
the chosen model is the best of them all in terms of error rate, but
in terms of average log probability, several other models do better,
(the best gives 0.150). There is a pattern for all model classes that
the best performance in terms of error rate on test cases is obtained
using 40 PCs, but the best performance in terms of average log
probability for test cases is obtained using only 20 PCs. It may be
that the 40 PC models overfit a bit (even after trying to control this
with early stopping or a penalty), but that this overfitting is not
too damaging when classifying (where all that matters is whether the
probability of class 1 is greater or less than 0.5).
There is no clear winner when comparing results with no penalty (but
with overfitting controlled by early stopping) versus with a penalty
(chosen to be large enough that so that early stopping is not really
needed). The MLP with no penalty and 40 PCs has a error rate only
slightly higher than the best error rate obtained using a penalty.
The maximum likelihood logistic regression models are clearly inferior
than the MLP models, since the test performance of all logistic
regression models is worse than that of every MLP model, by both error
rate and average log probability. This could either be because there
are substantial non-linear aspects to the relationship of class to
covariates, or because maximum likelihood estimation for these models
overfits (with no attempt to control this), or because of a
combination of these two reasons.