STA 414/2104, Spring 2014, Assignment #1.
I'll first present results on the two training sets for each type of
method, then summarize the main results, and discuss these results
further at the end.
Results with linear models.
With a simple linear model, the average squared error on test cases
was 0.288 for training set 1, and 0.289 for training set 2. The
squared error was less when the squares of the inputs were also used
as covariates (0.283 for training set 1, 0.258 for training set 2),
but when cubes were added as well, the squared error increased for
training set 1 (to 0.311), though it decreased further for training
set 2 (to 0.251).
So although adding such extra covariates derived from the original
inputs can be beneficial, there is clearly a possibility of
overfitting. One can note that some of the regression coefficients on
some of the squared and cubed inputs were significant by the usual
standard, indicating that the relationship of the response to the
inputs is not entirely a simple linear one.
The Gaussian process model with the covariance function designed to
mimic the simple linear model did indeed produce the same average
squared error on test cases, as expected.
Results with the squared exp covariance function, max. marginal likelihood.
Without rescaling of inputs 1 and 7, the Gaussian process model with
the covariance function using the exponential of the squared
differences, with hyperparameters set by maximizing marginal
likelihood, gave average squared prediction errors of 0.293 and 0.257
using the two training sets (the same for the two initial values I
tried). This is worse than the simple linear model for training set
1, and worse than the model with squares and cubes for training set 2.
So the Gaussian process model does not show much or any advantage in
this context.
However, when inputs 1 and 7 were rescaled so that their range of
variation was similar to the others, the performance of this Gaussian
process model improved, to an average squared error of 0.244 for
training set 1 and 0.245 for training set 2 (for both training sets,
the results were the same for the two initial values I tried). This
is better than any of the models using "lm".
Computation time was about 2 seconds for these Gaussian process
methods (using pqR 2014-02-23, and one core of a 3.33 GHz processor).
This is much greater than for "lm", but is still quite tolerable.
Results with other covariance functions, maximizing marginal likelihood.
For both training sets, results improved with the covariance function
using the exponential of the absolute values of differences - to
average squared test error of 0.241 using training set 1 and 0.216 for
training set 2. Both results were the same for the two starting
values I tried. The computation time was again about 2 seconds.
For the covariance function with terms of both kinds, the average
squared error using training set 1 decreased further, to 0.237 (the
same for all initial values I tried). However, for training set 2,
results were worse - average squared error of 0.236 with one initial
value, and 0.221 for three other initial values (which give higher
marginal likelihood, and hence would be chosen in preference to the
other). However, all results were better than when the covariance
function with the exponential of squared distances was used, so when
it is not known which kind of covariance function would be better,
this one with terms of both kinds may be a good choice. It was a bit
slower, however, taking roughly 5 seconds.
Result when hyperparameters were chosen by cross validation.
When hyperparameter values were chosen by cross validation, results
were almost uniformly worse or no better than when the hyperparameters
were chosen by maximizing marginal likelihood.
For the covariance function using the exponential of squared distance,
average squared error with the first training set was 0.276 (versus
0.244 when maximizing marginal likelihood), and for the second
training set average squared error was 0.244 (versus 0.245 when
maximizing marginal likelihood). The very slight gain for training
set 2 is offset by a substantial loss for training set 1. Also,
although the average squared errors were the same for two initial
values, the hyperparameter estimates differed somewhat, so the cross
validation may be somewhat sensitive to initial values. Furthermore,
the noise standard deviation hyperparameter was estimated as being
rather small (unrealistically?) when training set 2 was used, which
might be disturbing even if the predictions are OK.
With the covariance function using the exponential of absolute
differences, average squared errors with the two training sets were
0.240 and 0.219, compared with 0.241 and 0.216 when hyperparameters
were set by maximizing marginal likelihood. Here also, although the
average squared errors did not vary with initial values, the
hyperparameters chosen did vary quite a bit.
When the covariance function with two terms was used, average squared
errors with the two training sets were 0.240 and 0.222, slightly worse
worse than what was obtained when maximizing marginal likelihood.
Since cross validation is considerably slower than maximizing marginal
likelihood (taking between 8 and 79 seconds for these runs), there
seems to be no benefit of cross validation over maximizing marginal
likelihood, at least for this problem.
Results of Bayesian prediction using importance sampling.
Importance sampling took much longer than the other methods (between
134 and 235 seconds for each run, with 1000 points). The results were
good - from slightly worse to slightly better than the best other
methods. However, at least for this problem, one might wonder whether
the extra computation time was worth it.
Summary of results.
Here is a summary of the results with the three covariance functions
when maximizing marginal likelihood, using cross validation, and using
importance sampling, for both training sets:
exp sq cov exp abs cov two term cov
ML CV IS ML CV IS ML CV IS
Training set 1: 0.244 0.276 0.246 0.241 0.240 0.240 0.237 0.240 0.236
Training set 2: 0.245 0.244 0.244 0.216 0.219 0.215 0.221 0.222 0.225
Concluding discussion.
We see that the test set performance can vary quite a bit between the
two training sets. Clearly, one could not draw strong conclusions
about which method is better based on only one training set (indeed,
one really needs more than two).
For this problem, using a covariance function with a term that takes
the exponential of the absolute values of differences seems
beneficial. The covariance term with the exponential of squared
differences produces to functions that are smooth (differentiable),
which it seems is not really appropriate for this problem. The
marginal likelihood from importance sampling does favour the
exponential of absolute differences covariance function over the
exponential of squared differences covariance function. However, the
two-term covariance function has much higher marginal likelihood than
either one-term covariance function, so perhaps using both terms is
better.
Cross validation at best does about the same as maximizing marginal
likelihood, and in one case (the covariance function using exponential
of the squared differences, on training set 1), cross validation
produced a substantially worse result than the other methods. It
therefore seems to be not very robust, at least judging from this one
problem. It's also rather slow.
Bayesian prediction with importance sampling gave good results, but
was also slow. However, when comparing times, one should keep in mind
that one run maximizing marginal likelihood did find a poor local
maximum, which indicates that for reliably good results, one should
run "nlm" from several initial values. Accounting for this reduces
the speed advantage of maximizing marginal likelihood over importance
sampling.