EXAMPLES OF LEARNING WITH GRADIENT DESCENT, EARLY STOPPING & ENSEMBLES

Although this software is intended primarily to support research in
Bayesian methods, I have also implemented traditional gradient-descent
learning for neural networks.  This allows easy comparisons of
traditional and Bayesian methods, and supports research into
variations of traditional methods that may work better.

In particular, the software supports the "early stopping" technique.
When a network is trained to a minimum of the error on the training
set (minus the log likelihood), performance on test data is often bad,
since the training data has been "overfit".  To avoid this, many
people use one of the networks from earlier in the training process,
selected based on the error on a separate "validation set".

To do early stopping, the available training data must be partitioned
into an "estimation set" and a "validation set".  The network
parameters (weights and biases) are randomly initialized to values
close to zero, and then gradually changed (by gradient descent) so as
to minimize error on the estimation set.  The error on the validation
set is computed for the networks found during this process, and the
single network with minimum validation error is used to make
predictions for future test cases.

The split into estimation and validation sets in this procedure seems
a bit arbitrary and wasteful.  To alleviate this problem, one can
train several networks using early stopping, based on different
splits, and on different random initializations of the weights.
Predictions for training cases are then be made by averaging the
predictions from the networks selected from each of these runs, a
process somewhat analogous to the averaging done when making
predictions for a Bayesian model based on a sample from the posterior
distribution.

Neural network training using early stopping and ensembles is
described by Carl Rasmussen in his thesis (available in Postscript
from http://www.cs.utoronto.ca/~carl), and in a paper of mine on
``Assessing relevance determination methods using DELVE'' (see my web
page).

To demonstrate how the software can be used to do gradient descent
learning, early stopping, and prediction using ensembles, I will show
here how these methods can be applied to the binary response problem
used as an example earlier (see Ex-netgp-b.doc).

The data and command files for these examples are in the "ex-gdes"
directory.


Gradient descent learning for the binary response problem.

First, we will see how to use the software to set the network
parameters so as to minimize error (minus the log likelihood) on the
entire training set, or at least to get as close to the minimum error
as we can using gradient descent optimization.  We can then see how
well this network performs on test data.

To start, we need to specify the network architecture.  This is done
using the same sort of command as is used for Bayesian networks,
except that the prior specifications are simply "+" or "-", indicating
whether the corresponding sets of weights are present or absent.  The
following command creates a network with two inputs, one layer of 15
hidden units, and one output unit, with input-hidden weights, hidden
biases, hidden-output weights, and an output bias:

    > net-spec blog.gd 2 15 1 / - + + - + - +

The "+" is actually translated to a very large number, with the result
that the "prior" for these parameters has virtually no effect.
Instead of a "+", one can put a positive number, s, which produces the
effect of "weight decay" with penalty equal to the sum of the squares
of these weights times 1/(2*s^2).  More elaborate hierarchical priors
are meaningless if training is to be done by gradient descent.

Next, we must specify the data model.  For this problem, the response
is binary, and we use a model in which the probability of a response
of 1 is found by passing the output of the network through the
logistic function.  The following command specifies this:

    > model-spec blog.gd binary

When the response is real, a noise standard deviation of 1 would
conventionally be used, which causes minus the log likelihood to be
half the squared error.

The location of the data is specified as for Bayesian networks:

    > data-spec blog.gd 2 1 2 / bdata.train . 

For these examples, the 300 training cases are stored in bdata.train
(one per line, with the two inputs coming first).  The 200 test cases
are in bdata.test, but this is not mentioned above.  The reason for
doing things this way (rather than putting all the data in one file)
will be apparent when we get to the examples using early stopping.

Finally, we can train the network using gradient descent, with the
command:

    > net-gd blog.gd 100000 1000 / 0.4 batch

This does 100000 iterations of "batch" gradient descent (ie, with each
update based on all training cases), with networks being saved in the
log file every 1000 iterations.  The software also supports "on-line"
gradient descent, which is often faster to converge initially, but
does not reach the exact optimum.  See net-gd.doc for details.

The stepsize to use for gradient descent learning is specified as
well; here it is 0.4.  If learning is unstable (ie, the error
sometimes goes up rather than down), the stepsize will have to be
reduced.  Net-gd does not try to determine relative stepsizes itself,
but stepsizes for groups of parameters can be set individually.

The above command takes 200 seconds on the system used (see
Ex-system.doc).  While waiting, you can monitor progress using
net-plt.  For example, the progress of the training error can be
viewed with the command:

    > net-plt t l blog.gd | plot

Individual networks can be displayed using net-display.  For example,

    > net-display -w blog.gd 1000

displays the network at iteration 1000.  The "-w" option suppresses
the hyperparameter values, which are meaningless with this model.

Once training has finished, we can make predictions based on the last
network, which should have the lowest training error.  The following
command prints a summary of performance at predicting the cases in
bdata.test, both in terms of the log probability assigned to the
correct target value and in terms of the error rate when guessing the
target:

    > net-pred mpa blog.gd 100000 / bdata.test .

    Number of iterations used: 1

    Number of test cases: 200

    Average log probability of targets:    -0.364+-0.066

    Fraction of guesses that were wrong:  0.1450+-0.0250

Performance is substantially worse than that obtained using Bayesian
training.  This is due to "overfitting".  The following command
illustrates the problem by plotting the change during the run of the
training error and the error on the test cases:

    > net-plt t lL blog.gd / bdata.test . | plot

From this plot, it is clear that we would have been better off to stop
training earlier than 100000 iterations.  Of course, we can't stop
training based on the test error plotted above, since we don't know
the test targets when training.


Gradient descent with early stopping for the binary response problem.

We can try to prevent overfitting by choosing one of the networks
found during training according to performance on a subset of the
available training cases that we have excluded from the set used for
the gradient descent training.  This training scheme can be
implemented using the following commands:

    > net-spec blog.gdes 2 15 1 / - + + - + - +
    > model-spec blog.gdes binary

    > data-spec blog.gdes 2 1 2 / bdata.train@1:225 . bdata.train@226:300 .

    > net-gd blog.gdes 100 5 / 0.4 batch
    > net-gd blog.gdes 1000 50 / 0.4 batch
    > net-gd blog.gdes 20000 500 / 0.4 batch

    > net-plt t L blog.gdes | find-min

Of the 300 available training cases, the first three-quarters are used
for the gradient descent training, while the last quarter are used to
choose the a network from those found during training.  These 75
validation cases are listed as "test" cases in the data-spec command
above, though they are not true test cases.  This allows the best of
the networks according to error on the validation set to be found
using the net-plt command, in conjunction with find-min (documented in
find-min.doc).  Note that to save computer time, one might wish to
actually stop the training once it becomes apparent that further
training is unlikely to find a better network, but that is not
attempted here.

Three net-gd commands are used above so that networks can be saved to
the log file more frequently early in training.  It sometimes happens
that the best network according to validation error is from very early
in the training process.  We would not wish to miss it as a result of
saving too few networks in the early stages of training.

The final find-min command above outputs "3000" as the iteration that
gives the best validation error.  We can use this network to make
predictions for test cases, as below:

    > net-pred mpa blog.gdes 3000 / bdata.test .                           

    Number of iterations used: 1

    Number of test cases: 200

    Average log probability of targets:    -0.263+-0.043

    Fraction of guesses that were wrong:  0.1150+-0.0226

As can be seen, performance is considerably better than that obtained
in the previous section by training for 100000 iterations.


Using an ensemble of networks trained by early stopping.

In the early stopping procedure just described, the use of the first
three-quarters of the training data for estimation and the last
one-quarter for validation is arbitrary.  Whenever a training
procedure involves arbitrary or random choices, it is generally better
(on average) to repeat the procedure several times with different
choices, and then make predictions by averaging the predictions made
by the networks in this "ensemble".  The following commands implement
this idea for early stopping, by averaging over both the choice of
which quarter of the data to use for validation, and over the random
choice of initial weights:

    > net-spec blog.gdese1 2 15 1 / - + + - + - +
    > model-spec blog.gdese1 binary
    > data-spec blog.gdese1 2 1 2 / bdata.train@-226:300 . bdata.train@226:300 .

    > rand-seed blog.gdese1 1

    > net-gd blog.gdese1 100 5 / 0.4 batch
    > net-gd blog.gdese1 1000 50 / 0.4 batch
    > net-gd blog.gdese1 20000 500 / 0.4 batch

    > net-plt t L blog.gdese1 | find-min

    > net-spec blog.gdese2 2 15 1 / - + + - + - +
    > model-spec blog.gdese2 binary
    > data-spec blog.gdese2 2 1 2 / bdata.train@-151:225 . bdata.train@151:225 .

    > rand-seed blog.gdese2 2

    > net-gd blog.gdese2 100 5 / 0.4 batch
    > net-gd blog.gdese2 1000 50 / 0.4 batch
    > net-gd blog.gdese2 20000 500 / 0.4 batch

    > net-plt t L blog.gdese2 | find-min

    > net-spec blog.gdese3 2 15 1 / - + + - + - +
    > model-spec blog.gdese3 binary
    > data-spec blog.gdese3 2 1 2 / bdata.train@-76:150 . bdata.train@75:150 .

    > rand-seed blog.gdese3 3 

    > net-gd blog.gdese3 100 5 / 0.4 batch
    > net-gd blog.gdese3 1000 50 / 0.4 batch
    > net-gd blog.gdese3 20000 500 / 0.4 batch

    > net-plt t L blog.gdese3 | find-min

    > net-spec blog.gdese4 2 15 1 / - + + - + - +
    > model-spec blog.gdese4 binary
    > data-spec blog.gdese4 2 1 2 / bdata.train@-1:75 . bdata.train@1:75 .

    > rand-seed blog.gdese4 4

    > net-gd blog.gdese4 100 5 / 0.4 batch
    > net-gd blog.gdese4 1000 50 / 0.4 batch
    > net-gd blog.gdese4 20000 500 / 0.4 batch

    > net-plt t L blog.gdese4 | find-min

The networks selected from the four training runs above as having
lowest validation error are at iterations 3000, 950, 3500, and 11000.
We can now make predictions for test cases using these four networks,
as follows:

    > net-pred mpa blog.gdese1 3000 blog.gdese2 950 \
                   blog.gdese3 3500 blog.gdese4 11000 / bdata.test .

    Number of iterations used: 4

    Number of test cases: 200

    Average log probability of targets:    -0.263+-0.040

    Fraction of guesses that were wrong:  0.1050+-0.0217

The resulting classification performance is slightly better than was
found using just the first of the four networks.