EXAMPLES OF LEARNING WITH GRADIENT DESCENT, EARLY STOPPING & ENSEMBLES

Although this software is intended primarily to support research in
Bayesian methods, I have also implemented traditional gradient-descent
learning for neural networks.  This allows easy comparisons of
traditional and Bayesian methods, and supports research into
variations of traditional methods that may work better.

In particular, the software supports the "early stopping" technique.
When a network is trained to a minimum of the error on the training
set (minus the log likelihood), performance on test data is often bad,
since the training data has been "overfit".  To avoid this, many
people use one of the networks from earlier in the training process,
selected based on the error on a separate "validation set".

To do early stopping, the available training data must be partitioned
into an "estimation set" and a "validation set".  The network
parameters (weights and biases) are randomly initialized to values
close to zero, and then gradually changed (by gradient descent) so as
to minimize error on the estimation set.  The error on the validation
set is computed for the networks found during this process, and the
single network with minimum validation error is used to make
predictions for future test cases.

The split into estimation and validation sets in this procedure seems
a bit arbitrary and wasteful.  To alleviate this problem, one can
train several networks using early stopping, based on different
splits, and on different random initializations of the weights.
Predictions for training cases are then made by averaging the
predictions from the networks selected from each of these runs, a
process somewhat analogous to the averaging done when making
predictions for a Bayesian model based on a sample from the posterior
distribution.

Neural network training using early stopping and ensembles is
described by Carl Rasmussen in his thesis (available at
mlg.eng.cam.ac.uk/pub/pdf/Ras96b.pdf), and in a paper of mine on
``Assessing relevance determination methods using DELVE'' (see my web
page).

To demonstrate how the software can be used to do gradient descent
learning, early stopping, and prediction using ensembles, I will show
here how these methods can be applied to the binary response problem
used as an example earlier (see Ex-netgp-b.doc).

The data and command files for these examples are in the "ex-gdes"
directory.


Gradient descent learning for the binary response problem.

First, we will see how to use the software to set the network
parameters so as to minimize error (minus the log likelihood) on the
entire training set, or at least to get as close to the minimum error
as we can using gradient descent optimization.  We can then see how
well this network performs on test data.

To start, we need to specify the network architecture.  This is done
using the same sort of command as is used for Bayesian networks,
except that the prior specifications are simply "+" or "-", indicating
whether the corresponding sets of weights are present or absent.  The
following command creates a network with two inputs, one layer of 15
hidden units, and one output unit, with input-hidden weights, hidden
biases, hidden-output weights, and an output bias:

    > net-spec blog.gd 2 15 1 / ih=+ bh=+ ho=+ bo=+

The "+" is actually translated to a very large number, with the result
that the "prior" for these parameters has virtually no effect.
Instead of a "+", one can put a positive number, s, which produces the
effect of "weight decay" with penalty equal to the sum of the squares
of these weights times 1/(2*s^2).  More elaborate hierarchical priors
are meaningless if training is to be done by gradient descent.

Next, we must specify the data model.  For this problem, the response
is binary, and we use a model in which the probability of a response
of 1 is found by passing the output of the network through the
logistic function.  The following command specifies this:

    > model-spec blog.gd binary

When the response is real, a noise standard deviation of 1 would
conventionally be used, which causes minus the log likelihood to be
half the squared error.

The location of the data is specified as for Bayesian networks:

    > data-spec blog.gd 2 1 2 / bdata.train . 

For these examples, the 300 training cases are stored in bdata.train
(one per line, with the two inputs coming first).  The 1000 test cases
are in bdata.test, but this is not mentioned above.  The reason for
doing things this way (rather than putting all the data in one file)
will be apparent when we get to the examples using early stopping.

Finally, we can train the network using gradient descent, with the
command:

    > net-gd blog.gd 200000 1000 / 0.4 batch

This does 200000 iterations of "batch" gradient descent (ie, with each
update based on all training cases), with networks being saved in the
log file every 1000 iterations.  The software also supports "on-line"
gradient descent, which is often faster to converge initially, but
does not reach the exact optimum.  See net-gd.doc for details.

The stepsize to use for gradient descent learning is specified as
well; here it is 0.4.  If learning is unstable (ie, the error
sometimes goes up rather than down), the stepsize will have to be
reduced.  Net-gd does not try to determine relative stepsizes itself,
but stepsizes for groups of parameters can be set individually.

The above command takes 7.2 seconds on the system used (see
Ex-test-system.doc).  While waiting (if you have a slower computer),
you can monitor progress using net-plt.  For example, the progress of
the training error can be viewed with the command:

    > net-plt t l blog.gd | plot

Individual networks can be displayed using net-display.  For example,

    > net-display -p blog.gd 1000

displays the network at iteration 1000.  The "-p" option shows the
parameter values only, suppressing the hyperparameter values, which
are meaningless with this model.

Once training has finished, we can make predictions based on the last
network, which should have the lowest training error.  The following
command prints a summary of performance at predicting the cases in
bdata.test, both in terms of the log probability assigned to the
correct target value and in terms of the error rate when guessing the
target:

    > net-pred mpa blog.gd 200000 / bdata.test .

    Number of iterations used: 1

    Number of test cases: 1000

    Average log probability of targets:    -0.856+-0.102
    Fraction of guesses that were wrong:    0.1320+-0.0107

Performance is substantially worse than that obtained using Bayesian
training.  This is due to "overfitting".  The following command
illustrates the problem by plotting the change during the run of the
training error and the error on the test cases:

    > net-plt t lL blog.gd / bdata.test . | plot

From this plot (viewable in blog-lL.png), it is clear that we would
have been better off to stop training earlier than 200000 iterations.
Of course, we can't stop training based on the test error, since (in a
real application) we don't know the test targets when training.


Gradient descent with early stopping for the binary response problem.

We can try to prevent overfitting by choosing one of the networks
found during training according to performance on a subset of the
available training cases that we have excluded from the set used for
the gradient descent training.  This training scheme can be
implemented using the following commands:

    > net-spec blog.gdes 2 15 1 / ih=+ bh=+ ho=+ bo=+
    > model-spec blog.gdes binary
    > data-spec blog.gdes 2 1 2 / bdata.train@1:225 . bdata.train@226:300 .
    > net-gd blog.gdes 20000 10 / 0.4 batch

    > net-plt t L blog.gdes | find-min

Of the 300 available training cases, the first three-quarters are used
for the gradient descent training, while the last quarter are used to
choose the a network from those found during training.  These 75
validation cases are listed as "test" cases in the data-spec command
above, though they are not true test cases.  This allows the best of
the networks according to error on the validation set to be found
using the net-plt command, in conjunction with find-min (documented in
find-min.doc).  Note that to save computer time, one might wish to
actually stop the training once it becomes apparent that further
training is unlikely to find a better network, but that is not
attempted here.  (Stopping as soon as the validation error goes up
even a bit is a bad idea, since it might go down again later.)

The final find-min command above outputs "1190" as the iteration that
gives the best validation error.  We can use this network to make
predictions for test cases, as below:

    > net-pred mpa blog.gdes 1190 / bdata.test .                           

    Number of iterations used: 1

    Number of test cases: 1000

    Average log probability of targets:    -0.282+-0.020
    Fraction of guesses that were wrong:    0.1270+-0.0105

As can be seen, performance is considerably better than that obtained
in the previous section by training for 200000 iterations, though
still not as good as with Bayesian training (see Ex-netgp-b.doc).


Using an ensemble of networks trained by early stopping.

In the early stopping procedure just described, the use of the first
three-quarters of the training data for estimation and the last
one-quarter for validation is arbitrary.  Whenever a training
procedure involves arbitrary or random choices, it is generally better
(on average) to repeat the procedure several times with different
choices, and then make predictions by averaging the predictions made
by the networks in this "ensemble".  The following commands implement
this idea for early stopping, by averaging over both the choice of
which quarter of the data to use for validation, and over the random
choice of initial weights:

    > net-spec blog.gdese1 2 15 1 / ih=+ bh=+ ho=+ bo=+
    > model-spec blog.gdese1 binary
    > data-spec blog.gdese1 2 1 2 / bdata.train@-226:300 . bdata.train@226:300 .
    > rand-seed blog.gdese1 1
    > net-gd blog.gdese1 20000 10 / 0.4 batch

    > net-plt t L blog.gdese1 | find-min

    > net-spec blog.gdese2 2 15 1 / ih=+ bh=+ ho=+ bo=+
    > model-spec blog.gdese2 binary
    > data-spec blog.gdese2 2 1 2 / bdata.train@-151:225 . bdata.train@151:225 .
    > rand-seed blog.gdese2 2
    > net-gd blog.gdese2 20000 10 / 0.4 batch

    > net-plt t L blog.gdese2 | find-min

    > net-spec blog.gdese3 2 15 1 / ih=+ bh=+ ho=+ bo=+
    > model-spec blog.gdese3 binary
    > data-spec blog.gdese3 2 1 2 / bdata.train@-76:150 . bdata.train@75:150 .
    > rand-seed blog.gdese3 3 
    > net-gd blog.gdese3 20000 10 / 0.4 batch

    > net-plt t L blog.gdese3 | find-min

    > net-spec blog.gdese4 2 15 1 / ih=+ bh=+ ho=+ bo=+
    > model-spec blog.gdese4 binary
    > data-spec blog.gdese4 2 1 2 / bdata.train@-1:75 . bdata.train@1:75 .
    > rand-seed blog.gdese4 4
    > net-gd blog.gdese4 20000 10 / 0.4 batch

    > net-plt t L blog.gdese4 | find-min

The networks selected from the four training runs above as having
lowest validation error are at iterations 1190, 880, 1780, and 12720.
We can now make predictions for test cases using these four networks,
as follows:

    > net-pred mpa blog.gdese1 1190 blog.gdese2 880 \
                   blog.gdese3 1780 blog.gdese4 12720 / bdata.test .

    Number of iterations used: 4

    Number of test cases: 1000

    Average log probability of targets:    -0.266+-0.018
    Fraction of guesses that were wrong:    0.1210+-0.0103

The resulting classification performance is a bit better than was
found using just the first of the four networks.  But it's still not
as good as with Bayesian training.

Interestingly, one can get a bit better results using networks 1000
later in the run than were chosen based on the validation error, as
illustrated here:

    > net-pred mpa blog.gdese1 2190 blog.gdese2 1880 \
                   blog.gdese3 2780 blog.gdese4 13720 / bdata.test .

    Number of iterations used: 4

    Number of test cases: 1000

    Average log probability of targets:    -0.265+-0.020
    Fraction of guesses that were wrong:    0.1170+-0.0102

This may be because the selection of a single network to use based on
a validation set isn't necessarily optimal when predictions are
actually made by an ensemble.  (Though at least for this dataset,
results using early stopping with just the first split are also a bit
better when using the network 1000 iterations past the validation
error minimum.)