EXAMPLES OF LEARNING WITH GRADIENT DESCENT, EARLY STOPPING & ENSEMBLES Although this software is intended primarily to support research in Bayesian methods, I have also implemented traditional gradient-descent learning for neural networks. This allows easy comparisons of traditional and Bayesian methods, and supports research into variations of traditional methods that may work better. In particular, the software supports the "early stopping" technique. When a network is trained to a minimum of the error on the training set (minus the log likelihood), performance on test data is often bad, since the training data has been "overfit". To avoid this, many people use one of the networks from earlier in the training process, selected based on the error on a separate "validation set". To do early stopping, the available training data must be partitioned into an "estimation set" and a "validation set". The network parameters (weights and biases) are randomly initialized to values close to zero, and then gradually changed (by gradient descent) so as to minimize error on the estimation set. The error on the validation set is computed for the networks found during this process, and the single network with minimum validation error is used to make predictions for future test cases. The split into estimation and validation sets in this procedure seems a bit arbitrary and wasteful. To alleviate this problem, one can train several networks using early stopping, based on different splits, and on different random initializations of the weights. Predictions for training cases are then made by averaging the predictions from the networks selected from each of these runs, a process somewhat analogous to the averaging done when making predictions for a Bayesian model based on a sample from the posterior distribution. Neural network training using early stopping and ensembles is described by Carl Rasmussen in his thesis (available at mlg.eng.cam.ac.uk/pub/pdf/Ras96b.pdf), and in a paper of mine on ``Assessing relevance determination methods using DELVE'' (see my web page). To demonstrate how the software can be used to do gradient descent learning, early stopping, and prediction using ensembles, I will show here how these methods can be applied to the binary response problem used as an example earlier (see Ex-netgp-b.doc). The data and command files for these examples are in the "ex-gdes" directory. Gradient descent learning for the binary response problem. First, we will see how to use the software to set the network parameters so as to minimize error (minus the log likelihood) on the entire training set, or at least to get as close to the minimum error as we can using gradient descent optimization. We can then see how well this network performs on test data. To start, we need to specify the network architecture. This is done using the same sort of command as is used for Bayesian networks, except that the prior specifications are simply "+" or "-", indicating whether the corresponding sets of weights are present or absent. The following command creates a network with two inputs, one layer of 15 hidden units, and one output unit, with input-hidden weights, hidden biases, hidden-output weights, and an output bias: > net-spec blog.gd 2 15 1 / ih=+ bh=+ ho=+ bo=+ The "+" is actually translated to a very large number, with the result that the "prior" for these parameters has virtually no effect. Instead of a "+", one can put a positive number, s, which produces the effect of "weight decay" with penalty equal to the sum of the squares of these weights times 1/(2*s^2). More elaborate hierarchical priors are meaningless if training is to be done by gradient descent. Next, we must specify the data model. For this problem, the response is binary, and we use a model in which the probability of a response of 1 is found by passing the output of the network through the logistic function. The following command specifies this: > model-spec blog.gd binary When the response is real, a noise standard deviation of 1 would conventionally be used, which causes minus the log likelihood to be half the squared error. The location of the data is specified as for Bayesian networks: > data-spec blog.gd 2 1 2 / bdata.train . For these examples, the 300 training cases are stored in bdata.train (one per line, with the two inputs coming first). The 1000 test cases are in bdata.test, but this is not mentioned above. The reason for doing things this way (rather than putting all the data in one file) will be apparent when we get to the examples using early stopping. Finally, we can train the network using gradient descent, with the command: > net-gd blog.gd 200000 1000 / 0.4 batch This does 200000 iterations of "batch" gradient descent (ie, with each update based on all training cases), with networks being saved in the log file every 1000 iterations. The software also supports "on-line" gradient descent, which is often faster to converge initially, but does not reach the exact optimum. See net-gd.doc for details. The stepsize to use for gradient descent learning is specified as well; here it is 0.4. If learning is unstable (ie, the error sometimes goes up rather than down), the stepsize will have to be reduced. Net-gd does not try to determine relative stepsizes itself, but stepsizes for groups of parameters can be set individually. The above command takes 7.2 seconds on the system used (see Ex-test-system.doc). While waiting (if you have a slower computer), you can monitor progress using net-plt. For example, the progress of the training error can be viewed with the command: > net-plt t l blog.gd | plot Individual networks can be displayed using net-display. For example, > net-display -p blog.gd 1000 displays the network at iteration 1000. The "-p" option shows the parameter values only, suppressing the hyperparameter values, which are meaningless with this model. Once training has finished, we can make predictions based on the last network, which should have the lowest training error. The following command prints a summary of performance at predicting the cases in bdata.test, both in terms of the log probability assigned to the correct target value and in terms of the error rate when guessing the target: > net-pred mpa blog.gd 200000 / bdata.test . Number of iterations used: 1 Number of test cases: 1000 Average log probability of targets: -0.856+-0.102 Fraction of guesses that were wrong: 0.1320+-0.0107 Performance is substantially worse than that obtained using Bayesian training. This is due to "overfitting". The following command illustrates the problem by plotting the change during the run of the training error and the error on the test cases: > net-plt t lL blog.gd / bdata.test . | plot From this plot (viewable in blog-lL.png), it is clear that we would have been better off to stop training earlier than 200000 iterations. Of course, we can't stop training based on the test error, since (in a real application) we don't know the test targets when training. Gradient descent with early stopping for the binary response problem. We can try to prevent overfitting by choosing one of the networks found during training according to performance on a subset of the available training cases that we have excluded from the set used for the gradient descent training. This training scheme can be implemented using the following commands: > net-spec blog.gdes 2 15 1 / ih=+ bh=+ ho=+ bo=+ > model-spec blog.gdes binary > data-spec blog.gdes 2 1 2 / bdata.train@1:225 . bdata.train@226:300 . > net-gd blog.gdes 20000 10 / 0.4 batch > net-plt t L blog.gdes | find-min Of the 300 available training cases, the first three-quarters are used for the gradient descent training, while the last quarter are used to choose the a network from those found during training. These 75 validation cases are listed as "test" cases in the data-spec command above, though they are not true test cases. This allows the best of the networks according to error on the validation set to be found using the net-plt command, in conjunction with find-min (documented in find-min.doc). Note that to save computer time, one might wish to actually stop the training once it becomes apparent that further training is unlikely to find a better network, but that is not attempted here. (Stopping as soon as the validation error goes up even a bit is a bad idea, since it might go down again later.) The final find-min command above outputs "1190" as the iteration that gives the best validation error. We can use this network to make predictions for test cases, as below: > net-pred mpa blog.gdes 1190 / bdata.test . Number of iterations used: 1 Number of test cases: 1000 Average log probability of targets: -0.282+-0.020 Fraction of guesses that were wrong: 0.1270+-0.0105 As can be seen, performance is considerably better than that obtained in the previous section by training for 200000 iterations, though still not as good as with Bayesian training (see Ex-netgp-b.doc). Using an ensemble of networks trained by early stopping. In the early stopping procedure just described, the use of the first three-quarters of the training data for estimation and the last one-quarter for validation is arbitrary. Whenever a training procedure involves arbitrary or random choices, it is generally better (on average) to repeat the procedure several times with different choices, and then make predictions by averaging the predictions made by the networks in this "ensemble". The following commands implement this idea for early stopping, by averaging over both the choice of which quarter of the data to use for validation, and over the random choice of initial weights: > net-spec blog.gdese1 2 15 1 / ih=+ bh=+ ho=+ bo=+ > model-spec blog.gdese1 binary > data-spec blog.gdese1 2 1 2 / bdata.train@-226:300 . bdata.train@226:300 . > rand-seed blog.gdese1 1 > net-gd blog.gdese1 20000 10 / 0.4 batch > net-plt t L blog.gdese1 | find-min > net-spec blog.gdese2 2 15 1 / ih=+ bh=+ ho=+ bo=+ > model-spec blog.gdese2 binary > data-spec blog.gdese2 2 1 2 / bdata.train@-151:225 . bdata.train@151:225 . > rand-seed blog.gdese2 2 > net-gd blog.gdese2 20000 10 / 0.4 batch > net-plt t L blog.gdese2 | find-min > net-spec blog.gdese3 2 15 1 / ih=+ bh=+ ho=+ bo=+ > model-spec blog.gdese3 binary > data-spec blog.gdese3 2 1 2 / bdata.train@-76:150 . bdata.train@75:150 . > rand-seed blog.gdese3 3 > net-gd blog.gdese3 20000 10 / 0.4 batch > net-plt t L blog.gdese3 | find-min > net-spec blog.gdese4 2 15 1 / ih=+ bh=+ ho=+ bo=+ > model-spec blog.gdese4 binary > data-spec blog.gdese4 2 1 2 / bdata.train@-1:75 . bdata.train@1:75 . > rand-seed blog.gdese4 4 > net-gd blog.gdese4 20000 10 / 0.4 batch > net-plt t L blog.gdese4 | find-min The networks selected from the four training runs above as having lowest validation error are at iterations 1190, 880, 1780, and 12720. We can now make predictions for test cases using these four networks, as follows: > net-pred mpa blog.gdese1 1190 blog.gdese2 880 \ blog.gdese3 1780 blog.gdese4 12720 / bdata.test . Number of iterations used: 4 Number of test cases: 1000 Average log probability of targets: -0.266+-0.018 Fraction of guesses that were wrong: 0.1210+-0.0103 The resulting classification performance is a bit better than was found using just the first of the four networks. But it's still not as good as with Bayesian training. Interestingly, one can get a bit better results using networks 1000 later in the run than were chosen based on the validation error, as illustrated here: > net-pred mpa blog.gdese1 2190 blog.gdese2 1880 \ blog.gdese3 2780 blog.gdese4 13720 / bdata.test . Number of iterations used: 4 Number of test cases: 1000 Average log probability of targets: -0.265+-0.020 Fraction of guesses that were wrong: 0.1170+-0.0102 This may be because the selection of a single network to use based on a validation set isn't necessarily optimal when predictions are actually made by an ensemble. (Though at least for this dataset, results using early stopping with just the first split are also a bit better when using the network 1000 iterations past the validation error minimum.)