EXAMPLES OF LEARNING WITH GRADIENT DESCENT, EARLY STOPPING & ENSEMBLES Although this software is intended primarily to support research in Bayesian methods, I have also implemented traditional gradient-descent learning for neural networks. This allows easy comparisons of traditional and Bayesian methods, and supports research into variations of traditional methods that may work better. In particular, the software supports the "early stopping" technique. When a network is trained to a minimum of the error on the training set (minus the log likelihood), performance on test data is often bad, since the training data has been "overfit". To avoid this, many people use one of the networks from earlier in the training process, selected based on the error on a separate "validation set". To do early stopping, the available training data must be partitioned into an "estimation set" and a "validation set". The network parameters (weights and biases) are randomly initialized to values close to zero, and then gradually changed (by gradient descent) so as to minimize error on the estimation set. The error on the validation set is computed for the networks found during this process, and the single network with minimum validation error is used to make predictions for future test cases. The split into estimation and validation sets in this procedure seems a bit arbitrary and wasteful. To alleviate this problem, one can train several networks using early stopping, based on different splits, and on different random initializations of the weights. Predictions for training cases are then be made by averaging the predictions from the networks selected from each of these runs, a process somewhat analogous to the averaging done when making predictions for a Bayesian model based on a sample from the posterior distribution. Neural network training using early stopping and ensembles is described by Carl Rasmussen in his thesis (available in Postscript from http://www.cs.utoronto.ca/~carl), and in a paper of mine on ``Assessing relevance determination methods using DELVE'' (see my web page). To demonstrate how the software can be used to do gradient descent learning, early stopping, and prediction using ensembles, I will show here how these methods can be applied to the binary response problem used as an example earlier (see Ex-netgp-b.doc). The data and command files for these examples are in the "ex-gdes" directory. Gradient descent learning for the binary response problem. First, we will see how to use the software to set the network parameters so as to minimize error (minus the log likelihood) on the entire training set, or at least to get as close to the minimum error as we can using gradient descent optimization. We can then see how well this network performs on test data. To start, we need to specify the network architecture. This is done using the same sort of command as is used for Bayesian networks, except that the prior specifications are simply "+" or "-", indicating whether the corresponding sets of weights are present or absent. The following command creates a network with two inputs, one layer of 15 hidden units, and one output unit, with input-hidden weights, hidden biases, hidden-output weights, and an output bias: > net-spec blog.gd 2 15 1 / - + + - + - + The "+" is actually translated to a very large number, with the result that the "prior" for these parameters has virtually no effect. Instead of a "+", one can put a positive number, s, which produces the effect of "weight decay" with penalty equal to the sum of the squares of these weights time 1/(2*s^2). More elaborate hierarchical priors are meaningless if training is to be done by gradient descent. Next, we must specify the data model. For this problem, the response is binary, and we use a model in which the probability of a response of 1 is found by passing the output of the network through the logistic function. The following command specifies this: > model-spec blog.gd binary When the response is real, a noise standard deviation of 1 would conventionally be used, which causes minus the log likelihood to be half the squared error. The location of the data is specified as for Bayesian networks: > data-spec blog.gd 2 1 2 / bdata.train . For these examples, the 300 training cases are stored in bdata.train (one per line, with the two inputs coming first). The 200 test cases are in bdata.test, but this is not mentioned above. The reason for doing things this way (rather than putting all the data in one file) will be apparent when we get to the examples using early stopping. Finally, we can train the network using gradient descent, with the command: > net-gd blog.gd 100000 1000 / 0.4 batch This does 100000 iterations of "batch" gradient descent (ie, with each update based on all training cases), with networks being saved in the log file every 1000 iterations. The software also supports "on-line" gradient descent, which is often faster to converge initially, but does not reach the exact optimum. See net-gd.doc for details. The stepsize to use for gradient descent learning is specified as well; here it is 0.4. If learning is unstable (ie, the error sometimes goes up rather than down), the stepsize will have to be reduced. Net-gd does not try to determine relative stepsizes itself, but stepsizes for groups of parameters can be set individually. The above command takes about 10 minutes on our machine. While waiting, you can monitor progress using net-plt. For example, the progress of the training error can be viewed with the command: > net-plt t l blog.gd | plot Individual networks can be displayed using net-display. For example, > net-display -w blog.gd 1000 displays the network at iteration 1000. The "-w" option suppresses the hyperparameter values, which are meaningless with this model. Once training has finished, we can make predictions based on the last network, which should have the lowest training error. The following command prints a summary of performance at predicting the cases in bdata.test, both in terms of the log probability assigned to the correct target value and in terms of the error rate when guessing the target: > net-pred mpa blog.gd 100000 / bdata.test . Number of iterations used: 1 Number of test cases: 200 Average log probability of targets: -0.364+-0.066 Fraction of guesses that were wrong: 0.1450+-0.0250 Performance is substantially worse than that obtained using Bayesian training. This is due to "overfitting". The following command illustrates the problem by plotting the change during the run of the training error and the error on the test cases: > net-plt t lL blog.gd / bdata.test . | plot From this plot, it is clear that we would have been better off to stop training earlier than 100000 iterations. Of course, we can't stop training based on the test error plotted above, since we don't know the test targets when training. Gradient descent with early stopping for the binary response problem. We can try to prevent overfitting by choosing one of the networks found during training according to performance on a subset of the available training cases that we have excluded from the set used for the gradient descent training. This training scheme can be implemented using the following commands: > net-spec blog.gdes 2 15 1 / - + + - + - + > model-spec blog.gdes binary > data-spec blog.gdes 2 1 2 / bdata.train@1:225 . bdata.train@226:300 . > net-gd blog.gdes 100 5 / 0.4 batch > net-gd blog.gdes 1000 50 / 0.4 batch > net-gd blog.gdes 20000 500 / 0.4 batch > net-plt t L blog.gdes | find-min Of the 300 available training cases, the first three-quarters are used for the gradient descent training, while the last quarter are used to choose the a network from those found during training. These 75 validation cases are listed as "test" cases in the data-spec command above, though they are not true test cases. This allows the best of the networks according to error on the validation set to be found using the net-plt command, in conjunction with find-min (documented in find-min.doc). Note that to save computer time, one might wish to actually stop the training once it becomes apparent that further training is unlikely to find a better network, but that is not attempted here. Three net-gd commands are used above so that networks can be saved to the log file more frequently early in training. It sometimes happens that the best network according to validation error is from very early in the training process. We would not wish to miss it as a result of saving too few networks in the early stages of training. The final find-min command above outputs "3000" as the iteration that gives the best validation error. We can use this network to make predictions for test cases, as below: > net-pred mpa blog.gdes 3000 / bdata.test . Number of iterations used: 1 Number of test cases: 200 Average log probability of targets: -0.263+-0.043 Fraction of guesses that were wrong: 0.1150+-0.0226 As can be seen, performance is considerably better than that obtained in the previous section by training for 100000 iterations. Using an ensemble of networks trained by early stopping. In the early stopping procedure just described, the use of the first three-quarters of the training data for estimation and the last one-quarter for validation is arbitrary. Whenever a training procedure involves arbitrary or random choices, it is generally better (on average) to repeat the procedure several times with different choices, and then make predictions by averaging the predictions made by the networks in this "ensemble". The following commands implement this idea for early stopping, by averaging over both the choice of which quarter of the data to use for validation, and over the random choice of initial weights: > net-spec blog.gdese1 2 15 1 / - + + - + - + > model-spec blog.gdese1 binary > data-spec blog.gdese1 2 1 2 / bdata.train@-226:300 . bdata.train@226:300 . > rand-seed blog.gdese1 1 > net-gd blog.gdese1 100 5 / 0.4 batch > net-gd blog.gdese1 1000 50 / 0.4 batch > net-gd blog.gdese1 20000 500 / 0.4 batch > net-plt t L blog.gdese1 | find-min > net-spec blog.gdese2 2 15 1 / - + + - + - + > model-spec blog.gdese2 binary > data-spec blog.gdese2 2 1 2 / bdata.train@-151:225 . bdata.train@151:225 . > rand-seed blog.gdese2 2 > net-gd blog.gdese2 100 5 / 0.4 batch > net-gd blog.gdese2 1000 50 / 0.4 batch > net-gd blog.gdese2 20000 500 / 0.4 batch > net-plt t L blog.gdese2 | find-min > net-spec blog.gdese3 2 15 1 / - + + - + - + > model-spec blog.gdese3 binary > data-spec blog.gdese3 2 1 2 / bdata.train@-76:150 . bdata.train@75:150 . > rand-seed blog.gdese3 3 > net-gd blog.gdese3 100 5 / 0.4 batch > net-gd blog.gdese3 1000 50 / 0.4 batch > net-gd blog.gdese3 20000 500 / 0.4 batch > net-plt t L blog.gdese3 | find-min > net-spec blog.gdese4 2 15 1 / - + + - + - + > model-spec blog.gdese4 binary > data-spec blog.gdese4 2 1 2 / bdata.train@-1:75 . bdata.train@1:75 . > rand-seed blog.gdese4 4 > net-gd blog.gdese4 100 5 / 0.4 batch > net-gd blog.gdese4 1000 50 / 0.4 batch > net-gd blog.gdese4 20000 500 / 0.4 batch > net-plt t L blog.gdese4 | find-min The networks selected from the four training runs above as having lowest validation error are at iterations 3000, 950, 3500, and 11000. We can now make predictions for test cases using these four networks, as follows: > net-pred mpa blog.gdese1 3000 blog.gdese2 950 \ blog.gdese3 3500 blog.gdese4 11000 / bdata.test . Number of iterations used: 4 Number of test cases: 200 Average log probability of targets: -0.263+-0.040 Fraction of guesses that were wrong: 0.1050+-0.0217 The resulting classification performance is slightly better than was found using just the first of the four networks.