EXAMPLES OF BAYESIAN NETWORK TRAINING This section shows how Bayesian training for neural network models can be done for three simple synthetic problems. The output shown below was obtained by running the software on our machine, with ">" at the start of a line indicating a command line that was input. It is possible (even likely) that your results will differ, even if you have installed the software correctly, since small differences in floating point arithmetic can be magnified into large differences in the course of the simulation. However, unless one of the simulations became stuck in an isolated local mode, the final predictions you obtain from 'net-pred' should be close to those reported below. All the data sets mentioned here are present in the 'examples' sub-directory, along with the C source of the programs that generated them. It is assumed below that you have changed to this directory. The command sequences for running the simulations that are mentioned below are also stored in this directory, in shell files with the names 'rcmds', 'bcmds', and 'ccmds'. Note that the particular network architectures, priors, and Markov chain sampling options used below are only examples of reasonable choices. There are many other possibilities that are also reasonable. To gain a full understanding of the various possibilities, and their advantages and disadvantages, you will need to read both my thesis and the detailed documentation in the ".doc" files. A simple regression problem As a first example, we will look at a simple regression problem, in which there is one real-valued input for each case, and one real-valued target, whose value is to be predicted. I generated synthetic data of this type in which the input variable, x, for each case had a standard Gaussian distribution and the corresponding target value came from a Gaussian distribution with standard deviation 0.1 and mean given by 0.3 + 0.4*x + 0.5*sin(2.7*x) + 1.1/(1+x^2) I generated 200 cases in total, stored in the file 'rdata'. Each case consists of a line containing first the input value and then the target value. The first 100 of these cases are meant for training, and the second 100 for testing. We will model this data using a multilayer perceptron with one input unit, one hidden layer of eight tanh units, and a single output unit whose activation function is the identity. The value of the output unit will be taken as the mean of a Gaussian distribution for the target, with the standard deviation of this Gaussian (the noise level) being a hyperparameter to be estimated along with the parameters of the network. We will also use hyperparameters to express the prior distributions for the network parameters. Specifically, we will use one hyperparameter for the input-to-hidden weights, one for the hidden unit biases, and one for the hidden-to-output weights. The output unit bias will be given a simple Gaussian prior, with no adjustable hyperparameter. (The role of hyperparameters is primarily to introduce dependencies between parameters, so they are usually not used when they would control just a single parameter.) The first step in applying this model to the data is to create a log file containing the specifications for the network architecture and the priors to use for the network parameters. This can be done using the following command: > net-spec rlog 1 8 1 / - 0.05:0.5 0.05:0.5 - x0.05:0.5 - 100 Here, 'rlog' is the name of the new log file, and the arguments "1", "8", and "1", specify the numbers of input, hidden, and output units. Following the "/", the priors for the various groups of network parameters are given, with a "-" indicating that a parameter group should be omitted (equivalent to the parameters being zero). The order of the groups is a bit hard to remember, but you can get a summary easily by just invoking 'net-spec' with no arguments. The groups in the above command that are not omitted are the input-hidden weights, the hidden biases, the hidden-output weights, and the output bias. In general, the prior specifications used in the net-spec command consist of a "width" value followed by up to three "alpha" values, with perhaps an option character tacked on to the front. For the full details, see Appendix A of my thesis and prior.doc. Here, I will just comment on the particular priors used above. The prior specification used for the output bias is simply "100", which means that the bias has a Gaussian prior with mean zero and standard deviation 100. The prior specifications of the form "0.05:0.5" indicate that the parameters in these groups are associated with a hyperparameter, which gives the standard deviation of a Gaussian prior for these parameters. The hyperparameter itself has a rather vague prior that spans several orders of magnitude around one. The inverse gamma priors used are somewhat difficult to visualize, because their tails are asymmetrical, but some standard choices are often appropriate. Here, the "0.5" after the colon controls how vague the prior is (closer to zero is more vague). The "0.05" specifies the location of this vague distribution, but due to the asymmetry of the tails, it is closer to being the lower limit of the prior than the centre (for vague priors such as this). The "x" in front of the prior for the hidden-to-output weights indicates that the prior should be automatically rescaled based on the number of hidden units, so as to produce an effect that is independent of the number of hidden units (in the limit of large numbers). Once the network has been specified, we need to say how the network outputs will be used to model the targets (response variables) in the data set. We do this with the 'model-spec' command: > model-spec rlog real 0.05:0.5 In this case, the targets are real-valued, and are modeled as the network output plus Gaussian noise, with the noise standard deviation being a hyperparameter having the prior given by the last argument of the command. The syntax of the prior specification is the same as for the priors on network parameters. You can view the architecture and prior specifications stored in the log file by invoking 'net-spec' with just a log file argument. In this example, this should give the following result: > net-spec rlog Network Architecture: Size of input layer: 1 Sizes of hidden layers: 8 Size of output layer: 1 Data model: real Prior Specifications: Hidden Layer 0 Input-Hidden Weights: 0.050:0.50 Hidden Biases: 0.050:0.50 Output Layer Hidden0-Output Weights: x0.050:0.50 Input-Output Weights: 0.050:0.50 Output Biases: 100.000 You can also view the model specification by invoking 'model-spec' with just one argument giving the log file. Once the network and data model have been specified, we need to specify the data sets to be used for training and (optionally) for testing. We do this using the 'data-spec' command: > data-spec rlog 1 1 / rdata@1:100 . rdata@101:200 . Number of training cases: 100 Number of test cases: 100 Here, "rlog" is the name of the log file we created with 'net-spec', to which the data specifications will be appended. The "1" and "1" arguments give the numbers of inputs and targets. These must be consistent with the network architecture (if not, an error will be reported later when you try to start the training). After the "/", specifications for where to get the training and test data are given. Each such specification consists of two parts: the source for the inputs, and the source for the targets. The specification "rdata@1:100" means that the training inputs come from the file 'rdata', in lines 1 to 100, while the specification of "rdata@101:200" for the test inputs indicates that they also come from the file 'rdata', but in lines 101 to 200. In the above command, the sources for the targets are given as just ".", which means the target items are on the same lines as the inputs, following the last input item. We could have said that the targets come from a completely different file, however. It is also possible to specify exactly where on a line the inputs and targets are located (and hence to ignore some items in the file). For documentation on these and other features, see numin.doc. Though it is not done above, the 'data-spec' command also allows you to specify transformations to be applied to the inputs or targets before they are used. This is useful, for example, if you wish to use inputs that have been "normalized" to have mean zero and variance one, based on the training data. See data-spec.doc for the details. In the training runs reported in the thesis, I used a short "initial phase" to get things started, followed by a long "sampling phase" to bring the simulation to equilibrium and then produce a sample of networks from the posterior for use in prediction. I still use the same general procedure, but with some changes to how the initial phase is done. It seems desirable to start the simulation in a state where the hyperparameters take on moderate values, and leave them fixed for a few iterations so that the network parameters will also take on moderate values. This can be accomplished using the following commands: > net-gen rlog fix 0.5 > mc-spec rlog repeat 10 sample-noise heatbath hybrid 100:10 0.2 > net-mc rlog 1 The 'net-gen' command stores a network in the log file with index zero, in which the hyperparameters have values of 0.5, and the network parameters are zero. This is the initial state of the simulation run. The following 'mc-spec' command specifies the Markov chain operations to be performed in the initial phase. Here, each iteration consists of ten repetitions of the following steps: Gibbs sampling for the noise level, a heatbath replacement of the momentum variables, and a hybrid Monte Carlo update with a trajectory 100 leapfrog steps long, using a window of 10, and a stepsize adjustment factor of 0.2. Note that the hyperparameters are not updated, and hence will remain fixed at values of 0.5. Finally, a single such iteration is done by calling 'net-mc' with an iteration limit of 1. The stepsize adjustment factor of 0.2 used above is typical of what is needed, but will not be appropriate in all circumstances. After the 'net-mc' command has finished, the number of the 10 hybrid Monte Carlo updates that were rejected can be determined using the command 'net-plt t r rlog', which will write the iteration number (of 1) and the rejection rate on standard output. If the rejection rate is high (say, over 0.3), a new run should be done using a smaller stepsize adjustment factor. In the initial phase, one would generally start by guessing a value for the stepsize adjustment factor that is on the low side, since there is no point in optimizing this choice. At this point, we hope to have a network stored in the log file (with index 1) that has values for both the parameters and hyperparameters that are of moderate magnitude, and which have adapted at least somewhat to the training data. We can now start serious sampling with the following commands: > mc-spec rlog sample-sigmas heatbath hybrid 1000:10 0.4 > net-mc rlog 100 The 'mc-spec' command appends a new set of Markov chain operations to the log file, which will override the previous set. These operations are Gibbs sampling for both the hyperparameters and the noise level (the "sigmas"), a heatbath update for the momentum variables, and a hybrid Monte Carlo update with a trajectory 1000 leapfrog steps long, a window of 10, and a stepsize adjustment factor of 0.4. A long trajectory length is typically desirable for the sampling phase. As in the initial phase, the stepsize adjustment factor of 0.4 used is typical, but not universally applicable. It may pay at this stage to experiment in order to find out how large this factor can be while keeping the rejection rate low. The use of a "window" of around 10 states costs little and is often beneficial. The 100 iterations of the sampling phase started with the command 'net-mc rlog 100' will take a while to complete (about six minutes on our SGI machine). While you wait, you can look at the last network saved in the log file (or any earlier one) using 'net-display'. For example, after about a minute, you might see the following: > net-display rlog Network in file "rlog" with index 15 Input to Hidden Layer 0 Weights [1] 3.36 3.36: -1.24 +1.73 -4.04 -0.24 +2.49 -2.95 -1.56 -3.21 Hidden Layer 0 Biases [2] 1.90: -1.58 -3.59 +4.42 +0.85 -4.59 -3.69 +2.03 -0.40 Hidden Layer 0 to Output Weights [3] 0.76 0.76: -1.23 0.76: +1.04 0.76: +0.34 0.76: -0.38 0.76: -0.24 0.76: +0.80 0.76: +0.52 0.76: -0.56 Output Biases [4] 100.00: +1.10 Noise levels 0.08 - 0.08 This display of network parameters and hyperparameters is divided into sections for different parameter groups. Within each section, the numbers before the colons are hyperparameters, those after are parameters (weight and biases). There are more hyperparameters shown than were mentioned earlier, but for this network architecture, the extra hyperparameters are either fixed in value (the 100 for output biases), or tied to the value of a higher-level hyperparameter, so they are effectively not present. The parameter groups in the 'net-display' output are identified by numbers in square brackets. These can be used with the 'h', 'w', and 'W' quantities of 'net-plt'. For example, to see how the hyperparameter controlling the hidden-to-output weights has changed during the simulation (so far), one can use the command > net-plt t h3 rlog | plot where 'plot' is some suitable plot program. (One can also just invoke net-plt and look at the numbers printed on standard output.) Here 'h3' refers to the top-level hyperparameter for group 3, which is seen in the output of 'net-display' above to be the hidden-to-output group. By looking at plots of the hyperparameters and quantities such as the squared error on the training set ('b'), one can get an idea of when the simulation has reached equilibrium. Networks from that point on can then be used to make predictions for test case using the 'net-pred' program. Often, it will not be completely clear that equilibrium has been reached until the simulation has been allowed to proceed for quite a long time, but predictions based on shorter runs may nevertheless be quite good. For this problem, let's assume that we have decided to discard the first 20 iterations as perhaps not coming from the equilibrium distribution. The following command will use the networks from the remaining 80 iterations to produce predictions for all test cases, and report the average squared error: > net-pred itn rlog 21: Number of networks used: 80 Case Inputs Targets Means Error^2 1 0.92 1.49 1.59 0.0093 2 0.71 1.83 1.79 0.0012 3 0.20 1.72 1.67 0.0021 ( midde lines omitted ) 98 -0.69 0.35 0.35 0.0000 99 -1.33 0.19 0.36 0.0277 100 -0.09 1.31 1.24 0.0052 Average squared error guessing mean: 0.01035+-0.00147 The options "itn" specified ask for a listing of the inputs ("i") and targets ("t") for each case, along with the mean ("n") output for that case of the 80 networks used for prediction. The squared error when using this mean to predict the target is shown for each case, and the average squared error for the test cases is shown at the bottom, along with its standard error with respect to the random selection of test cases. Considering that the average squared error with optimal prediction is 0.01 (due to the noise of standard deviation 0.1 added when generating the data), the network model has done quite well, as one would hope it would on an easy problem such as this. It is also possible to get predictions for cases that are not in the test set that was specified with 'data-spec'. For example: > net-pred nb rlog 11: / "%echo 2.3" 1.37 Here, the options "nb" ask for only the predictive mean, with "bare" output (no headings). The argument at the end says that the inputs for test cases (here, just one case) should be taken from the output of the Unix command "echo 2.3", which just outputs the number 2.3. A problem with a binary response As a second example, I generated a data set with two real-valued inputs, x1 and x2, and a binary target. The inputs were drawn independently from standard Gaussian distributions. The target was then set to "1" with the following probability: exp ( - ((x1-0.4)^2 + (x2+0.3)^2) ^ 2 ) ie, the negative exponential of the fourth power of the distance of (x1,x2) from (0.5,-0.3). I generated 500 cases, and used the first 300 for training and the remaining 200 for testing. For this problem, we can again try using a network with one layer of hidden units (fifteen for this problem), and a single output unit. For a binary target, a Gaussian data model would be inappropriate; instead, we can use a logistic regression model, in which the probability of the target being "1" is obtained by passing the value of the output unit through the logistic function, f(x)=1/(1+exp(-x)). The network, model, and data specification commands needed are quite similar to those used in the regression problem above: > net-spec blog 2 15 1 / - 0.05:0.5 0.05:0.5 - x0.05:0.5 - 100 > model-spec blog binary > data-spec blog 2 1 2 / bdata@1:300 . bdata@301:500 . Number of training cases: 300 Number of test cases: 200 The 'net-spec' command differs only in the number of input units (2), the number of hidden units (15). The data model used is "binary", with no need for a noise level prior. The 'data-spec' command also says that there are two inputs, and one target. It also has a third argument of "2" just before the "/", which indicates that the target must be an integer with two possible values (which are "0" and "1"). The initial phase commands that were used for the simple regression problem turn out to be adequate for this problem as well: > net-gen blog fix 0.5 > mc-spec blog repeat 10 sample-noise heatbath hybrid 100:10 0.2 > net-mc blog 1 For the sampling phase, we could also try using commands similar to those presented above, but we might as well try something different (the mc-spec command needed here is long, so it is continued by putting a "\" at the end of the first line): > mc-spec blog repeat 10 sample-sigmas heatbath 0.95 \ hybrid 100:10 0.3 negate > net-mc blog 200 The above 'mc-spec' command specifies the variant of hybrid Monte Carlo with "persistence" in which relatively short trajectories are used, but random walks are suppressed by only partially replacing the momentum in the 'heatbath' step. Note that for this to work well, the rejection rate must be quite small. This alternative method has the advantage that it allows for more frequent hyperparameter updates (with 'sample-sigmas'). On our SGI machine, the 200 iterations requested above take about 70 minutes. During this time, one can monitor the simulation, for instance with the command: > net-plt t h1h2h3 blog | plot where again, 'plot' is some appropriate plotting program, which in this case must be capable of plotting three superimposed graphs, for the three hyperparameters h1, h2, and h3. The values for these hyperparameters exhibit quite a high degree of autocorrelation, which is why it is advisable to allow the simulation to go for 200 iterations, in order to be more confident that the simulation has explored all high-probability regions of the posterior distribution. Once the run has finished, we can make predictions using 'net-pred', based, say, on the networks from the final 3/4 of the run (which is a generally reasonable portion). For a problem such as this, where the response is binary, we are most likely interested in guessing whether the target is "0" or "1", with performance measured by how often we are right. We can get such predictions as follows: > net-pred itm blog 51: Number of networks used: 150 Case Inputs Targets Guesses Wrong? 1 -1.56 0.90 0.00 0 0 2 0.09 -0.13 1.00 1 0 3 -0.79 0.85 0.00 0 0 ( midde lines omitted ) 198 1.49 0.70 0.00 0 0 199 -2.23 -0.28 0.00 0 0 200 -0.91 -0.03 0.00 0 0 Fraction of guesses that were wrong: 0.0900+-0.0203 The "itm" options ask for output listing the inputs, the targets, and the guesses based on the mode of the predictive distribution. A summary is also printed that reports the fraction of guesses that were wrong, along with the standard error for this estimate of the error rate. A three-way classification problem As a final example, I created a synthetic data set for a three-way classification problem. Data items were generated by first drawing quantities x1, x2, x3, and x4 independently from distributions uniform over (0,1). The class of the item, represented by "0", "1", or "2", was then selected as follows: if the two-dimensional Euclidean distance of (x1,x2) from the point (0.4,0.5) was less than 0.35, the class was set to "0"; otherwise, if 0.8*x1+1.8*x2 was less than 0.6, the class was set to "1"; and if neither of these conditions held, the class was set to "2". Note that x3 and x4 have no effect on the class. The class selected by this procedure was the target value for the network; the inputs available to the network were not x1, x2, x3, and x4 themselves, however, but rather these four values plus Gaussian noise of standard deviation 0.1. I generated 1000 cases in this way, of which 400 were used for training and 600 for testing. This example will illustrate the "softmax" model for target values in a small set, and the Automatic Relevance Determination (ARD) prior for input-to-hidden weights. The network and model specifications used were as follows: > net-spec clog 4 8 3 / - 0.05:0.5:0.5 0.05:0.5 - x0.05:0.5 - 0.05:0.5 > model-spec clog class This specifies a network with 4 input units, 8 hidden units, and 3 output units. The output units have identity activation functions, but their values are used in a "softmax" data model, in which the probability for target i is exp(o_i) / SUM_j exp(o_j), where the o_j are the values of the output units. The prior specified for input-to-hidden weights, "0.05:0.5:0.5", has two "alpha" values, indicating that there is both a high-level hyperparameter controlling the overall magnitude of input-to-hidden weights, and lower-level hyperparameters for each input unit, which control the magnitudes of weights out of each input. If some of the inputs are irrelevant (as are the third and fourth inputs in this problem), we hope that the corresponding hyperparameters will take on small values, forcing the weights out of these inputs to be close to zero, and thereby avoiding any damage to predictive performance that could result from the inclusion of these irrelevant inputs. We need to use the following data specification for this problem: > data-spec clog 4 1 3 / cdata@1:400 . cdata@401:1000 . Number of training cases: 400 Number of test cases: 600 The arguments "4" and "1" are the numbers of input and target values. The "3" before the "/" says that each target must be one of the integers "0", "1", or "2". This "3" must match the number of softmax output units specified in the network architecture, or an error will result when training is started. The initial and sampling phases of the simulation can be done using the same approach as was used above for the binary data set: > net-gen clog fix 0.5 > mc-spec clog repeat 10 sample-noise heatbath hybrid 100:10 0.2 > net-mc clog 1 > mc-spec clog repeat 10 sample-sigmas heatbath 0.95 \ hybrid 100:10 0.3 negate > net-mc clog 200 This takes about 75 minutes on our SGI machine. We can see the effect of the ARD model by looking at the values of the low-level input-hidden hyperparameters over the course of the simulation, with the command: > net-plt t h1@ clog | plot The third and fourth hyperparameters are much smaller than the first two, indicating the ARD has done its job. Predictions for test cases based on the predictive mode can now be done in the same way as for the binary response problem. If we are interested only in the estimated classification performance, we can use the following command: > net-pred ma clog 51: Number of networks used: 150 Number of test cases: 600 Fraction of guesses that were wrong: 0.1367+-0.0140 Here, the "a" option is used to suppress everything except the summary.