A SIMPLE REGRESSION PROBLEM As a first example, we will look at a simple regression problem, in which there is one real-valued input for each case, and one real-valued target, whose value is to be predicted. I generated synthetic data of this type in which the input variable, x, for each case had a standard Gaussian distribution and the corresponding target value came from a Gaussian distribution with standard deviation 0.1 and mean given by 0.3 + 0.4*x + 0.5*sin(2.7*x) + 1.1/(1+x^2) I generated 200 cases in total, stored in the file 'rdata'. Each case consists of a line containing first the input value and then the target value. The first 100 of these cases are meant for training, and the second 100 for testing. A neural network model for the regression problem. We will model this data using a multilayer perceptron with one input unit, one hidden layer of eight tanh units, and a single output unit whose activation function is the identity. The value of the output unit will be taken as the mean of a Gaussian distribution for the target, with the standard deviation of this Gaussian (the noise level) being a hyperparameter to be estimated along with the parameters of the network. We will also use hyperparameters to express the prior distributions for the network parameters. Specifically, we will use one hyperparameter for the input-to-hidden weights, one for the hidden unit biases, and one for the hidden-to-output weights. The output unit bias will be given a simple Gaussian prior, with no adjustable hyperparameter. (The role of hyperparameters is primarily to introduce dependencies between parameters, so they are usually not used when they would control just a single parameter.) The first step in applying this model to the data is to create a log file containing the specifications for the network architecture and the priors to use for the network parameters. This can be done using the following command: > net-spec rlog.net 1 8 1 / - 0.05:0.5 0.05:0.5 - x0.05:0.5 - 100 Here, "rlog.net" is the name of the new log file, and the arguments "1", "8", and "1", specify the numbers of input, hidden, and output units. Following the "/", the priors for the various groups of network parameters are given, with a "-" indicating that a parameter group should be omitted (equivalent to the parameters being zero). The order of the groups is a bit hard to remember, but you can get a summary easily by just invoking 'net-spec' with no arguments. The groups in the above command that are not omitted are the input-hidden weights, the hidden biases, the hidden-output weights, and the output bias. In general, the prior specifications used in the net-spec command consist of a "width" value followed by up to three "alpha" values, with perhaps an option character tacked on to the front. For the full details, see Appendix A of my thesis and prior.doc. Here, I will just comment on the particular priors used above. The prior specification used for the output bias is simply "100", which means that the bias has a Gaussian prior with mean zero and standard deviation 100. The prior specifications of the form "0.05:0.5" indicate that the parameters in these groups are associated with a hyperparameter, which gives the standard deviation of a Gaussian prior for these parameters. The hyperparameter itself has a rather vague prior that spans several orders of magnitude around one. The inverse gamma priors used are somewhat difficult to visualize, because their tails are asymmetrical, but some standard choices are often appropriate. Here, the "0.5" after the colon controls how vague the prior is (closer to zero is more vague). The "0.05" specifies the location of this vague distribution, but due to the asymmetry of the tails, it is closer to being the lower limit of the prior than the centre (for vague priors such as this). The "x" in front of the prior for the hidden-to-output weights indicates that the prior should be automatically rescaled based on the number of hidden units, so as to produce an effect that is independent of the number of hidden units (in the limit of large numbers). Once the network has been specified, we need to say how the network outputs will be used to model the targets (response variables) in the data set. We do this with the 'model-spec' command: > model-spec rlog.net real 0.05:0.5 In this case, the targets are real-valued, and are modeled as the network output plus Gaussian noise, with the noise standard deviation being a hyperparameter having the prior given by the last argument of the command. The syntax of the prior specification is the same as for the priors on network parameters. You can view the architecture and prior specifications stored in the log file by invoking 'net-spec' with just a log file argument. In this example, this should give the following result: > net-spec rlog.net Network Architecture: Size of input layer: 1 Sizes of hidden layers: 8 Size of output layer: 1 Data model: real Prior Specifications: Hidden Layer 0 Input-Hidden Weights: 0.050:0.50 Hidden Biases: 0.050:0.50 Output Layer Hidden0-Output Weights: x0.050:0.50 Output Biases: 100.000 You can also view the model specification by invoking 'model-spec' with just one argument giving the log file. Once the network and data model have been specified, we need to specify the data sets to be used for training and (optionally) for testing. We do this using the 'data-spec' command: > data-spec rlog.net 1 1 / rdata@1:100 . rdata@101:200 . Number of training cases: 100 Number of test cases: 100 Here, "rlog.net" is the log file we created with 'net-spec', to which the data specifications will be appended. The "1" and "1" arguments give the numbers of inputs and targets. These must be consistent with the network architecture (if not, an error will be reported later when you try to start the training). After the "/", specifications for where to get the training and test data are given. Each such specification consists of two parts: the source for the inputs, and the source for the targets. The specification "rdata@1:100" means that the training inputs come from the file 'rdata', in lines 1 to 100, while the specification of "rdata@101:200" for the test inputs indicates that they also come from the file 'rdata', but in lines 101 to 200. In the above command, the sources for the targets are given as just ".", which means the target items are on the same lines as the inputs, following the last input item. We could have said that the targets come from a completely different file, however. It is also possible to specify exactly where on a line the inputs and targets are located (and hence to ignore some items in the file). For documentation on these and other features, see numin.doc. Though it is not done above, the 'data-spec' command also allows you to specify transformations to be applied to the inputs or targets before they are used. This is useful, for example, if you wish to use inputs that have been "normalized" to have mean zero and variance one, based on the training data. See data-spec.doc for the details. In this example, the data already has approximately mean zero and variance one, so the priors used are sensible without normalization. In the training runs reported in the thesis, I used a short "initial phase" to get things started, followed by a long "sampling phase" to bring the simulation to equilibrium and then produce a sample of networks from the posterior for use in prediction. I still use the same general procedure, but with some changes to how the initial phase is done. It seems desirable to start the simulation in a state where the hyperparameters take on moderate values, and leave them fixed for a few iterations so that the network parameters will also take on moderate values. This can be accomplished using the following commands: > net-gen rlog.net fix 0.5 > mc-spec rlog.net repeat 10 sample-noise heatbath hybrid 100:10 0.2 > net-mc rlog.net 1 The 'net-gen' command stores a network in the log file with index zero, in which the hyperparameters have values of 0.5, and the network parameters are zero. This is the initial state of the simulation run. The following 'mc-spec' command specifies the Markov chain operations to be performed in the initial phase. Here, each iteration consists of ten repetitions of the following steps: Gibbs sampling for the noise level, a heatbath replacement of the momentum variables, and a hybrid Monte Carlo update with a trajectory 100 leapfrog steps long, using a window of 10, and a stepsize adjustment factor of 0.2. Note that the hyperparameters are not updated, and hence will remain fixed at values of 0.5. Finally, a single such iteration is done by calling 'net-mc' with an iteration limit of 1. The stepsize adjustment factor of 0.2 used above is typical of what is needed, but will not be appropriate in all circumstances. After the 'net-mc' command has finished, the number of the 10 hybrid Monte Carlo updates that were rejected can be determined using the command 'net-plt t r rlog.net', which will write the iteration number (of 1) and the rejection rate on standard output. If the rejection rate is high (say, over 0.3), a new run should be done using a smaller stepsize adjustment factor. In the initial phase, one would generally start by guessing a value for the stepsize adjustment factor that is on the low side, since there is no point in optimizing this choice. At this point, we hope to have a network stored in the log file (with index 1) that has values for both the parameters and hyperparameters that are of moderate magnitude, and which have adapted at least somewhat to the training data. We can now start serious sampling with the following commands: > mc-spec rlog.net sample-sigmas heatbath hybrid 1000:10 0.4 > net-mc rlog.net 100 The 'mc-spec' command appends a new set of Markov chain operations to the log file, which will override the previous set. These operations are Gibbs sampling for both the hyperparameters and the noise level (the "sigmas"), a heatbath update for the momentum variables, and a hybrid Monte Carlo update with a trajectory 1000 leapfrog steps long, a window of 10, and a stepsize adjustment factor of 0.4. A long trajectory length is typically desirable for the sampling phase. As in the initial phase, the stepsize adjustment factor of 0.4 used is typical, but not universally applicable. It may pay at this stage to experiment in order to find out how large this factor can be while keeping the rejection rate low. The use of a "window" of around 10 states costs little and is often beneficial. The 100 iterations of the sampling phase started with the command 'net-mc rlog.net 100' will take a little while to complete (about one minute on our 550MHz Pentium III). If you put the command in the background (or add a '&' to the end of the 'net-mc' command), you will be able to monitor progress while you wait. For example, you can look at the last network saved in the log file (or any earlier one) using 'net-display'. After a few seconds, you might see the following: > net-display rlog.net Network in file "rlog.net" with index 15 Input to Hidden Layer 0 Weights [1] 3.57 3.57: -2.75 +2.66 +4.67 +1.67 -2.99 +3.42 -0.94 +2.69 Hidden Layer 0 Biases [2] 3.66: -0.40 +2.96 +9.48 +2.35 -2.66 -3.74 +8.69 -6.43 Hidden Layer 0 to Output Weights [3] 0.57 0.57: -0.73 0.57: -0.59 0.57: +0.26 0.57: +0.07 0.57: -0.51 0.57: -0.57 0.57: +0.64 0.57: +0.82 Output Biases [4] 100.00: +0.51 Noise levels 0.08 - 0.08 This display of network parameters and hyperparameters is divided into sections for different parameter groups. Within each section, the numbers before the colons are hyperparameters, those after are parameters (weight and biases). There are more hyperparameters shown than were mentioned earlier, but for this network architecture, the extra hyperparameters are either fixed in value (the 100 for output biases), or tied to the value of a higher-level hyperparameter, so they are effectively not present. The parameter groups in the 'net-display' output are identified by numbers in square brackets. These can be used with the 'h', 'w', and 'W' quantities of 'net-plt'. For example, to see how the hyperparameter controlling the hidden-to-output weights has changed during the simulation (so far), one can use the command > net-plt t h3 rlog.net | plot where 'plot' is some suitable plot program. (One can also just invoke net-plt and look at the numbers printed on standard output.) Here 'h3' refers to the top-level hyperparameter for group 3, which is seen in the output of 'net-display' above to be the hidden-to-output group. Here's a command that shows how the individual weights in this group change during the run: > net-plt t w3@ rlog.net | plot In this case, the "plot" program must be one (such as xgraph) that is capable of displaying more than one superimposed graph, with the data for each graph being separated by a blank line. Some plot programs may prefer the data to come with multiple values per line, which is the format produced by the 'net-tbl' command: > net-tbl tw3@ rlog.net | plot Note that there is no space between the "t" quantity and the others in this command. By looking at plots of the hyperparameters and quantities such as the squared error on the training set ('b'), one can get an idea of when the simulation has reached equilibrium. Networks from that point on can then be used to make predictions for test case using the 'net-pred' program. Often, it will not be completely clear that equilibrium has been reached until the simulation has been allowed to proceed for quite a long time, but predictions based on shorter runs may nevertheless be quite good. For this problem, let's assume that we have decided to discard the first 20 iterations as perhaps not coming from the equilibrium distribution. The following command will use the networks from the remaining 80 iterations to produce predictions for all test cases, and report the average squared error: > net-pred itn rlog.net 21: Number of networks used: 80 Case Inputs Targets Means Error^2 1 0.92 1.49 1.58 0.0075 2 0.71 1.83 1.79 0.0014 3 0.20 1.72 1.68 0.0016 ( middle lines omitted ) 98 -0.69 0.35 0.35 0.0000 99 -1.33 0.19 0.37 0.0306 100 -0.09 1.31 1.24 0.0056 Average squared error guessing mean: 0.00937+-0.00123 The options "itn" specified ask for a listing of the inputs ("i") and targets ("t") for each case, along with the mean ("n") output for that case of the 80 networks used for prediction. The squared error when using this mean to predict the target is shown for each case, and the average squared error for the test cases is shown at the bottom, along with its standard error with respect to the random selection of test cases. Considering that the average squared error with optimal prediction is 0.01 (due to the noise of standard deviation 0.1 added when generating the data), the network model has done quite well, as one would hope it would on an easy problem such as this. It is also possible to get predictions for cases that are not in the test set that was specified with 'data-spec'. For example: > net-pred nb rlog.net 11: / "%echo 2.3" +1.36491809e+00 Here, the options "nb" ask for only the predictive mean, with "bare" output (no headings, also higher precision, in exponential format). The argument at the end says that the inputs for test cases (here, just one case) should be taken from the output of the Unix command "echo 2.3", which just outputs the number 2.3. The 'net-pred' program can also find the median and the 10% and 90% quantiles of the predictive distribution, as illustrated below: > net-pred itdq rlog.net 21: Number of iterations used: 80 Case Inputs Targets Medians |Error| 10% Qnt 90% Qnt 1 0.92 1.49 1.58 0.0901 1.46 1.70 2 0.71 1.83 1.79 0.0359 1.66 1.91 3 0.20 1.72 1.68 0.0376 1.56 1.80 4 0.19 1.76 1.65 0.1093 1.54 1.77 5 1.18 1.18 1.21 0.0297 1.08 1.34 6 -2.10 0.05 0.05 0.0052 -0.10 0.20 7 -1.34 0.30 0.37 0.0639 0.25 0.49 8 -1.62 0.48 0.42 0.0589 0.28 0.55 9 -2.29 -0.41 -0.27 0.1434 -0.54 0.05 10 -0.23 0.89 0.98 0.0841 0.86 1.09 (some lines omitted) 91 -0.16 1.21 1.10 0.1058 0.99 1.22 92 0.79 1.70 1.72 0.0254 1.60 1.84 93 -1.99 0.19 0.18 0.0099 0.05 0.32 94 0.18 1.71 1.65 0.0608 1.53 1.77 95 1.56 0.84 0.82 0.0167 0.68 0.94 96 -0.32 0.80 0.82 0.0217 0.69 0.95 97 -0.12 1.23 1.17 0.0588 1.05 1.29 98 -0.69 0.35 0.35 0.0021 0.23 0.48 99 -1.33 0.19 0.37 0.1810 0.24 0.50 100 -0.09 1.31 1.24 0.0749 1.12 1.36 Average abs. error guessing median: 0.07723+-0.00581 We see here that all but one of the actual targets for the twenty test cases shown lie between the 10% and 90% quantiles. When the median is used as the "best guess", performance is judged by the average absolute error, not squared error, since this is the error measure that is minimized by the true median. A Gaussian process model for the regression problem. We can also model this data using a Gaussian process. Such a model is similar to a network model with an infinite number of hidden units. The weights in this hypothetical infinite network are not represented explicitly (fortunately, since this would require an infinite amount of memory). Only the hyperparameters are explicitly represented. A Gaussian process model is specified using the gp-spec command, which is analogous to the net-spec command. For the simple regression model, the following is one appropriate specification: > gp-spec rlog.gp 1 1 100 / 0.05:0.5 0.05:0.5 Here, "rlog.gp" is the name of the new log file that will hold the results of the Gaussian process run. The first two arguments following the log file are the numbers of inputs and outputs, respectively, both "1" for this problem. The (optional) argument of "100" that follows is the prior for the constant part of the covariance function used. This corresponds to the prior for the output unit bias in a network model. A specification for a linear part of the covariance could follow (but doesn't here); it would correspond to a prior for direct input-output connections in a network. For reasons of computational accuracy, it is best not to use too vague a prior for the constant part of the covariance, even though that would not usually be a problem from a statistical point of view. The remaining arguments (after the "/") give the priors for the hyperparameters used in an exponential term of the covariance function. These priors correspond to those for the hidden-output and input-hidden weights in a network model. (There is no counterpart here to the prior for the hidden unit biases in a network model.) The first prior is for the scale of this term, which controls the magnitude of the non-linear variation in the function. The second prior is for the relevance of the input, which controls the amount by which the input has to change to produce a change in the non-linear component of the function that is comparable to the overall scale over which this component varies. The prior specifications are in the same form as is used for network specifications (see prior.doc). The specifications of "0.05:0.5" used here are vague, allowing these hyperparameters to take on values over a wide range. The specification can be viewed by invoking 'gp-spec' with just the name of the log file: > gp-spec rlog.gp Number of inputs: 1 Number of outputs: 1 Constant part of covariance: 100.000 Exponential parts of covariance: Scale Relevance Power 0.050:0.50 0.050:0.50 2.000 Once the Gaussian process model for functions has been specified, we can specify how the function values are used to model the targets in the dataset using 'model-spec', in exactly the same was as for a network model: > model-spec rlog.gp real 0.05:0.5 We also say where the training and (optionally) the test data comes from using 'data-spec': > data-spec rlog.gp 1 1 / rdata@1:100 . rdata@101:200 . The model and data specifications can be viewed by invoking these programs with just the name of a log file. We are now ready to sample from the posterior distribution of the hyperparameters for the Gaussian process model. To start, we can fix the hyperparameters at reasonable initial values, using 'gp-gen': > gp-gen rlog.gp fix 0.5 0.1 This fixes the scale hyperparameters to 0.5 and the relevance hyperparameters to 0.1 (linear hyperparameters, if present, would be fixed to the product of these). By default, the hyperparameters are set to the "width" value from their prior specification. Because the priors are often vague (as here), this may not be a very reasonable starting point. We now specify the Markov chain operations to be used in sampling. There are a great many possibilities for these operations. Here is one reasonable method: > mc-spec rlog.gp heatbath hybrid 20:4 0.5 This uses hybrid Monte Carlo, with trajectories 20 leapfrog steps long, with a window of 4 states. The stepsize adjustment factor used is 0.5. If the rejection rate turns out to be too high (as can be checked using the 'gp-plt t r rlog.gp' command), the stepsize should be adjusted downward. To perform these sampling operations 100 times, we use the following command: > gp-mc rlog.gp 100 We can let this run in the background (eg, by adding '&' to the end of the command), and use 'gp-plt' or 'gp-display' to monitor progress. The quantities that can be plotted with 'gp-plt' are similar to those that can be plotted using 'net-plt', except that quantities relating to test cases have been omitted, since they would often take a long time to compute (the 'E' and 'H' quantities, defined in the "mc" module, may also take a long time). See gp-quantities.doc for details. Once the gp-mc run has completed (which takes about half a minute on our 550MHz Pentium III), the iterations from the latter part of the run can be used to make predictions for test cases. This is done using 'gp-pred', which operates much like 'net-pred'. The following command makes predictions for the test cases based on the last 80 of the 100 iterations, and reports the average squared error: > gp-pred na rlog.gp 21: Number of iterations used: 80 Number of test cases: 100 Average squared error guessing mean: 0.00952+-0.00126 This takes less than a second on our machine. Predictions will take longer when the number of training cases is larger, or if the median or log probability are to be found (options "d" or "p" of 'gp-pred'). As can be seen, the performance of the Gaussian process model is quite similar to that of the neural network model for this problem. (The difference in average performance seen is probably not significant, considering the standard errors.) The predictions for test cases made above are found directly from the covariances between the targets in the training cases and the unknown target in a test case. The values of the regression function for the training cases are never explicitly found. Consequently, it is not possible to plot quantities such as the squared error on training cases over the course of the run. To plot such quantities, you will have to ask for the function values for training cases to be generated in each iteration. This takes a significant amount of time, and can potentially cause numerical problems, which is why gp-plt won't just do it as needed. If you want to be able to plot the squared error on training cases (or similar quantities such as case-by-case likelihoods), you will need to change the 'mc-spec' command to the following: > mc-spec rlog.gp2 discard-values heatbath hybrid 20:4 0.5 sample-values The "sample-values" operation at the end generates function values for all the training cases, which will be stored in the log file. These values can later be used to compute the squared error for training cases, which can be plotted with a command such as > gp-plt t b rlog.gp2 | plot The "discard-values" operation throws away the function values (if present) before the operations for updating hyperparameters. Throwing away information may seem wasteful, but it actually improves convergence in this context. Unfortunately, if you make only this change, you will probably get the following error message when you try to run 'gp-mc': Couldn't find Cholesky decomposition of posterior covariance in sample-values! This message is produced when the round-off error in the matrix computations used by "sample-values" is enough to turn the results into nonsense. The problem is due to the poor "conditioning" of the covariance matrix. Roughly speaking, the covariances between neighbouring training cases are so high that knowing all but one function value is enough to determine the remaining function value to a precision comparable to the level of round-off error. To fix this, the conditioning of the covariance matrix must be improved. Changing the 'gp-spec' command as follows is sufficient on our machine: > gp-spec rlog.gp2 1 1 10 - 0.01 / 0.05:0.5 0.05:0.5 There are two changes here from the 'gp-spec' command used before. First, the constant part of the covariance has been reduced from 100 to 10, which makes little difference when the data is centred at about zero, as it is for this problem. Since arithmetic is done in floating point, this increases the effective precision of the covariances. Second, the covariance now includes a "jitter" part of 0.01 (the "-" preceding this indicates that there is still no linear part). Jitter is much like noise, in that it varies independently from one case to another, but it is considered part of the function value, which noise is not. The jitter makes all the function values less predictable, reducing the problem of poor conditioning. Jitter plays a more crucial role for binary and class models.