A SIMPLE REGRESSION PROBLEM As a first example, we will look at a simple regression problem, in which there is one real-valued input for each case, and one real-valued target, whose value is to be predicted. This example will also illustrate some important issues with usage of Markov chain Monte Carlo. I generated synthetic data of this type in which the input variable, x, for each case had a standard Gaussian distribution and the corresponding target value came from a Gaussian distribution with standard deviation 0.1 and mean given by 0.3 + 0.4*x + 0.5*sin(2.7*x) + 1.1/(1+x^2) I generated 1100 cases in total, stored in the file 'rdata'. Each case consists of a line containing first the input value and then the target value. The first 100 of these cases are meant for training, and the remaining 1000 for testing. The training cases are plotted in rdata-train.png and the test caes are plotted in rdata-test.png. A neural network model for the regression problem. We will model this data using a multilayer perceptron with one input unit, one hidden layer of eight tanh units, and a single output unit whose activation function is the identity. The value of the output unit will be taken as the mean of a Gaussian distribution for the target, with the standard deviation of this Gaussian (the noise level) being a hyperparameter to be estimated along with the parameters of the network. We will also use hyperparameters to express the prior distributions for the network parameters. Specifically, we will use one hyperparameter for the input-to-hidden weights, one for the hidden unit biases, and one for the hidden-to-output weights. The output unit bias will be given a simple Gaussian prior, with no adjustable hyperparameter. (The role of hyperparameters is primarily to introduce dependencies between parameters, so they are usually not used when they would control just a single parameter.) The first step in applying this model to the data is to create a log file containing the specifications for the network architecture and the priors to use for the network parameters. This can be done using the following command: > net-spec rlog.net 1 8 1 / ih=0.05:0.5 bh=0.05:0.5 ho=x0.05:0.5 bo=100 Here, "rlog.net" is the name of the new log file, and the arguments "1", "8", and "1", specify the numbers of input, hidden, and output units. Following the "/", the priors for the various groups of network parameters are given, with the groups identified by abbreviations such as "ih" for "input-to-hidden" weights (see net-spec.doc for the complete list). The groups in the above command that are for the input-hidden weights, the hidden biases, the hidden-to-output weights, and the output bias. In general, the prior specifications used in the net-spec command consist of a "width" value followed by up to three "alpha" values, with perhaps an option character tacked on to the front. For the full details, see net-models.PDF and prior.doc. Here, I will just comment on the particular priors used above. The prior specification used for the output bias is simply "100", which means that the bias has a Gaussian prior with mean zero and standard deviation 100. The prior specifications of the form "0.05:0.5" indicate that the parameters in these groups are associated with a hyperparameter, which gives the standard deviation of a Gaussian prior for these parameters. The hyperparameter itself has a rather vague prior that spans several orders of magnitude around one. The inverse gamma priors used are somewhat difficult to visualize, because their tails are asymmetrical, but some standard choices are often appropriate. Here, the "0.5" after the colon controls how vague the prior is (closer to zero is more vague). The "0.05" specifies the location of this vague distribution, but due to the asymmetry of the tails, it is closer to being the lower limit of the prior than the centre (for vague priors such as this). The "x" in front of the prior for the hidden-to-output weights indicates that the prior should be automatically rescaled based on the number of hidden units, so as to produce an effect that is independent of the number of hidden units (in the limit of large numbers). Once the network has been specified, we need to say how the network outputs will be used to model the targets (response variables) in the data set. We do this with the 'model-spec' command: > model-spec rlog.net real 0.05:0.5 In this case, the targets are real-valued, and are modeled as the network output plus Gaussian noise, with the noise standard deviation being a hyperparameter having the prior given by the last argument of the command. The syntax of the prior specification is the same as for the priors on network parameters. You can view the architecture and prior specifications stored in the log file by invoking 'net-spec' with just a log file argument. In this example, this should give the following result: > net-spec rlog.net Network Architecture: Input layer: size 1 Hidden layer 0: size 8 tanh Output layer: size 1 Prior Specifications: Hidden Layer 0 Input-Hidden Weights: 0.050:0.50 Hidden Biases: 0.050:0.50 Output Layer Hidden0-Output Weights: x0.050:0.50 Output Biases: 100.000 You can also view the model specification by invoking 'model-spec' with just one argument giving the log file. Once the network and data model have been specified, we need to specify the data sets to be used for training and (optionally) for testing. We do this using the 'data-spec' command: > data-spec rlog.net 1 1 / rdata@1:100 . rdata@101: . Number of training cases: 100 Number of test cases: 1000 Here, "rlog.net" is the log file we created with 'net-spec', to which the data specifications will be appended. The "1" and "1" arguments give the numbers of inputs and targets. These must be consistent with the network architecture (if not, an error will be reported later when you try to start the training). After the "/", specifications for where to get the training and test data are given. Each such specification consists of two parts: the source for the inputs, and the source for the targets. The specification "rdata@1:100" means that the training inputs come from the file 'rdata', in lines 1 to 100, while the specification of "rdata@101:" for the test inputs indicates that they also come from the file 'rdata', but in lines 101 to the end of the file. In the above command, the sources for the targets are given as just ".", which means the target items are on the same lines as the inputs, following the last input item. We could have said that the targets come from a completely different file, however. It is also possible to specify exactly where on a line the inputs and targets are located (and hence to ignore some items in the file). For documentation on these and other features, see numin.doc. Though it is not done above, the 'data-spec' command also allows you to specify transformations to be applied to the inputs or targets before they are used. This is useful, for example, if you wish to use inputs that have been "standardized" to have mean zero and variance one, based on the training data. See data-spec.doc for the details. In this example, the data already has approximately mean zero and variance one, so the priors used are sensible without normalization. In the training runs reported in the thesis, I used a short "initial phase" to get things started, followed by a long "sampling phase" to bring the simulation to equilibrium and then produce a sample of networks from the posterior for use in prediction. I still use the same general procedure, but with some changes to how the initial phase is done. It seems desirable to start the simulation in a state where the hyperparameters take on moderate values, and leave them fixed for a few iterations so that the network parameters will also take on moderate values. This can be accomplished using the following commands: > net-gen rlog.net fix 0.5 > mc-spec rlog.net repeat 20 sample-noise heatbath hybrid 100:10 0.2 > net-mc rlog.net 1 The 'net-gen' command stores a network in the log file with index zero, in which the hyperparameters have values of 0.5, and the network parameters are zero. This is the initial state of the simulation run. The following 'mc-spec' command specifies the Markov chain operations to be performed in the initial phase. Here, each iteration consists of 20 repetitions of the following steps: Gibbs sampling for the noise level, a heatbath replacement of the momentum variables, and a hybrid (Hamiltonian) Monte Carlo update with a trajectory 100 leapfrog steps long, using a window of 10, and a stepsize adjustment factor of 0.2. Note that the hyperparameters are not updated, and hence will remain fixed at values of 0.5. Finally, a single such iteration is done by calling 'net-mc' with an iteration limit of 1. The stepsize adjustment factor of 0.2 used above is typical of what is needed, but will not be appropriate in all circumstances. After the 'net-mc' command has finished, the number of the 20 hybrid Monte Carlo updates that were rejected can be determined using the command 'net-plt t r rlog.net', which will write the iteration number (of 1) and the rejection rate on standard output. If the rejection rate is high (say, over 0.3), a new run should be done using a smaller stepsize adjustment factor. In the initial phase, one would generally start by guessing a value for the stepsize adjustment factor that is on the low side, since there is no point in optimizing this choice. At this point, we hope to have a network stored in the log file (with index 1) that has values for both the parameters and hyperparameters that are of moderate magnitude, and which have adapted at least somewhat to the training data. We can now start serious sampling with the following commands: > mc-spec rlog.net sample-sigmas heatbath hybrid 2000:10 0.5 > net-mc rlog.net 1000 The 'mc-spec' command appends a new set of Markov chain operations to the log file, which will override the previous set. These operations are Gibbs sampling for both the hyperparameters and the noise level (the "sigmas"), a heatbath update for the momentum variables, and a hybrid Monte Carlo update with a trajectory 2000 leapfrog steps long, a window of 10, and a stepsize adjustment factor of 0.5. A long trajectory length is typically desirable for the sampling phase. As in the initial phase, the stepsize adjustment factor of 0.5 used is typical, but not universally applicable. It may pay at this stage to experiment in order to find out how large this factor can be while keeping the rejection rate low. The use of a "window" of around 10 states costs little and is often beneficial. The 999 iterations of the sampling phase started with the command 'net-mc rlog.net 1000' take 17 seconds to complete on the the system used (see Ex-test-system.doc). If your computer is slower, you might put the command in the background (or add a '&' to the end of the 'net-mc' command), and then monitor progress while you wait. For example, you can look at the last network saved in the log file (or any earlier one) using 'net-display'. Shortly after starting the run, you might see the following: > net-display rlog.net Network in file "rlog.net" with index 150 Input to Hidden Layer 0 Weights [1] 0.48 0.48: +0.20 +0.51 +0.55 -0.16 +1.04 +1.02 +0.25 -0.87 Hidden Layer 0 Biases [2] 1.97: +0.48 -4.15 -1.84 -0.46 -1.10 +1.32 -0.31 -3.40 Hidden Layer 0 to Output Weights [3] 9.16 9.16: +9.68 9.16:-12.70 9.16: -4.17 9.16: +1.96 9.16: -6.27 9.16: -5.19 9.16:+19.00 9.16: -5.24 Output Biases [4] 100.00:-18.77 Noise levels 0.10 - 0.10 This display of network parameters and hyperparameters is divided into sections for different parameter groups. Within each section, the numbers before the colons are hyperparameters, those after are parameters (weight and biases). There are more hyperparameters shown than were mentioned earlier, but for this network architecture, the extra hyperparameters are either fixed in value (the 100 for output biases), or tied to the value of a higher-level hyperparameter, so they are effectively not present. The parameter groups in the 'net-display' output are identified by numbers in square brackets. These can be used with the 'h', 'w', and 'W' quantities of 'net-plt'. For example, to see how the hyperparameter controlling the hidden-to-output weights has changed during the simulation (so far), one can use the command > net-plt t h3 rlog.net | plot where 'plot' is some suitable plot program, as discussed in Ex-intro.doc. (One can also just invoke net-plt and look at the numbers printed on standard output.) Here 'h3' refers to the top-level hyperparameter for group 3, which is seen in the output of 'net-display' above to be the hidden-to-output group. Here's a command that shows how the individual weights in this group change during the run: > net-plt t w3@ rlog.net | plot In this case, the "plot" program must be one that is capable of displaying more than one superimposed graph, with the data for each graph being separated by a blank line. Some plot programs may prefer the data to come with multiple values per line, which is the format produced by the 'net-tbl' command: > net-tbl tw3@ rlog.net | some-other-plot-program Note that there is no space between the "t" quantity and the others in this command. By looking at plots of the hyperparameters and quantities such as the squared error on the training set ('b'), one can get an idea of when the simulation has reached equilibrium. For this run, a plot of the 'b' quantity can be seen in rlog-b.png, and a plot of the three hyperparameters (on a log scale) in rlog-h1h2h3.png. Note that if you run these commands and produce these plots, you might see something different, due to differences in floating-point roundoff error. See below for a discussion of how the results you see might differ. Networks from a point where equilibrium seems to have been reached can be used to make predictions for test cases using the 'net-pred' program. From the plots above, it seems that equilibrium has been reached after the first 100 iterations (or fewer). So we could discard the first 100 iterations as perhaps not coming from the equilibrium distribution, and use the following command to use the networks from the remaining 900 iterations to produce predictions for all test cases, and report the average squared error: > net-pred itn rlog.net 101: Number of iterations used: 900 Case Inputs Targets Means Error^2 1 0.75 1.74 1.75 0.0002 2 1.21 1.21 1.14 0.0054 3 -0.54 0.36 0.48 0.0157 (middle lines omitted) 998 -1.56 0.48 0.42 0.0033 999 0.15 1.69 1.62 0.0043 1000 0.74 1.77 1.76 0.0002 Average squared error guessing mean: 0.01033+-0.00048 The options "itn" specified ask for a listing of the inputs ("i") and targets ("t") for each case, along with the mean ("n") output for that case of the 900 networks used for prediction. The squared error when using this mean to predict the target is shown for each case, and the average squared error for the test cases is shown at the bottom, along with its standard error with respect to the random selection of test cases. Considering that the average squared error with optimal prediction is 0.01 (due to the noise of standard deviation 0.1 added when generating the data), the network model has done quite well. We can ask for just a summary of performance on test cases as follows: > net-pred npa rlog.net 101: Number of iterations used: 900 Number of test cases: 1000 Average log probability of targets: 0.854+-0.019 Average squared error guessing mean: 0.01033+-0.00048 Here, the 'a' option gives only the averages (not predictions for individual cases), and the 'p' option asks for performance in terms of log probability of targets. It is also possible to get predictions for cases that are not in the test set that was specified with 'data-spec'. For example: > net-pred nb rlog.net 101: / "%echo 2.3" +1.27448149e+00 Here, the options "nb" ask for only the predictive mean, with "bare" output (no headings, also higher precision, in exponential format). The argument at the end says that the inputs for test cases (here, just one case) should be taken from the output of the Unix command "echo 2.3", which just outputs the number 2.3. Since no true value for the target is provided, the squared error with this prediction is not shown. The 'net-pred' program can also find the median and the 10% and 90% quantiles of the predictive distribution. The program limits the number of iterations that can be used when finding medians and quantiles, so (though it's not actually necessary here) the command below uses "%5" to look only at the 180 iterations above 100 with numbers that are multiples of five: > net-pred itdq rlog.net 101:%5 Number of iterations used: 180 Case Inputs Targets Medians |Error| 10% Qnt 90% Qnt 1 0.75 1.74 1.75 0.0132 1.62 1.89 2 1.21 1.21 1.14 0.0737 0.99 1.28 3 -0.54 0.36 0.49 0.1266 0.35 0.62 (middle lines omitted) 998 -1.56 0.48 0.42 0.0579 0.27 0.57 999 0.15 1.69 1.62 0.0693 1.48 1.75 1000 0.74 1.77 1.76 0.0148 1.62 1.90 Average abs. error guessing median: 0.08091+-0.00192 When the median is used as the "best guess", performance is judged by the average absolute error, not squared error, since this is the error measure that is minimized by the true median. These results seem quite good, so we might decide we're finished. However, we seldom can be absolutely sure that our MCMC runs have actually sampled well from the true posterior distribution. To obtain more confidence that our results are correct, we can run the Markov chain for longer. It's possible that this will reveal that we haven't actually reached equilibrium. And even if we are actually sampling from the right distribution, a longer run will provide more accurate Monte Carlo estimates, which would typically improve predictions at least slightly. We can extend the chain to 3000 iterations by just running net-mc again with 3000 as the iteration limit: > net-mc rlog.net 3000 We can then look at the plots of squared error on the training set and of the hyperpameters for this longer run. These plots can be seen in rlog-b-3000.png and rlog-h1h2h3-3000.png. They reveal that shortly after iteration 1000, the chain moves to states in which the squared error on the training set is somewhat larger, and the hyperparameter values are quite different! We might conclude from this that the chain really only reaches equilibrium after about 1050 iterations, and we should discard earlier iterations when making predictions. This gives results as follows: > net-pred npa rlog.net 1051: Number of iterations used: 1950 Number of test cases: 1000 Average log probability of targets: 0.805+-0.027 Average squared error guessing mean: 0.02016+-0.00385 We see that performance is actually a bit worse than using iterations from 101 to 1000. We might decide to run for even longer, as follows: > net-mc rlog.net 50000 Plots of squared error on training cases and of hyperparameter values for this long run are in rlog-b-50000.png and rlog-h1h2h3-50000.png. We now see that the posterior distribution is multimodal. Only one mode was seen in iterations from 101 to 1000. Soon after iteration 1000, we see the only other mode, up through iteration 3000. But in the run for 50000 iterations, we see movement back and forth between the two modes, though more time is spent in one of the modes than in the other. Which mode was seen first in this run will have been due to the particular random initialization done. A different random initialization, or even slightly different roundoff errors when using a different computer system, might have led to the other mode being seen first. Note that states in the mode where the chain spends more time actually have larger training set error on average (and consequently lower posterior probability / higher energy). This is possible if this mode has a larger volume than the other mode. The performance using all but the first 100 of the 50000 iterations is as follows: > net-pred npa rlog.net 101: Number of iterations used: 49900 Number of test cases: 1000 Average log probability of targets: 0.830+-0.022 Average squared error guessing mean: 0.01873+-0.00335 This is a bit better than when using states from iterations 1051 to 3000, but not as good as when using only states from iterations 101 to 1000. The difference is due to predictions at points outside the range of the training data. We can see this using the following command: > ( net-pred inb rlog.net 101:1000 / "%grid -3:3%0.01" > echo " " > net-pred inb rlog.net 1051:3000 / "%grid -3:3%0.01" > ) | plot This finds the predictions for a grid of inputs from -3 to 3 using first the iterations from 101 to 1000 (sampling from the first mode) and then those from iterations 1051 to 3000 (sampling the second mode). The result can be seen in rlog-2-pred.png. We can see that the predictions are very similar in the region where there is training data (see rdata-train.png). The test data (see rdata-test.png) includes some points that are outside this region, where the predictions using states from the two modes differ quite a bit. We can look in detail at the predictions at inputs -3, 0, and +3 using the mode sampled in iterations 101 to 1000, the mode sampled in iterations 1051 to 3000, and using both modes as sampled in iterations 101 to 50000: > net-pred inqQ rlog.net 101:1000 / "%echo -3; echo 0; echo 3" Number of iterations used: 900 Case Inputs Means 10% Qnt 90% Qnt 1% Qnt 99% Qnt 1 -3.00 -1.34 -2.45 -0.35 -3.51 0.51 2 0.00 1.39 1.25 1.53 1.14 1.64 3 3.00 2.27 1.69 2.87 1.13 3.26 > net-pred inqQ rlog.net 1051:3000%2 / "%echo -3; echo 0; echo 3" Number of iterations used: 975 Case Inputs Means 10% Qnt 90% Qnt 1% Qnt 99% Qnt 1 -3.00 -2.19 -2.65 -1.75 -3.13 -1.42 2 0.00 1.38 1.25 1.52 1.13 1.64 3 3.00 3.02 2.69 3.33 2.44 3.62 > net-pred inqQ rlog.net 101:50000%50 / "%echo -3; echo 0; echo 3" Number of iterations used: 998 Case Inputs Means 10% Qnt 90% Qnt 1% Qnt 99% Qnt 1 -3.00 -2.10 -2.61 -1.62 -3.12 -0.44 2 0.00 1.38 1.25 1.52 1.13 1.64 3 3.00 2.96 2.61 3.31 1.81 3.59 Here are the true values of the function at these inputs: > calc x=-3 "0.3 + 0.4*x + 0.5*Sin(2.7*x) + 1.1/(1+x^2)" -1.27494 > calc x=0 "0.3 + 0.4*x + 0.5*Sin(2.7*x) + 1.1/(1+x^2)" 1.4 > calc x=3 "0.3 + 0.4*x + 0.5*Sin(2.7*x) + 1.1/(1+x^2)" 2.09494 The predictions for input of 0 are all very similar, and quite good. The predictions for inputs of -3 and +3 are much better when using iterations from 101 to 1000 - the extrapolation beyond the training data using parameters from this mode happens to better match the true function. This seems to mostly be luck, since absent strong prior knowledge, how to extrapolate is just a guess. When using iterations 1051 to 3000, the extrapolation is not good, and the true values lie outside the interval from the 1% to the 99% quantiles - a sign of overconfidence in the predictions. However, when using all iterations from 101 to 50000, the true values do lie within the interval from the 1% to 99% quantiles, showing that sampling both modes gives a better indication of the true uncertainly (even though the mean prediction is still not very good). A Gaussian process model for the regression problem. We can also model this data using a Gaussian process. Such a model is similar to a network model with an infinite number of hidden units. The weights in this hypothetical infinite network are not represented explicitly (fortunately, since this would require an infinite amount of memory). Only the hyperparameters are explicitly represented. A Gaussian process model is specified using the gp-spec command, which is analogous to the net-spec command. For the simple regression model, the following is one appropriate specification: > gp-spec rlog.gp 1 1 100 / 0.05:0.5 0.05:0.5 Here, "rlog.gp" is the name of the new log file that will hold the results of the Gaussian process run. The first two arguments following the log file are the numbers of inputs and outputs, respectively, both "1" for this problem. The (optional) argument of "100" that follows is the prior for the constant part of the covariance function used. This corresponds to the prior for the output unit bias in a network model. A specification for a linear part of the covariance could follow (but doesn't here); it would correspond to a prior for direct input-output connections in a network. For reasons of computational accuracy, it is best not to use too vague a prior for the constant part of the covariance, even though that would not usually be a problem from a statistical point of view. The remaining arguments (after the "/") give the priors for the hyperparameters used in an exponential term of the covariance function. These priors correspond to those for the hidden-output and input-hidden weights in a network model. (There is no counterpart here to the prior for the hidden unit biases in a network model.) The first prior is for the scale of this term, which controls the magnitude of the non-linear variation in the function. The second prior is for the relevance of the input, which controls the amount by which the input has to change to produce a change in the non-linear component of the function that is comparable to the overall scale over which this component varies. The prior specifications are in the same form as is used for network specifications (see prior.doc). The specifications of "0.05:0.5" used here are vague, allowing these hyperparameters to take on values over a wide range. The specification can be viewed by invoking 'gp-spec' with just the name of the log file: > gp-spec rlog.gp Number of inputs: 1 Number of outputs: 1 Constant part of covariance: 100.000 Exponential parts of covariance: Scale Relevance Power Flags 0.050:0.50 0.050:0.50 2.000 Once the Gaussian process model for functions has been specified, we can specify how the function values are used to model the targets in the dataset using 'model-spec', in exactly the same was as for a network model: > model-spec rlog.gp real 0.05:0.5 We also say where the training and (optionally) the test data comes from using 'data-spec': > data-spec rlog.gp 1 1 / rdata@1:100 . rdata@101: . The model and data specifications can be viewed by invoking these programs with just the name of a log file. We are now ready to sample from the posterior distribution of the hyperparameters for the Gaussian process model. To start, we can fix the hyperparameters at reasonable initial values, using 'gp-gen': > gp-gen rlog.gp fix 0.5 0.1 This fixes the scale hyperparameters to 0.5 and the relevance hyperparameters to 0.1 (linear hyperparameters, if present, would be fixed to the product of these). By default, the hyperparameters are set to the "width" value from their prior specification. Because the priors are often vague (as here), this may not be a very reasonable starting point. We now specify the Markov chain operations to be used in sampling. There are a great many possibilities for these operations. Here is one reasonable method: > mc-spec rlog.gp heatbath hybrid 20:4 0.5 This uses hybrid Monte Carlo, with trajectories 20 leapfrog steps long, with a window of 4 states. The stepsize adjustment factor used is 0.5. If the rejection rate turns out to be too high (as can be checked using the 'gp-plt t r rlog.gp' command), the stepsize should be adjusted downward. To perform these sampling operations 100 times, we use the following command: > gp-mc rlog.gp 100 This takes 0.89 seconds on the system described in Ex-test-system.doc. We can use 'gp-plt' or 'gp-display' to see how things went. The quantities that can be plotted with 'gp-plt' are similar to those that can be plotted using 'net-plt', except that quantities relating to test cases have been omitted, since they would often take a long time to compute (the 'E' and 'H' quantities, defined in the "mc" module, may also take a long time). See gp-quantities.doc for details. The following will plot the hyperparameters being sampled in this run: > gp-plt t nS1R1 rlog.gp | plot Using a plotting option that makes the vertical scale be logarithmic would be useful in this case (this is the -ly option for 'graph'). Iterations from the latter part of the run can be used to make predictions for test cases. This is done using 'gp-pred', which operates much like 'net-pred'. The following command makes predictions for the test cases based on the last 80 of the 100 iterations, and reports the average squared error: > gp-pred na rlog.gp 21: Number of iterations used: 80 Number of test cases: 1000 Average squared error guessing mean: 0.01308+-0.00127 This takes less than 0.1 seconds on the sytem used, but predictions will take longer when the number of training cases is larger, or if the median or log probability are to be found (options "d" or "p" of 'gp-pred'). As can be seen, the performance of the Gaussian process model is similar to that of the neural network model for this problem. The predictions for test cases made above are found directly from the covariances between the targets in the training cases and the unknown target in a test case. The values of the regression function for the training cases are never explicitly found. Consequently, it is not possible to plot quantities such as the squared error on training cases over the course of the run. To plot such quantities, you will have to ask for the function values for training cases to be generated in each iteration. This takes a significant amount of time, and can potentially cause numerical problems, which is why gp-plt won't just do it as needed. If you want to be able to plot the squared error on training cases (or similar quantities such as case-by-case likelihoods), you will need to change the 'mc-spec' command to the following: > mc-spec rlog.gp2 discard-values heatbath hybrid 20:4 0.5 sample-values The "sample-values" operation at the end generates function values for all the training cases, which will be stored in the log file. These values can later be used to compute the squared error for training cases, which can be plotted with a command such as > gp-plt t b rlog.gp2 | plot The "discard-values" operation throws away the function values (if present) before the operations for updating hyperparameters. Throwing away information may seem wasteful, but it actually improves convergence in this context. Unfortunately, if you make only this change, you will probably get the following error message when you try to run 'gp-mc': Couldn't find Cholesky decomposition (2) in sample-values! This message is produced when the round-off error in the matrix computations used by "sample-values" is enough to turn the results into nonsense. The problem is due to the poor "conditioning" of the covariance matrix. Roughly speaking, the covariances between neighbouring training cases are so high that knowing all but one function value is enough to determine the remaining function value to a precision comparable to the level of round-off error. To fix this, the conditioning of the covariance matrix must be improved. Changing the 'gp-spec' command as follows seems to be sufficient: > gp-spec rlog.gp2 1 1 10 - 0.01 / 0.05:0.5 0.05:0.5 There are two changes here from the 'gp-spec' command used before. First, the constant part of the covariance has been reduced from 100 to 10, which makes little difference when the data is centred at about zero, as it is for this problem. Since arithmetic is done in floating point, this increases the effective precision of the covariances. Second, the covariance now includes a "jitter" part of 0.01 (the "-" preceding this indicates that there is still no linear part). Jitter is much like noise, in that it varies independently from one case to another, but it is considered part of the function value, which noise is not. The jitter makes all the function values less predictable, reducing the problem of poor conditioning. Jitter plays a more crucial role for binary and class models.