A PROBLEM WITH A BINARY RESPONSE As a second example, I generated a data set with two real-valued inputs, x1 and x2, and a binary target. The inputs were drawn independently from standard Gaussian distributions. The target was then set to "1" with the following probability: exp ( - ((x1-0.4)^2 + (x2+0.3)^2) ^ 2 ) ie, the negative exponential of the fourth power of the distance of (x1,x2) from (0.5,-0.3). I generated 500 cases, and used the first 300 for training and the remaining 200 for testing. A neural network model for the binary response problem. For this problem, we can again try using a network with one layer of hidden units (fifteen for this problem), and a single output unit. For a binary target, a Gaussian data model would be inappropriate; instead, we can use a logistic regression model, in which the probability of the target being "1" is obtained by passing the value of the output unit through the logistic function, f(x)=1/(1+exp(-x)). The network, model, and data specification commands needed are quite similar to those used in the regression problem above: > net-spec blog.net 2 15 1 / - 0.05:0.5 0.05:0.5 - x0.05:0.5 - 100 > model-spec blog.net binary > data-spec blog.net 2 1 2 / bdata@1:300 . bdata@301:500 . Number of training cases: 300 Number of test cases: 200 The 'net-spec' command differs only in the number of input units (2), the number of hidden units (15). The data model used is "binary", with no need for a noise level prior. The 'data-spec' command also says that there are two inputs, and one target. It also has a third argument of "2" just before the "/", which indicates that the target must be an integer with two possible values (which are "0" and "1"). The initial phase commands that were used for the simple regression problem turn out to be adequate for this problem as well: > net-gen blog.net fix 0.5 > mc-spec blog.net repeat 10 sample-noise heatbath hybrid 100:10 0.2 > net-mc blog.net 1 For the sampling phase, we could also try using commands similar to those presented above, but we might as well try something different (the mc-spec command needed here is long, so it is continued by putting a "\" at the end of the first line): > mc-spec blog.net repeat 10 sample-sigmas heatbath 0.95 \ hybrid 100:10 0.3 negate > net-mc blog.net 200 The above 'mc-spec' command specifies the variant of hybrid Monte Carlo with "persistence" in which relatively short trajectories are used, but random walks are suppressed by only partially replacing the momentum in the 'heatbath' step. Note that for this to work well, the rejection rate must be quite small. This alternative method has the advantage that it allows for more frequent hyperparameter updates (with 'sample-sigmas'). On our SGI machine, the 200 iterations requested above take about twenty minutes. During this time, one can monitor the simulation, for instance with the command: > net-plt t h1h2h3 blog.net | plot where again, 'plot' is some appropriate plotting program, which in this case must be capable of plotting three superimposed graphs, for the three hyperparameters h1, h2, and h3. The values for these hyperparameters exhibit quite a high degree of autocorrelation, which is why it is advisable to allow the simulation to go for 200 iterations, in order to be more confident that the simulation has explored all high-probability regions of the posterior distribution. Once the run has finished, we can make predictions using 'net-pred', based, say, on the networks from the final 3/4 of the run (which is a generally reasonable portion). For a problem such as this, where the response is binary, we are most likely interested in guessing whether the target is "0" or "1", with performance measured by how often we are right. We can get such predictions as follows: > net-pred itm blog.net 51: Number of networks used: 150 Case Inputs Targets Guesses Wrong? 1 -1.56 0.90 0.00 0 0 2 0.09 -0.13 1.00 1 0 3 -0.79 0.85 0.00 0 0 ( midde lines omitted ) 198 1.49 0.70 0.00 0 0 199 -2.23 -0.28 0.00 0 0 200 -0.91 -0.03 0.00 0 0 Fraction of guesses that were wrong: 0.0900+-0.0203 The "itm" options ask for output listing the inputs, the targets, and the guesses based on the mode of the predictive distribution. A summary is also printed that reports the fraction of guesses that were wrong, along with the standard error for this estimate of the error rate. A Gaussian process model for the binary response problem. This binary data can be modeled using a Gaussian process as well. The Gaussian process specifies a distribution over real-valued functions. The function value for the inputs in a particular case can be passed through the logistic function, f(x) = 1/(1+exp(-x)), to give the probability that the target in that case will be '1' rather than '0'. The specifications for a Gaussian process model analogous to the network model used above are as follows: > gp-spec blog.gpl 2 1 1 - 0.1 / 0.05:0.5 0.05:0.5 > model-spec blog.gpl binary > data-spec blog.gpl 2 1 2 / bdata@1:300 . bdata@301:500 . Number of training cases: 300 Number of test cases: 200 The 'gp-spec' command used here does not quite correspond to the 'net-spec' command used above. The following command would give a closer correspondence: > gp-spec blog.gpl 2 1 100 / 0.05:0.5 0.05:0.5 The reason for modifying this, by reducing the constant part of the covariance (from 100 to 1) and introducing a small (0.1) "jitter" part, is to solve computational problems of poor conditioning and slow convergence. The "jitter" part lets the function value vary randomly from case to case, as if a small amount of Gaussian noise were added. This makes the function values less tightly tied to each other, easing the computational difficulties (with the unmodified specification, 'gp-mc' would probably work only if the dataset was very small). These changes have little effect on the statistical properties of the model, at least for this problem. Sampling might now be done using the following commands: > gp-gen blog.gpl fix 0.5 1 > mc-spec blog.gpl repeat 4 scan-values 200 heatbath hybrid 6 0.5 > gp-mc blog.gpl 50 These commands (and variations on them) are similar to those that you might use for a regression problem, except that a "scan-values 200" operation has been included. This operation performs 200 Gibbs sampling scans over the function values for each training case. An explicit representation of function values is essential for a binary model, though not for a simple regression model, because the relationship of function values to targets is not Gaussian. Unfortunately, for the very same reason, the direct "sample-values" operation is not available, so Gibbs sampling must be done. The 50 sampling iterations take about 17 minutes on our SGI machine. While you wait, you can check the progress of the hyperparameters using gp-display: > gp-display blog.gpl GAUSSIAN PROCESS IN FILE "blog.gpl" WITH INDEX 10 HYPERPARAMETERS Constant part: 1.000 Jitter part: 0.1 Exponential parts: 5.767 0.683 : 0.683 0.683 Note that there are only two variable hyperparameters, since the constant and jitter parts are fixed, and the relevance hyperparameters for each input are tied to the higher-level relevance hyperparameter. A time plot of two variable hyperparameters can be obtained as follows: > gp-plt t S1R1 blog.gpl | plot Once sampling has finished, predictions for test cases can be made using 'gp-pred'. The following command prints only the average classification performance: > gp-pred ma blog.gpl 21:%2 Number of iterations used: 15 Number of test cases: 200 Fraction of guesses that were wrong: 0.0900+-0.0203 This takes about about ten seconds on our machine. This is slower than prediction using a network model, since a covariance matrix must be inverted for each iteration used (saving the inverses for each iteration would take up lots of disk space). The accuracy achieved is about the same as the network model, which is not too surprising, given that they both use a logistic function to relate function values to target probabilities. It is possible to define a Gaussian process model that will behave as what is known is statistics as a "probit" model rather than a "logistic" model. This is done by simply setting the amount of jitter to be large compared to the span over which the logistic function varies from nearly 0 to nearly 1. This can be done by changing the 'gp-spec' command to the following, in which the jitter part is 10 (the constant part is also changed to be of the same magnitude): > gp-spec blog.gpp 2 1 10 - 10 / 0.05:0.5 0.05:0.5 One can then proceed as above. The final performance is similar, but the scale hyperparameter takes on much larger values.