A PROBLEM WITH A BINARY RESPONSE In this example, each case consists of two real-valued inputs, x1 and x2, and a binary target. Data was generated by drawing the inputs for a case independently from standard Gaussian distributions. The target was then set to "1" with the following probability: exp ( - ((x1-0.4)^2 + (x2+0.3)^2) ^ 2 ) ie, the negative exponential of the fourth power of the distance of (x1,x2) from (0.5,-0.3). I generated 500 cases, and used the first 300 for training and the remaining 200 for testing. A neural network model for the binary response problem. We will try modeling this data using a network with one layer of 15 hidden units and a single output unit. For a binary target, a Gaussian data model would be inappropriate; instead, we can use a logistic regression model, in which the probability of the target being "1" is obtained by passing the value of the output unit through the logistic function, f(x)=1/(1+exp(-x)). The network, model, and data specification commands needed are quite similar to those used in the regression problem (see Ex-netgp-r.doc): > net-spec blog.net 2 15 1 / - 0.05:0.5 0.05:0.5 - x0.05:0.5 - 100 > model-spec blog.net binary > data-spec blog.net 2 1 2 / bdata@1:300 . bdata@301:500 . Number of training cases: 300 Number of test cases: 200 The data model used is "binary", with no need for a noise level prior. The 'data-spec' command also says that there are two inputs, and one target. It also has a third argument of "2" just before the "/", which indicates that the target must be an integer with two possible values (which are "0" and "1"). The initial phase commands that were used for the simple regression problem turn out to be adequate for this problem as well: > net-gen blog.net fix 0.5 > mc-spec blog.net repeat 10 sample-noise heatbath hybrid 100:10 0.2 > net-mc blog.net 1 (The sample-noise operation above is actually unnecessary, since there is no noise for this model.) For the sampling phase, we could also try using commands similar to those presented above, but we might as well try something different (the mc-spec command needed here is long, so it is continued by putting a "\" at the end of the first line): > mc-spec blog.net repeat 10 sample-sigmas heatbath 0.95 \ hybrid 100:10 0.3 negate > net-mc blog.net 200 The above 'mc-spec' command specifies the variant of hybrid Monte Carlo with "persistence" in which relatively short trajectories are used, but random walks are suppressed by only partially replacing the momentum in the 'heatbath' step. Note that for this to work well, the rejection rate must be quite small. This alternative method has the advantage that it allows for more frequent hyperparameter updates (with 'sample-sigmas'). The 200 iterations requested above take about 7.5 minutes on the system used (see Ex-system.doc). During this time, one can monitor the simulation, for instance with the command: > net-plt t h1h2h3 blog.net | plot where again, 'plot' is some appropriate plotting program, which in this case must be capable of plotting three superimposed graphs, for the three hyperparameters h1, h2, and h3. The values for these hyperparameters exhibit quite a high degree of autocorrelation, which is why it is advisable to allow the simulation to go for 200 iterations, in order to be more confident that the simulation has explored all high-probability regions of the posterior distribution. Once the run has finished, we can make predictions using 'net-pred', based, say, on the networks from the final 3/4 of the run (which is a generally reasonable portion). For a problem such as this, where the response is binary, we might be interested in guessing whether the target is '0' or '1', with performance measured by how often we are right. We can get such predictions as follows: > net-pred itmp blog.net 51: Number of iterations used: 150 Case Inputs Targets Log Prob Guesses Wrong? 1 -1.56 0.90 0.00 -0.001 0 0 2 0.09 -0.13 1.00 -0.057 1 0 3 -0.79 0.85 0.00 -0.004 0 0 (some lines omitted ) 99 0.14 0.86 1.00 -2.594 0 1 100 0.21 0.18 1.00 -0.103 1 0 101 -1.32 -0.19 0.00 -0.006 0 0 (some lines omitted ) 198 1.49 0.70 0.00 -0.030 0 0 199 -2.23 -0.28 0.00 -0.007 0 0 200 -0.91 -0.03 0.00 -0.055 0 0 Average log probability of targets: -0.244+-0.034 Fraction of guesses that were wrong: 0.0850+-0.0198 The "itmp" options ask for output listing the inputs, the targets, the the guesses based on the mode of the predictive distribution, and the log of the probability of the correct target in the predictive distribution. A summary is also printed showing the average log probability of the correct target and the fraction of guesses that were wrong, along with the standard errors for these as estimates of performance on the population of cases. The average log probability is an overall indication of how well the model has done. The classification error rate might be more directly relevant for some applications. A Gaussian process model for the binary response problem. This binary data can be modeled using a Gaussian process as well. The Gaussian process specifies a distribution over real-valued functions. The function value for the inputs in a particular case can be passed through the logistic function, f(x) = 1/(1+exp(-x)), to give the probability that the target in that case will be '1' rather than '0'. The specifications for a Gaussian process model analogous to the network model used above are as follows: > gp-spec blog.gpl 2 1 1 - 0.1 / 0.05:0.5 0.05:0.5 > model-spec blog.gpl binary > data-spec blog.gpl 2 1 2 / bdata@1:300 . bdata@301:500 . Number of training cases: 300 Number of test cases: 200 The 'gp-spec' command used here does not quite correspond to the 'net-spec' command used above. The following command would give a closer correspondence: > gp-spec blog.gpl 2 1 100 / 0.05:0.5 0.05:0.5 The reason for modifying this, by reducing the constant part of the covariance (from 100 to 1) and introducing a small (0.1) "jitter" part, is to solve computational problems of poor conditioning and slow convergence. The "jitter" part lets the function value vary randomly from case to case, as if a small amount of Gaussian noise were added. This makes the function values less tightly tied to each other, easing the computational difficulties (with the unmodified specification, 'gp-mc' would probably work only if the dataset was very small). These changes have little effect on the statistical properties of the model, at least for this problem. Sampling might now be done using the following commands: > gp-gen blog.gpl fix 0.5 1 > mc-spec blog.gpl repeat 4 scan-values 200 heatbath hybrid 8 0.3 > gp-mc blog.gpl 50 These commands (and variations on them) are similar to those that you might use for a regression problem, except that a "scan-values 200" operation has been included. This operation performs 200 Gibbs sampling scans over the function values for each training case. An explicit representation of function values is essential for a binary model, though not for a simple regression model, because the relationship of function values to targets is not Gaussian. Unfortunately, for the very same reason, the direct "sample-values" operation is not available, so one must use Gibbs sampling or Metropolis-Hastings updates (the met-values or mh-values operations). The 50 sampling iterations take 5.9 minutes on the system used (see Ex-system.doc). While you wait, you can check the progress of the hyperparameters using gp-display: > gp-display blog.gpl GAUSSIAN PROCESS IN FILE "blog.gpl" WITH INDEX 10 HYPERPARAMETERS Constant part: 1.000 Jitter part: 0.1 Exponential parts: 5.973 0.620 : 0.620 0.620 Note that there are only two variable hyperparameters, since the constant and jitter parts are fixed, and the relevance hyperparameters for each input are tied to the higher-level relevance hyperparameter. A time plot of two variable hyperparameters can be obtained as follows: > gp-plt t S1R1 blog.gpl | plot Once sampling has finished, predictions for test cases can be made using 'gp-pred'. The following command prints only the average classification performance: > gp-pred ma blog.gpl 21:%2 Number of iterations used: 15 Number of test cases: 200 Fraction of guesses that were wrong: 0.0950+-0.0208 This takes 2.5 seconds on the system used (see Ex-system.doc). This is slower than prediction using a network model, since the Cholesky decomposition of the covariance matrix must be found for each iteration used (saving these would take up lots of disk space). The accuracy achieved is about the same as for the network model, which is not too surprising, given that they both use a logistic function to relate function values to target probabilities. It is possible to define a Gaussian process model that will behave as what is known is statistics as a "probit" model rather than a "logistic" model. This is done by simply setting the amount of jitter to be large compared to the span over which the logistic function varies from nearly 0 to nearly 1. This can be done by changing the 'gp-spec' command to the following, in which the jitter part is 10 (the constant part is also changed to be of the same magnitude): > gp-spec blog.gpp 2 1 10 - 10 / 0.05:0.5 0.05:0.5 One can then proceed much as above (though it turns out that a larger stepsize can be used). The final performance is similar, but the scale hyperparameter takes on much larger values.