A PROBLEM WITH A BINARY RESPONSE

In this example, each case consists of two real-valued inputs, x1 and
x2, and a binary target.  Data was generated by drawing the inputs for
a case independently from standard Gaussian distributions.  The target
was then set to "1" with the following probability:

    exp ( - ((x1-0.4)^2 + (x2+0.3)^2) ^ 2 )

ie, the negative exponential of the fourth power of the distance of
(x1,x2) from (0.5,-0.3).  I generated 1300 cases, and used the first
300 for training and the remaining 1000 for testing.  The 300 training
cases can be seen in bdata-train.png (class 0 red, class 1 green).


A neural network model for the binary response problem.

We will try modeling this data using a network with one layer of 15
hidden units and a single output unit.  For a binary target, a
Gaussian data model would be inappropriate; instead, we can use a
logistic regression model, in which the probability of the target
being "1" is obtained by passing the value of the output unit through
the logistic function, f(x)=1/(1+exp(-x)).

The network, model, and data specification commands needed are quite
similar to those used in the regression problem (see Ex-netgp-r.doc):

    > net-spec blog.net 2 15 1 / ih=0.05:0.5 bh=0.05:0.5 ho=x0.05:0.5 bo=100 
    > model-spec blog.net binary
    > data-spec blog.net 2 1 2 / bdata@1:300 . bdata@301: .
    Number of training cases: 300
    Number of test cases: 1000

The data model used is "binary", with no need for a noise level prior.

The 'data-spec' command also says that there are two inputs, and one
target.  It also has a third argument of "2" just before the "/",
which indicates that the target must be an integer with two possible
values (which are "0" and "1").

We can use commands for the initial sampling phase similar to those
used for the simple regression problem:

    > net-gen blog.net fix 0.5
    > mc-spec blog.net repeat 10 sample-noise heatbath hybrid 100:10 0.2
    > net-mc blog.net 1

(Note that the sample-noise operation above is actually unnecessary
here, since there is no noise for classification models.)  For the
sampling phase, we could also try using commands similar to those
presented above, but we might as well try something different (the
mc-spec command needed here is long, so it is continued by putting a
"\" at the end of the first line):

    > mc-spec blog.net repeat 10 sample-sigmas heatbath 0.95 \
                                 hybrid 100:10 0.3 negate
    > net-mc blog.net 200

The above 'mc-spec' command specifies the variant of hybrid Monte
Carlo with "persistence" in which relatively short trajectories are
used, but random walks are suppressed by only partially replacing the
momentum in the 'heatbath' step.  Note that for this to work well, the
rejection rate must be quite small.  This alternative method has the
advantage that it allows for more frequent hyperparameter updates
(with 'sample-sigmas').

The 200 iterations requested above take 7.0 seconds on the system used
(see Ex-test-system.doc).  During this time, one can monitor the
simulation, for instance with the command:

    > net-plt t h1h2h3 blog.net | plot

where again, 'plot' is some appropriate plotting program, which in
this case must be capable of plotting three superimposed graphs, for
the three hyperparameters h1, h2, and h3.  The values for these
hyperparameters exhibit quite a high degree of autocorrelation, which
is why it is advisable to allow the simulation to go for 200
iterations (or even more), in order to be more confident that the
simulation has explored all high-probability regions of the posterior
distribution.

Once the run has finished, we can make predictions using 'net-pred',
based, say, on the networks from the final 3/4 of the run (which is a
generally reasonable portion).  For a problem such as this, where the
response is binary, we might be interested in guessing whether the
target is '0' or '1', with performance measured by how often we are
right.  We can get such predictions as follows:

    > net-pred itmp blog.net 51: 

    Number of iterations used: 150

    Case  Inputs        Targets  Log Prob  Guesses  Wrong?

       1   -0.01  -1.03    1.00    -0.708      0       1 
       2    0.29  -0.18    1.00    -0.025      1       0 
       3    0.34   0.63    1.00    -0.851      0       1 
       4    0.20  -0.16    1.00    -0.029      1       0 
       5    0.03   0.51    0.00    -0.572      0       0 
       6   -0.50   0.77    0.00    -0.006      0       0 

                     (some lines omitted )

    Average log probability of targets:    -0.237+-0.017
    Fraction of guesses that were wrong:    0.1040+-0.0097

The "itmp" options ask for output that lists the inputs, the targets,
the the guesses based on the mode of the predictive distribution (and
whether they are wrong), and the logs of the probabilities of the
correct targets in the predictive distribution.  A summary is also
printed showing the average log probability of the correct target and
the fraction of guesses that were wrong, along with the standard
errors for these as estimates of performance on the population of
cases.  The average log probability is an overall indication of how
well the model has done.  The classification error rate might be more
directly relevant for some applications.


A Gaussian process model for the binary response problem.

This binary data can be modeled using a Gaussian process as well.  The
Gaussian process specifies a distribution over real-valued functions.
The function value for the inputs in a particular case can be passed
through the logistic function, f(x) = 1/(1+exp(-x)), to give the
probability that the target in that case will be '1' rather than '0'.

The specifications for a Gaussian process model analogous to the
network model used above are as follows:

    > gp-spec blog.gpl 2 1 1 - 0.1 / 0.05:0.5 0.05:0.5
    > model-spec blog.gpl binary
    > data-spec blog.gpl 2 1 2 / bdata@1:300 . bdata@301: .
    Number of training cases: 300
    Number of test cases: 200

The 'gp-spec' command used here does not quite correspond to the
'net-spec' command used above.  The following command would give a
closer correspondence:

    > gp-spec blog.gpl 2 1 100 / 0.05:0.5 0.05:0.5

The reason for modifying this, by reducing the constant part of the
covariance (from 100 to 1) and introducing a small (0.1) "jitter"
part, is to solve computational problems of poor conditioning and slow
convergence.  The "jitter" part lets the function value vary randomly
from case to case, as if a small amount of Gaussian noise were added.
This makes the function values less tightly tied to each other, easing
the computational difficulties (with the unmodified specification,
'gp-mc' would probably work only if the dataset was very small).
These changes have little effect on the statistical properties of the
model, at least for this problem.

Sampling might now be done using the following commands:

    > gp-gen blog.gpl fix 0.5 1
    > mc-spec blog.gpl repeat 4 scan-values 200 heatbath hybrid 8 0.3
    > gp-mc blog.gpl 50

These commands (and variations on them) are similar to those that you
might use for a regression problem, except that a "scan-values 200"
operation has been included.  This operation performs 200 Gibbs
sampling scans over the function values for each training case.  An
explicit representation of function values is essential for a binary
model, though not for a simple regression model, because the
relationship of function values to targets is not Gaussian.
Unfortunately, for the very same reason, the direct "sample-values"
operation is not available, so one must use Gibbs sampling or
Metropolis-Hastings updates (the met-values or mh-values operations).

The 50 sampling iterations take 25 seconds on the system used (see
Ex-test-system.doc).  While you wait, you can check the progress of
the hyperparameters using gp-display.  For example, you might see:

    > gp-display blog.gpl

    GAUSSIAN PROCESS IN FILE "blog.gpl" WITH INDEX 10
    
    HYPERPARAMETERS
    
    Constant part:
    
          1.000
    
    Jitter part:
    
            0.1
    
    Exponential parts:
    
          5.555
          0.657 :      0.657      0.657

Note that there are only two variable hyperparameters, since the
constant and jitter parts are fixed, and the relevance hyperparameters
for each input are tied to the higher-level relevance hyperparameter.
A time plot of two variable hyperparameters can be obtained as follows:

    > gp-plt t S1R1 blog.gpl | plot

Once sampling has finished, predictions for test cases can be made
using 'gp-pred'.  The following command prints only the average
classification performance:

    > gp-pred ma blog.gpl 21:

    Number of iterations used: 30

    Number of test cases: 1000

    Fraction of guesses that were wrong:    0.1050+-0.0097

This takes 1.26 seconds on the system used (see Ex-test-system.doc).
This is slower than prediction using a network model, since the
Cholesky decomposition of the covariance matrix must be found for each
iteration used (saving these would take up lots of disk space).  The
accuracy achieved is about the same as for the network model, which is
not too surprising, given that they both use a logistic function to
relate function values to target probabilities.

It is possible to define a Gaussian process model that will behave as
what is known is statistics as a "probit" model rather than a
"logistic" model.  This is done by simply setting the amount of jitter
to be large compared to the span over which the logistic function
varies from nearly 0 to nearly 1.  This can be done by changing the
'gp-spec' command to the following, in which the jitter part is 10
(the constant part is also changed to be of the same magnitude):

    > gp-spec blog.gpp 2 1 10 - 10 / 0.05:0.5 0.05:0.5

One can then proceed much as above (though it turns out that a larger
stepsize can be used).  The final performance is similar, but the
scale hyperparameter takes on much larger values.