A THREE-WAY CLASSIFICATION PROBLEM

I also created a synthetic data set for a three-way classification
problem.  Data items were generated by first drawing quantities x1,
x2, x3, and x4 independently from distributions uniform over (0,1).
The class of the item, represented by "0", "1", or "2", was then
selected as follows: if the two-dimensional Euclidean distance of
(x1,x2) from the point (0.4,0.5) was less than 0.35, the class was set
to "0"; otherwise, if 0.8*x1+1.8*x2 was less than 0.6, the class was
set to "1"; and if neither of these conditions held, the class was set
to "2".  Note that x3 and x4 have no effect on the class.  The class
selected by this procedure was the target value for the network; the
inputs available for use in predicting the class were not x1, x2, x3,
and x4 themselves, however, but rather these four values plus Gaussian
noise of standard deviation 0.1.  I generated 1000 cases in this way,
of which 400 were used for training and 600 for testing.


A neural network model for the three-way classification problem.

This example will illustrate the "softmax" model for target values in
a small set, and the Automatic Relevance Determination (ARD) prior for
input-to-hidden weights.  The network and model specifications used
were as follows:

    > net-spec clog.net 4 8 3 / ih=x0.2:0.5:0.5 bh=0.05:0.5 \
                                ho=x0.05:0.5 bo=0.05:0.5
    > model-spec clog.net class

This specifies a network with 4 input units, 8 hidden units, and 3
output units.  The output units have identity activation functions,
but their values are used in a "softmax" data model, in which the
probability for target i is exp(o_i) / SUM_j exp(o_j), where the o_j
are the values of the output units. 

The prior specified for input-to-hidden weights, "x0.2:0.5:0.5", has
two "alpha" values, indicating that there is both a high-level
hyperparameter controlling the overall magnitude of input-to-hidden
weights, and lower-level hyperparameters for each input unit, which
control the magnitudes of weights out of each input.  If some of the
inputs are irrelevant (as are the third and fourth inputs in this
problem), we hope that the corresponding hyperparameters will take on
small values, forcing the weights out of these inputs to be close to
zero, and thereby avoiding any damage to predictive performance that
could result from the inclusion of these irrelevant inputs. The prefix
of "x" causes the width of this prior to be automatically scaled
according to the number of inputs.  This is appropriate if the number
of highly relevant inputs is thought not to depend on the total number
of inputs.

We need to use the following data specification for this problem:

    > data-spec clog.net 4 1 3 / cdata@1:400 . cdata@401:1000 .
    Number of training cases: 400
    Number of test cases: 600

The arguments "4" and "1" are the numbers of input and target values.
The "3" before the "/" says that each target must be one of the
integers "0", "1", or "2".  This "3" must match the number of softmax
output units specified in the network architecture, or an error will
result when training is started.

The initial and sampling phases of the simulation can be done using
the same approach as was used above for the binary data set (as noted
there, the "sample-noise" operation doesn't actually do anything for
classification models like this):

    > net-gen clog.net fix 0.5
    > mc-spec clog.net repeat 10 sample-noise heatbath hybrid 100:10 0.2
    > net-mc clog.net 1
    > mc-spec clog.net repeat 10 sample-sigmas heatbath 0.95 \
                                 hybrid 100:10 0.3 negate
    > net-mc clog.net 200

This takes 8.7 minutes on the system used (see Ex-system.doc).

We can see the effect of the ARD model by looking at the values of
the low-level input-hidden hyperparameters over the course of the
simulation, with the command:

    > net-plt t h1@ clog.net | plot

The third and fourth hyperparameters are much smaller than the first
two, indicating that ARD has done its job.

Predictions for test cases based on the predictive mode can now be
done in the same way as for the binary response problem.  If we are
interested only in the estimated classification performance, we can
use the following command:

    > net-pred ma clog.net 51:

    Number of networks used: 150

    Number of test cases: 600

    Fraction of guesses that were wrong:  0.1400+-0.0142

Here, the "a" option is used to suppress everything except the summary.

If we are interested in the probabilities for each class, we can use
the "n" option, as follows:

    > net-pred tn clog.net 51:

    Number of iterations used: 150

    Case Targets   Means               Error^2

       1    2.00    0.19   0.00   0.81  0.0726
       2    2.00    0.23   0.01   0.76  0.1073
       3    2.00    0.12   0.00   0.87  0.0319
       4    0.00    0.82   0.04   0.13  0.0524
                     
                       . . .

     597    2.00    0.04   0.00   0.96  0.0027
     598    1.00    0.33   0.65   0.01  0.2346
     599    0.00    0.23   0.00   0.77  1.1964
     600    2.00    0.03   0.00   0.97  0.0016

    Average squared error guessing mean:   0.20556+-0.01623

The "Means" columns give the predictive probabilities for the three
classes for each case.  The "Error^2" column gives the squared error
found by summing the squared differences of these probabilities with
the indicator variables that are zero for the wrong classes and one
for the correct class.  


A Gaussian process model for the three-way classification problem.

Gaussian process models can also be used for multi-way classification
problems, in much the same way as for a binary response.  Here are
commands that implement a "probit" style model for the three-way
classification problem, using "persistent" hybrid Monte Carlo:

    > gp-spec clog.gpp 4 3 10 - 10 / 0.05:0.5 x0.2:0.5:1
    > model-spec clog.gpp class
    > data-spec clog.gpp 4 1 3 / cdata@1:400 . cdata@401:1000 .

    > gp-gen clog.gpp fix 0.5 1
    > mc-spec clog.gpp repeat 5 scan-values 100 heatbath 0.9 \
                                hybrid 1 0.3 negate
    > gp-mc clog.gpp 5
    > mc-spec clog.gpp repeat 5 scan-values 100 heatbath 0.98 \
                                hybrid 1 0.3 negate
    > gp-mc clog.gpp 100

For the first five iterations, the persistence is fairly small (0.9),
so that the high initial energy will be dissipated rapidly.  The
remaining iterations use a larger persistence (0.98) to suppress
random walks.  This run takes 19 minutes on the system used (see
Ex-system.doc).  The cubic scaling of the computations for the
Gaussian process models leads to their becoming slower relative to
neural network models as the number of training cases increases.

One can observe how the relevance hyperparameters change during the
run as follows, assuming that the "plot" program can display four
superimposed graphs:

    > gp-plt t R1@ clog.gpp | plot

Once the run is finished, predictions for test cases can be made
as illustrated below:

    > gp-pred itm clog.gpp 21:%5

    Number of iterations used: 16
    
    Case  Inputs                       Targets  Guesses  Wrong?
    
       1    0.29   0.90   0.74   0.33    2.00       2       0
       2    0.70   0.27   0.17   0.89    2.00       2       0
       3   -0.11   0.76   0.11  -0.02    2.00       2       0
       4    0.54   0.24   0.39   0.78    0.00       0       0

                      (middle lines omitted)

     597   -0.10   0.93   0.27   0.28    2.00       2       0
     598    0.14   0.19   0.08   0.70    1.00       1       0
     599    0.66   0.79   0.03   0.09    0.00       2       1
     600    0.76   0.90   0.73  -0.03    2.00       2       0
    
    Fraction of guesses that were wrong:  0.1350+-0.0140
    
This takes 27 seconds on the system used (see Ex-system.doc).  The "n"
option can be used to get the predictive probability for each class,
as illustrated above for net-pred.

A "logistic" style model can be obtained by decreasing the jitter,
using a 'gp-spec' command such as the following:

    > gp-spec clog.gpl 4 3 1 - 0.1 / 0.05:0.5 x0.2:0.5:1

A smaller stepsize is necessary with this model (0.15 rather than
0.3).  Performance is similar.