A THREE-WAY CLASSIFICATION PROBLEM

I also created a synthetic data set for a three-way classification
problem.  Data items were generated by first drawing quantities x1,
x2, x3, and x4 independently from distributions uniform over (0,1).
The class of the item, represented by "0", "1", or "2", was then
selected as follows: if the two-dimensional Euclidean distance of
(x1,x2) from the point (0.4,0.5) was less than 0.35, the class was set
to "0"; otherwise, if 0.8*x1+1.8*x2 was less than 0.6, the class was
set to "1"; and if neither of these conditions held, the class was set
to "2".  Note that x3 and x4 have no effect on the class.  The class
selected by this procedure was the target value for the network; the
inputs available for use in predicting the class were not x1, x2, x3,
and x4 themselves, however, but rather these four values plus Gaussian
noise of standard deviation 0.1.  I generated 1000 cases in this way,
of which 400 were used for training and 600 for testing.


A neural network model for the three-way classification problem.

This example will illustrate the "softmax" model for target values in
a small set, and the Automatic Relevance Determination (ARD) prior for
input-to-hidden weights.  The network and model specifications used
were as follows:

    > net-spec clog.net 4 8 3 / - x0.2:0.5:0.5 0.05:0.5 - \
                                x0.05:0.5 - 0.05:0.5
    > model-spec clog.net class

This specifies a network with 4 input units, 8 hidden units, and 3
output units.  The output units have identity activation functions,
but their values are used in a "softmax" data model, in which the
probability for target i is exp(o_i) / SUM_j exp(o_j), where the o_j
are the values of the output units. 

The prior specified for input-to-hidden weights, "x0.2:0.5:0.5", has
two "alpha" values, indicating that there is both a high-level
hyperparameter controlling the overall magnitude of input-to-hidden
weights, and lower-level hyperparameters for each input unit, which
control the magnitudes of weights out of each input.  If some of the
inputs are irrelevant (as are the third and fourth inputs in this
problem), we hope that the corresponding hyperparameters will take on
small values, forcing the weights out of these inputs to be close to
zero, and thereby avoiding any damage to predictive performance that
could result from the inclusion of these irrelevant inputs. The prefix
of "x" causes the width of this prior to be automatically scaled
according to the number of inputs.  This is appropriate if the number
of highly relevant inputs is thought not to depend on the total number
of inputs.

We need to use the following data specification for this problem:

    > data-spec clog.net 4 1 3 / cdata@1:400 . cdata@401:1000 .
    Number of training cases: 400
    Number of test cases: 600

The arguments "4" and "1" are the numbers of input and target values.
The "3" before the "/" says that each target must be one of the
integers "0", "1", or "2".  This "3" must match the number of softmax
output units specified in the network architecture, or an error will
result when training is started.

The initial and sampling phases of the simulation can be done using
the same approach as was used above for the binary data set (but the
"sample-noise" operation doesn't actually do anything here):

    > net-gen clog.net fix 0.5
    > mc-spec clog.net repeat 10 sample-noise heatbath hybrid 100:10 0.2
    > net-mc clog.net 1
    > mc-spec clog.net repeat 10 sample-sigmas heatbath 0.95 \
                                 hybrid 100:10 0.3 negate
    > net-mc clog.net 200

This takes about 25 minutes on our SGI machine.

We can see the effect of the ARD model by looking at the values of
the low-level input-hidden hyperparameters over the course of the
simulation, with the command:

    > net-plt t h1@ clog.net | plot

The third and fourth hyperparameters are much smaller than the first
two, indicating that ARD has done its job.

Predictions for test cases based on the predictive mode can now be
done in the same way as for the binary response problem.  If we are
interested only in the estimated classification performance, we can
use the following command:

    > net-pred ma clog.net 51:

    Number of networks used: 150

    Number of test cases: 600

    Fraction of guesses that were wrong:  0.1367+-0.0140

Here, the "a" option is used to suppress everything except the summary.


A Gaussian process model for the three-way classification problem.

Gaussian process models can also be used for multi-way classification
problems, in much the same way as for a binary response.  Here are
commands that implement a "probit" style model for the three-way
classification problem, using "persistent" hybrid Monte Carlo:

    > gp-spec clog.gpp 4 3 10 - 10 / 0.05:0.5 x0.2:0.5:1
    > model-spec clog.gpp class
    > data-spec clog.gpp 4 1 3 / cdata@1:400 . cdata@401:1000 .

    > gp-gen clog.gpp fix 0.5 1
    > mc-spec clog.gpp repeat 5 scan-values 100 heatbath 0.9 \
                                hybrid 1 0.3 negate
    > gp-mc clog.gpp 5
    > mc-spec clog.gpp repeat 5 scan-values 100 heatbath 0.99 \
                                hybrid 1 0.3 negate
    > gp-mc clog.gpp 100

For the first five iterations, the persistence is fairly small (0.9),
so that the high initial energy will be dissipated rapidly.  The
remaining iterations use a larger persistence (0.99) to suppress
random walks.  The run takes about 90 minutes on our machine.  The
n-cubed scaling of the computations for the Gaussian process models
leads to their becoming slower relative to neural network models as
the number of training cases increases.

One can observe how the relevance hyperparameters change during the
run as follows, assuming that the "plot" program can display four
superimposed graphs:

    > gp-plt t R1@ clog.gpp | plot

Once the run is finished, predictions for test cases can be made
as illustrated below:

    > gp-pred itm clog.gpp 51:%5

    Number of iterations used: 10
    
    Case  Inputs                       Targets  Guesses  Wrong?
    
       1    0.29   0.90   0.74   0.33    2.00       2       0
       2    0.70   0.27   0.17   0.89    2.00       2       0
       3   -0.11   0.76   0.11  -0.02    2.00       2       0
       4    0.54   0.24   0.39   0.78    0.00       0       0

                      (middle lines omitted)

     597   -0.10   0.93   0.27   0.28    2.00       2       0
     598    0.14   0.19   0.08   0.70    1.00       1       0
     599    0.66   0.79   0.03   0.09    0.00       2       1
     600    0.76   0.90   0.73  -0.03    2.00       2       0
    
    Fraction of guesses that were wrong:  0.1317+-0.0138
    
This takes about three minutes on our machine.

A "logistic" style model can be obtained by decreasing the jitter,
using a 'gp-spec' command such as the following:

    > gp-spec clog.gpl 4 3 1 - 0.1 / 0.05:0.5 x0.2:0.5:1

This model appears to convergence less rapidly than the probit model,
and requires a smaller stepsize, but gives similar results.