A THREE-WAY CLASSIFICATION PROBLEM This example uses a synthetic data set for a three-way classification problem. Data items were generated by first drawing quantities x1, x2, x3, and x4 independently from distributions uniform over (0,1). The class of the item, represented by "0", "1", or "2", was then selected as follows: if the two-dimensional Euclidean distance of (x1,x2) from the point (0.4,0.5) was less than 0.35, the class was set to "0"; otherwise, if 0.8*x1+1.8*x2 was less than 0.6, the class was set to "1"; and if neither of these conditions held, the class was set to "2". Note that x3 and x4 have no effect on the class. The class selected by this procedure was the target value for the network; the inputs available for use in predicting the class were not x1, x2, x3, and x4 themselves, however, but rather these four values plus Gaussian noise of standard deviation 0.1. I generated 1400 cases in this way, of which 400 were used for training and 1000 for testing. The values of x1, x2, and the class (red=0, green=1, blue=2) for the training data can be seen in cdata-train-x1x2.png. A neural network model for the three-way classification problem. This example will illustrate the "softmax" model for target values in a small set, and the Automatic Relevance Determination (ARD) prior for input-to-hidden weights. The network and model specifications used were as follows: > net-spec clog.net 4 8 3 / ih=x0.2:0.5:1 bh=0.05:0.5 \ ho=x0.05:0.5 bo=0.05:0.5 > model-spec clog.net class This specifies a network with 4 input units (with the default tanh activation function), 8 hidden units, and 3 output units. The output units have identity activation functions, but their values are used in a "softmax" data model, in which the probability for target i is exp(o_i) / SUM_j exp(o_j), where the o_j are the values of the output units. The prior specified above for input-to-hidden weights, "x0.2:0.5:1", has two "alpha" values, indicating that there is both a high-level hyperparameter controlling the overall magnitude of input-to-hidden weights, and lower-level hyperparameters for each input unit, which control the magnitudes of weights out of each input. If some of the inputs are irrelevant (as are the third and fourth inputs in this problem), we hope that the corresponding hyperparameters will take on small values, forcing the weights out of these inputs to be close to zero, and thereby avoiding any damage to predictive performance that could result from the inclusion of these irrelevant inputs. The prefix of "x" causes the width of this prior to be automatically scaled according to the number of inputs. This is appropriate if the number of highly relevant inputs is thought not to depend on the total number of inputs. We need to use the following data specification for this problem: > data-spec clog.net 4 1 3 / cdata@1:400 . cdata@401: . Number of training cases: 400 Number of test cases: 1000 The arguments "4" and "1" are the numbers of input and target values. The "3" before the "/" says that each target must be one of the integers "0", "1", or "2". This "3" must match the number of softmax output units specified in the network architecture, or an error will result when training is started. The initial and sampling phases of the simulation can be done using the same approach as was used above for the binary data set (as noted there, the "sample-noise" operation doesn't actually do anything for classification models like this): > net-gen clog.net fix 0.5 > mc-spec clog.net repeat 10 sample-noise heatbath hybrid 100:10 0.2 > net-mc clog.net 1 > mc-spec clog.net repeat 10 sample-sigmas heatbath 0.95 \ hybrid 100:10 0.3 negate > net-mc clog.net 200 This takes 10.4 seconds on the system used (see Ex-test-system.doc). We can see the effect of the ARD model by looking at the values of the low-level input-hidden hyperparameters over the course of the simulation, with the command: > net-plt t h1@ clog.net | plot Example output for this plot is in cplt-h1.png. The third and fourth hyperparameters are much smaller than the first two, indicating that ARD has done its job. Predictions for test cases based on the predictive mode can now be done in the same way as for the binary response problem. If we are interested only in the estimated classification performance, we can use the following command: > net-pred ma clog.net 51: Number of networks used: 150 Number of test cases: 1000 Fraction of guesses that were wrong: 0.1980+-0.0126 Here, the "a" option is used to suppress everything except the summary. If we are interested in the probabilities for each class, we can use the "n" option, as follows: > net-pred tn clog.net 51: Number of iterations used: 150 Number of iterations used: 150 Case Targets Means Error^2 1 0.00 0.78 0.00 0.22 0.0953 2 2.00 0.36 0.01 0.63 0.2691 3 2.00 0.02 0.00 0.98 0.0008 4 0.00 0.67 0.00 0.33 0.2183 5 0.00 0.92 0.02 0.06 0.0094 6 1.00 0.20 0.70 0.09 0.1364 (some lines omitted) Average squared error guessing mean: 0.27445+-0.01381 The "Means" columns give the predictive probabilities for the three classes for each case. The "Error^2" column gives the squared error found by summing the squared differences of these probabilities with the indicator variables that are zero for the wrong classes and one for the correct class. A Gaussian process model for the three-way classification problem. Gaussian process models can also be used for multi-way classification problems, in much the same way as for a binary response. Here are commands that implement a "probit" style model for the three-way classification problem, using "persistent" hybrid Monte Carlo: > gp-spec clog.gpp 4 3 10 - 10 / 0.05:0.5 x0.2:0.5:1 > model-spec clog.gpp class > data-spec clog.gpp 4 1 3 / cdata@1:400 . cdata@401:1000 . > gp-gen clog.gpp fix 0.5 1 > mc-spec clog.gpp repeat 5 scan-values 100 heatbath 0.9 \ hybrid 1 0.3 negate > gp-mc clog.gpp 5 > mc-spec clog.gpp repeat 5 scan-values 100 heatbath 0.98 \ hybrid 1 0.3 negate > gp-mc clog.gpp 100 For the first five iterations, the persistence is fairly small (0.9), so that the high initial energy will be dissipated rapidly. The remaining iterations use a larger persistence (0.98) to suppress random walks. This run takes 73 seconds on the system used (see Ex-test-system.doc). The cubic scaling of the computations for the Gaussian process models leads to their becoming slower relative to neural network models as the number of training cases increases. One can observe how the relevance hyperparameters change during the run as follows, assuming that the "plot" program can display four superimposed graphs: > gp-plt t R1@ clog.gpp | plot Once the run is finished, predictions for test cases can be made as illustrated below: > gp-pred itm clog.gpp 21:%5 Number of iterations used: 16 Case Inputs Targets Guesses Wrong? 1 0.26 0.70 0.50 0.63 0.00 0 0 2 -0.01 0.65 0.61 0.76 2.00 2 0 3 0.74 0.90 0.18 0.17 2.00 2 0 4 0.32 0.78 0.49 0.36 0.00 0 0 5 0.26 0.51 0.44 0.75 0.00 0 0 6 -0.05 0.15 0.07 0.96 1.00 1 0 (some lines omitted) Fraction of guesses that were wrong: 0.1950+-0.0125 This takes 3.3 seconds on the system used (see Ex-test-system.doc). The "n" option can be used to get the predictive probability for each class, as illustrated above for net-pred. A "logistic" style model can be obtained by decreasing the jitter, using a 'gp-spec' command such as the following: > gp-spec clog.gpl 4 3 1 - 0.1 / 0.05:0.5 x0.2:0.5:1 A smaller stepsize is necessary with this model (0.15 rather than 0.3). Performance is similar.