A THREE-WAY CLASSIFICATION PROBLEM I also created a synthetic data set for a three-way classification problem. Data items were generated by first drawing quantities x1, x2, x3, and x4 independently from distributions uniform over (0,1). The class of the item, represented by "0", "1", or "2", was then selected as follows: if the two-dimensional Euclidean distance of (x1,x2) from the point (0.4,0.5) was less than 0.35, the class was set to "0"; otherwise, if 0.8*x1+1.8*x2 was less than 0.6, the class was set to "1"; and if neither of these conditions held, the class was set to "2". Note that x3 and x4 have no effect on the class. The class selected by this procedure was the target value for the network; the inputs available for use in predicting the class were not x1, x2, x3, and x4 themselves, however, but rather these four values plus Gaussian noise of standard deviation 0.1. I generated 1000 cases in this way, of which 400 were used for training and 600 for testing. A neural network model for the three-way classification problem. This example will illustrate the "softmax" model for target values in a small set, and the Automatic Relevance Determination (ARD) prior for input-to-hidden weights. The network and model specifications used were as follows: > net-spec clog.net 4 8 3 / - x0.2:0.5:0.5 0.05:0.5 - \ x0.05:0.5 - 0.05:0.5 > model-spec clog.net class This specifies a network with 4 input units, 8 hidden units, and 3 output units. The output units have identity activation functions, but their values are used in a "softmax" data model, in which the probability for target i is exp(o_i) / SUM_j exp(o_j), where the o_j are the values of the output units. The prior specified for input-to-hidden weights, "x0.2:0.5:0.5", has two "alpha" values, indicating that there is both a high-level hyperparameter controlling the overall magnitude of input-to-hidden weights, and lower-level hyperparameters for each input unit, which control the magnitudes of weights out of each input. If some of the inputs are irrelevant (as are the third and fourth inputs in this problem), we hope that the corresponding hyperparameters will take on small values, forcing the weights out of these inputs to be close to zero, and thereby avoiding any damage to predictive performance that could result from the inclusion of these irrelevant inputs. The prefix of "x" causes the width of this prior to be automatically scaled according to the number of inputs. This is appropriate if the number of highly relevant inputs is thought not to depend on the total number of inputs. We need to use the following data specification for this problem: > data-spec clog.net 4 1 3 / cdata@1:400 . cdata@401:1000 . Number of training cases: 400 Number of test cases: 600 The arguments "4" and "1" are the numbers of input and target values. The "3" before the "/" says that each target must be one of the integers "0", "1", or "2". This "3" must match the number of softmax output units specified in the network architecture, or an error will result when training is started. The initial and sampling phases of the simulation can be done using the same approach as was used above for the binary data set (but the "sample-noise" operation doesn't actually do anything here): > net-gen clog.net fix 0.5 > mc-spec clog.net repeat 10 sample-noise heatbath hybrid 100:10 0.2 > net-mc clog.net 1 > mc-spec clog.net repeat 10 sample-sigmas heatbath 0.95 \ hybrid 100:10 0.3 negate > net-mc clog.net 200 This takes about 65 minutes on our SGI machine. We can see the effect of the ARD model by looking at the values of the low-level input-hidden hyperparameters over the course of the simulation, with the command: > net-plt t h1@ clog.net | plot The third and fourth hyperparameters are much smaller than the first two, indicating that ARD has done its job. Predictions for test cases based on the predictive mode can now be done in the same way as for the binary response problem. If we are interested only in the estimated classification performance, we can use the following command: > net-pred ma clog.net 51: Number of networks used: 150 Number of test cases: 600 Fraction of guesses that were wrong: 0.1367+-0.0140 Here, the "a" option is used to suppress everything except the summary. A Gaussian process model for the three-way classification problem. Gaussian process models can also be used for multi-way classification problems, in much the same way as for a binary response. Here are commands that implement a "probit" style model for the three-way classification problem, using "persistent" hybrid Monte Carlo: > gp-spec clog.gpp 4 3 10 - 10 / 0.05:0.5 x0.2:0.5:1 > model-spec clog.gpp class > data-spec clog.gpp 4 1 3 / cdata@1:400 . cdata@401:1000 . > gp-gen clog.gpp fix 0.5 1 > mc-spec clog.gpp repeat 5 scan-values 100 heatbath 0.9 \ hybrid 1 0.4 negate > gp-mc clog.gpp 5 > mc-spec clog.gpp repeat 5 scan-values 100 heatbath 0.99 \ hybrid 1 0.4 negate > gp-mc clog.gpp 100 For the first five iterations, the persistence is fairly small (0.9), so that the high initial energy will be dissipated rapidly. The remaining iterations use a largers persistence (0.99) to suppress random walks. The run takes about 220 minutes on our SGI machine. The n-cubed scaling of the computations for the Gaussian process models leads to their becoming slower relative to neural network models as the number of training cases increases. Once the run is finished, predictions for test cases can be made as illustrated below: > gp-pred itm clog.gpp 51:%5 Number of iterations used: 10 Case Inputs Targets Guesses Wrong? 1 0.29 0.90 0.74 0.33 2.00 2 0 2 0.70 0.27 0.17 0.89 2.00 2 0 3 -0.11 0.76 0.11 -0.02 2.00 2 0 4 0.54 0.24 0.39 0.78 0.00 0 0 (middle lines omitted) 597 -0.10 0.93 0.27 0.28 2.00 2 0 598 0.14 0.19 0.08 0.70 1.00 1 0 599 0.66 0.79 0.03 0.09 0.00 2 1 600 0.76 0.90 0.73 -0.03 2.00 2 0 Fraction of guesses that were wrong: 0.1333+-0.0139 This takes about 11 minutes on our SGI machine. A "logistic" style model can be obtained by decreasing the jitter, using a 'gp-spec' command such as the following: > gp-spec clog.gpl 4 3 1 - 0.1 / 0.05:0.5 x0.2:0.5:1 This model appears to convergence less rapidly than the probit model, but gives similar results.