A REGRESSION PROBLEM WITH OUTLIERS Finally, we will go back to the simple regression problem we started with, but now some of the cases will be "outliers", for which the noise is much greater than for normal cases. In this synthetic data, the input variable, x, again had a standard Gaussian distribution and the corresponding target value came from a distribution with mean given by 0.3 + 0.4*x + 0.5*sin(2.7*x) + 1.1/(1+x^2) For most cases, the distribution about this mean was Gaussian with standard deviation 0.1. However, with probability 0.05, a case is an "outlier", for which the standard deviation was 1.0 instead. I generated 200 cases in total, stored in the file 'odata'. The first 100 of these cases are meant for training, the second 100 for testing. It is also possible to test on 'rdata', to see how well the function learned predicts data that is never corrupted by high noise. A neural network model for regression with outliers. One way to model data with "outliers" is to let the noise level vary from one case to another. If the noise for the outlier cases is set to be higher, they will end up having less influence on the function learned, as is desirable. The software allows the noise variance for a case to vary according to an inverse gamma distribution. This is effectively the same as letting the noise have a t-distribution rather than a Gaussian distribution. The commands used to do this are as follows: > net-spec olog.net 1 8 1 / - 0.05:0.5 0.05:0.5 - x0.05:0.5 - 100 > model-spec olog.net real 0.05:0.5::4 > data-spec olog.net 1 1 / odata@1:100 . odata@101:200 . > net-gen olog.net fix 0.5 > mc-spec olog.net repeat 10 sample-noise heatbath hybrid 100:10 0.2 > net-mc olog.net 1 > mc-spec olog.net sample-sigmas heatbath hybrid 1000:10 0.4 > net-mc olog.net 100 The crucial difference is in the 'model-spec' command, where the noise prior of 0.05:0.5::4 specifies that the per-case noise precision (inverse variance) follows a gamma distribution with shape parameter of 4. When this is integrated over, a t-distribution with 4 degrees of freedom results. This t-distribution is by no means an exact model of the way the noise was actually generated, but its fairly heavy tails are enough to prevent the model from paying undue attention to the outliers. The resulting model can be tested on data from the same source using net-pred: > net-pred na olog.net 31: Number of networks used: 70 Number of test cases: 100 Average squared error guessing mean: 0.01943+-0.01129 One can also see how well the model does on the uncorrupted data that was used originally: > net-pred na olog.net 31: / rdata@101:200 . Number of networks used: 70 Number of test cases: 100 Average squared error guessing mean: 0.00881+-0.00120 This is actually better than the results obtained earlier with the model trained on uncorrupted data, though the observed difference is probably not statistically significant. In constrast, the results are substantially worse when the data with outliers is used to train a standard model where the noise is Gaussian, with the same variance for each case. A Gaussian process model for regression with outliers. Gaussian process regression can also use a t-distribution for the noise, specified using 'model-spec', as above. Implementation of this model requires sampling for function values in training cases, so a small amount of "jitter" will almost always have to be included in the covariance function. A "sample-variances" operation must also be specified in 'mc-spec', to allow the case-by-case noise variances to be sampled. The following commands illustrate how this is done: > gp-spec olog.gpt 1 1 1 - 0.001 / 0.05:0.5 0.05:0.5 > model-spec olog.gpt real 0.05:0.5::4 > data-spec olog.gpt 1 1 / odata@1:100 . odata@101:200 . > gp-gen olog.gpt fix 0.5 0.1 > mc-spec olog.gpt sample-variances heatbath hybrid 20:4 0.5 > gp-mc olog.gpt 200 This takes about six minutes on our SGI machine. The progress of the run can be monitored by examining the case-by-case noise standard deviations in (say) the first 8 training cases, as follows: > gp-plt t v@0:7 olog.gpt | plot Once the run has converged, a few of these standard deviations (for cases that are outliers) should be much bigger than the others. The noise standard deviations can also be examined using the "-n" option of 'gp-display'. Predictions can be made using 'gp-pred': > gp-pred na olog.gpt 101:%5 Number of iterations used: 20 Number of test cases: 100 Average squared error guessing mean: 0.01995+-0.01186 This performance is very similar to that of the network model.