A REGRESSION PROBLEM WITH OUTLIERS

Finally, we will go back to the simple regression problem we started
with, but now some of the cases will be "outliers", for which the
noise is much greater than for normal cases.

In this synthetic data, the input variable, x, again had a standard
Gaussian distribution and the corresponding target value came from a
distribution with mean given by

         0.3 + 0.4*x + 0.5*sin(2.7*x) + 1.1/(1+x^2)

For most cases, the distribution about this mean was Gaussian with
standard deviation 0.1.  However, with probability 0.05, a case is an
"outlier", for which the standard deviation was 1.0 instead.

I generated 200 cases in total, stored in the file 'odata'.  The first
100 of these cases are meant for training, the second 100 for testing.
It is also possible to test on 'rdata', to see how well the function
learned predicts data that is never corrupted by high noise.


A neural network model for regression with outliers.

One way to model data with "outliers" is to let the noise level vary
from one case to another.  If the noise for the outlier cases is set
to be higher, they will end up having less influence on the function
learned, as is desirable.  The software allows the noise variance for
a case to vary according to an inverse gamma distribution.  This is
effectively the same as letting the noise have a t-distribution rather
than a Gaussian distribution.

The commands used to do this are as follows:

    > net-spec olog.net 1 8 1 / - 0.05:0.5 0.05:0.5 - x0.05:0.5 - 100 
    > model-spec olog.net real 0.05:0.5::4
    > data-spec olog.net 1 1 / odata@1:100 . odata@101:200 .

    > net-gen olog.net fix 0.5
    > mc-spec olog.net repeat 10 sample-noise heatbath hybrid 100:10 0.2
    > net-mc olog.net 1

    > mc-spec olog.net sample-sigmas heatbath hybrid 1000:10 0.4
    > net-mc olog.net 100

The crucial difference is in the 'model-spec' command, where the noise
prior of 0.05:0.5::4 specifies that the per-case noise precision
(inverse variance) follows a gamma distribution with shape parameter
of 4.  When this is integrated over, a t-distribution with 4 degrees
of freedom results.  This t-distribution is by no means an exact model
of the way the noise was actually generated, but its fairly heavy
tails are enough to prevent the model from paying undue attention to
the outliers.

The resulting model can be tested on data from the same source using
net-pred:

    > net-pred na olog.net 31:

    Number of networks used: 70

    Number of test cases: 100

    Average squared error guessing mean:   0.01943+-0.01129

One can also see how well the model does on the uncorrupted data that
was used originally:

    > net-pred na olog.net 31: / rdata@101:200 .

    Number of networks used: 70

    Number of test cases: 100

    Average squared error guessing mean:   0.00881+-0.00120

This is actually better than the results obtained earlier with the
model trained on uncorrupted data, though the observed difference is
probably not statistically significant.

In constrast, the results are substantially worse when the data with
outliers is used to train a standard model where the noise is Gaussian, 
with the same variance for each case.


A Gaussian process model for regression with outliers.

Gaussian process regression can also use a t-distribution for the
noise, specified using 'model-spec', as above.  Implementation of this
model requires sampling for function values in training cases, so a
small amount of "jitter" will almost always have to be included in the
covariance function.  A "sample-variances" operation must also be
specified in 'mc-spec', to allow the case-by-case noise variances to
be sampled.  The following commands illustrate how this is done:

    > gp-spec olog.gpt 1 1 1 - 0.001 / 0.05:0.5 0.05:0.5
    > model-spec olog.gpt real 0.05:0.5::4
    > data-spec olog.gpt 1 1 / odata@1:100 . odata@101:200 .

    > gp-gen olog.gpt fix 0.5 0.1
    > mc-spec olog.gpt sample-variances heatbath hybrid 20:4 0.5
    > gp-mc olog.gpt 200

This takes about six minutes on our SGI machine.  The progress of the
run can be monitored by examining the case-by-case noise standard
deviations in (say) the first 8 training cases, as follows:

    > gp-plt t v@0:7 olog.gpt | plot

Once the run has converged, a few of these standard deviations (for
cases that are outliers) should be much bigger than the others.  The
noise standard deviations can also be examined using the "-n" option
of 'gp-display'.

Predictions can be made using 'gp-pred':

    > gp-pred na olog.gpt 101:%5

    Number of iterations used: 20

    Number of test cases: 100

    Average squared error guessing mean:   0.01995+-0.01186

This performance is very similar to that of the network model.