A LINEAR REGRESSION MODEL

I generated 100 cases of synthetic data for this example, each case
consisting of two input (predictor) values and one target (response)
value.  The first input (i0) was randomly generated from the normal
distribution with mean zero and standard deviation one.  The second
input (i1) was set to the first input plus an offset that was normally
distributed with mean zero and standard deviation 0.1.  The two inputs
were therefore highly correlated.  The target value (t) was set to 
0.5 + 2.5*i0 - 0.5*i1 + e, where e was normally distributed noise with
mean zero and standard deviation 0.1.  This data was stored in the
file "rdata".


Specifying the model.

We can specify a Bayesian linear model for this data with a command
such as the following:

    > dist-spec rlog.met \
    >   "u ~ Normal(0,10^2) + w0 ~ Normal(0,10^2) + w1 ~ Normal(0,10^2) \
    >      + v ~ ExpGamma(1,0.2)" \
    >   "t ~ Normal (u + w0i0 + w1i1, Exp(-v))"

Here, the "\" characters at the ends of the lines tell the Unix shell
that the command is continued on the next line.  The "rlog.met"
argument of 'dist-spec' is the name of the log file to use.  The next
argument (which takes up two lines) is the prior specification.  It
must be put in quotes to keep the shell from interpreting the special
characters used.  The last argument is the likelihood specification,
which will also usually have to be quoted.

Let's first examine the likelihood above:

    t ~ Normal (u + w0i0 + w1i1, Exp(-v))

This says that the target value for a case, t, is modeled by a normal
distribution with mean given by u + w0i0 + w1i1 and variance given by
Exp(-v).  Here, u, w0, w1, and v are parameters of the model.  Model
parameters must have names starting with "u", "v", "w", "x", "y", or
"z", possibly followed by a single digit.  The two inputs for the case
are called i0 and i1.  If there were more than one target, they would
be called t0, t1, etc., but here the name "t" is used, which is a
synonym for "t0" (similarly "i" is a synonym for "i0").

The notation in the likelihood specification using "~" is actually an
abbreviation for the following:

    Log(2*Pi Exp(-v))/2 + [t - (u+w0i0+w1i1)]^2 / [2Exp(-v)]

which is minus the log of the normal probability density for t with
the given mean and variance.  In general, the likelihood is specified
by a formula giving minus the log of the probability or probability
density for the target or targets in one case, given the values of the
inputs.  Since the cases are assumed to be independent, minus the
total log likelihood is found by just adding these terms for all
cases.

The prior specification above is constructed with similar notation:

    u ~ Normal(0,10^2) + w0 ~ Normal(0,10^2) + w1 ~ Normal(0,10^2) 
      + v ~ ExpGamma(1,0.2)

This gives each of u, w0, and w1 an independent normal prior with mean
of 0 and standard deviation of 10 (ie, variance of 10^2).  The prior
specification is a formula for minus the log of the prior density, so
terms pertaining to different parameters are simply added together.
The terms involving "~" are abbreviations for minus the log prior
density for a known distribution.  Here, all the parameters are
independent under the prior, but it would be possible to make one
parameter depend on another, for instance by changing the term above
involving w1 to w1 ~ Normal(w0,10^2).

The "v" parameter is the log of the "precision" (inverse variance) for
the error term in the model.  The Markov chain sampling procedures
assume that the range of all variables being sampled is the entire
real line.  Since the precision is constrained to be positive, we must
represent it with some unconstrained variable that is transformed to
produce the actual precision.  An exponential transformation is often
convenient, and in this case is probably desirable in any case, in
order to make the Markov chain sampling more efficient.

Here, the prior for the precision is specified to be Gamma with shape
parameter of 1 and scale of 0.2 (which gives a mean value of 5).
Since we will generally want to represent Gamma variables by their
logs, the built-in ExpGamma distribution used here is for a variable
whose exponential has a Gamma distribution with the indicated shape
and scale parameters.


Specifying the data source.

After specifying the model with 'dist-spec', we need to say where the
data is found using 'data-spec', as follows:

    > data-spec rlog.met 2 1 / rdata .

The arguments after the name of the log file say that there are two
input values and one target value.  After the slash, the files where
the inputs and targets are found are given.  The "." says that the
targets are found in the same file as the inputs, following them on
the same line.  See data-spec.doc for more details and other options.


Sampling with Metropolis updates.

We can now sample from the posterior distribution of the model
parameters, after specifying how we want this to be done.  Here we'll
try sampling by Metropolis updates to all variables simultaneously:

    > mc-spec rlog.met repeat 45 metropolis 0.02 
    > dist-mc rlog.met 500

Each iteration will consist of 45 Metropolis updates, done using a
proposal distribution that adds a normally-distributed offset to each
variable with standard deviation 0.02.  The 'dist-mc' command does 500
such iterations (for a total of 22500 Metropolis updates).  This takes
0.32 seconds on the system used (see Ex-test-system.doc).

We can see what the average rejection rate was for these updates as
follows:

    > dist-tbl r rlog.met | series m

    Number of realizations: 1  Total points: 500

    Mean: 0.708044

A rejection rate of 0.7 is about right, so the stepsize of 0.02 looks
to be about optimal.  If you had started by trying a much larger or
smaller stepsize, you would need to adjust it by trial and error to
get a rejection rate in the range of about 0.3 to 0.8.

We can look at how well the chain is sampling by plotting the four
state variables over time:

    > dist-plt t uw0w1v rlog.met | plot

From this plot, it appears that the equilibrium distribution is
reached by about iteration 100, and that the chain is sampling
somewhat adequately from then on, though with a substantial degree of
autocorrelation.  We can quantify the inefficiency in estimation of
the log precision, v, that results from these autocorrelations as
follows:

    > dist-tbl v rlog.met 100: | series mac 15

    Number of realizations: 1  Total points: 401

    Sample mean: 4.456369  S.E. from correlations: 0.025878
    
      Lag  Autocorr.  Cum. Corr.
    
        1   0.889664    2.779328
        2   0.780966    4.341261
        3   0.699370    5.740000
        4   0.625771    6.991542
        5   0.563729    8.119000
        6   0.528770    9.176540
        7   0.464563   10.105667
        8   0.395203   10.896072
        9   0.329091   11.554254
       10   0.267998   12.090250
       11   0.208399   12.507047
       12   0.157683   12.822413
       13   0.108200   13.038813
       14   0.037211   13.113235
       15  -0.027520   13.058195

The cumulative correlation up to lag 15, where the autocorrelation is
approximately zero, indicates that about 13 times as many points will
be needed to get an estimate of a given accuracy as would be needed if
the points were independent.  This inefficiency results from the high
dependency between w0 and w1, which can be seen in the following plot:

    > dist-plt w0 w1 rlog.met 100: | plot-points

This dependency is why the stepsize must be so small - proposed
changes must be small, as otherwise the new point is likely to be
incompatible with this dependency between w0 and w1.

Estimates for various functions of state can be found using
'dist-est'.  For example, we can predict the target value for input
values of i0=1.9 and i1=2.1 as follows:

    > dist-est "u + w0*1.9 + w1*2.1" rlog.met 100:

    Number of sample points: 401

    Estimates for u + w0*1.9 + w1*2.1:

      Mean:    4.17296  (standard error 0.00179421)
      Std.dev: 0.0359291

    NOTE: The standard errors assume points are independent

Note that since the points aren't close to being independent, the
standard error shown is not correct.  Note also that the standard
deviation shown is the posterior standard deviation of the linear fit
at this point.  The prediction for a target value in a case with these
inputs would have additional uncertainty due to the noise.


Sampling with hybrid Monte Carlo.

The posterior dependency between w0 and w1 makes Markov chain sampling
somewhat difficult for this problem.  One could try to eliminate this
dependency by a reparameterization, which would certainly be possible
for this fairly simple problem, but in general this may be tedious or
infeasible.  However, the inefficiencies caused by dependencies may be
reduced by suppressing the random walk aspect of the sampling, which
can be done by using hybrid (Hamiltonian) Monte Carlo rather than the
simple Metropolis updates used above.

After 'dist-spec' and 'data-spec' commands that are the same as those
described above (apart from the name of the log file), we can sample
using hybrid Monte Carlo with commands such as the following:

    > mc-spec rlog.hmc heatbath hybrid 25 0.01
    > dist-mc rlog.hmc 500

One hybrid Monte Carlo update is done each iteration, consisting of a
"heatbath" operation that picks new values for the momentum variables,
followed by a dynamical trajectory of 25 leapfrog steps, with a
stepsize of 0.01.  This stepsize was picked by trial and error, and
results in a rejection rate of about 0.10.  In general, the stepsize
for this standard form of hybrid Monte Carlo should chosen so that the
rejection rate is somewhere between about 0.05 and 0.25.  The 500
iterations done take 0.37 seconds on the system used (see
Ex-test-system.doc), about the same as the Metropolis run above.

From examining plots of the state variables over time, it seems that
the hybrid Monte Carlo chain reaches equilibrium much more quickly
than the Metropolis chain, and samples more efficiently once
equilibrium is reached.  For example, here are the autocorrelations
for the precision parameter, which we also looked at for the
Metropolis chain:

    > dist-tbl v rlog.hmc 20: | series mac 5

    Number of realizations: 1  Total points: 481

    Sample mean: 4.440102  S.E. from correlations: 0.007704

      Lag  Autocorr.  Cum. Corr.

        1   0.079840    1.159681
        2   0.038386    1.236452
        3   0.051053    1.338557
        4   0.010604    1.359765
        5   0.099925    1.559616

The estimated autocorrelations above do not differ significantly from
zero.  It therefore appears that hybrid Monte Carlo is about ten times
more efficient than the simple Metropolis algorithm for this problem.


Hybrid Monte Carlo with windows.

Many other Markov chain methods could be applied to this problem, but
I will mention just one more - the variation of hybrid Monte Carlo in
which acceptance is based on "windows" of states at the start and at
the end of a trajectory.  This variation is described in my paper on
"An improved acceptance procedure for the hybrid Monte Carlo
algorithm" (Journal of Computational Physics, vol. 111, pp. 194-203,
1994), and more briefly in my book on Bayesian Learning for Neural
Networks.

This method can be applied to this problem as follows (after suitable
'dist-spec' and 'data-spec' commands):

    > mc-spec rlog.whmc heatbath hybrid 25:2 0.01
    > dist-mc rlog.whmc 500

The only difference compared to the specification for standard hybrid
Monte Carlo is the ":2" after the number of leapfrog steps (25).  This
specifies that windows of two states at the start and end of the
trajectory should be used.  Acceptance is based on the difference in
the total probability of all states in the window at the start and in
the window at the end.  The next state is picked from the end window,
if the trajectory was accepted, or from the start window, if the
trajectory was rejected, with this state selected from the appropriate
window according to the relative state probabilities.

Note that when windows are used the state may change even when there
was a "rejection", as long as one of the other states in the start
window is of comparable probability to the starting state.  This
provides some insurance against getting stuck at a point where the
rejection rate is very high.  The use of windows also tends to reduce
the rejection rate.  In this problem, the rejection rate with windows
is only 0.04, less than the rate of 0.10 seen above for the standard
hybrid Monte Carlo algorithm, while the time is only slightly greater
(due to a slight overhead from using windows).  The benefit is small
here, since the standard hybrid Monte Carlo method was doing very well
in any case, but use of windows can make a bigger difference for some
other problems, and is almost cost-free if the window size is a small
fraction of the trajectory length.