PRIOR:  Form of hierarchical prior specifications.

Priors for parameters such as network weights, biases, and offsets, or
for the noise levels in a regression model are specified using a
common syntax, described here.  These priors also used for Gaussian
process models, in which the parameters are implicit - the priors
above the parameter level are still analogous.

The prior for a group of parameters is hierarchical.  At the lowest
level, each parameter is picked either from a Gaussian distribution
with mean zero and with some precision (inverse of the variance), or
from a two point distribution concentrated at the corresponding plus
or minus standard deviation points (the latter is meant primarily for
debugging).  At the next level, the precision is picked from a Gamma
distribution with a specified shape parameter and with a mean given by
a hyperparameter common to all parameters of the same sub-group (eg,
weights out of the same unit).  This common value for the precision of
parameters in a sub-group is in turn picked from a Gamma distribution
with another specified shape parameter and with a mean given by a
hyperparameter common to all parameters in the group.  Finally, this
top-level hyperparameter is picked from a Gamma distribution with a
specified mean, and with yet another specified shape parameter.

Priors for groups without sub-groups (eg, network biases and offsets)
are similar, but go down only two levels.

These priors are specified using the following syntax:
 
     [x]Width[:[Alpha-group][:[Alpha-sub-group][:[Alpha-parameter]]]][!]

The Width part of the specification is used to specify the mean of the
precision at the top level, as described below.  The Alpha parts gives
the shape parameters of the Gamma distributions.  Alpha-group is used
when picking the common precision for all parameters in the group,
Alpha-sub-group is used when picking the precision common to all
parameters in one of the sub-groups, and Alpha-parameter is used when
picking the precision for a single parameter.  If an Alpha is omitted,
it is taken to be infinite, giving a distribution for the precision
concentrated at the mean.

A Gamma distribution with given mean and Alpha has a density
proportional to p^{Alpha/2-1} * exp(-p*Alpha/(2*mean)), where p is the
precision (always positive).

Width specifies the top-level mean for the precision as follows.  If
an "x" is not present, the mean precision is 1/Width^2.  A prior with
"x" is meaningful only when the prior is used in a context where there
are some number of "inputs", such as source units.  When an "x" is 
present in such a situation, the mean precision is determined as
follows:

   Alpha infinite:   N / Width^2
   Alpha > 2:        N * (Alpha/(Alpha-2)) / Width^2
   Alpha = 2:        N * log(N) / Width^2
   Alpha < 2:        N^(2/Alpha) / Width^2

Here N is the number of "inputs", such as the number of units in a
source layer for a particular type of weight in a network.  Alpha is
Alpha-parameter, if that is finite, and is otherwise Alpha-sub-group.
This scheme is intended to give proper scaling behaviour as N goes to
infinity (but it may not work when both Alpha-sub-group and
Alpha-parameter are finite).  For Alpha >= 2, convergence is to a
Gaussian process, for Alpha < 2, to a stable process of index Alpha.

If the prior specification ends with "!", the lowest-level
distribution is concentrated at the two standard deviation points,
rather than being Gaussian.  This is primarily a debugging tool - it
makes it easy to see whether each parameter was derived from the
correct hyperparameter.

The width part can be just a plus sign ("+"), rather than a number, in
which case the value used is 1e10 (infinity for most purposes).  This
may be useful when, for instance, a network is to be trained by
traditional minimization of error, with no weight penalty, but "prior"
specifications are still needed to say which groups of weights are
present.

A prior specification for the level of Gaussian noise in a regression
model has the same syntax as that used to specify priors for model
parameters.  The Width part gives the mean precision (inverse of the
noise variance) at the top of the hierarchical prior.  Alpha-group
gives the shape parameter for picking a precision with this mean that
is common to all outputs.  Alpha-sub-group gives the shape parameter
for picking precisions for each target with the common precision as
mean.  Alpha-parameter gives the shape parameter for picking a
precision for a particular output in a particular case, using that
target's precision as the mean.  Specifying a finite value for
Alpha-parameter has the effect of changing the distribution of the
noise for a particular output in a particular case from a Gaussian to
a t distribution.

In network models, the distributions for weights may be modified by an
"adjustment" sigma value associated with the destination unit for the
weight.  This adjustment multiplies the sigma for that weight which
would otherwise apply.  The precisions for the adjustments themselves
are drawn from a Gamma distribution with mean one and shape parameter
given by an alpha associated with the unit's layer.
  
Note that hyperparameters are generally input and displayed in terms
of the square root of the inverse of the precision (a 'sigma' value),
not in terms of the precisions themselves, even though those are the
values in terms of which the priors are mathematically expressed.

            Copyright (c) 1995-2004 by Radford M. Neal