A BIVARIATE DENSITY ESTIMATION PROBLEM

As a second illustration of the mixture model and Dirichlet diffusion
tree software, I generated bivariate real data from a mixture of two
component distributions, with probabilities 0.3 and 0.7.  These two
distributions were not exactly Gaussian, and the two real variables
were not exactly independent within one of these components.
Accordingly, modeling this data well with a mixture of Gaussian
distributions will require more than two components in the mixture.

For the exact distribution used, see the source of the generation
program, in ex-mixdft/rgen.c.  I generated 1000 cases with this
program, stored in 'rdata', of which the first 500 are used for
training, and the rest for testing.  A plot of the training cases can
be seen in rdata-mix-train.png.

The data files also contain a third number on each line, which
indicates which underlying component was used.  These indicators are
ignored in the examples below of bivariate density estimation.


A two-component mixture model for the density estimation problem.

We can first see what happens when we model this data with a mixture
of two Gaussians - even though we know the data cannot be perfectly
modeled in this way.  We specify this two-component model using the
'mix-spec' and 'model-spec' commands, as follows:

    > mix-spec rlog.2 0 2 2 / 1 0.05:0.5:0.2 10
    > model-spec rlog.2 real 0.05:0.5:0.5:1

The 'mix-spec' command creates the log file "rlog.2".  The arguments
following the log file name are the number of input attributes in a
case (always 0 at present), the number of target attributes (2 for
this bivariate problem), and the number of mixture components to use
(2 for this model).  

The Dirichlet concentration parameter follows the "/".  In this model,
its value is 1 (unscaled, since there's no 'x'), which produces a
uniform prior for the single number determining the probabilities of
the two components.

The "offset" parameters of the two components represent the Gaussian
means when modeling real data.  Hyperparameters determine the prior
means and standard deviations of these offsets (separately for the two
target attributes); priors for these hyperparameters are specified in
'mix-spec'.  In the above command, the prior for the mean of an offset
is Gaussian with standard deviation 10 (the last argument).  The
standard deviations for the offsets are given a hierarchical prior,
with a higher-level hyperparameter common to both the lower-level
standard deviations.  The top-level precision (standard deviation to
the power -2) is given a Gamma prior with mean 0.05 and shape
parameter 0.5; the precisions for the lower-level hyperparameters have
Gamma priors with mean given by the higher-level precision, and shape
parameter 0.2.  This is all specified by the second-to-last argument
of 'mix-spec'.

A similar hierarchical scheme is used for the "noise" standard
deviations (the standard deviations of the Gaussian distributions in
the mixture), except that this scheme has three levels - a top-level
hyperparameter, a hyperparameter for each target attribute, and
hyperparameters for each target for each component.  The 'model-spec'
command gives the top-level mean, and the shape parameters for the
Gamma priors going down the hierarchy.

We next specify where the data comes from, with 'data-spec':

    > data-spec rlog.2 0 2 / rdata@1:500 . rdata@501:1000 .

This says that there are 0 input attributes and 2 target attributes.

For this finite model, we can specify that all the Markov chain
updates should be done with Gibbs sampling, as follows:

    > mc-spec rlog.2 repeat 20 gibbs-indicators gibbs-params gibbs-hypers

The "repeat 20" just repeats these operations in a single iteration,
to reduce the volume of data stored in the log file.

Finally, we run the Markov chain simulation for 100 iterations:

    > mix-mc rlog.2 100

This takes only 0.1 seconds on the system used (Ex-test-system.doc).
We can then look at the hyperparameters and component parameters at
various iterations.  The last iteration should look something like the
following:

    > mix-display rlog.2

    MIXTURE MODEL IN FILE "rlog.2" WITH INDEX 100
    
    HYPERPARAMETERS

    Standard deviations for component offsets:
    
        0.585:     4.108   40.500
    
    Means for component offsets:
    
                  -0.549   -2.258
    
    Standard deviations for Gaussian target distributions:
    
        1.464:     0.959    3.693
    
    
    PARAMETERS AND FREQUENCIES FOR COMPONENTS OF THE MIXTURE
    
       1: 0.712   -2.046  +11.371
    
                   1.068    5.440
    
       2: 0.288   +1.757  +20.956
    
                   0.949    7.198

We see above that the two components are associated with fractions of
approximately 0.7 and 0.3 of the training cases, as expected from the
way the data was generated.  For each component, the two offset
parameters, giving the Gaussian means, are shown on the first line,
and the standard deviation parameters on the following line.  These
component parameters are approximately what we would expect from
looking at a plot of the data, but of course the two Gaussian
components cannot perfectly model the actual distribution.

A quantitative measure of how well the model fits the data can be
found by looking at the average log probability density of the test
cases:

    > mix-pred pa rlog.2 21:

    Number of iterations used: 80

    Number of test cases: 500

    Average log probability of targets:    -5.172+-0.048


An infinite mixture model for the density estimation problem.

To more closely approximate the true distribution, we can use a
mixture model with a countably infinite number of Gaussian components.
An infinite mixture is used if we simply omit the argument giving the
number of components in the 'mix-spec' command.  We must also change
the specification for the Dirichlet concentration parameter, preceding
it with an 'x' to indicate that it should be scaled so as to produce a
sensible infinite limit.  In the specification below, the moderate
value of 5 is chosen for this specification in order to indicate that
we believe that a fairly large number of components will have
substantial probability (the other prior specifications are the same
as before):

    > mix-spec rlog.inf 0 2 / x5 0.05:0.5:0.2 10

The 'model-spec' and 'data-spec' commands are the same as before:

    > model-spec rlog.inf real 0.05:0.5:0.5:1
    > data-spec rlog.inf 0 2 / rdata@1:500 . rdata@501:1000 .

The 'mc-spec' command must be altered, however, since it is not
possible to do Gibbs sampling for component indicators when there are
an infinite number of components.  The met-indicators operation is
used instead, with 10 changes being proposed to every indicator:

    > mc-spec rlog.inf repeat 20 met-indicators 10 gibbs-params gibbs-hypers

We can now run the simulation for 100 iterations:

    > mix-mc rlog.inf 100

This takes 2.1 seconds on the system used (see Ex-test-system.doc).
If we now examine the state with 'mix-display', we will find that
quite a few (around 20) mixture components are associated with
training cases - though fewer components account for the bulk of the
cases.

This model appears to fit the data slightly better than the
two-component model, as can be seen below:

    > mix-pred pa rlog.inf 21:

    Number of iterations used: 80

    Number of test cases: 500

    Average log probability of targets:    -5.103+-0.048

We can also see how well the model has captured the true distribution
by generating a sample of cases from a distribution drawn from the
posterior, as represented by the state at a particular iteration.  We
do this as follows:

    > mix-cases rlog.inf 100 new-mix 500

This command generates 500 new cases based on iteration 100, and
stores them (one per line) in the file "new-mix".  We can now use a
plot program to view a scatterplot of the data in "new", and compare
it with a scatterplot of data from the actual distribution - as in
new-mix.png and rdata-mix-train.png.  Note that the data in "new-mix"
is taken jointly from one example of a distribution from the posterior
distribution.  If 'mix-cases' is called for another iteration, it will
produce data from a different distribution from the posterior, which
in general could be quite different.  This variation represents the
uncertainly regarding the true distribution that remains when only a
finite amount of training data is available.  A representation of the
predictive distribution for a single new data point, which is the
average of distributions drawn from the posterior, could be obtained
by combining the output of 'mix-cases' for a number of iterations.


Dirichlet diffusion tree models for the density estimation problem.

We can also model this data with a Dirichlet diffusion tree model.
The Dirichlet diffusion tree can either model the data points
directly, or model latent value to which noise is added to produce the
data points.  Different divergence functions can also be used, as well
as various schemes for Markov chain sampling.

We can start with a model in which the data points are directly
produced by the Dirichlet diffusion tree.  We start by specifying
the model and the priors as follows:

    > dft-spec rlog.dft1a 0 2 / 0.5:0.5:0.5 - 0.1:0.5

This command specifies that there are 0 input variables (as required
at the moment) and 2 target variables.  The first argument after the
"/" specifies a hierarchical prior for the diffusion standard
deviations.  The first level of this prior is for a value common to
both target variables, with the base sigma of 0.5 being followed by
the shape parameter of 0.5.  The third 0.5 is the shape parameter for
the next level of the hierarchy, in which a sigma for each variable is
linked to the common sigma.  The next two arguments specify the values
of coefficients in the divergence function, which has the form

   a(t) = c0 + c1/(1-t) + c2/(1-t)^2

The "-" indicates that c0 is fixed at zero, the next argument gives a
prior for c1, and the absense of an argument after tha indicates that
c2 is fixed at zero.

To indicate that the Dirichlet diffusion tree models the data directly,
we simply do not include any 'model-spec' command.  We specify the
source of the data as follows:

    > data-spec rlog.dft1a 0 2 / rdata@1:500 . rdata@501:1000 .

Finally, we need to specify how the Markov chain sampling is to be done:

    > mc-spec rlog.dft1a repeat 25 slice-positions met-terminals \
    >                              gibbs-sigmas slice-div

The sampling will be done without explicit representation of node
locations, since none of the operations in this specification create
node locations.  The 'slice-positions' and 'met-terminals' operations
update the tree structure and divergence times for nonterminal nodes.
The 'gibbs-sigmas' and 'slice-div' operations update the diffusion
standard deviations for the two target variables (as well as the
common standard deviation linking them) and the c1 parameter of the
divergence function.

We can now run the Markov chain with these operations in order to
sample from the posterior distribution of trees and hyperparameters
given the data:

    > dft-mc rlog.dft1a 100

This takes 4.8 seconds on the system used (see Ex-test-system.doc).
After it has finished, we can look at the result with 'dft-display':

    > dft-display rlog.dft1a       

    DIFFUSION TREE MODEL IN FILE "rlog.dft1a" WITH INDEX 100
    
    
    PARAMETERS OF TREE 1
    
    Standard deviation parameters for diffusion process
    
        1.665:     2.121   13.638
    
    Divergence function parameters: - 1.6233 -

This lists the value at iteration 100 of the common standard
deviation, the diffusion standard deviations for the two target
variables, and the c1 parameter of the divergence function.

Other options to 'dft-display' allow examination of the tree
structure.  For instance,

    > dft-display -n rlog.dft1a 

    DIFFUSION TREE MODEL IN FILE "rlog.dft1a" WITH INDEX 100
    
    
    NON-TERMINAL NODES IN TREE 1
    
      Node  Children  Points  Div.Time
    
        -1 -336  337       6  0.603746
        -2  247  370       2  0.913668
        -3   23  181       2  0.995280
        -4 -409  334       3  0.944195
        -5 -211  374       4  0.995491
        -6 -385 -291      26  0.947063
        -7   91  209       2  0.926064
        -8 -493 -149       7  0.848791
        -9  139  355       2  0.997983
       -10 -275 -258       5  0.868474
       -11 -119  171      58  0.834055
       -12 -400   -1      17  0.583088
         
                    etc.

This shows the tree structure in terms of the children of each
nonterminal node, with nonterminal nodes being given negative numbers,
and terminal nodes (corresponding to data points) being given positive
numbers.  The "Points" column is the number of data points descended
from that node.

For diagnosing convergence, plots over time are more useful.  For
example, we can look at how the divergence times for the last common
ancestors of pairs of data points change over the course of the
simulation:

    > dft-plt t a1003123a14521 rlog.dft1a | plot

This plots the divergence time for the last common ancestor of
training cases 3 and 123 (numbered starting at 1) and the divergence
time of the last common ancestor of training cases 45 and 21.  The
syntax is a bit tricky.  The "a" indicator is followed by the number
of the tree, which is always 1 for this model, but more complex models
can have several trees.  This is followed by the indexes for the two
training cases, which MUST be written using the same number of digits
(with leading zeros added if necessary).

We can see how well the model has captured the distribution by seeing
how well it does at predicting test cases, using iterations from the
point when convergence seems to have been reached:

    > dft-pred tp rlog.dft1a 21:

    Number of iterations used: 80
    
    Case Targets         Log Prob
    
       1   -1.22   5.83    -4.336
       2   -2.73  16.33    -4.721
       3   -1.96   6.04    -3.990
       4   -0.77   9.24    -4.542
       5   -0.56   4.78    -5.072
    
         (middle lines omitted)

     496   -3.88  18.14    -6.200
     497    1.38   3.62    -8.357
     498   -0.40  26.50    -7.663
     499   -2.79  11.23    -3.938
     500   -2.28   4.53    -4.768
    
    Average log probability of targets:    -5.088+-0.047

This takes 0.73 seconds on the system used (see Ex-test-system.doc).
Predictive performance is a bit better than for the infinite mixture
model above.

Another approach to Markov chain sampling for this model is to keep
node locations as part of the state of the chain.  Updates to the tree
structure can then be done faster, but the number of iterations needed
for convergence and subsequent sampling may be greater.  Here is an
'mc-spec' command for this approach:

    > mc-spec   rlog.dft1b repeat 10 sample-locations \
                                     slice-positions met-terminals \
                                     gibbs-sigmas slice-div

Performing 100 iterations with these operations takes only 1.6 seconds
on the system used (see Ex-test-system.doc), and for this example,
predictive performance is nearly identical to that above.

One can also use a model with a different divergence function, as in
the following specification:

    > dft-spec  rlog.dft2a 0 2 / 0.5:0.5:0.5 0.01:0.5 - 0.01:0.5

The last three arguments of this command give priors for the c0 and c2
parameters of the divergence function, and specify that the c1
parameter is zero.  The divergence function therefore has the form
a(t) = c0 + c1/(1-t)^2.  This prior produces smoother density
functions, which is more appropriate for this example, although it
turns out that the difference in predictive performance is negligible.