A PROBABILITY ESTIMATION PROBLEM WITH BINARY DATA

As a first illustration of the mixture model and Dirichlet diffusion
tree software, I generated a dataset in which each case was composed
of ten binary attributes.  The distribution of these binary vectors
was a mixture of four component distributions, in each of which the
ten attributes were independent.  The four components each had
probability 0.25 of being chosen to generate a case.  The
probabilities for each component for each binary attributes were as
shown below:

          1    2    3    4    5    6    7    8    9   10

     1   0.1  0.2  0.2  0.2  0.2  0.8  0.8  0.8  0.8  0.7
     2   0.1  0.8  0.8  0.8  0.8  0.2  0.2  0.2  0.2  0.7
     3   0.9  0.2  0.2  0.8  0.8  0.2  0.2  0.8  0.8  0.7
     4   0.9  0.8  0.8  0.2  0.2  0.8  0.8  0.2  0.2  0.7

Each row gives the probabilities of each of the attributes being '1'
for one component of the mixture.  The columns are for the ten binary
attributes in each case.  The vectors generated in this way can be
seen as coming from one of four "patterns": 0000011111, 0111100001,
1001100111, and 1110011001, but with each bit of the chosen pattern
having a small probability of being switched (ranging from 0.1 to 0.3)
in any particular case.

I generated 1000 cases from this distribution, which are stored in the
file 'bdata'.  The first 500 are used for training, the rest for
testing.


A finite mixture model for the binary data.

We will first see what happens when we model this data as a mixture of
four components, which is the true number of components for this
problem.  Each component of the mixture will give each target
attribute a certain probability of being '1' rather than '0'.  These
probabilities are determined from the "offset" parameters for each
target, by means of the "logistic" function:

   Pr(target=1)  =  1 / (1 + exp(-offset))

The offset parameters for each attribute for each component are given
Gaussian prior distributions, with means and standard deviations that
are hyperparameters.  These hyperparameters could be fixed, but here
we will make them variable, separately for each target attribute, but
with each of them linked to a top-level hyperparameter, common to all
targets.

We set up this mixture model with the 'mix-spec' command, as follows:

    > mix-spec blog.4 0 10 4 / x1 0.05:0.5:0.2 10

Here, "blog.4" is the name of the log file used for this run, which is
created by the 'mix-spec' command.  The arguments of 'mix-spec' that
follow the name of the log file are the number of input attributes
(always 0 for mixture models at present), the number of target
attributes (10 for this problem), and the number of mixture components
(set to the true value of 4 in the command above).  The specifications
for priors are given following these arguments, after a "/" separator.

The first prior specification gives the concentration parameter of the
Dirichlet distribution for the mixing proportions for the components.
If this specification starts with "x", this value is automatically
divided by the number of components, which is the scaling required for
the model to reach a sensible limit as the number of components goes
to infinity.  The specification of "x1" above mildly favours unequal
mixing proportions (recall that the true proportions are equal).

The second argument after the "/" specifies the prior for the standard
deviations of the "offsets" for components (which determine the
probabilities for the binary target attributes).  In the hierarchical
scheme specified here (described in general in prior.doc), there is a
top-level standard deviation hyperparameter, given a prior in which
the corresponding "precision" (standard deviation to the power -2) has
a gamma distribution with mean 0.05 and shape parameter 0.5.  There is
a lower-level standard deviation hyperparameter for each target; for
each of these, the corresponding precision has a gamma distribution
with mean given by the higher-level precision, and shape parameter of
0.2.  Both these priors are rather vague (but the shape parameter of
0.2 is vaguer than 0.5), so these hyperparameters can adapt to the
structure of the data.

The last argument of 'mix-spec' is the standard deviation for the
means of the offsets for each of the targets, here set to 10.
Currently, this standard deviation for the mean offset is the same for
all targets, and is fixed in the 'mix-spec' command (it cannot be a
variable hyperparameter).

We can look at these specifications by calling 'mix-spec' with just
the log file as an argument:

    > mix-spec blog.4

    Number of inputs:      0
    Number of targets:     10

    Number of components:  4

    Dirichlet concentration parameter: x1.000
    Prior for SD hyperparameters:       0.050:0.50:0.20
    Prior for mean component offsets:   10.000

After the 'mix-spec' command, we need to specify that the targets are
binary using the 'model-spec' command, as follows:

    > model-spec blog.4 binary

We can then specify where the data comes from using 'data-spec':

    > data-spec blog.4 0 10 2 / bdata@1:500 . bdata@501:1000 .

The number of inputs is here specified to be 0 (as it must be at
present for mixture models), and the number of targets is specified to
be 10.  These must match the values given to 'mix-spec'.  We also
specify that the targets must be the integers 0 or 1 putting in a
following argument of 2 (meaning that there are 2 possible values,
which start at 0).  After a slash, we say where the inputs and targets
come from.  Note that we have to say where the inputs come from even
though there are 0 of them, but fortunately, we can then say that the
targets come from the same place by just using the "." argument.  In
the specification above, the data comes from the first 500 lines of
the file 'bdata', with one case per line.  The remaining 500 lines are
specified to be used for testing.  See data-spec.doc for more details.

Finally, we need to specify how Markov chain sampling is to be done
for this model.  At present, none of the standard Markov chain methods
are allowed for mixture models, only the specialized procedures
documented in mix-mc.doc.  Each of these procedures updates one of the
three parts of the state - the values of the hyperparameters, the
values of the parameters for the various mixture components, and the
values of the indicator variables saying which mixture component is
currently associated with each training case.  The values in each of
these parts of the state can be updated by Gibbs sampling, using the
corresponding procedure.  The following call of 'mc-spec' sets this
up:

    > mc-spec blog.4 repeat 20 gibbs-indicators gibbs-params gibbs-hypers

Here, the three Gibbs sampling operations are repeated 20 times each
iteration, just to cut down on the number of states saved in the log
file.

We can now run the Markov chain simulation for 100 iterations with the
following command:

    > mix-mc blog.4 100

The simulation starts with a state in which all the training cases are
associated with the same component, whose parameters are set to their
means, as specified by hyperparameter values that are also set to
their prior means.  The 'mix-gen' program could be used to initialize
things differently (see mix-gen.doc).

The above run takes 0.75 seconds on the system used (see
Ex-test-system.doc).  If it took longer, you could monitor its
progress with the 'mix-display' and 'mix-plt' programs.  For example,
if you're quick, you might see the following:

    > mix-display blog.4

    MIXTURE MODEL IN FILE "blog.4" WITH INDEX 50

    HYPERPARAMETERS
    
    Standard deviations for component offsets:

        0.342:     1.815    4.273    3.556    3.410    1.767
                   4.219    2.153    1.191    4.576    0.196

    Means for component offsets:

                  +0.924   +1.093   -0.684   +1.170   -1.183
                  +0.949   -1.397   +0.131   -1.657   +1.074


    PARAMETERS AND FREQUENCIES FOR COMPONENTS OF THE MIXTURE

       1: 0.274   -1.693   +0.815   +1.113   +1.512   +1.024
                  -1.555   -0.914   -0.761   -1.035   +1.163

       2: 0.270   +1.995   +1.008   +1.841   -2.204   -1.222
                  +2.118   +1.706   -1.676   -1.341   +0.990

       3: 0.252   -1.307   -1.388   -2.530   -1.468   -1.207
                  +1.603   +2.047   +1.941   +2.373   +1.032

       4: 0.204   +2.098   -1.174   -1.541   +1.916   +1.663
                  -1.531   -2.646   +1.010   +1.317   +1.349

This displays the state at the last iteration computed so far (here
iteration 50).  The hyperparameter values are shown first, with
top-level hyperparameters on the left, before the colon.  The
lower-level hyperparameters follow, ordered by the target attribute
that they pertain to.  Values for the component parameters follow,
sorted by the fraction of training cases that they are currently
associated with.  These fractions are the first numbers after 1:, 2:,
etc.; note that the actual mixing probabilities are not explicitly
represented, but are instead always integrated over.  Any components
that are not associated with any training case are omitted from the
output, and are not explicitly represented in the state.  The other
numbers shown for each component are the "offset" parameters for each
target attribute.

In this example, the simulation appears to have learned the essence of
the distribution by the iteration shown above.  Recall that a positive
offset corresponds to the probability of a 1 being greater than 1/2, a
negative offset to the probability of a 1 being less than 1/2.  The
four components shown above can thus be seen to correspond to the four
patterns used to generate the cases, as described earlier.  Note that
the last of the offset standard deviation hyperparameters (for the
last target attribute) is quite small.  This indicates that the model
has "learned" that the probabilities for the last target are the same
for all components, and hence can be modeled most efficiently at the
hyperparameter level, by the mean for that offset, rather than
separately for each component.

The indicators of which components are associated with each training
case can also be examined with 'mix-display', and of course iterations
other than the last can be viewed.  See mix-display.doc for details.

One can also look at the progress of the simulation using 'mix-plt'.
For example, the following will produce a time-plot of the cumulative
proportions of the training cases associated with the four components
(as sorted by decreasing frequency):

    > mix-plt t C1C2C3C4 blog.4 | plot

Here, 'plot' is some suitable plotting program.  One can also just let
the output of 'mix-plt' go to standard output, and then examine the
numbers manually.  Other quantities can also be plotted, as described
in mix-quantities.doc.

As well as examining quantities with 'mix-display' and 'mix-plt', one
can also produce a sample of cases from the predictive distribution
using 'mix-cases'.  Predictive probabilities for test cases can be
found using 'mix-pred'.  The following command gives the average log
probability for all the test cases, using the last 80 iterations:

    > mix-pred pa blog.4 21:

    Number of iterations used: 80

    Number of test cases: 500

    Average log probability of targets:    -6.099+-0.069

For comparison, the average log probability for these test cases
using the true mixture distribution is -6.034.


A countably infinite mixture model for the binary data.

Even though the true distribution for this example is a mixture of
four components, good results can nevertheless be obtained using a
mixture with a countably infinite number of components.  The prior for
mixture proportions used with such a model is designed so that a few
of the infinite number of components have substantial probability, so
the model does not result in an "overfitted" solution, in which every
training case is "explained" as coming from a different component.

We specify a countably infinite mixture model by simply omitting the
argument to 'mix-spec' that gives the number of components.  For this
example, we change the 'mix-spec' command used above to the following:

    > mix-spec blog.inf 0 10 / x1 0.05:0.5:0.2 10

The Dirichlet prior specification for mixing proportions of "x1" is
the same as for the mixture model with four components.  The "x"
specifies scaling downward with the number of components, which
produces a sensible limit as the number of components goes to
infinity, as here.  The "x" is therefore required for infinite
mixtures.

The other arguments of 'mix-spec' are as described for the finite
mixture model.  The 'model-spec' and 'data-spec' commands used are
also the same:

    > model-spec blog.inf binary
    > data-spec blog.inf 0 10 2 / bdata@1:500 . bdata@501:1000 .

We must change the 'mc-spec' command, however, since it is impossible
to do Gibbs sampling for the component indicators associated with
training cases - since their number is infinite, we can't compute all
the required conditional probabilities.  However, we can instead use a
specialized Metropolis-Hastings update, in which a new component to go
with a training case is proposed with probability determined by the
frequencies with which components are associated with other training
cases.  The proposal is accepted or rejected based on the resulting
change in likelihood.  The process can then be repeated any desired
number of times, to produce an approximation to Gibbs sampling (the
"approximation" is only with respect to convergence speed, the answer
is exact, asymptotically).  With this change, we can use the following
'mc-spec' command for this model:

    > mc-spec blog.inf repeat 20 met-indicators 10 gibbs-params gibbs-hypers

This is the same as for the finite mixture model, except that 10
repetitions of the Metropolis-Hastings update for the indicators are
specified using 'met-indicators 10', in place of 'gibbs-indicators'.

As before, we can now run the simulation with a command such as:

    > mix-mc blog.inf 100

We can examine the states with 'mix-display'.  For example, once the
'mix-mc' command completes, which takes 2.8 seconds on the system used
(see Ex-test-system.doc), the state should be something like the
following:

    > mix-display blog.inf
    
    MIXTURE MODEL IN FILE "blog.inf" WITH INDEX 100
    
    HYPERPARAMETERS
    
    Standard deviations for component offsets:
    
        0.652:     3.090    1.238    1.875    1.723    2.406
                   3.148    5.278    1.911    2.290    0.277
    
    Means for component offsets:
    
                  +0.322   +0.008   -1.202   -0.365   +0.911
                  -1.460   +0.081   +1.179   +0.458   +1.089
    
    
    PARAMETERS AND FREQUENCIES FOR COMPONENTS OF THE MIXTURE
    
       1: 0.292   -1.781   +0.988   +1.278   +1.351   +0.670
                  -1.254   -1.021   -0.865   -1.030   +1.493
    
       2: 0.252   -1.900   -1.600   -2.164   -1.237   -1.618
                  +1.506   +1.906   +1.830   +1.822   +0.981
    
       3: 0.248   +2.619   +0.830   +2.038   -2.486   -1.930
                  +1.892   +2.153   -1.425   -1.929   +0.947
    
       4: 0.190   +2.327   -1.354   -1.353   +1.696   +1.724
                  -1.741   -3.573   +0.967   +1.332   +0.943
    
       5: 0.014   +1.957   -2.620   +1.094   +2.526   +3.792
                  -2.249   +5.191   +0.864   +4.884   +0.975
    
       6: 0.002   -5.484   -0.456   -2.541   +0.298   +3.347
                  +1.349   -4.903   +1.000   -0.070   +1.466
    
       7: 0.002   -1.659   -0.611   -3.036   -3.022   -0.820
                  -3.068   -1.275   +1.828   +2.542   +0.534
    
The output is similar to that seen for the mixture model with four
components, except that seven components are associated with at least
one training case at iteration 100, as shown above.  (The infinite
number of remaining components are not explicitly represented, and are
not displayed by 'mix-display'.)  Three of these components are
associated with very few training cases, however (as seen from the
fractions of the training set after the component number).  This is to
be expected when the true distribution can be expressed using only
four components.  The number of such low-frequency components will
vary from iteration to iteration, as will other aspects of the state,
in accordance with the variability in the posterior distribution.

Since there are exactly four components in the real distribution, one
would expect that the model with four components would perform better
than a model with an infinite number of components.  Any penalty is
quite small in this example, however, as can be seen by looking at the
predictive probabilities:

    > mix-pred pa blog.inf 21:

    Number of iterations used: 80

    Number of test cases: 500

    Average log probability of targets:    -6.116+-0.070

This is almost the same as the average log probability for the four
component model.

A number of other ways of sampling for the indicators for countably
infinite mixture models are also available.  Here are two:

    > mc-spec blog.inf2 repeat 20 \
                gibbs-ext-indicators 2 gibbs-params gibbs-hypers

    > mc-spec blog.inf3 repeat 20 \
                met1-indicators gibbs1-indicators gibbs-params gibbs-hypers

These methods may be more efficient than using the "met-indicators"
operation for some models.  See mix-mc.doc for descriptions of these
and other Markov chain operations for mixture models.


A Dirichlet diffusion tree model for the binary data.

Binary data can also be modeled using Dirichlet diffusion trees.  This
is not the simplest example of how Dirichlet diffusion trees can be
used, since the trees naturally produce real valued data.  To model
binary data, these real values are treated as latent values, which are
put through the logistic function to produce the probabilities for the
binary data to be "1".  The binary data example here does not have the
sort of hierarchical structure that would make use of a Dirichlet
diffusion tree model desirable.  For both these reasons, the following
example of how to model binary data with a Dirichlet diffusion tree
may not be the best introduction to such models - you might want to
read the real-valued example in Ex-mixdft-r.doc first.

Nevertheless, this example does demonstrate how binary data can be
modeled, and shows that even when there is no hierarchical structure
to the data, a Dirichlet diffusion tree model can do quite well.

We start by specifying a Dirichlet diffusion tree model, as follows:

    > dft-spec blog.dft 0 10 / 2 - 1

As for a mixture model specification, this says that there are 0
"input" variables (as there must be at present), and 10 "target"
variables.  The argument of "2" is the standard deviation for the
diffusion process that produces the latent variables used to model the
binary variables (this could have been a prior specification rather
than a fixed value).  The last two arguments are the parameters of the
divergence function used to determine (stochastically) when the
diffusion paths for different cases diverge.  The general form of the
divergence function is

   a(t) = c0 + c1/(1-t) + c2/(1-t)^2

The parameters c0, c1, c2 are given at the end of the 'dft-spec'
command, with "-" being equivalent to zero.  Any zero parameters at
the end can just be omitted.  The specification above therefore
corresponds to the divergence function a(t) = 1/(1-t).  Any of c0, c1,
and c2 can be a prior specification, instead of a constant.  See
dft-spec.doc for details, and Ex-mixdft-r.doc for an example.

We next specify the data model, which is "binary", indicating binary
data modeled using the logistic function:

    > model-spec blog.dft binary

We go on to specify the source of the data, as for the mixture example
above:

    > data-spec blog.dft 0 10 2 / bdata@1:500 . bdata@501:1000 .

Finally, we need to specify how to do Markov chain sampling for the
posterior distribution of the trees and latent values that underlie
the data.

    > mc-spec blog.dft repeat 10 gibbs-latent slice-positions met-terminals

Since the hyperparameters are all fixed for this model, there is no
need for Markov chain operations to update them.  The operations above
will lead to the state of the Markov chain being only the latent
values and the tree structure, with divergence times, but without the
locations of non-terminal nodes (which will be integrated over).  The
'gibbs-latent' operation does a Gibbs sampling scan over the latent
values associated with the targets in training cases.  (Initially, the
latent values are taken to be +1 for targets of "1", and -1 for
targets of "0".)  The 'slice-positions' and 'met-terminals' operations
both update the tree structure and divergence times.  See dft-mc.doc
for details.

We can run the Markov chain for 100 iterations with the following
command:

    > dft-mc blog.dft 100

We can see the progress of the Markov chain by examining various
quantities, such as the latent values associated with training cases.
The following command will plot the latent values for the first target
for the first six training cases:

    > dft-plt t o1@1:6 blog.dft | plot

The first four of these cases have "0" for their first target value,
the last two have "1".  The plot should show that the latent values
for the first four cases tend to be negative, whereas the latent
values for the last two cases tend to be positive, at least once the
Markov chain has approached convergence.

The 'dft-mc' command above takes 19 seconds on the system used (see
Ex-test-system.doc).  Once it is done, we can see how well the model
predicts new data using the following command, which looks at the
Dirichlet diffusion tree models produced by the last 60 iterations of
the Markov chain:

    > dft-pred pa blog.dft 41:

    Number of iterations used: 60

    Number of test cases: 500

    Average log probability of targets:    -6.110+-0.066

This takes 9.1 seconds on the system used.  Performance is not much
worse than for the four-component mixture model above, which is based
on the true structure of the data.