A PROBABILITY ESTIMATION PROBLEM WITH BINARY DATA As a first illustration of the mixture model software, I generated a dataset in which each case was composed of ten binary attributes. The distribution of these binary vectors was a mixture of four component distributions, in each of which the ten attributes were independent. The four components each had probability 0.25 of being chosen to generate a case. The probabilities for each component for each binary attributes were as shown below: 1 2 3 4 5 6 7 8 9 10 1 0.1 0.2 0.2 0.2 0.2 0.8 0.8 0.8 0.8 0.7 2 0.1 0.8 0.8 0.8 0.8 0.2 0.2 0.2 0.2 0.7 3 0.9 0.2 0.2 0.8 0.8 0.2 0.2 0.8 0.8 0.7 4 0.9 0.8 0.8 0.2 0.2 0.8 0.8 0.2 0.2 0.7 Each row gives the probabilities of each of the attributes being '1' for one component of the mixture. The columns are for the ten binary attributes in each case. The vectors generated in this way can be seen as coming from one of four "patterns": 0000011111, 0111100001, 1001100111, and 1110011001, but with each bit of the chosen pattern having a small probability of being switched (ranging from 0.1 to 0.3) in any particular case. I generated 1000 cases from this distribution, which are stored in the file 'bdata'. The first 500 are used for training (the others are not used at all at the moment). A finite mixture model for the binary data. We will first see what happens when we model this data as a mixture of four components, which is the true number of components for this problem. Each component of the mixture will give each target attribute a certain probability of being '1' rather than '0'. These probabilities are determined from the "offset" parameters for each target, by means of the "logistic" function: Pr(target=1) = 1 / (1 + exp(-offset)) The offset parameters for each attribute for each component are given Gaussian prior distributions, with means and standard deviations that are hyperparameters. These hyperparameters could be fixed, but here we will make them variable, separately for each target attribute, but with each of them linked to a top-level hyperparameter, common to all targets. We set up this mixture model with the 'mix-spec' command, as follows: > mix-spec blog.4 0 10 4 / x1 0.05:0.5:0.2 10 Here, "blog.4" is the name of the log file used for this run, which is created by the 'mix-spec' command. The arguments of 'mix-spec' that follow the name of the log file are the number of input attributes (always 0 for mixture models at present), the number of target attributes (10 for this problem), and the number of mixture components (set to the true value of 4 in the command above). The specifications for priors are given following these arguments, after a "/" separator. The first prior specification gives the concentration parameter of the Dirichlet distribution for the mixing proportions for the components. If this specification starts with "x", this value is automatically divided by the number of components, which is the scaling required for the model to reach a sensible limit as the number of components goes to infinity. The specification of "x1" above mildly favours unequal mixing proportions (recall that the true proportions are equal). The second argument after the "/" specifies the prior for the standard deviations of the "offsets" for components (which determine the probabilities for the binary target attributes). In the hierarchical scheme specified here (described in general in prior.doc), there is a top-level standard deviation hyperparameter, given a prior in which the corresponding "precision" (standard deviation to the power -2) has a gamma distribution with mean 0.05 and shape parameter 0.5. There is a lower-level standard deviation hyperparameter for each target; for each of these, the corresponding precision has a gamma distribution with mean given by the higher-level precision, and shape parameter of 0.2. Both these priors are rather vague (but the shape parameter of 0.2 is vaguer than 0.5), so these hyperparameters can adapt to the structure of the data. The last argument of 'mix-spec' is the standard deviation for the means of the offsets for each of the targets, here set to 10. Currently, this standard deviation for the mean offset is the same for all targets, and is fixed in the 'mix-spec' command (it cannot be a variable hyperparameter). We can look at these specifications by calling 'mix-spec' with just the log file as an argument: > mix-spec blog.4 Number of inputs: 0 Number of targets: 10 Number of components: 4 Dirichlet concentration parameter: x1.000 Prior for SD hyperparameters: 0.050:0.50:0.20 Prior for mean component offsets: 10.000 After the 'mix-spec' command, we need to specify that the targets are binary using the 'model-spec' command, as follows: > model-spec blog.4 binary We can then specify where the data comes from using 'data-spec': > data-spec blog.4 0 10 2 / bdata@1:500 . The number of inputs is here specified to be 0 (as it must be at present for mixture models), and the number of targets is specified to be 10. These must match the values given to 'mix-spec'. We also specify that the targets must be the integers 0 or 1 putting in a following argument of 2 (meaning that there are 2 possible values, which start at 0). After a slash, we say where the inputs and targets come from. Note that we have to say where the inputs come from even though there are 0 of them, but fortunately, we can then say that the targets come from the same place by just using the "." argument. In the specification above, the data comes from the first 500 lines of the file 'bdata', with one case per line. See data-spec.doc for more details. Finally, we need to specify how Markov chain sampling is to be done for this model. At present, none of the standard Markov chain methods are allowed for mixture models, only the specialized procedures documented in mix-mc.doc. Each of these procedures updates one of the three parts of the state - the values of the hyperparameters, the values of the parameters for the various mixture components, and the values of the indicator variables saying which mixture component is currently associated with each training case. The values in each of these parts of the state can be updated by Gibbs sampling, using the corresponding procedure. The following call of 'mc-spec' sets this up: > mc-spec blog.4 repeat 20 gibbs-indicators gibbs-params gibbs-hypers Here, the three Gibbs sampling operations are repeated 20 times each iteration, just to cut down on the number of states saved in the log file. We can now run the Markov chain simulation for 100 iterations with the following command: > mix-mc blog.4 100 The simulation starts with a state in which all the training cases are associated with the same component, whose parameters are set to their means, as specified by hyperparameter values that are also set to their prior means. The 'mix-gen' program could be used to initialize things differently (see mix-gen.doc). The above run takes a bit over one minute on our machine. It can be run in the background, in which case progress can be monitored with the 'mix-display' and 'mix-plt' programs. For example, after a few seconds, we might see the following: > mix-display blog.4 MIXTURE MODEL IN FILE "blog.4" WITH INDEX 17 HYPERPARAMETERS Standard deviations for component offsets: 0.198: 4.056 0.718 1.579 1.702 1.830 1.548 2.504 1.957 4.120 0.184 Means for component offsets: +1.816 +0.022 +0.182 +0.037 +0.500 +1.322 -0.018 +1.375 +2.802 +0.990 PARAMETERS AND FREQUENCIES FOR COMPONENTS OF THE MIXTURE 1: 0.310 -2.451 +0.738 +1.342 +1.093 +0.605 -0.960 -1.146 -0.441 -0.798 +0.903 2: 0.250 -2.121 -1.530 -1.999 -0.917 -1.393 +1.727 +1.862 +1.603 +3.118 +1.101 3: 0.238 +3.950 +0.915 +1.372 -1.960 -0.940 +1.307 +1.793 -1.268 -2.102 +1.035 4: 0.202 +2.710 -0.978 -1.015 +1.900 +1.915 -1.802 -2.903 +1.268 +1.452 +0.827 This displays the state at the last iteration computed so far (here iteration 17). The hyperparameter values are shown first, with top-level hyperparameters on the left, before the colon. The lower-level hyperparameters follow, ordered by the target attribute that they pertain to. Values for the component parameters follow, sorted by the fraction of training cases that they are currently associated with. These fractions are the first numbers after 1:, 2:, etc.; note that the actual mixing probabilities are not explicitly represented, but are instead always integrated over. Any components that are not associated with any training case are omitted from the output, and are not explicitly represented in the state. The other numbers shown for each component are the "offset" parameters for each target attribute. In this example, the simulation appears to have learned the essence of the distribution by the iteration shown above. Recall that a positive offset corresponds to the probability of a 1 being greater than 1/2, a negative offset to the probability of a 1 being less than 1/2. The four components shown above can thus be seen to correspond to the four patterns used to generate the cases, as described earlier. Note that the last of the offset standard deviation hyperparameters (for the last target attribute) is quite small. This indicates that the model has "learned" that the probabilities for the last target are the same for all components, and hence can be modeled most efficiently at the hyperparameter level, by the mean for that offset, rather than separately for each component. The indicators of which components are associated with each training case can also be examined with 'mix-display', and of course iterations other than the last can be viewed. See mix-display.doc for details. One can also look at the progress of the simulation using 'mix-plt'. For example, the following will produce a time-plot of the cumulative proportions of the training cases associated with the four components (as sorted by decreasing frequency): > mix-plt t C1C2C3C4 blog.4 | plot Here, 'plot' is some suitable plotting program. One can also just let the output of 'mix-plt' go to standard output, and then examine the numbers manually. Other quantities can also be plotted, as described in mix-quantities.doc. As well as examining quantities with 'mix-display' and 'mix-plt', one can also produce a sample of cases from the predictive distribution using 'mix-cases'. Unfortunately, these are about the only things one can do with the results at the moment. There are no programs that find predictive probabilities, fill in missing values in partial cases, or do any of the other useful things that are quite possible with these models, but which haven't been implemented. See mix-extensions.doc for a wish list of such facilities. A countably infinite mixture model for the binary data. Even though the true distribution for this example is a mixture of four components, good results can nevertheless be obtained using a mixture with a countably infinite number of components. The prior for mixture proportions used with such a model is designed so that a few of the infinite number of components have substantial probability, so the model does not result in an "overfitted" solution, in which every training case is "explained" as coming from a different component. We specify a countably infinite mixture model by simply omitting the argument to 'mix-spec' that gives the number of components. For this example, we change the 'mix-spec' command used above to the following: > mix-spec blog.inf 0 10 / x1 0.05:0.5:0.2 10 The Dirichlet prior specification for mixing proportions of "x1" is the same as for the mixture model with four components. The "x" specifies scaling downward with the number of components, which produces a sensible limit as the number of components goes to infinity, as here. The "x" is therefore required for infinite mixtures. The other arguments of 'mix-spec' are as described for the finite mixture model. The 'model-spec' and 'data-spec' commands used are also the same: > model-spec blog.inf binary > data-spec blog.inf 0 10 2 / bdata@1:500 . We must change the 'mc-spec' command, however, since it is impossible to do Gibbs sampling for the component indicators associated with training cases - since their number is infinite, we can't compute all the required conditional probabilities. However, we can instead use a specialized Metropolis-Hastings update, in which a new component to go with a training case is proposed with probability determined by the frequencies with which components are associated with other training cases. The proposal is accepted or rejected based on the resulting change in likelihood. The process can then be repeated any desired number of times, to produce an approximation to Gibbs sampling. With this change, we can use the following 'mc-spec' command for this model: > mc-spec blog.inf repeat 20 met-indicators 10 gibbs-params gibbs-hypers This is the same as for the finite mixture model, except that 10 repetitions of the Metropolis-Hastings update for the indicators are specified using 'met-indicators 10', in place of 'gibbs-indicators'. As before, we can now run the simulation with a command such as: > mix-mc blog.inf 100 We can examine the states with 'mix-display'. For example, once the 'mix-mc' command completes, the state should be something like the following: > mix-display blog.inf MIXTURE MODEL IN FILE "blog.inf" WITH INDEX 100 HYPERPARAMETERS Standard deviations for component offsets: 0.294: 2.090 1.637 2.191 2.686 2.195 2.299 3.737 2.555 1.565 0.133 Means for component offsets: +1.456 -0.883 -0.760 -0.798 -0.455 -0.482 -1.981 -0.859 +0.352 +0.943 PARAMETERS AND FREQUENCIES FOR COMPONENTS OF THE MIXTURE 1: 0.328 -1.592 +0.810 +0.710 +1.011 +0.551 -0.881 -1.333 -0.470 -0.783 +1.147 2: 0.262 +1.549 +1.284 +1.459 -2.410 -1.312 +2.129 +2.131 -1.709 -1.162 +0.768 3: 0.212 -1.539 -1.378 -2.675 -1.357 -1.244 +1.930 +2.561 +1.802 +1.866 +0.942 4: 0.170 +2.130 -1.917 -1.421 +1.789 +1.926 -1.626 -2.642 +1.403 +1.302 +0.841 5: 0.018 -3.048 -1.125 -3.540 -0.104 -2.428 +0.796 -6.877 +3.549 +0.400 +1.079 6: 0.006 +1.207 -4.213 -0.463 +1.713 +4.755 -1.497 +0.838 +3.890 +0.436 +1.154 7: 0.004 +2.252 -4.192 +0.727 +2.935 -5.049 -4.338 -3.355 +1.462 -0.389 +0.886 The output is similar to that seen for the mixture model with four components, except that seven components are associated with at least one training case at iteration 100, as shown above. (The infinite number of remaining components are not explicitly represented, and are not displayed by 'mix-display'.) Several of these components are associated with very few training cases, however (as seen from the fractions of the training set after the component number). This is to be expected when the true distribution can be expressed using only four components. The number of such low-frequency components will vary from iteration to iteration, as will other aspects of the state, in accordance with the variability in the posterior distribution.