EXAMPLE OF CLASSIFICATION WITH A DIRICHLET DIFFUSION TREE JOINT MODEL. Rather than classify items with a neural network or Gaussian process model for the conditional distribution of the class given inputs, we can instead model the joint distribution of the inputs and the class, from which we can then derive the conditional distribution of the class given the inputs. Here, this is done using a Dirichlet diffusion tree model for the joint distribution. One advantage of this approach is that unlabelled data (with the class missing) can be used to help learn the classifier. The commands used here are in the rbcmds.dft file in ex-mixdft. The data is the same as that used for the example of modeling a bivariate density (see Ex-mixdft-r.doc), except that we now also look at the 0/1 indicator of which component each data point was generated from, which was previously ignored. The full data file (in ex-mixdft) can be used to create a training set of 500 cases in which only the last 10 cases have class labels, as follows: > head -500 rdata | sed "1,490s/.\$/?/" >rdata.t Don't worry if this is gibberish to you - all that matters is the final result, in which the first 490 cases have the class indicator replaced by "?", which indicates a missing value. The following specifications set up a Dirichlet diffusion tree model for the two inputs and the class (all regarded as "targets" for this model): > dft-spec rblog.dft 0 3 / 0.5:0.5:0.5 0.01:0.5 - 0.01:0.5 > model-spec rblog.dft real 0.1 last-binary > data-spec rblog.dft 0 3 / rdata.t@1:500 . Note that "last-binary" option of model-spec. This says that although the targets are generally real-valued, the very last target is binary. We can now sample from the posterior distribution for the tree and the parameters of the model as follows: > mc-spec rblog.dft repeat 15 gibbs-latent slice-positions \ met-terminals gibbs-sigmas slice-div > dft-mc rblog.dft 1000 This takes 109 seconds on the system used (see Ex-test-system.doc). We can use iterations from the end of this run to evaluate the predictive density for some new vector of targets. In order to make a prediction for the class of some test case in which only the two real-valued targets are known, we need to evaluate the predictive density for the test case with 0 filled in for the class and for the test case with 1 filled in for the class. Two files of test cases (the last 500 in rdata) with the actual classes replaced by 0 and by 1 can be created as follows (again, don't worry if the details don't make sense to you): > tail -500 rdata | sed "1,\$s/.\$/0/" >rdata.0 > tail -500 rdata | sed "1,\$s/.\$/1/" >rdata.1 The following commands find the log probability densities for these test cases, based on every fifth iteration after iteration 400 from the log file: > dft-pred pb rblog.dft 405:%5 / rdata.0 . >rdata.lp0 > dft-pred pb rblog.dft 405:%5 / rdata.1 . >rdata.lp1 The following commands convert the log probability densities into probability densities: > sed "s/e/E/" <rdata.lp0 | sed "s/.*/calc \"Exp(&)\"/" \ | bash | sed "s/ */p0=/" >rdata.up0 > sed "s/e/E/" <rdata.lp1 | sed "s/.*/calc \"Exp(&)\"/" \ | bash | sed "s/ */p1=/" >rdata.up1 The ratio of the probability density of a test case with the class set to 1 to the probability density of the same test case with the class set to 0 can be used to find the conditional probability of class 1, as follows: > combine rdata.up0 rdata.up1 | sed "s/.*/calc & \"p1\\/(p0+p1)\"/" \ | bash >rdata.p1 The final result, in the file rdata.p1, is the predictive probability of class 1 for each of the 500 test cases. We can now now guess that the class is 1 if this probability is greater than 0.5: > sed "s/0.[56789].*/1/" <rdata.p1 | sed "s/...*/0/" >rdata.guess and compare with the true class label: > tail -500 rdata | sed "s/.* //" >rdata.true > combine rdata.true rdata.guess | fgrep "0 1" | wc > combine rdata.true rdata.guess | fgrep "1 0" | wc There are 14 test cases with true class 0 where the model guessed 1, and none where the true class was 1 but the model guessed 0, giving an error rate of 2.8%. The asymmetry in errors is probably due to 0 being the less common class, so the model tends to guess 1 in ambiguous cases. This error rate likely is much better than we could achieve with any method that looks only at the 10 training cases for which the class was provided for training.