Density Modeling and Clustering Using Dirichlet Diffusion Trees

Radford M. Neal, Dept. of Statistics and Dept. of Computer Science, University of Toronto

I introduce a family of prior distributions over multivariate distributions, based on the use of a ``Dirichlet diffusion tree'' to generate exchangeable data sets. These priors can be viewed as generalizations of Dirichlet processes and of Dirichlet process mixtures, but unlike simple mixtures, they can capture the hierarchical structure present in many distributions, by means of the latent diffusion tree underlying the data. This latent tree also provides a hierarchical clustering of the data, which, unlike ad hoc clustering methods, comes with probabilistic indications of uncertainty. The relevance of each variable to the clustering can also be determined. Although Dirichlet diffusion trees are defined in terms of a continuous-time process, posterior inference involves only finite-dimensional quantities, allowing computation to be performed by reasonably efficient Markov chain Monte Carlo methods. The methods are demonstrated on problems of modeling a two-dimensional density and of clustering gene expression data.

In J. M. Bernardo, et al. (editors) Bayesian Statistics 7, pp. 619-629 (proceedings of the 7th Valencia conference): postscript, pdf.

The models described in this paper are implemented as part of my software for flexible Bayesian modeling.

I was awarded the Lindley Prize for this paper.

Associated references: This paper includes parts (but not all) of the following technical report:
Neal, R. M. (2001) ``Defining priors for distributions using Dirichlet diffusion trees'', Technical Report No. 0104, Dept. of Statistics, University of Toronto, 25 pages: abstract, postscript, pdf, associated software.