## Density Modeling and Clustering Using Dirichlet Diffusion Trees

**Radford M. Neal,
Dept. of Statistics and Dept. of Computer Science, University of Toronto**

I introduce a family of prior distributions over multivariate
distributions, based on the use of a ``Dirichlet diffusion tree'' to
generate exchangeable data sets. These priors can be viewed as
generalizations of Dirichlet processes and of Dirichlet process
mixtures, but unlike simple mixtures, they can capture the
hierarchical structure present in many distributions, by means of the
latent diffusion tree underlying the data. This latent tree also
provides a hierarchical clustering of the data, which, unlike *ad
hoc* clustering methods, comes with probabilistic indications of
uncertainty. The relevance of each variable to the clustering can
also be determined. Although Dirichlet diffusion trees are defined in
terms of a continuous-time process, posterior inference involves only
finite-dimensional quantities, allowing computation to be performed by
reasonably efficient Markov chain Monte Carlo methods. The methods
are demonstrated on problems of modeling a two-dimensional density and
of clustering gene expression data.

In J. M. Bernardo, *et al.* (editors) *Bayesian Statistics 7*,
pp. 619-629 (proceedings of the 7th Valencia conference):
postscript,
pdf.

The models described in this paper are implemented as part of my
software for flexible Bayesian modeling.

I was awarded the
Lindley Prize for this paper.

**Associated references:**
This paper includes parts (but not all) of the following technical report:
Neal, R. M. (2001) ``Defining priors for distributions using Dirichlet
diffusion trees'', Technical Report No. 0104, Dept. of Statistics,
University of Toronto, 25 pages:
abstract,
postscript,
pdf,
associated software.