FACILITIES PROVIDED BY THIS SOFTWARE This software implements flexible Bayesian models for regression, classification, and probability or density estimation applications. The regression and classification models are based on multilayer perceptron neural networks or on Gaussian processes. The probability and probability density models are based on finite or countably infinite mixture models; the infinite models are also know as Dirichlet process mixture models. Bayesian inference for these model is done using Markov chain Monte Carlo methods. Software modules that support Markov chain sampling are included in the distribution, and may be useful in other applications. Note that I am distributing this software to facilitate research in this area. Potential users should make note of the copyright notice at the beginning of this document (or accessible via the first hypertext link). You must obtain permission from me before using this software for purposes other than research or education. You should also note that the software may have bugs, particularly regarding recently added or experimental features. The neural network models are described in my thesis, "Bayesian Learning for Neural Networks", which has now been published by Springer-Verlag (ISBN 0-387-94724-8). The neural network models implemented are essentially as described in the Appendix of this book. The Gaussian process models are in many ways analogous to the network models. The Gaussian process models implemented in this software, and computatonal methods that used, are described in my technical report entitled "Monte Carlo implementation of Gaussian process models for Bayesian regression and classification", available in compressed Postscript at URL ftp://ftp.cs.utoronto.ca/pub/radford/mc-gp.ps.Z. The Gaussian process models for regression are similar to those evaluated by Carl Rasmussen in his thesis, "Evaluation of Gaussian Processes and other Methods for Non-Linear Regression", available from his home page, at the URL http://www.cs.utoronto.ca/~carl/; he also talks about neural network models. To understand how to use the software implementing these models, it is essential for you to have read at least one of these references. The neural network software supports Bayesian learning for regression problems, classification problems, and survival analysis (experimental), using models based on networks with any number of hidden layers, with a wide variety of prior distributions for network parameters and hyperparameters. The Gaussian process software supports regression and classification models that are similar to neural network models with an infinite number of hidden units, using Gaussian priors. The advantages of Bayesian learning for both types of model include the automatic determination of "regularization" hyperparameters, without the need for a validation set, the avoidance of overfitting when using large networks, and the quantification of uncertainty in predictions. The software implements the Automatic Relevance Determination (ARD) approach to handling inputs that may turn out to be irrelevant (developed with David MacKay). For problems and networks of moderate size (eg, 200 training cases, 10 inputs, 20 hidden units), fully training a neural network model (to the point where one can be reasonably sure that the correct Bayesian answer has been found) typically takes several hours to a day on our SGI machine. However, quite good results, competitive with other methods, are often obtained after training for under an hour. The time required to train the Gaussian process models depends a lot on the number of training cases. For 100 cases, these models may take only a few minutes to train (again, to the point where one can be reasonably sure that convergence to the correct answer has occurred). For 1000 cases, however, training might well require a day of computation. The finite mixture models are similar to those which have been used by many people - for example, Lavine and West (Canadian Journal of Statistics, vol. 20, pp. 451-461, 1992) fit similar models using similar Markov chain Monte Carlo methods. The countably infinite mixture models are equivalent to Dirichlet process mixtures. Markov chain sampling for these models has been described by Escobar and West (Journal of the American Statistical Association, vol. 90, pp. 577-588, 1995). Both finite and infinite mixture models (for binary data) are described in my tech report, "Bayesian mixture modeling by Monte Carlo simulation", available by anonymous ftp at the URL ftp://ftp.cs.utoronto.ca/pub/radford/bmm.ps.Z. The models and Markov chain methods used in the software are not identical to those described in any of these references, however. The details can be found only in the software documentation. For this reason, the mixture software may be a bit difficult to figure out until such time as I get around to writing up a paper describing this implemention. This part of the software is rather preliminary in other respects as well. The software consists of a number of programs and modules. Four major components are included in this distribution, each with its own directory: util Modules and programs of general utility. mc Modules and programs that support sampling using Markov chain Monte Carlo methods, using modules from util. net Modules and programs that implement Bayesian inference for models based on multilayer perceptrons, using the modules from util and mc. gp Modules and programs that implement Bayesian inference for models based on Gaussian processes, using the modules from util and mc. mix Modules and programs that implement Bayesian inference for finite and infinite mixture models, using modules from util and mc. In addition, the 'bvg' directory contains modules and programs for sampling from a bivariate Gaussian distribution, as a simple demonstration of the capabilities of the Markov chain Monte Carlo facilities. Other than by providing this example, and the detailed documentation on various commands, I have not attempted to document how you might go about using the Markov chain Monte Carlo modules for another application. The 'examples' directory contains the data sets that are used in the tutorial examples, along with shell scripts containing the commands used. It is possible to use this software to do learning and prediction without any knowledge of how the programs are written (assuming that the software can be installed as described below without any problems). However, the complete source code is included so that researchers can modify the programs to try out their own ideas. The software is written in ANSI C, and is meant to be run in a UNIX environment. Specifically, it was developed on an SGI machine running IRIX Release 5.3. It also seems to run OK on a SPARC machine running SunOS 5, using the 'gcc' C compiler, and on DEC Alpha machines. As far as I know, the software does not depend on any peculiarities of these environments (except perhaps for the use of the drand48 psuedo-random number generator), but you may nevertheless have problems getting it to work in substantially different environments, and I can offer little or no assistance in this regard. There is no dependence on any particular graphics package or graphical user interface. (The 'xxx-plt' programs are designed to allow their output to be piped directly into the 'xgraph' plotting program, but other plotting programs can be used instead, or the numbers can be examined directly.)