Bayesian inference begins with a prior distribution for model parameters that is meant to capture prior beliefs about the relationship being modeled. For multilayer perceptron networks, where the parameters are the connection weights, the prior lacks any direct meaning - what matters is the prior over functions computed by the network that is implied by this prior over weights. In this paper, I show that priors over weights can be defined in such a way that the corresponding priors over functions reach reasonable limits as the number of hidden units in the network goes to infinity. When using such priors, there is thus no need to limit the size of the network in order to avoid "overfitting". The infinite network limit also provides insight into the properties of different priors. A Gaussian prior for hidden-to-output weights results in a Gaussian process prior for functions, which can be smooth, Brownian, or fractional Brownian, depending on the hidden unit activation function and the prior for input-to-hidden weights. Quite different effects can be obtained using priors based on non-Gaussian stable distributions. In networks with more than one hidden layer, a combination of Gaussian and non-Gaussian priors appears most interesting.
Technical Report CRG-TR-94-1 (March 1994), 22 pages: postscript, pdf.
Neal, R. M. (1994) Bayesian Learning for Neural Networks, Ph.D. Thesis, Dept. of Computer Science, University of Toronto, 195 pages: abstract, postscript, pdf, associated references, associated software.A revised version of this thesis, with some new material, was published by Springer-Verlag:
Neal, R. M. (1996) Bayesian Learning for Neural Networks, Lecture Notes in Statistics No. 118, New York: Springer-Verlag: blurb, associated references, associated software.