SAMPLING FROM A RING DISTRIBUTION IN THREE DIMENSIONS I will here demonstrate a variety of Markov chain sampling methods using as an example a distribution that forms a ring in a space of three dimensions, parameterized by variables called "x", "y", and "z". For details on the various methods described, see mc-spec.doc. A table comparing the performance of the methods on these examples is included at the end of this section. Specifying the distribution. The "ring" distribution is specified by a 'dist-spec' command such as the following (the "\" says the command continues on the next line): > dist-spec glog \ "x^2/2 + y^2/2 + z^2/2 + (x+y+z)^2 + 10000/(1+x^2+y^2+z^2)" Recall that the formula given to 'dist-spec' is the "energy function", which is minus the log probability density, plus any constant. If the energy function above consisted of only the first three terms, the distribution would be multivariate normal, with x, y, and z being independent, each having mean zero and variance one. The fourth term, (x+y+z)^2, leaves the distribution still normal, but squashes it in the direction where x, y, and z increase equally. The final term is large near the origin, and from there decreases to zero symmetrically in all directions. It has the effect of making a hole in the centre of what would otherwise have been a normal distribution, leaving a ring shape. The examples below are assumed to start with a 'dist-spec' command of the above form, except that "glog" is replaced with the name of the log file used for the method being demonstrated. Multivariate Metropolis updates. We will first see how to sample from this distribution using a variation of the Metropolis algorithm in which all three state variables are changed simultaneously. In the proposal distribution used, new values for the state variable are chosen independently, each from a normal distribution with mean equal to the present value, and a specified standard deviation. The following command specifies that 50 Metropolis operations of this sort should be done for each full iteration: > mc-spec glog.met,1 repeat 50 metropolis 1 end Here, "glob.met,1" is the name of the log file to use, which would have been created using a 'dist-spec' command like the one above. The "repeat 50 ... end" construction causes the enclosed operations to be repeated the given number of times. This saves space in the log file, compared to doing one metropolis operation for 50 times as many iterations. This is appropriate when many such operations will be needed to get to a substantially different point. A similar result can be obtained using an extra argument to 'dist-mc' (see xxx-mc.doc), but using "repeat" has the advantage that one can also easily look at the rejection rate over the 50 repetitions. We can now sample for 10000 Markov chain iterations (a total of 500000 metropolis updates) with the following command: > dist-mc glog.met,1 10000 This takes about half a second on the test system (Ex-test-system.doc) as will all the other sampling commands in this section. We can look at what happened to the three state variables during these 10000 iterations with a command such as > dist-plt t xyz glog.met,1 | plot Or you might look at only the first 200 iterations: > dist-plt t xyz glog.net,1 :200 | plot Depending on your plot program's requirements, you might instead use a command such as > dist-tbl txyz glog.met,1 | plot or you might have to plot the variables one at a time, with commands such as > dist-plt t x glog.met,1 | plot From these plots, you can see that the chain quite rapidly reached the equilibrium distribution - maybe even by the end of the first iteration (ie, within the first 50 metropolis updates). Just to be sure, however, let's discard the first 10 iterations as "burn-in". We can now take a look at the "ring" distribution with commands like > dist-plt x y glog.met,1 11: | plot-points If "plot-points" plots points rather than lines, this will produce a scatterplot of the distribution for "x" and "y" (with "z" ignored), which will look like a flattened ring. The ring is actually circular, but it's tilted with respect to the axes, so you'll be able to see it as a circle only if you have a three-dimensional plotting program. However, the 9990 points plotted are not independent, as can also be seen from the plots of "x", "y", and "z" versus "t". The effect of this can be seen more clearly by looking at only points up to iteration 1000: > dist-plt x y glog.met,1 11:1000 | plot-points This plot will probably not show points distributed perfectly uniformly around the ring. Instead, there will be clumps here or there, which result from inadequate sampling. So to get a sample of points that are a good representation of the distribution using this chain, it is necessary to be run for quite a few iterations. We can get a quantitative idea of how poor the sampling is with the following command: > dist-tbl x glog.met,1 11: | series mac 50 Number of realizations: 1 Total points: 9990 Sample mean: 0.403307 S.E. from correlations: 0.435419 Lag Autocorr. Cum. Corr. 1 0.951543 2.903085 2 0.908311 4.719707 3 0.866078 6.451863 4 0.826428 8.104720 5 0.790122 9.684965 6 0.755438 11.195842 7 0.722632 12.641105 8 0.691039 14.023184 9 0.661355 15.345894 10 0.631852 16.609599 11 0.604368 17.818335 12 0.579568 18.977472 13 0.555770 20.089013 14 0.533468 21.155949 15 0.510170 22.176290 16 0.488388 23.153066 17 0.468346 24.089758 18 0.449704 24.989165 19 0.429849 25.848863 20 0.413143 26.675149 21 0.396623 27.468396 22 0.381805 28.232006 23 0.366428 28.964862 24 0.351633 29.668128 25 0.336837 30.341802 26 0.323512 30.988826 27 0.308738 31.606303 28 0.294569 32.195441 29 0.282383 32.760207 30 0.270926 33.302059 31 0.260250 33.822559 32 0.248232 34.319023 33 0.235130 34.789283 34 0.221089 35.231461 35 0.209061 35.649584 36 0.198239 36.046063 37 0.189368 36.424798 38 0.181272 36.787342 39 0.174378 37.136097 40 0.167159 37.470416 41 0.159599 37.789614 42 0.152854 38.095322 43 0.145695 38.386712 44 0.139347 38.665405 45 0.131296 38.927997 46 0.124033 39.176064 47 0.118028 39.412119 48 0.111391 39.634901 49 0.106981 39.848864 50 0.103838 40.056539 The 'dist-tbl' command above outputs a list of values for "x" for iterations from 11 on. The 'series' command with options "mac" finds the mean of these numbers, their autocorrelations (out to lag 50 here), and the cumulative correlations. The cumulative correlation at the earliest lag past which the autocorrelations are about zero indicates the factor by which sampling is made inefficient by the correlations (here, about 40); see Ex-dist-n.doc for more details. From symmetry, we know that the true mean for "x" is zero. The estimate of 0.403307 above is consistent with this, in view of the estimated standard error of +-0.435419. (Note that differences from the true value of up to about twice the standard error are plausible.) We can also get estimates using 'dist-est', but as discussed in Ex-dist-n.doc, the standard errors it produces do not account for autocorrelation. We might try to improve the efficiency of sampling by changing the standard deviation of the Metropolis proposal distribution - which is also known as the "stepsize" for the operation. One indication of whether the stepsize is appropriate is the rejection rate for the Metropolis operations, which can be viewed with a command such as > dist-plt t r glog.met,1 | plot Here, the rejection rate is about 0.75, which is acceptable. Very low or very high rejection rates are usually an indication that sampling would work better with a different stepsize. Although the stepsize of 1 that was used above appears to be OK, we could try a smaller stepsize with the following commands (following a 'dist-spec' command): > mc-spec glog.met,0.2 repeat 50 metropolis 0.2 end > dist-mc glob.met,0.2 10000 Or we might try a larger stepsize: > mc-spec glog.met,5 repeat 50 metropolis 5 end > dist-mc glob.met,5 10000 If enough iterations are done, the same estimates should be obtained all these chains, but some stepsizes will produce a more efficient chain than others. By examining plots of how the state variables change, and looking at the autocorrelations with 'series', one can conclude that sampling is much less efficient with a stepsize of 0.2 than with a stepsize of 1 (about ten times less efficient, based on cumulative correlations). With a stepsize of 5, the sampling is about as good as (or even better than) with a stepsize of 1, even though the rejection rate is quite high. This is a phenomenon that occurs only in problems with an effective dimensionality of three or less - for higher-dimensional problems, a rejection rate close to one is generally an indication of poor sampling. Single-variable Metropolis updates. We can also try sampling using Metropolis updates that change only one variable at a time. This is done using "met-1" operations, specified as follows: > mc-spec glog.met1,1 repeat 18 met-1 1 end As with "metropolis" operations, we specify a "stepsize", which is the standard deviation for proposed change to a variable. Each "met-1" operation tries to change each variable in turn, accepting or rejecting the change based on the change in energy as a result of making the proposed change to just that variable. Since there are three state variables for this distribution, a single "met-1" operation must therefore calculate the energy three times, and hence takes about three times as long as a "metropolis" operation. To facilitate comparisons, the repeat count is corresponding less in this specification. As before, can now sample for 10000 iterations using a 'dist-mc' command: > dist-mc glog.met1,1 10000 You can see how well this method samples in the same ways as discussed above. You could also try sampling using "met-1" with a stepsize of 0.2 and 5. You should see that the rejection rate with "met-1" is lower than with "metropolis" operations using the same stepsize. Nevertheless, sampling from this distribution seems to be less efficient with "met-1" than with "metropolis". This is not always so, however. For distributions where at least some of the variables are close to being independent, updating one variable at a time can be more efficient. It is also sometimes possible to save computation time when recomputing the energy after a change to just one variable, though that possibility is not presently exploited by this software. Single-variable slice sampling. Variables can also be updated one at a time is using single-variable slice sampling, which is described in my tech report on "Markov chain Monte Carlo methods based on `slicing' the distribution" (available from my web page), or in the published paper, called "Slice sampling". Several variations on this procedure are implemented in this software. The method in which the slice is found by "stepping out" can be done as follows: > mc-spec glog.slc1,1 repeat 4 slice-1 1 end > dist-mc glog.slc1,1 10000 This does single-variable slice sampling using an initial interval of size 1, which is extended in steps of the same size until both ends are outside the slice. The "doubling" procedure is also implemented, but is not illustrated here. The "e" quantity records the average number of energy function evaluations done in the slice sampling updates for one iteration. We can find the average of this quantity over all iterations with a command such as > dist-tbl e glog.slc1,1 | series m Number of realizations: 1 Total points: 10000 Sample mean: 5.381167 Note that 5.381167 is the average number of evaluations for updating one variable, not for updating all three of them. As with the Metropolis methods, performance varies with the stepsize chosen. However, one advantage of single-variable slice sampling is that it is a bit less sensitive to the choice of stepsize than the single-variable Metropolis algorithm. Multivariate slice sampling. We can also use variations of slice sampling in which all variables are updated simultaneously (as described in my "Slice sampling" paper). The simplest such scheme randomly places a hyperrectangle containing the current point, picks points randomly from it, and shrinks it when the point chosen is outside the slice, until a point inside the slice is finally found. This can be done with an mc-spec command such as the following: > mc-spec glog.slc,5 repeat 13 slice 5 end This works about as well as multivariate Metropolis with a stepsize of 1 or 5. One can also specify that shrinkage is to occur only in the coordinate direction where the product of the energy gradient and the dimension of the hyperrectangle is greatest in magnitude. The following command does this, with the number of repetitions set so that the time per iteration is about the same: > mc-spec glog.slcg,5 repeat 6 slice -g 5 end For this problem, using the gradient information with -g (or -G, another variant) gives little or no advantage, after accounting for the extra time needed to compute the gradient. However, for problems where variables have greatly differing scales (not compensated for by differing stepsizes), the -g and -G options can be very beneficial. One can also try multivariate slice sampling with Gaussian "crumbs" rather than hyperrectangles: > mc-spec glog.sgau,5 repeat 13 slice-gaussian -e 5 end The -e option results in the Gaussian distribution being shrunk on the basis of the energy of the rejected trial point. Sampling with Hamiltonian dynamics. It is possible to sample much more efficiently by suppressing the random walk behaviour that the methods above all exhibit. One way of doing this is by adding "momentum" variables, which will keep the state moving in the same direction for an extended period of time. The original "position" variables along with these "momentum" variables can be updated by applying Hamiltonian dynamics for some period of fictitious time, implemented by performing some number of "leapfrog" steps with a specified stepsize. To produce an ergodic Markov chain, the momentum should also be changed using "heatbath" updates, but it should not be changed too quickly, as this will re-introduce random walk behaviour. For a detailed discussion of this "stochastic dynamics" method, see my book, Bayesian Learning for Neural Networks, or my review paper, Probabilistic Inference Using Markov Chain Monte Carlo Methods, or may later review, MCMC Using Hamiltonian Dynamics. We can try out this dynamical method as follows: > mc-spec glog.dyn,0.3 repeat 40 heatbath 0.98 dynamic 1 0.3 end > dist-mc glog.dyn,0.3 10000 The argument of "heatbath" is the factor by which to multiply the old momentum variables (after which noise is added). A value of 1-d results in random walks being suppressed for around 1/d iterations. If you now look at the state variables with a command such as > dist-plt t xyz glog.dyn,0.3 | plot you will see that they are initially far from their stationary distribution, but after about 50 iterations they settle down, and from there on the chain samples very well. The initial behaviour of the chain can be understood by looking at what is happening to the the "kinetic energy", which is half the sum of squares of the momentum variables: > dist-plt t K glog.dyn,0.3 | plot Initially, the kinetic energy (as well as the "potential" energy, which is what is specified in 'dist-spec') is very large. It is only slowly dissipated, as a result of the "heatbath" updates of the momentum variables. Eventually, however, the kinetic energy reaches its equilibrium distribution (around a value of 3/2), and the chain samples from approximately the desired distribution. This method is not exact, however, because the Hamiltonian dynamics is simulated inexactly, biasing the results. The hybrid Monte Carlo (also called Hamiltonian Monte Carlo) method eliminates this bias by using an acceptance test. Hybrid/Hamiltonian Monte Carlo. Several variations of the hybrid/Hamiltonian Monte Carlo method are supported by this software. In the "standard" method, each iteration starts by picking completely new values for the momentum variables with the "heatbath" operation. Hamiltonian dynamics is then simulated for some number of leapfrog steps, using some specified stepsize, and the end-point of the simulated trajectory is accepted or rejected based on the change in the total energy. The following commands apply the standard hybrid Monte Carlo method, using trajectories of 45 leapfrog steps, done with a stepsize of 0.3. To avoid problems with large initial energies, a few Metropolis updates are done at the beginning. > mc-spec glog.hmc,0.3 repeat 50 metropolis 1 > dist-mc glog.hmc,0.3 1 > mc-spec glog.hmc,0.3 heatbath hybrid 45 0.3 > dist-mc glog.hmc,0.3 10000 The length of the trajectory should be chosen based on the number of steps for which we want to suppress random walks - longer for more difficult problems where it takes many steps to get from one end of the distribution to the other. As with the Metropolis methods, you can check the rejection rate with a command such as the following (the 2: causes the first iteration, which consisted of Metropolis updates, to be excluded): > dist-tbl r glog.hmc,0.3 2: | series m Number of realizations: 1 Total points: 9999 Mean: 0.039004 It is also useful to look at the changes in energy on which the decisions to accept or reject were made: > dist-plt t D glog.hmc,0.3 2: | plot-points If most of these points are much greater than one, the rejection rate will be high. With a stepsize of 0.3, the change in energy is seldom greater than 0.5, but if the stepsize is increased to 0.8 much larger changes are often seen, and with a stepsize of 1, almost no trajectories are accepted. We can see how well the chain is sampling by plotting the state variables, as described above, and by looking at the autocorrelations, with a command such as: > dist-tbl x glog.hmc,0.3 10: | series mac 10 Number of realizations: 1 Total points: 9991 Sample mean: -0.023125 S.E. from correlations: 0.126909 Lag Autocorr. Cum. Corr. 1 0.538886 2.077772 2 0.282877 2.643526 3 0.146972 2.937470 4 0.066028 3.069527 5 0.033698 3.136924 6 0.023167 3.183258 7 0.020455 3.224167 8 0.029720 3.283607 9 0.025383 3.334374 10 0.019557 3.373488 From the cumulative correlations, we can estimate that estimating the expectation of "x" using points from this chain is a factor of only about 3.4 less efficient than using independent points. This is about ten times better than the best of the chains described above. On the other hand, we have seen that some care is needed to pick an appropriate stepsize for hybrid Monte Carlo, and it is also often necessary to start at a reasonably good point, as was done here by doing a few Metropolis updates at the beginning. The "persistent" form of hybrid Monte Carlo (described in Bayesian Learning for Neural Networks) is also implemented. Here are the appropriate commands: > mc-spec glog.phmc,0.18 repeat 50 metropolis 1 > dist-mc glog.phmc,0.18 1 > mc-spec glog.phmc,0.18 repeat 35 heatbath 0.99 hybrid 1 0.18 negate end > dist-mc glog.phmc,0.18 10000 The use of a "heatbath" operation with a non-zero "decay" causes the momentum to change only slowly. Because of this, even though only a single leapfrog update is done, random walk behaviour will still be suppressed. The "negate" operation negates the momentum, which normally undoes a negation at the end of the "hybrid" operation. If the "hybrid" update was rejected, however, the first negation will not have been done, and movement will be reversed, undermining the random walk suppression. To avoid this, a smaller stepsize is used, to keep the rejection rate very small. Performance is a bit worse than for the standard hybrid Monte Carlo method. Reflective slice sampling. Another way of avoiding random walks is to apply slice sampling to all variables at once, and to sample from within the multivariate slice by a sort of dynamical method that proceeds at a constant speed while reflecting off the boundaries of the slice. We would rather not compute these boundaries exactly, however. Instead, we can proceed in steps of some size and when we cross the boundary reflect at either the last inside point or the first outside point. These reflection operations are based on the gradient of the energy at that point. Slice sampling with reflection from the last inside point is done as follows: > mc-spec glog.slci,0.3 heatbath slice-inside 35 0.3 > dist-mc glog.slci,0.3 10000 The "heatbath" operation picks a random momentum vector, which sets the speed of motion and the initial direction. We then find the slice and take 35 steps within the slice with that velocity, using a stepsize of 0.3 (ie, the change in the state at each step is 0.3 times the momentum vector). If a step takes us from inside the slice to outside the slice, we backtrack, and change the momentum vector by reflection based on the gradient. It is necessary to check that this reflection would also occur in the reversed trajectory; if not, we must reject by negating the momentum, causing further steps to retrace the trajectory. Rejections will be less frequent if the stepsize is small. Slice sampling with outside reflection is done as follows: > mc-spec glog.slco,0.3 heatbath slice-outside 45 0.3 > dist-mc glog.slco,0.3 10000 This method operates as described for inside reflection, except that when a step ends at a point outside the slice, we stay there, but reflect based on the gradient at that point. If the final point is inside the slice, we accept it. Otherwise we reject, and return to the point from which the trajectory started. Both inside and outside reflection work reasonably well for this distribution - not quite as good as hybrid Monte Carlo, but better than the other methods. Overrelaxed slice sampling. Random walks can also be suppressed for some distributions using single-variable slice sampling with "overrelaxation", in which variables are changed from their current point to the corresponding point on the "other side" of their conditional distribution. This can be done with the following commands: > mc-spec glog.slcv,0.3 slice-over 8 0.1 0.3 > dist-mc glog.slcv,0.3 10000 The first argument of "slice-over" is the number of "refinements" used to more accurately pin down the end-points of the slice (using bisection). More refinements result in an update that is closer to an ideal overrelaxed update, and also reduces the probability of rejection (caused by the overrelaxed point being outside the slice). The second argument specifies the probability of doing an ordinary single-variable slice sampling update rather than an overrelaxed one. Occasional ordinary updates are necessary in order to ensure ergodicity, but they should not be done too often if random walks are to be suppressed. The final argument is the stepsize to use when finding the endpoints of the slice initially. For this distribution, overrelaxation does not work as well as the dynamical methods or reflective slice sampling, but it is effective for some other distributions. Performance summary. The following table summarizes the inefficiency factor for estimating the expectation of the "x" variable using the various methods (the autocorrelation time), determined from estimates of the cumulative correlations up to when the autocorrelations are near zero. The repetition counts for the methods were set so that they all required about half a second of computation time for 10000 iterations on the machine described in Ex-test-system.doc (between 0.42 and 0.52 seconds, except as noted below). Hence the autocorrelation time shown below is a good guide to their relative efficiency. METHOD STEPSIZE AUTOCORR. NOTES TIME metropolis 0.2 450 1 40 5 25 met-1 0.2 550 1 90 5 180 slice-1 0.2 125 Computation time was 0.58 seconds 1 80 5 100 slice 5 40 slice -g 5 30 slice -G 5 55 slice-gaussian 5 65 dynamic 0.3 3.0 This method is not exact hybrid (HMC) 0.3 3.2 persistent HMC 0.18 8.9 slice-inside 0.3 11 slice-outside 0.3 4.6 slice-over 2 30 The relative performance of the different methods will of course be somewhat different for different distributions, but large gains from using hybrid Monte Carlo (HMC) or other methods that suppress random walks are common.