MODELING PROBABILITIES FOR CATEGORICAL DATA As an example of a model for categorical data, I will show here how to model probabilities for targets that come from the set {0,1,2}, with the prior distribution for these probabilities being uniform over the simplex of valid probabilities. This example illustrates one way of representing probabilities using unbounded real parameters. It can easily be generalized to arbitrary Dirichlet priors. I also show how to compute the marginal likelihood for this model by using Annealed Importance Sampling, and also by using simple importance sample (which can be done as a degenerate case of AIS). Specifying the model. The probabilities for the three possible values of the target will be represented using the parameters w0, w1, and w2, with the probability of the value "0" being Exp(w0)/[Exp(w0)+Exp(w1)+Exp(w2)] and similarly for the other values. These probabilities will be positive and sum to one for any values of the parameters. This is necessary if we are to use this software, since it doesn't allow state variables to be constrained. With this representation, the model can be specified by the following somewhat formidable command: > dist-spec plog.met \ "w0~ExpGamma(1,1) + w1~ExpGamma(1,1) + w2~ExpGamma(1,1)" \ "-Delta(t-0)*(w0-LogSumExp(w0,w1,w2)) \ -Delta(t-1)*(w1-LogSumExp(w0,w1,w2)) \ -Delta(t-2)*(w2-LogSumExp(w0,w1,w2))" These priors for w0, w1, and w2 will produce a uniform prior for the resulting probabilities. More generally, if the priors for w0, w1, etc. are ExpGamma(a1,1), ExpGamma(a2,1), etc., the prior distribution for the probabilities Exp(w0)/[Exp(w0)+...], Exp(w1)/[Exp(w0)+...], and so forth will be Dirichlet with parameters a1, a2, etc. The terms in the likelihood are constructed using the "Delta" function which has the value one when its argument is zero and has the value zero otherwise. Since the target, t, is in the set {0,1,2}, exactly one of the terms in the likelihood will be non-zero for each case. When the target has the value "0", the likelihood should be minus the log of the probability of the target being "0", that is: - Log { Exp(w0) / [ Exp(w0) + Exp(w1) + Exp(w2) ] } In the specification above, this is written as - (w0 - LogSumExp(w0,w1,w2)) using the "LogSumExp" function, which computes the log of the sum of the exponentials of its arguments. This is shorter and faster. More importantly, the LogSumExp function is guaranteed to produce a valid result even when computing Exp(w0), Exp(w1), or Exp(w2) would result in floating point overflow or underflow. This is probably not crucial for this example, but it is important in some similar contexts. Specifying the data source. The data is stored one item per line in the file "pdata". This source is specified with the following command: > data-spec plog.met 0 1 3 / pdata . The arguments after the log file are the number of input variables (0), the number of target variables (1), and an argument (3) that indicates that the targets are categorical with three possible values, represented by the integers 0, 1, and 2. This argument could have been left out, but including it causes the data to be checked to ensure that each item is in the set {0,1,2}. zThe remaining arguments give the files where the inputs (all zero of them) and targets come from, with the "." for the latter indicating that the targets come from the same place as the inputs. The file "pdata" contains 17 items, of which 6 are "0", 8 are "1", and 3 are "2". Sampling with the Metropolis algorithm. We can sample from the posterior distribution using any of various Markov chain methods. Here's one example using the Metropolis algorithm: > mc-spec plog.met repeat 10 metropolis 1 > dist-mc plog.met 1000 This takes about eight seconds on our machine. Convergence is quite rapid, and the sampling efficiency is reasonably good. We can estimate the probabilities of the three possible values as follows: > dist-est "Exp(w0)/[Exp(w0)+Exp(w1)+Exp(w2)]" plog.met 100: Number of sample points: 901 Estimates for Exp(w0)/[Exp(w0)+Exp(w1)+Exp(w2)]: Mean: 0.349174 (standard error 0.00342273) Std.dev: 0.102739 NOTE: The standard errors assume points are independent > dist-est "Exp(w1)/[Exp(w0)+Exp(w1)+Exp(w2)]" plog.met 100: Number of sample points: 901 Estimates for Exp(w1)/[Exp(w0)+Exp(w1)+Exp(w2)]: Mean: 0.450742 (standard error 0.00354029) Std.dev: 0.106268 NOTE: The standard errors assume points are independent > dist-est "Exp(w2)/[Exp(w0)+Exp(w1)+Exp(w2)]" plog.met 100: Number of sample points: 901 Estimates for Exp(w2)/[Exp(w0)+Exp(w1)+Exp(w2)]: Mean: 0.200084 (standard error 0.00290151) Std.dev: 0.0870937 NOTE: The standard errors assume points are independent Note that since the points are not entirely independent, the standard errors are not quite right. For comparison, the exactly correct posterior means of the three probabilities (found analytically) are 0.35, 0.45, and 0.2. Computing the marginal likelihood with Annealed Importance Sampling. We can use Annealed Importance Sampling (AIS) to find the prior probability of the observed data given this model - sometimes called the "marginal likelihood". This is the normalizing constant for the posterior distribution, if it is specified as the product of the prior and the likelihood. Annealed Importance Sampling (which is described in a technical report of this name available from my web page) can also help in handling isolated modes, though this should not be a problem for this model. To use AIS, we specify the model and the data source as above, and then use the following commands: > mc-temp-sched plog.ais 0.1 0.3 0.6 > mc-spec plog.ais AIS repeat 10 metropolis 0.5 > dist-mc plog.ais 1000 The 'mc-temp-sched' command specifies a "tempering schedule": a series of distributions that the annealing run will pass through. For a Bayesian model, these distributions are created by multiplying the log likelihood part of the "energy" by various "inverse temperatures", listed as arguments to 'mc-temp-sched'. A final distribution with inverse temperature of 1 is assumed at the end of the schedule. This final distribution is the posterior, which we wish to sample from, and to find the normalizing constant for. The chain proceeds sequentially through these distributions. The "AIS" operation moves to the next distribution in the schedule, and updates a "weight" according to the ratio of probabilities under the new and old distributions. If the chain is at the last distribution (at inverse temperature 1), it moves to the first distribution (here, at inverse temperature 0.1), after first drawing a new state from the distribution at inverse temperature 0, which for a Bayesian model is the prior. There are four temperatures in the tempering schedule here, and each iteration does one "AIS" operation. The 1000 iterations will therefore produce 250 annealing runs. Markov chain operations such as "metropolis" operate as usual, except that the "energy function" is modified by the inverse temperature. For AIS to operate, it must be possible to sample from the prior. If the prior is specified entirely as a sum of "~" terms, with no circular references, the software knows how to sample from it. Otherwise, you must provide a separate program to do the sampling. For how this is done, see the documentation on the -read-prior option in dist-spec.doc and also the information in dist-mc.doc. The commands above take about nine seconds on our machine. Once they have finished, we can estimate functions of state with 'dist-est', which will automatically look at only the iterations that pertain to the final distribution of interest. It also computes weighted estimates using the weights found during the annealing runs. Here is an example: > dist-est "Exp(w0)/[Exp(w0)+Exp(w1)+Exp(w2)]" plog.ais Number of sample points: 250 Variance of normalized weights: 1.21148 Adjusted sample size: 113.0 (reduction factor 0.452) Estimates for importance weights: Mean of weights: 2.85948e-09 (standard error 1.99056e-10) Log of mean: -19.6726 (standard error 0.0696127) Standard estimates for Exp(w0)/[Exp(w0)+Exp(w1)+Exp(w2)]: Mean: 0.342344 (standard error 0.00925806) Std.dev: 0.100527 Effective sample size: 117.9 (reduction factor 0.472) Jacknife estimates for Exp(w0)/[Exp(w0)+Exp(w1)+Exp(w2)]: Mean: 0.342353 (standard error 0.00934127) Effective sample size: 115.8 (reduction factor 0.463) NOTE: The standard errors assume points are independent Note that there is no need to discard any of the early iterations, since each annealing run is independent. Because of this independence, the standard errors computed are valid, subject to the usual caveats regarding the possibility that some important part of the distribution has been missed entirely. Along with an estimate for expectation of the specified function, 'dist-est' outputs an estimate for the mean of the weights, which for a Bayesian model will be the marginal likelihood, provided the likelihood specification included all terms, even those that are constant. The log of this estimate is also output, since the value will often be extremely small or large. For this model, the marginal likelihood can be computed exactly by multiplying successive predictive probabilities. For this data, it is 6! 8! 3! -------- = 2.86378e-9 log is -19.67112 19! / 2! The estimates and estimated standard errors are consistent with this. This problem is actually easy enough that the marginal likelihood can be found by simple importance sampling. This can be viewed as a degenerate form of Annealed Importance Sampling in which the tempering schedule has only one distribution - the assumed one, at temperature 1, which is the posterior. We can do simple importance sampling with the following commands: > mc-temp-sched plog.is - > mc-spec plog.is AIS > dist-mc plog.is 1000 The 'mc-temp-sched' command with the argument "-" sets up the null tempering schedule. The 'mc-spec' command just has an "AIS" operation, with no Markov chain operations. This causes a new state to be drawn from the prior every iteration, and the appropriate weight to be computed based on the likelihood. Estimates for expectations and for the marginal likelihood can be found using 'dist-est'. For example: > dist-est "Exp(w0)/[Exp(w0)+Exp(w1)+Exp(w2)]" plog.is Number of sample points: 1000 Variance of normalized weights: 3.85409 Adjusted sample size: 206.0 (reduction factor 0.206) Estimates for importance weights: Mean of weights: 2.83221e-09 (standard error 1.75827e-10) Log of mean: -19.6822 (standard error 0.0620814) Standard estimates for Exp(w0)/[Exp(w0)+Exp(w1)+Exp(w2)]: Mean: 0.344737 (standard error 0.00530831) Std.dev: 0.104735 Effective sample size: 389.3 (reduction factor 0.389) Jacknife estimates for Exp(w0)/[Exp(w0)+Exp(w1)+Exp(w2)]: Mean: 0.344731 (standard error 0.00532859) Effective sample size: 386.3 (reduction factor 0.386) NOTE: The standard errors assume points are independent Doing importance sampling takes only a second on our machine, which is faster than Annealed Importance Sampling. However, importance sampling quickly becomes infeasible for problems with more parameters or with more data. Annealed Importance Sampling is more feasible for realistic problems, though some work is needed to set up a good tempering schedule, which may have to include many more distributions than were needed for this easy problem (often hundreds or more). See mc-ais.doc for documentation on how to obtain information that is helpful in adjusting the tempering schedule. The software also supports the related methods of simulated tempering, tempered transitions, and tempered hybrid Monte Carlo. Annealed importance sampling and the tempering schemes are also supported for neural network models (but not for Gaussian process and mixture models). There's no tutorial documentation on these features, however, just the detailed command documentation, and the command files in the 'ex-ais' directory that implement the tests done in my "Annealed Importance Sampling" paper.