EXAMPLES OF BAYESIAN NEURAL NETWORK IMAGE MODELS One prominent area of application for neural network models is classification of images. Here, a fairly simple artificial example of such an image classification problem is presented. Several Bayesian neural network models are applied to this task, including ones that use convolutional connections. Fitting such models by simple gradient descent is also demonstrated. These examples also illustrate how the training data can be divided in several ways into a set of cases used for fitting, and a set of validation cases, which can be used to assess performance. Although Bayesian methods do not require a validation set in theory, in practice, some check on whether good performance has been achieved is advisable before actually using predictions from a Bayesian model, to guard against statistical mistakes, poor convergence of MCMC methods, or, if nothing else, out-and-out bugs in the scripts or programs used. The computation times given below were obtained running on a CPU, with arithmetic done in single precision. Running the calls of net-mc on a GPU might reduce the time considerably. The command files for this example are in the ex-image sub-directory. The image classification problem used in these examples. The program in igen.c generates artificial 6x6 monochrome images in which one of the symbols "+", "X", "O", or "H" appears in some 3x3 patch in the image. The classification task is to predict from the 36 pixels of an image which symbol appears in some patch in the image (whose location could be any of the sixteen possibilities). The file 'idata' has 21536 generated cases, the first 1536 of which are used for training, and the remaining 20000 for testing. Generation of an image starts by randomly generating background pixel values from a Gaussian distribution with mean zero and standard deviation one, independently for each of the 36 pixels. A location for a symbol is then randomly gnerated from the 16 possible positions. (Note that the centre position of the symbol cannot be in the first or last row, or the first or last column, leaving four possible row positions and four possible column positions.) The identity of the symbol is also picked randomly from "+", "X", "O", and "H" (coded as 0, 1, 2, or 3), with each symbol having probability 1/4. The pixels in the 3x3 patch where the symbol is located are then replaced by the following patterns for each symbol: "+" "X" "O" "H" -1 +1 -1 +1 -1 +1 +1 +1 +1 +1 -1 +1 +1 +1 +1 -1 +1 -1 +1 -1 +1 +1 +1 +1 -1 +1 -1 +1 -1 +1 +1 +1 +1 +1 -1 +1 Finally, Gaussian noise with mean zero and standard deviation 0.5 is added to all pixels, and the pixel values are rounded to three decimal places. When passed an argument, the igen program prints each image, with the true location and identity of the symbol, as well as the probabilities for each symbol that can be inferred from the pixels using the true model of how these images are generated. Here are three of the images generated: Case 2, Class +(0), Centred at 1,3, PP: +0.999 X0.000 O0.001 H0.000 0 1 2 3 4 5 0 1 2 3 4 5 0 -0.626 +1.056 -0.845 +0.762 -0.477 -0.837 O # O # O 1 +0.723 +0.971 +1.134 +1.040 +1.059 +0.147 # # # # # 2 -1.095 -2.091 -0.671 +0.971 -0.649 +0.715 O O O # O # 3 +0.787 -1.511 +1.770 +1.620 +0.221 -0.582 # O # # O 4 +0.535 -1.675 -0.308 -0.815 -0.786 +0.273 # O O O 5 -0.215 +0.335 -1.835 -0.286 +1.338 +0.336 O # Case 11, Class X(1), Centred at 2,4, PP: +0.157 X0.843 O0.000 H0.000 0 1 2 3 4 5 0 1 2 3 4 5 0 -0.818 +0.064 -0.260 +0.159 +0.272 +0.009 O 1 +0.236 +1.215 -0.756 -0.263 -1.089 +1.030 # O O # 2 -1.443 -1.099 +0.870 -0.965 +1.381 -1.830 O O # O # O 3 -1.044 +1.340 +0.445 +1.512 -1.571 +0.600 O # # O # 4 -0.776 +0.444 +1.656 -1.832 +0.346 -0.823 O # O O 5 -0.431 +0.744 +1.261 +3.750 -1.659 +1.955 # # # O # Case 60, Class O(2), Centred at 4,1, PP: +0.000 X0.000 O0.418 H0.582 0 1 2 3 4 5 0 1 2 3 4 5 0 +1.576 -1.255 +1.379 +1.325 -0.615 +0.105 # O # # O 1 +0.734 -0.416 -1.230 -0.001 -1.100 +0.948 # O O # 2 +0.837 -1.332 +0.800 -2.698 +1.139 +0.417 # O # O # 3 +1.236 +0.820 +1.372 -0.339 +0.924 +0.754 # # # # # 4 +0.508 -1.105 +0.611 +2.562 -1.724 -1.193 # O # # O O 5 +0.715 +1.017 +0.755 -0.284 -1.175 +1.185 # # # O # The diagrams at the right show the pixels of the image thresholded at values less than -0.5 (O), greater than +0.5 (#), and between -0.5 and +0.5 (space). For the second image (case 11), noise in the pixels has made the symbol less certain (probability 0.843 for the correct symbol) than for the first image (case 2) (probability 0.999 for the correct symbol), and for the third image (case 60), the correct symbol has lower probability (0.418) than one of the incorrect symbols (0.582. All these probabilities are computed assuming the true generation model is known. The igen program computes the performance on test cases when using the true model, which is an upper limit on what can be accomplished when learning a model from the training cases (unless one is just lucky). Here is the summary: Error rate on test cases with true model: 0.069 Average squared error for test cases with true model: 0.100 Average log probability for test cases with true model: -0.174 A fully-connected Bayesian neural network model. We can first try a neural network model that has no built-in knowledge of how the 36 pixel values are arranged in a 6x6 image, or any specific knowledge that the symbol to be detected is in a 3x3 patch of this image. The model simply feeds the 36 inputs into a hidden layer with 80 "softplus" units, which in turn feed into an output layer with 4 units, which are used to define the class probabilities (using the "softmax" scheme), as in the classification example in Ex-netgp-c.doc. This model can be specified with the following commands: net-spec $log 36 80 softplus 4 \ / ih=0.05:3::1.5 bh=0.3:2 ho=0.1:3::4 bo=0.1:2 model-spec $log class Here, $log represents the name of the log file. This is followed by the specification that there are 36 input, 80 hidden units with "softplus" activation function, and 4 units used to produce class probabilities using the "softmax" scheme. The prior specification for input-to-hidden weights, of 0.05:3::1.5, does not allow for different inputs to have different relevance (there is no number between the second and third colons, so there is no variation between input groups), since we will assume that it is known that the inputs are almost equally relevant (since the symbol to be recognized might occur anywhere in the image). The final 1.5 in the prior specification gives the input-to-hidden weights a heavy-tailed prior (a t distribution with 1.5 degrees of freedom), allowing for the possibility that the input-to-hidden connections are effectively sparse, with a hidden unit looking mostly at only a few of the pixel values. The hidden-to-output weights also have a (less extreme) heavy-tailed prior, allowing for some of the hidden units to have small weights to some of the output units. A scheme allowing cross-validation assessments of performance is used, in which four training runs are done, each on 3/4 of the training data, with the omitted 1/4 allowing for monitoring of performance (to see, for example, whether it is advisable to do further MCMC iterations). Of course, making such assessments based on the test data would be "cheating" (since in real life, the test targets, and probably the test inputs as well, are not available when training). The data specification used is therefore as follows: data-spec $log 36 1 4 / idata-train@-$start:$end . idata-train@$start:$end . Here $start and $end represent the range of training cases comprising the validation set (here specified as the "test" set), with the remainder of the 1536 training cases used for fitting the model. In the four runs that are done, $start and $end will be set to 1 and 384, 385 and 768, 769 and 1152, and 1153 and 1536. (These runs will use log files of ilog1.net1, ilog1.net2, ilog1.net3, and ilog1.net4.) When final predictions for test cases are made, the "test" data specified in this data-spec command will be overridden by the actual test data, which is in idata-test. The numbers in the data-spec command specify that there are 36 inputs, and 1 target, which has 4 possible values (coded as 0, 1, 2, 3). With the network model and data specified, training is done as follows: rand-seed $log $run net-gen $log fix 0.1 mc-spec $log repeat 120 heatbath hybrid 200:20 0.05 net-mc $log 1 mc-spec $log repeat 60 sample-sigmas heatbath hybrid 400:40 0.07 net-mc $log 2 mc-spec $log repeat 24 sample-sigmas heatbath 0.9 hybrid 1000:100 0.09 negate net-mc $log 1500 Here, $run is 1, 2, 3, or 4, identifying which of the four runs this is, which affects the random number seed used (and the log file name). The hyperparameters are initialized to the fairly small value of 0.1, and 120 HMC updates are done with the hyperparameters fixed at this value, so that the parameters will take on reasonable values for the data before trying to update the hyperparameters. After this, 60 HMC updates are done with hyperparmeters updated before each HMC update, using a moderate number (400) of leapfrog steps. This is intended to get the hyperparameters to reasonable values (given how well the data has been fitted so far). The main sampling iterations are then done, in which updates for the hyperparameters alternate with HMC iterations using a longer trajectory of 1000 leapfrog iterations. Each of the 1498 saved states from this main sampling phase are the result of 24 such pairs of hyperparameter updates and HMC updates for parameters. The HMC updates all use the "windowed" method, to increase acceptance rate, with the windows being 1/10 of the trajectory. Relatively small leapfrog stepsizes are used initially, to guard against rejection rates being high when starting at an atypical location. To further reduce random walk behaviour, without reducing the number of hyperparameter updates, the "persistent momentum" version of HMC is used for the final sampling phase. Each full iteration of this sampling procedure takes about 27 seconds on the system used (Ex-test-system.doc), for a total of about 690 minutes (about 11.5 hours) for the 1500 iterations done. The command script in ex-image/icmds1.net does all four runs in parallel, so this will be the total wall-clock time if the system used has at least four processor cores. Various plots can be made to check whether these runs are sampling (or have sampled) well, or whether more iterations (or a better scheme) are needed. The performance of the HMC updates can be checked by plotting the rejection rates for each iteration: net-plt t r ilog1.net? | plot-points As seen in ilog1-r.png (with different colours for each run), most of the iterations have rejection rates below 0.2, but not very close to 0, with no apparent trend, so the stepsize seems reasonable. The average rejection rate can be found as follows: net-tbl r ilog1.net? | series m which gives a mean of 0.032225, small enough that there should not be much problem with trajectory reversals from rejections when using persistent momentum with factor 0.9 (which randomizes momentum on a time scale of about 1/(1-0.9) = 10 iterations, less than 1/0.032225). The fit (squared error) for the training data can be checked with net-plt t b ilog1.net? | plot-points As seen in ilog1-b.png, all four runs seem to reach the equilibrium distribution for this value in 100 iterations or less. (The equilibrium distributions differ slightly between runs because they have different training sets.) The squared error for each iteration on the validation cases can be seen with net-plt t B ilog1.net? | plot-points As seen in ilog1-BB.png, this quantity also seems to have reached equilibrium in less than 100 iterations. For both the 'b' and 'B' quantities, one can see some long-lag autocorrelation, indicating that longer runs would be needed to get really good samples from the posterior distributions. But the run length done here seems adequate for purposes of producing predictions. Note that although the average squared error on validation cases is useful in seeing how well the Markov chain is sampling, the error from a single iteration does NOT tell one how good the predictions on these validation cases found by averaging many iterations would be - poor predictions from a single iteration may be the result of the model producing high confidence predictions at each iteration, that vary from iteration to iteration, so that the averaged prediction has much lower confidence (and perhaps much better performance). So there is not necessarily anything wrong when the error on validation cases from individual networks is high. We can look at the main hyperparameters (standard deviations for input-hidden weights, hidden biases, and hidden-output weights) at each iteration of the first run as follows: net-plt t h1h2h3 ilog1.net1 | plot-points-log-scale This data is best plotted on a log scale, because the three hyperparameters differ greatly in magnitude. The plot is in ilog1-h1h2h3.png. Note that h1 (input-to-hidden) is red, h2 (hidden biases) is green, and h3 (hidden-to-output) is blue. From this plot, equilibrium seems to have been reached after about 300 iterations. Once all runs have finished, we can assess performance on the actual test set as follows (with the first 500 iterations discarded as "burn-in", which seems more than sufficient given the plots above): net-pred mnpa ilog1.net1 501: ilog1.net2 501: \ ilog1.net3 501: ilog1.net4 501: / idata-test . The output is as follows: Number of iterations used: 4000 Number of test cases: 20000 Average log probability of targets: -1.058+-0.004 Fraction of guesses that were wrong: 0.4704+-0.0035 Average squared error guessing mean: 0.58624+-0.00240 Only slightly worse results would be obtained using runs only one tenth as long (150 iterations, taking about an hour). The command net-pred mnpa ilog1.net1 71:150 ilog1.net2 71:150 \ ilog1.net3 71:150 ilog1.net4 71:150 / idata-test . produces the following output: Number of iterations used: 320 Number of test cases: 20000 Average log probability of targets: -1.061+-0.004 Fraction of guesses that were wrong: 0.4733+-0.0035 Average squared error guessing mean: 0.58784+-0.00240 Note that guessing by ignoring the inputs and assigning equal probabilities to the four classes gives an average log probability of -1.386, an error rate of 0.75, and an average squared error of 0.75. So this model performs significantly better than random guessing. However, the error rate of 47% is considerably more than the 6.9% error rate that is achievable with knowledge of the true generating process, so there is much room for improvement. A convolutional Bayesian neural network model. Supposing that we know that the symbol to be recognized is a 3 pixel by 3 pixel patch in a 6 by 6 image, and that this patch may appear at any position in the image, it makes sense to use a convolutional network, that uses several filters looking at 3x3 patches of the image. Each such filter is defined by the 9 weights on connectiions from a 3x3 patch at some position in the image to a hidden unit, with the weights for all possible positions being the same. We might expect that four filters, one for each class, would be sufficient, but the network defined here will allow for five filters (computing five output channels), which may make it easier to find the four needed, since a "spare" filter may allow only-locally-optimal modes to be more easily escaped. The simplest model of this type has a single hidden layer with 80 hidden units (5 filters/channels, each needing 16 units for all possible symbol positions), which connect to the four output units used to define class probabilities. Such a network model can be specified as follows: net-spec $log 36 80 softplus 4 \ / ih=0.2:2 config:iconfig bh=0.3:2 ho=0.1:3::4 bo=0.1:2 model-spec $log class The "input-config" flag after the specification of the prior for input-to-hidden weights says that the configuration of these weights is specified in the file "iconfig", whose contents are as follows: # Configuration for convolutional layer connections for model of test images. D=6 # image width and height P=3 # patch width and height F=5 # number of filters [ 1 1 1 ] D-P+1{ # Loop over vertical positions of patch D-P+1{ # Loop over horizontal positions of patch P{ # Loop over rows of patch P{ # Loop over pixels within row of patch = = = F-1{ = + + } # Connections for all filters [ + = +F ] } [ +D = = ] P[ = = +F ] } [ + +F 1 ] } [ +D = 1 ] D-P+1[ = +F 1 ] } See net-config.doc for an explanation of the specification language used above. The effect of this specification is to set up the convolutional connections, with 45 weights defining 5 filters (each looking at 9 pixels), which are shared by 16x9x5=720 connections (which is less than the 36x80=2880 possible connections). The other commands used are almost the same as for the fully-connected network model above, except that stepsizes have been increased a bit. Each full iteration takes about 32 seconds on the system used (Ex-test-system.doc), for a total of 13.4 hours (with the four runs done in parallel). This is actually a bit slower then the time for the fully-connected network - there are fewer input-to-hidden connections to handle, but in this case, that advantage is more than offset by the overhead of processing the non-contiguous configuration of connections. As above, we can plot the 'r' (rejection rate), 'b' (squared error on training data), and 'B' (squared error on validation data) quantities for the four runs using net-plt. The results can be seen as ilog2-r.png, ilog2-b.png, and ilog2-BB.png. The main hyperparameter values for the first run can also be plotted, as above, giving the result in ilog2-h1h2h3.png. Once all runs have finished, we can assess performance on the actual test set as follows: net-pred mnpa ilog2.net1 501: ilog2.net2 501: \ ilog2.net3 501: ilog2.net4 501: / idata-test . Here, the first 500 iterations are discarded as "burn-in". This seems mostly sufficient given the plots above, except that the fourth run may not have reached equilibrium until around iteration 700. The output is as follows: Number of iterations used: 4000 Number of test cases: 20000 Average log probability of targets: -0.731+-0.006 Fraction of guesses that were wrong: 0.2661+-0.0031 Average squared error guessing mean: 0.39534+-0.00321 Results only slightly worse would be obtained using runs only half as long. The command net-pred mnpa ilog2.net1 301:750 ilog2.net2 301:750 \ ilog2.net3 301:750 ilog2.net4 301:750 / idata-test . produces the following output: Number of iterations used: 1800 Number of test cases: 20000 Average log probability of targets: -0.743+-0.005 Fraction of guesses that were wrong: 0.2698+-0.0031 Average squared error guessing mean: 0.40237+-0.00298 These results are considerably better than were obtained above when using a fully-connected network, demonstrating the advantage of using a convolutional network when, as here, it is known that it fits the structure of the problem. The role of the convolutional connections can be confirmed by looking at the weights for each of the five filters as 3x3 "receptive fields". This can be done with the "idisp" shell file (which uses R). The output for the last iteration of the first run is as follows: , , 1 [,1] [,2] [,3] [1,] 4.0 3.4 4.5 [2,] 5.3 -5.3 4.5 [3,] 4.1 5.7 5.2 , , 2 [,1] [,2] [,3] [1,] 5.0 -6.0 3.4 [2,] 3.9 3.5 4.7 [3,] 4.4 -2.0 4.1 , , 3 [,1] [,2] [,3] [1,] -4.3 4.8 -4.2 [2,] 3.7 5.3 3.4 [3,] -3.2 3.7 -2.7 , , 4 [,1] [,2] [,3] [1,] 6.9 -2.4 5.1 [2,] -5.5 4.0 -3.8 [3,] 4.4 -4.2 2.7 , , 5 [,1] [,2] [,3] [1,] 0.8 -3.6 3.2 [2,] -3.9 1.5 -3.6 [3,] 3.9 -1.9 3.0 One can see that the first filter is sensitive to the "O" symbol, the second to the "H" symbol, the third to the "+" symbol, and the fourth and fifth to the "X" symbol. Adding another hidden layer, with more assumptions. The 26.6% error rate for the model with one convolutional hidden layer is still not close to the 6.9% error rate obtained using the true generative model. We can try adding a second hidden layer to allow more complex computations that could allow for closer to optimal performance. (Though note that performance is also limited by the amount of training data, not just by the capabilities of the network used.) The architecture used has convolutional connections from the inputs to the first hidden layer, as in the model above, and this hidden layer is then fully connected to a second hidden layer with 16 tanh units. The second hidden layer connects to the output units. In this model, the biases on the first layer of hidden units will be constrained to be the same for all positions in the image (ie, there are only 5 bias parameters, one for each of the filters). Similarly, the connections from units in the first hidden layer to the second hidden layer will be the same for all positions. These constraints are based on the assumption that the pattern being looked for is equally likely to occur anywhere in the image. This network model is specified as follows: net-spec $log 36 80 softplus 16 4 \ / ih=0.2:2 config:iconfig bh0=0.3:2 config:bconfig \ hh=0.1:2 config:h2config bh1=0.3:2 ho=0.1:2 bo=0.1:2 The bconfig file sets up the biases to be the same for units using the same filter, as follows: # Configuration for convolutional layer biases for model of test images. D=6 # image width and height P=3 # patch width and height F=5 # number of filters D-P+1( # Loop over vertical positions of patch D-P+1( # Loop over horizontal positions of patch + 1 F-1( + + ) # biases for all filters ) ) The h2config file specifies that weights on connections from the first to second hidden layer are shared for units in the first hidden layer that are for the same filter: # Configuration for hidden-hidden connections for model of test images. D=6 # image width and height P=3 # patch width and height F=5 # number of filters U=16 # number of units in second hidden layer D-P+1( # Loop over vertical positions of patch D-P+1( # Loop over horizontal positions of patch [ = 0 0 ] F( # Loop over filters [ + 0 = ] U( = + + ) # Loop over units in next hidden layer ) ) ) The mc-spec and other commands are the same as for the previous two networks, except for stepsize adjustments. The 1500 iterations for the four runs take 20.1 hours on the system used (Ex-test-system.doc). Plots of the rejection rate, training set error, validation set error, and hyperparameter values (for the first run) can be seen in ilog3-r.png, ilog3-b.png, ilog3-BB.png, and ilog3-h1h2h3h4h5.png. The results on the test set using iterations 501 to 1500 of the four runs are as follows: Number of iterations used: 4000 Number of test cases: 20000 Average log probability of targets: -0.554+-0.005 Fraction of guesses that were wrong: 0.2198+-0.0029 Average squared error guessing mean: 0.30820+-0.00323 The 22.0% error rate seen here is a considerable improvement over the 26.6% error rate with the simple convolutional network. Almost identical results would be obtained using runs only a tenth as long (about two hours). Using iterations 71 to 150 of each run produces these results: Number of iterations used: 320 Number of test cases: 20000 Average log probability of targets: -0.556+-0.006 Fraction of guesses that were wrong: 0.2178+-0.0029 Average squared error guessing mean: 0.30891+-0.00324