Data set 1.
On this dataset, I found that a learning rate of 0.0002 for 30000
iterations worked well. Results with two random seeds are plotted,
producing similar results. The one-dimensional structure of the data
is captured well. The auto-encoder output for each data point is near
the closest point that could be produced (though the reconstruction is
not perfect in this respect, with some points reconstructed with
larger error than seems necessary).
Data set 2.
Although this this data set is similar to data set 1, I found that a
smaller learning rate of 0.00005 was needed for stability; I increased
the number of iterations to 120000 to compensate for the smaller
learning rate. Results with two random seeds are plotted. Results
for both show that the one-dimensional structure is captured well on
the right side of the plot (x1 greater than -0.5), but not so well on
the left side. The first seed produces worse results, with what looks
like a spurious one-dimensiona structure. The second seed gives
better results, but there is still a large error in the reconstruction
of some points.
Data set 3 (zip code images).
For this data, I used a learning rate of 0.000025, for 50000
iterations, trying two random seeds. Some instability is still
apparent in the later part of each run, but it doesn't look serious.
For both runs, the values of the bottleneck units for the digits 0, 1,
2, and 6 seem to be sufficient to usually distinguish them from the
other digits. However, 4 and 9 are mixed together, as are 3 and 8,
The 5s are perhaps better separated from the other digits in the first
run than in the second run (even though the 5s are quite near other
digits in the first run, they seem to not overlap as much). Some of
the 7s are well-separated from other digits, but not all of them.
Using more than two bottleneck units might improve the digit
separations. (The main reason for using two in this exercise is to
allow for an easily interpretable 2D plot.) It's also possible that
longer training time and/or more hidden units in layers 1 and 3 might
help. Finally, the images in the actual MNIST dataset have twice the
resolution and there are 60000 training cases rather than 600, which
ought to help.