STA 414/2104: Statistical Methods for Machine Learning and Data Mining (Jan-Apr 2012)

ANNOUNCEMENTS: Solutions for all assignments and test 3 are below. You can pick up remaining work from my office (SS 6026A) on May 2 from 1:10-2:00 or May 3 from 1:10-2:00.

Instructor:

Radford Neal, Office: SS6026A, Phone: (416) 978-4970, Email: radford@stat.utoronto.ca
Office Hours: Fridays, 1:10pm to 2:00pm, in SS6026A.

Lectures:

Tuesdays 12:10pm to 2:00pm in BA 1220; Thursdays 12:10 to 1:00pm in GB 244. The first lecture is January 10. The last lecture is April 5. There are no lectures February 21 and 23 (Reading Week).

Evaluation:

55% Four assignments, worth 10%, 15%, 15%, and 15%.
45% Three 50-minutes tests, each worth 15%, held in lecture time on February 9, March 15, and April 5.
The assignments are to be done by each student individually. Any discussion of the assignments with other students should be about general issues only, and should not involve giving or receiving written, typed, or emailed notes.

Textbook:

There is no textbook for this course. I will be posting lecture slides and links to online references.

Computing:

Assignments will be done in R. Statistics Graduate students will use the Statistics research computing system. Undergraduates and graduate students from other departments will use CQUEST. You can request an account on CQUEST if you're an undergraduate student in this course (you need to fill out a form if you're a grad student).
You can also use R on your home computer by downloading it for free from www.R-project.org. From that site, here is the Introduction to R.

Lecture slides:

Note that slides are updated as mistakes are corrected, or the amount of material covered in the week becomes apparent.
Week 1 (Introduction)
Week 2 (Linear basis functions, penalties, cross-validation)
Week 3 (Introduction to Bayesian methods)
Week 4 (Conjugate priors, Bayesian linear basis function models)
Week 5 (More on Bayesian linear basis function models)
Week 6 (Multilayer perceptron neural networks, early stopping)
Week 7 (Bayesian neural networks, Gaussian process models)
Week 8 (Classification and loss functions, generative models, discriminative models, large margin classifiers)
Week 9 (Support vector machines)
Week 10 (Clustering, mixture models, EM algorithm, Bayesian mixture models)
Week 11 (Dimensionality reduction, PCA, Factor analysis, auto-encoders)
Week 12 (Kernel PCA)

Tests:

Questions and answers for Test 1. Marks will be adjusted by the formula m' = 100*(1-((75-m)/75)^1.4). The mark written on the test paper is the unadjusted mark. Here is the summary of the mark distribution, after adjustment:
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   23.0    62.5    76.0    70.3    85.5   100.0 
Test 2 marks will be adjusted by the formula m' = 100*(1-((80-m)/80)^1.4).
Questions and answers for Test 3. Test 3 marks will be adjusted by the formula m' = 100*(1-((90-m)/90)^1.4).

Assignments:

Assignment 1: handout
Data set 1: training data, test data.
Data set 2: training data, test data.
Here are the hints for using R.
Solution: R functions, R test script, output, discussion.
Here is the summary of the mark distribution:
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  64.00   79.50   86.00   84.61   91.50   96.00 
Assignment 2: handout
Data set 1: training data, test data.
Data set 2: training set 1, training set 2, test data.
All the data sets have headers (so read with head=TRUE). The response is the first variable, with the remaining variables being the inputs.
Solution: Modified R functions,
Data set 1: script, output, training plot (noadj), training plot (adj), comparison plot,
Data set 2: script, output, training plot (noadj1), training plot (adj1), training plot (noadj2), training plot (adj2),
Discussion.
Assignment 3: handout
Artificial data 1: training data, test data.
Artificial data 2: training data, test data.
Ozone data: training data, test data.
All the data sets have headers (so read with head=TRUE). The response is the first variable, with the remaining variables being the inputs.
Solution: GP R functions, R script, output, discussion.
Assignment 4: handout
Artificial data 1: estimation data, validation data, test data. There are no headers (so read with head=FALSE).
Artificial data 2: estimation data, validation data, test data. There are no headers (so read with head=FALSE).
Gene expression data set: estimation data, validation data, test data.
This data set was derived from the data reported in this paper. I did some pre-processing, selected a subset of 1000 genes, and merged all their data before randomly dividing into estimation, validation, and test sets. For your interest (not required for the assignment), the classification into two types of cancer for the cases is as follows: estimation, validation, test.
Note that I've changed the penalty definition slightly from what I talked about in the preview in class, and no longer ask you to discuss variations on the penalty. Also, as discussed in class, in the M step, you don't need to find the final maximum (just improve things), and so can use the previous iteration's sigma when maximizing for mu. This also means that you need to initialize sigma (the sample standard deviations would be appropriate).
It's possible that some responsibilities will underflow to zero sometimes, which is OK, except that if a component has zero responsibility for all data items, you will need to skip restimation of its parameters to avoid getting NaN as a result.
Solution: EM function, script for data set 1, script for data set 2, script for data set 3, output for data set 1, output for data set 2, output for data set 3, discussion.

Example R programs:

Week 2 lecture example (linear basis function models): script, functions.
Week 5 lecture example (Bayesian linear basis function models): script, functions.
Week 6 lecture example (multilayer perceptron networks): script, data, functions.
Week 7 lecture example (sampling from a network prior): script.
Week 8 lecture example (sampling Gaussian process classification model): script.
Week 10 lecture example (EM algorithm for mixture): function.
Week 12 lecture example (Kernel PCA): function, example.

Practice problems:

Practice problem set #1.
Practice problem set #2, and the answers.
Practice problem set #3, and the answers.

Some useful on-line references

Information Theory, Inference, and Learning Algorithms, by David MacKay.
David MacKay's thesis.
The Elements of Statistical Learning: Data Mining, Inference, and Prediction (2nd edition), by Trevor Hastie, Robert Tibshirani, and Jerome Friedman.
Gaussian Processes for Machine Learning, by Carl Edward Rasmussen and Christopher K. I. Williams.
Proceedings of the International Conference on Machine Learning (ICML)
Proceedings of the annual conference on Neural Information Processing Systems (NIPS)

Web pages for past related courses:

STA 414/2104 (Spring 2011)
STA 414/2104 (Spring 2007)
STA 414/2104 (Spring 2006)
CSC 411 (Fall 2006)
STA 410/2102 (Spring 2004) - has many examples of R programs