STA 414/2104: Statistical Methods for Machine Learning and Data Mining (Jan-Apr 2006)

Note: There was a typo in my script for computing final marks, correction of which has changed some people's marks. Fortunately, none of the changes are drastic. My apologies for this!

All course work has been marked and can now be picked up. You can phone 978-4970 to see if I'm in my office first.

Instructor: Radford Neal, Office: SS6016A, Phone: (416) 978-4970, Email: radford@stat.utoronto.ca

Office hours: Mondays 2:30-3:30 and Wednesdays 11:30-12:30, in SS6016A.

Lectures:

Tuesdays, Thursdays, and Fridays, 1:10pm to 2:00pm, in SS 2111. The first lecture is January 10; the last is April 13.
There are no classes during Reading Week, from February 20 to 24.

Assessment:

For graduate students (in STA 2104):
Two tests: 10% each
Three assignments: 17% each
Project: 29%
For undergraduate students (in STA 414):
Either the same as for the graduate students,
or:
Two tests: 10% each
Four assignments: 20% each

Undergraduates who wish to do three assignments and a project must begin the project at the same time as the graduate students, but may later switch to doing four assignments if they wish. However, they can't hand in both a project and the fourth assignment.

The assignments are to be done by each student individually. Any discussion of the assignments with other students should be about general issues only, and should not involve giving or receiving written or typed notes.

Projects may be done individually or in groups of two (possibly more than two, with special permission). More will be expected of a group project than an individual project.

Course Text:

Hastie, Tibshirani, and Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Springer.
The web page for the book contains errata, datasets, and other information.

Computing:

The assignments (and possibly project) will involve writing small programs. I recommend that you write these in either R or Matlab, though other languages are also possible.

If you don't have a home computer (or don't want to use it), you can get an account on the CQUEST computer system. If you're an undergraduate registered in STA 414, you should be able to get an account by clicking on "Request an Account" in the upper left of the main CQUEST page. If you're a graduate student in STA 2104 without other computing access, you need to fill out a form you can get from me to get a CQUEST account.

If you have a home computer, you can download R for free from www.r-project.org. There should be compiled versions for Windows and Linux, as well as source that can be compiled for Unix/Linux systems. The download comes with documentation, including An Introduction to R.

Another option for home use is Matlab, which costs money, or Octave, a free Matlab look-alike, available from www.octave.org. I have only limited experience with Octave, however, and it appears to be less well-supported than R.

Here are some notes on R and some notes on Matlab that may be useful.

What to read in the text:

Chapter 1
Chapter 2
Chapter 3 (except 3.4.6)
Chapter 4 (except 4.2)
Chapter 5 (except 5.8 and 5.9)
Chapter 7 (except 7.8 and 7.11)
Chapter 14 (sections 14.1 to 14.3)

Other useful references:

Notes by Nancy Reid for an earlier version of this course.

Slides for my NIPS*2004 tutorial on Bayesian methods for machine learning, in Postscript or PDF.

A web page about Gaussian process models.

My paper on Dirichlet diffusion trees.

Assignments:

Assignment 1: Postscript, PDF.
Data to test on: Training inputs, Training responses, Test inputs, Test responses.
Here are some notes on R and some notes on Matlab that may be useful.
Here is the solution (in R): functions, test script, output, discussion.

Assignment 2: Postscript, PDF.
Here is the data: Training inputs, Training classes, Test inputs, Test classes.
Here are some notes on R and some notes on Matlab that may be useful.
Clarification: The assignment handout doesn't specify exactly how to do the principal component computation. You should subtract the mean (on training cases) from each of the 200 variables, but do not standardize by dividing by the standard deviation. (Of course, you can try it with standardization too if you're interested.)
Here is the solution (in R): main script, PCA functions, LDA functions, output.

Assignment 3: Postscript, PDF.
Data to test on: Training inputs, Training responses, Test inputs, Test responses, Inputs on 51x51 grid.
Here are some notes on R and some notes on Matlab that may be useful.
Here is the solution (in R): main script, spline functions, output, plots.

Assignment 4: Postscript, PDF.
Data to test on: Training inputs, Training responses, Test inputs, Test responses, Inputs on 51x51 grid.
You are also supposed to test on the data for Assignment 3, available above.
Here are some notes on R for this assignment.
Here is the solution (in R): main script, Gaussian process functions, output, plots.

Tests:

Test 1 was held during class on Tuesday, February 28.
It covered material presented in lectures through February 17, or in the textbook Chapter 1, Chapter 2, Chapter 3 (except 3.4.6), and Chapter 4 (except 4.2 and 4.5).
Here are stem plots of the mark distributions..

Test 2 was held during class on Friday, April 7.
Here are stem plots of the mark distributions..

Projects:

Here is some more information on projects, including some suggested topics. Projects are due on April 24.

Lecture slides:
Tuesday Thursday Friday
Week 1 Postscript PDF Postscript PDF Postscript PDF
Week 2 Postscript PDF Postscript PDF No slides (R tutorial)
Week 3 Postscript PDF No slides (R tutorial) Postscript PDF
Week 4 Postscript PDF No slides (R tutorial) Postscript PDF
Week 5 Postscript PDF Postscript PDF Postscript PDF
Week 6 Postscript PDF Postscript PDF Postscript PDF
Week 7 No slides (Test) Postscript PDF Postscript PDF
Week 8 Postscript PDF Postscript PDF Postscript PDF
Week 9 Postscript PDF Postscript PDF No slides
Week 10 Postscript PDF Postscript PDF Postscript PDF
Week 11 No slides Postscript PDF No slides
Week 12 Postscript PDF Postscript PDF No slides (Test)
Week 13 No slides Postscript PDF

Example R programs:

An implementation of 1-NN: The one.nn function, A script to test it, The data needed for the test
You can also find lots of example R programs in the web page for my version of STA 410.