STA 414/2104, Sprint 2014, Hints on using R for Assignment 1
Reading the data.
You can read the x values from the files "train1x", "train2x", and
"testx" using 'read.table', with the head=FALSE argument, to tell it
that there is no header line. This will give you a "data frame"
containing the covariates. You should convert this to a matrix using
the 'as.matrix' function, since otherwise accesses to elements of it
will be rather slow.
You can read the y values with either 'read.table', or 'scan'.
Fitting a linear model.
You should fit the simple linear model by least squares using R's 'lm'
function. It takes as its argument a "formula", such as y ~ x1+x2,
which means that the responses in the vector y are modeled by a linear
model with covariates x1 and x2, which may also be vectors, or which
may be matrices, in which case each column of the matrix is a
covariate. If you converted the covariates from the data files to a
matrix, you can therefore say that all covariates are to be used in
the model by justing specifying the matrix of training inputs. An
intercept in included in the model by default as well (as desired for
this assignment).
The 'lm' function returns a model object, which you should assign to a
variable. You can then look at the results (including regression
coefficients and estimated standard deviation for the residuals) using
the 'summary' function. You can get the regression coefficients
(including the intercept as the first one) with the 'coef' function.
You'll need the coefficients to make predictions for the test cases.
The 'mean' function will be useful in computing the average squared
error on the test cases.
To create matrices that have squares and cubes of the original
covariates, it may be convenient to use the 'cbind' function, which
pastes together two or more matrices side-by-side.
Defining covariance functions.
To use the Gaussian process functions from the week 6 demo, you will
need to define suitable covariance functions, each of which will take
as arguments a single covariate vector, a matrix of covariates for a
set of cases (with each row being the covariate vector for one case),
and a vector of hyperparameter values. You should ignore the first
hyperparameter, which is the noise standard deviation, and is not used
in defining the noise-free covariance function.
One R trick that may be useful is that you're allowed to add or
subtract a vector and a matrix, if the length of the vector is equal
to the number of rows in the matrix. The result of such an operation
is found by adding or subtracting the vector to or from each column of
the matrix. Combining this with the transpose operation ('t' in R)
may be useful. You may also find the rowSums or colSums operations to
be useful - they take a matrix as their argument, and return a vector
found by summing all elements in each row, or each column.
Cross validation.
You'll need to extend the week 6 example functions with one to find
hyperparameters based on 10-fold cross validation.
You can minimize cross validation error with the 'nlm' function that,
as was used in the week 6 example functions to maximize the likelihood
(without the minus sign used there to switch from minimization to
maximization). You'll need to pass 'nlm' a function for computing the
cross validation error. The 'seq' function may be useful in dividing
the training data into 10 parts. You can find examples of cross
validation in the week 2 example program, as well as in example
programs and assignment solutions from previous years.
Importance sampling.
You'll also need to extend the week 6 example functions with one to
make predictions using Bayesian averaging, done with importance
sampling.
You can generate a vector of k independent normal random variates with
means given by the vector mu and standard deviations given by the
vector sd with the expression rnorm(k,mu,sd). You'll need to do this
to generate values from the prior distribution.
It is good practice to set the random number seed to some constant
before starting random number generation, with a call such as
set.seed(1). This way, you get the same results if you run the
program again, which makes finding bugs a lot easier
When doing prediction by importance sampling from the prior, the
weight given to predictions with a sampled parameter value is the
probability density of the training data given that parameter value.
Such a probability density for an entire training set is often very
small or very large - so small or large that underflow or overflow may
occur, when the number can't be represented on the computer (using the
usual floating-point scheme). The way around this is to work in terms
of the log of the probability density, without ever computing the
probability density itself (there are tricks to do that). However,
for this assignment, you can use the actual probability densities,
since they don't underflow or overflow with the data sets you are
using.
You may find it useful to return several quantities from the
importance sampling function. This can be done by putting them
together with the 'list' function.