ST3189 Machine learning
1. (a) Indicate whether the following statements are true or false. Briefly justify your answers.
i. The decision boundary of a support vector machine is determined only by support vectors. [3 marks]
ii. Linear discriminant analysis can only produce a linear decision boundary. [3 marks]
iii. If we train a linear regression estimator on only half the data, the bias of the estimator will be smaller than training it on the entire dataset. [3 marks]
(b) Consider the k-nearest neighbours classification using the Euclidean distance on the dataset shown in Figure 1.
Figure 1: For Question 1 (b).
i. What is the Leave-One-Out Cross Validation (LOOCV) error when using 1-nearest neighbour? [3 marks]
ii. What is the LOOCV error when using 7-nearest neighbours? [3 marks]
iii. Sketch the 3-nearest neighbour decision boundary and identify regions classified as “+” and “-”, respectively. [6 marks]
(c) The stepwise and best subset selection procedure can be used for variable selec- tion. Discuss the main advantage and disadvantage of the stepwise procedure compared with the best subset selection. [4 marks]
2. Consider a linear regression setting where the response variable is y = (y1, . . . , yn ) and there is one feature, or else predictor, x = (x1, . . . , xn ), where xi > 0 for all i = 1, ..., n. We are interested in fitting the following model
yi = β log xi + ∈i , i = 1, . . . , n,
where the error terms ∈i’s are independent and distributed according to the Normal distribution with mean 0 and known variance σ2. Equivalently, we can write that, given x, each yi is independent and distributed according to the Normal distribution with mean β log xi and known variance σ2.
(a) Derive the likelihood function for the unknown parameter β . [3 marks]
(b) Derive the Jefreys prior for β . Use it to obtain the corresponding posterior distribution. [6 marks]
(c) Consider the Normal distribution prior for β with zero mean and variance ω2.
Use it to obtain the corresponding posterior distribution. [6 marks]
(d) Consider the least squares criterion
(1)
and show that the estimator of β that minimises equation (1), also maximises the likelihood function derived in part (a). Derive this estimator and, in addition, consider the following penalised least squares criterion
(2)
given a λ > 0. Derive the estimator of β that minimises equation (2) and compare it with the one that minimises equation (1). [5 marks]
(e) Provide a Bayes estimator for each of the posteriors in parts (b) and (c) and compare them with the estimators of part (d). [5 marks]
3. (a) Consider the regression task of predicting the variable Y based on the variable X given the following training sample:
Apply the recursive binary splitting algorithm to produce a regression tree. The objective is to minimise the residual sum of squares (RSS)
where cm is the prediction for Yi corresponding to the region Rm of the tree. The stopping criterion, in order to find the regions Rm of the tree, requires all nodes to have less than 4 observations. Provide the splitting rules, the regions Rm and a diagram of the tree as well as your calculations in detail. [13 marks]
(b) Suppose we wish to perform k-means clustering with k = 2 on the following data set containing five observations and one variable: X = (-2; -3; 3; 4; 6). Suppose that our random initialisation ends up with two cluster centres at the following locations: Cluster Centre 1: X = 2; Cluster Centre 2: X = 5.
i. Show how the k-means algorithm will work from this point on. You need to indicate what the initial cluster assignments will be, how the cluster centres and assignments change at each step, as well as the final cluster assignments and centres. Note that you should only need to do this for a few iterations before you get the final solution. [8 marks]
ii. What would happen in the k-means algorithm if the observation 3 was actually recorded wrong and its correct value was 2? [4 marks]
4. (a) Suppose that we have five observed points, each with four features. We present the Euclidean distance between any two observations with measurements on these four features in the following matrix.
Use the matrix with Euclidean distances to perform hierarchical clustering, using average linkage. [13 marks]
(b) Consider the classification tree problem with two inputs and a training dataset displayed in Figure 2. For the first split point, we consider the following two candidate splits:
i. R1 = {X : X1 ≤ 0.25} and R2 = {X : X1 > 0.25};
ii. R1 = {X : X1 ≤ 0.75} and R2 = {X : X1 > 0.75}.
Compute the misclassification errors for splits in (i.) and (ii.) above. Among these two candidate splits, where would we make the first split? [6 marks]
Figure 2: Training dataset with X1 ; X2 and two classes for Y : “red circle” and “blue square”.
(c) Consider the following binary classification problem with Y = k; k ∈ {1; 2}. At a data point x; P (Y = 1jX = x) = 0.6. Let x9 be the nearest neighbour of x and P (Y = 1jX = x9 ) = p > 0. What are the values of p such that the 1-neighbour error at x is at least 0.5? [6 marks]