Introduction to Machine Learning M146
Spring Quarter 2025
Homework #1
Due: 22nd April 2025, Tuesday, before 11:59 pm
Problem 1 (Perceptron)
Suppose we have a training set with 8 samples, each sample has feature vector in R2:
#
|
1
|
2
|
3
|
4
|
5
|
6
|
7
|
8
|
X
|
[4,0]
|
[1,1]
|
[0,1]
|
[-2,-2]
|
[-2,1]
|
[1,0]
|
[5,2]
|
[3,0]
|
y
|
1
|
-1
|
-1
|
1
|
-1
|
1
|
-1
|
-1
|
We are going to implement the perceptron algorithm to train a linear classifier with 2 dimensional weight vector w ∈ R2 (no bias term). We start with initial weight vector as the first sample in our dataset, i.e. w1 = x1 . Note that: when wTx = 0, the algorithm predicts +1.
To simplify the calculation, you only need to test and possibly update each sample once in the given sequence. You can either implement the algorithm by hand or programming.
(a) Is the data linearly separable? Will our algorithm converge if we run it several times over the same sequence? Explain.
(b) Regardless of whether the dataset is linearly separable or not, calculate the updates of the weight vector on this sequence for one round over the entire dataset. Follow the order of the index for the samples and show your calculations.
(c) Provide closed-form functions for the perceptron, Voted perceptron, and Average perceptron, using the weight vector(s) derived in part (b).
(d) Using the functions derived in part (c), compare the errors between the perceptron, the Voted perceptron predictor, and the Average perceptron predictor across the entire dataset. For each point in the dataset, find the label assigned by each classifier and report the error over the dataset.
Problem 2 (Locally Weighted Linear Regression)
Consider a linear regression problem in which we want to “weight” different training instances differently because some of the instances are more important than others. Specifically, suppose we want to minimize
(1)
Here αn > 0. In class, we worked out what happens for the case where all the weights (the αn’s) are the same. In this problem, we will generalize some of those ideas to the weighted setting.
(a) Calculate the gradient by computing the partial derivatives of J with respect to each of the
parameters (w0, w1).
(b) Prove that Eq. (1) has a global optimal solution.
Problem 3 (Modified Logistic Regression with alternative labels)
In class, we have seen the logistic regression when labels are {0,1}. In this question, you will derive the logistic regression when labels are instead {-1,1}.
In class, we considered the dataset D = {(x1, y1), . . . , (xn, yn)} with n samples where xi ∈ Rd and labels yi ∈ {0, 1} for all i ∈ [n]. The prediction function hw(x) studied in class is given by
(2)
Moreover, the objective studied in class to minimize for logistic regression was:
P2.1: (3)
We want to modify this objective function so that is the activation function and the labels.
(a) Show that
tanhw(x) = 2hw′ (x) − 1, w′ = 2w.
(b) What are the asymptotic values of the function tanh w(x) as wTx → ∞ and wTx → −∞? Roughly draw the graph of this function with respect to w Tx. What is the decision criterion you can choose for predicting labels as −1 or 1?
(c) Using your answer in part (b), argue that we cannot directly replace h w(xi) with tanhw(x) in the optimization problem P2.1 (3).
(d) When labels are i ∈ {−1, 1} show, using your answer in part (a), that the optimization problem in P2.1 is equivalent to:
P2.2: (4)
(e) Compute the gradient of the loss function in P2.2 (4) for a single sample xi. Consider the two cases = 1 and = −1 separately
Problem 4 (Programming Exercise: Binary Classification)
In this exercise, you will work through a family of binary classifications. Our data consists of inputs xn ∈ R1×d and labels yn ∈ {−1, 1} for n ∈ {1,..., N}. We will work on a subset of the Fashion-MNIST dataset which focuses on classifying whether the image is for a Dress (y = 1) or a Shirt (y = −1). Your goal is to learn a classifier based on linear predictor h w(x) = wTx. Let
(5)
The main file is the Notebook Jupyter notebook.
(a) (Visualization): Visualize a sample of the training data. What is the dimensions of X train , and Xtest.
(b) (Perceptron): Implement Perceptron Algorithm to classify your training data. Let the maximum number of iterations of the Algorithm numiter = N (number of training samples). At each iteration, compute the percentage of misclassified points in the training dataset, and save it into a Loss hist array. Plot the history of the loss function (Loss hist). What is the final value of the loss function and the squared ℓ2 norm value of the weight ? Looking at the loss function, can you comment on whether the Perceptron algorithm converges?
(c) (Perceptron test error): Compute the percentage of misclassified points in the test data for the trained Perceptron.
(d) (Logistic Regression): In this part, we will implement the logistic regression for binary classification. Recall that logistic regression attempts to minimize the objective function
(6)
where xn = (1, xn), and 1A = 1 if A is true and 0 otherwise. Moreover, hw(xn) = wT xn. First, we will add an additional feature to each instance and set it to one. This is equivalent to adding an additional first column to X and setting it to all ones.
Modify the get features() in Logistic.py file to create a matrix X for logistic regression model.
(e) Complete predict() in Logistic.py file to predict y from X and w.
(f) Complete the function loss and grad() to compute the loss function and the gradient of the loss function with respect to w for a data set X and labels y at given weights w. Test your results by running the code in the main file Notebook.ipynb. If you implement everything correctly, you should get the loss function within 0.7 and squared ℓ 2 norm of the gradient around 1.8 × 105.
(g) Complete the function train LR() to train the logistic regression model for given learning rate η = 10-6 , batch size = 100, and number of iterations numiters = 5000. Plot the history of the loss function (Loss hist). What is the final value of the loss function and the squared ℓ2 norm value of the weight ?
(h) (Logistic Regression test error): Compute the percentage of misclassified points in the test data for the trained Logistic Regression.
(i) (Logistic Regression and Batch Size): Train the Logistic regression model with differ- ent batch size b ∈ {1, 50, 100, 200, 300}, learning rate η = 10-5, and number of iterations numiter = 6000/b. Train each model 50 or 100 times and average the test error for each value of batch size. Plot the test error as a function of the batch size. Which batch size gives the minimum test error?
Problem 5 (Programming Exercise: Polynomial Regression)
In this exercise, you will work through linear and polynomial regression. Our data consists of inputs xn ∈ R and outputs yn ∈ R, n ∈ {1,..., N}, which are related through a target function y = f(x). Your goal is to learn a linear predictor hw(x) that best approximates f(x).
code and data
• code : regression.py, Notebook .ipynb
• data : regression_train .csv, regression_test .csv
Visualization
As we learned last week, it is often useful to understand the data through visualizations. For this data set, you can use a scatter plot to visualize the data since it has only two properties to plot (x and y).
(a) Visualize the training and test data using the plot_data( . . .) function. What do you ob- serve? For example, can you make an educated guess on the effectiveness of linear regression in predicting the data?
Linear Regression
Recall that linear regression attempts to minimize the objective function
In this problem, we will use the matrix-vector form where
and each instance xn = (1, xn,1,..., xn,D)T .
In this instance, the number of input features D = 1.
Rather than working with this fully generalized, multivariate case, let us start by considering a simple linear regression model:
hw(x) = wTx = w0 + w1x1
regression.py contains the skeleton code for the class Regression. Objects of this class can be instantiated as model = Regression (m) where m is the degree of the polynomial feature vector where the feature vector for instance n, Setting m = 1 instantiates an object where the feature vector for instance n, (1, xn,1)T .
(b) Note that to take into account the intercept term (w0), we can add an additional “feature” to each instance and set it to one, e.g. xi,0 = 1. This is equivalent to adding an additional first column to X and setting it to all ones. Modify get_poly_features() in Regression.py for the case m = 1 to create the matrix X for a simple linear model.
(c) Before tackling the harder problem of training the regression model, complete predict() in Regression.py to predict y from X and w.
(d) One way to solve linear regression is through gradient descent (GD).
Recall that the parameters of our model are the wj values. These are the values we will adjust to minimize J(w). In gradient descent, each iteration performs the update
With each step of gradient descent, we expect our updated parameters w j to come closer to the parameters that will achieve the lowest value of J(w).
• As we perform gradient descent, it is helpful to monitor the convergence by computing the loss, i.e., the value of the objective function J. Complete loss_and_grad() to calculate J(w), and the gradient. Test your results by running the code in the main file Notebook .ipnyb. If you implement everything correctly, you should get the loss function around 4 and gradient approximately [-3.2, -10.5].
We will use the following specifications for the gradient descent algorithm:
– We run the algorithm for 10, 000 iterations.
– We will use a fixed step size.
• So far, you have used a default learning rate (or step size) of η = 0.01. Try different η = 10-4 , 10-3 , 10-1, and make a table of the coefficients and the final value of the objective function. How do the coefficients compare?
(e) In class, we learned that the closed-form solution to linear regression is
w = (XTX)-1 XTy.
Using this formula, you will get an exact solution in one calculation: there is no “loop until convergence” like in gradient descent.
• Implement the closed-form solution closed_form().
• What is the closed-form solution? How do the coefficients and the cost compare to those obtained by GD? How quickly does the algorithm run compared to GD?
Polynomial Regression
Now let us consider the more complicated case of polynomial regression, where our hypothesis is
hw(x) = wT φ(x) = w0 + w1儿 + w2儿2 + ... + wm儿m.
(f) Recall that polynomial regression can be considered as an extension of linear regression in which we replace our input matrix X with
where φ(x) is a function such that φj(x) = xj for j = 0, . . . , m.
Update gen_poly_features() for the case when m ≥ 2.
(g) For m = {0, . . . , 10}, use the closed-form solver to determine the best-fit polynomial regres- sion model on the training data, and with this model, calculate the loss on both the training data and the test data. Generate a plot depicting how loss varies with model complexity (polynomial degree) – you should generate a single plot with both training and test error, and include this plot in your writeup. Which degree polynomial would you say best fits the data? Was there evidence of under/overfitting the data? Use your plot to justify your answer.
Regularization
Finally, we will explore the role of regularization. For this problem, we will use l2-regularization so that our regularized objective function is
again optimizing for the parameters θ .
(h) Modify loss_and_grad() to incorporate l2-regularization.
(i) Use your updated solver to find the coefficients that minimize the error for a tenth-degree polynomial (m = 10) given regularization factor λ = 0, 10-8 , 10-7 , . . . , 10-1 , 100. Now use these coefficients to calculate the loss (unregularized) on both the training data and test data as a function of λ . Generate a plot depicting how the loss error varies with λ (for your x-axis, let x = [1, 2, . . . , 10] correspond to λ = [0, 10-8 , 10-7 , . . . , 100] so that λ is on a logistic scale, with regularization increasing as x increases). Which λ value appears to work best?