Assignment 3
Big Data and Machine Learning for Economics and Finance
Submission Rules:
Provide an html document that is generated by RMarkdown and that con- tains
the R code,
the R output,
and your comments on the output.
Comment each line of your R code as well. Give thorough explanations throughout.
Please note that the function set.seed() may not be used at any time in the assignment.
Please note that, when providing your answers, you may not use any extra packages other than the ones explicitly mentioned in each exercise. For example, if the question says ''the only extra packages allowed are ISLR2 and boot'', then you may type library(ISLR2), and library(boot) when writing your answers to the questions in that exercise, but you may not type library(MASS) or library(any other package) anywhere in your submission.
When asked to carry out a certain task (such as, for example, fitting a certain model or running a certain algorithm), it must be determined first whether that task is feasible or not, and when feasible, whether it can be carried out exactly as prescribed in the question or whether it can only be approximately carried out.
Exercise 1. (40 points) The only extra packages allowed in this exrecise are tree and boot. Consider the following data generating mechanism: X is a uniform random variable on the
interval [0;100] and
Y=1{30>X+U}+1{X+U>90}
where U is a standard normal random variable that is independent of X .
Assuming that X is the input variable and Y is the output variable, we are interested in comparing predictions from classification trees with ones based on Logistic Regressions.
1. Generate a sample of size n=1000 from that model.
a. Using R, produce a scatterplot of X vs. Y.
b. Produce another plot representing the different observations of X , where each of the observations is given a different color depending on the value of Y. Are the colors separable using a hyperplane?
c. We are interested in giving predictions for Y when x=10, 50 or 90 using a clas- sification tree. After fitting a tree to the data, show how to give predictions both using the function predict, and by arguing based on a graphical representation of the tree.
d. Run logistic regression and give predictions for the same 3 values of x.
e. Compare the prediction performance the two methods.
2. Attempt to reproduce the results in the following figure using R.
Figure 1. Monte Carlo Experiment
3. Based on your knowledge, examine all classification methods learned in this course and establish which methods would perform well on samples drawn using the data generating mechanism described in this exercise.
Make a table where on the left you write down the name of the method(s), and on the right you explain if you believe it would perform well while justifying your answer. Your answer and justification for each method should not be more than 10 words long. Please note that this part of the exercise should not be answered with any R coding.
Exercise 2. (30 points) The only extra packages allowed in this exrecise are tree and boot. An applied data analyst is interested in assessing the performance of supervised learning
when applied to the following data generation scheme
Z=Y2+U
X =exp (Z)+V
where Y , U and V are independent standard normal random variables.
1. Generate a sample of size n=10000 from that model. Assuming that Y is the input variable and X is the output variable, we are interested in comparing CART and a Generalized Linear Model.
a. Construct a tree and show how to use it in order to make a prediction for X when y=1. Use both the predict function and a plot of the tree to make the prediction.
b. Run a generalized linear model and use it to give a prediction for X corre- sponding to the same value of the input variable as the previous question.
c. Compare the prediction performance of the two methods.
2. Another applied data analyst looks at a sample of size n generated from the same data generation scheme and concludes that a supervised learning prediction exercise does not make sense as the X and Y variables are seemingly uncorrelated.
a. Using the bootstrap, show whether the applied data analyst is correct in their conclusion regarding the correlation.
b. Do you believe that the applied data analyst is right in believing that a super- vised learning exercise does not make sense in this particular case?
3. Based on your knowledge, examine all supervised learning methods learned in this course and establish which methods would perform well on samples drawn using the data generation scheme described in this exercise.
Make a table where on the left you write down the name of the method(s), and on the right you explain if you believe it would perform well while justifying your answer. Your answer and justification for each method should not be more than 10 words long.
Please note that this part of the exericse should not be answered with any R coding. Exercise 3. (30 points) No extra packages are allowed in this exrecise.
1. Consider the following dendrogram:
Figure 2. 10 observations
We are interested in clustering the data into three groups. What are the three groups obtained from the dendrogram? Which group contains the closest two points in the sample? Is the dendrogram well balanced?
2. Consider the following scatter plot representing the data on two variables X1 and X2 :
Figure 3. 4 observations
Going from left to right, the same 4 observations have the following values for a third variable Y=(1;2;3;0).
a. If Y is considered as the output variable in a supervised learning setting and (X1 ;X2) are considered the input variables, would this be a classification or a regression task?
b. If we were to fit a single split tree stump to this dataset, how many possible configurations are there?
c. Construct the optimal tree stump.