代写SUMMATIVE ASSIGNMENT 2 – BUSI4AY15 Business Analytics代做回归

SUMMATIVE ASSIGNMENT 2 – BUSI4AY15

Business Analytics

Masters Programmes 2024/25

For this assignment, you will be provided a data set in Excel and an Excel answer sheet. At the bottom of this assignment, you will find a list of exercises to execute on the data set provided. You are to input your answers into your Excel answer sheet and submit it on Blackboard. 

SUBMISSION INSTRUCTIONS

A penalty will be applied for work uploaded after 11:59am as detailed in the Late submission policy.  You must leave sufficient time to fully complete the upload process before the deadline and check that you have received a receipt. At peak periods, it can take up to 30 minutes for a receipt to be generated.

FORMAT

You are to submit the Excel answer sheet with the file name adequately changed as instructed. You are not to alter the structure of the Excel answer sheet. The Excel file should be kept in the .xlsx format.

MARKING GUIDELINES

The number of marks carried by each question in the exercise is indicated clearly in this assignment.

PLAGIARISM AND COLLUSION

Note that your data set is unique to you, correspondingly, the answers that you will obtain will also be unique to you.

Students suspected of plagiarism, either of published work or the work of other students, or of collusion will be dealt with according to School and University guidelines.

SPECIFIC INSTRUCTIONS

1. You will be able to find your data set for this assignment in the “Data Sets” folder. All of the files in this folder are named “yourZnumber_Number1_Number2.xlsx”. You are to find the data set corresponding to your Z number. Download this data set. 

2. You are strongly advised to save a copy of the data set in a safe location, in the unlikely but potential event that you accidentally overwrite the data during the process of your analysis.

3. Note that all of your colleagues have been provided different datasets, and consequently will arrive at different correct answers for the assignment. As such, please ensure that you use the data set that corresponds to your Z number and not the data sets of any of your colleagues.

4. In the assignment folder, you will also find an Excel file that is labelled “yourZnumber_SA2.xlsx”. This is your answer sheet, which you use to fill your responses in for each question and which you will submit on Blackboard. Change yourZnumber in the file name to your Z number right now.

5. As your assignment is machine-graded, any minute error in your file name will render it unreadable by the machine and will lead to a complete loss of marks, so please ensure that the file name is correct.

6. The Excel answer file contains two columns. The first column lists the question numbers to which an answer is expected in the Excel answer file. In the second column, you are to key in your answer to the corresponding question there. Do not alter the structure of the Excel file, namely, do not add new rows or columns or key in any values outside of the demarcated area.

7. You will be required to key different types of answers in a specific form.

a. For numerical answers, please leave your answers with at least 3 significant figures (in other words, with at least 3 non-zero digits, e.g. you may reflect “12.345” as “12.3” and “0.01234” as “0.0123”). If your numerical answer is a whole number, leave them as such (e.g. you may leave “3” as “3” as opposed to “3.00” that is in 3 significant figures).

b. If you are required to report probabilities, for example, p-values, if the numerical value is smaller than 0.0001 or 1e-4, please report the value as 0. If you are asked to report probabilities or proportion or percentages, please reflect the value in decimals (e.g. report 78% as 0.78). Never reflect your answer as a fraction (e.g. for 1/3, instead use 0.333 in 3 significant figures).

c. For multiple choice questions in this assignment, feel free to leave your responses in any of large or small capitals.

8. Wrong answers have a potential to carry partial marks.

9. This assignment comprises a total of 43 questions, amounting to a total of 100 points.

ASSIGNMENT QUESTIONS

You will be able to download your data set from Blackboard. Your data set will contain a total of 600 data points. Please split the data set into a training and testing data set. The training data set will comprise the first 500 data points and the testing data set will comprise the last 100 data points. Unless explicitly stated that it is for testing, the training data set should always be used.

You are strongly advised to save a copy of the data set in a safe location, in the unlikely but potential event that you accidentally overwrite the data during the process of your analysis.

In this data frame, you will find 10 columns. The column ‘y’ is to be assumed as the outcome / dependent variable. The other columns labelled from ‘x1’ to ‘x9’ are to be assumed as the predictors / independent variables.  

To verify that you have downloaded the correct data set and split the training and testing data correctly, please calculate the mean and variance of y for both the training and testing data sets. In the file name of your data set, you see the following “yourZnumber_Number1_Number2_Number3_Number4.xlsx”. Numbers 1 and 2 would be the mean and variance for y in the training data set; and numbers 3 and 4 are the mean and variances for the testing data set.

Do not proceed if these numbers are not correct! Verify with your tutor if they are incorrect.

For this assignment, unless otherwise stated, we will use a significance level of 5%.

Question 1

1 point

Find the mean of ‘y’.

Question 2

3 points

Compute the correlations of all 9 predictors ‘x1’ to ‘x9’ against the outcome ‘y’. Which of these predictors has the strongest correlation with ‘y’? Your answer should be in the format: ‘x<a number>’, e.g. ‘x1’.

Question 3

2 points

What is the consequence of answering Question 2 in the context of building a linear regression model?

Choose the most appropriate answer from the following responses and key in one of A, B, C or D into your Excel file.

A Predictors with a high correlation would indicate multi-collinearity and should not be included as predictors in the linear regression model.

B Predictors with a high correlation would indicate multi-collinearity and should  be included as predictors in the linear regression model.

C Predictors with a low correlation would indicate a weak relationship with the outcome ‘y’ and should not be included as predictors in the linear regression model.

D Predictors with a low correlation would indicate a weak linear relationship with the outcome ‘y’ and should still be included as predictors in the linear regression model, as we cannot rule out deeper non-linear relationships with the outcome.

Question 4

3 points

Build a linear regression model with ‘y’ as the outcome and all of the ‘x’ variables as the predictors. Let us call this Model A. Report the adjusted R-squared value.

Question 5

1 point

What percentage of the variance in ‘y’ is not explained by your current linear regression model in Question 4?

Question 6

1 point

What is the p-value for the variable ‘x4’ in your linear regression model?

Question 7

2 points

Depending on your answer for Question 6, what can you conclude about whether or not there is a relationship between the outcome variable ‘y’ and the feature ‘x4’?

Choose the most appropriate answer from the following responses and key in one of A, B, C or D into your Excel file.

A We do not have enough data or evidence to say that there is a relationship between outcome ‘y’ and the feature ‘x4’.

B We do not have enough evidence to say that there is definitely no relationship between outcome ‘y’ and the feature ‘x4’, but we can at least say that we do not have enough data to conclude that this relationship is linear.

C There is a linear relationship between outcome ‘y’ and feature ‘x4’.

D There is a relationship between outcome ‘y’ and feature ‘x4’, but we do not yet have enough data to conclude if this relationship is linear.

Question 8

3 points

Build a linear regression model with ‘y’ as the outcome, but with only the top 5 ‘x’ variables with the highest magnitude of correlation with ‘y’. Let us call this Model B. Report the adjusted R-squared value.

Question 9

2 points

Run a test that verifies if the R-squared value for Model B is significantly different from the R-squared value for Model A. Report the associated p-value for this test.

Question 10

2 points

Depending on your answer for Question 9, what can you conclude about whether or not it is a good idea to omit the ‘x’ variables with small magnitude of correlation with the outcome ‘y’?

Choose the most appropriate answer from the following responses and key in one of A, B, C or D into your Excel file.

A The p-value is small, thus indicating that there is no significant difference between the models. As such, we are justified in omitting the ‘x’ variables.

B The p-value is small, thus indicating that there is a significant difference between the models. As such, we are not justified in omitting the ‘x’ variables.

C The p-value is large, thus indicating that there is no significant difference between the models. As such, we are justified in omitting the ‘x’ variables.

D The p-value is large, thus indicating that there is no significant difference between the models. However, we should not rush to the conclusion that these ‘x’ variables play no part in the outcome ‘y’.

Question 11

1 point

You would notice that there is an ‘x’ variable that has a negative correlation with the outcome ‘y’ and with magnitude at least 0.2. Which variable is that? Your answer should be in the format: ‘x<a number>’, e.g. ‘x1’.

Question 12

3 points

Using your analysis from Model A, plot a scatter plot of the residuals against the variable you identified in Question 11. What can you conclude from the plot?

Choose the most appropriate answer from the following responses and key in one of A, B, C or D into your Excel file.

A There are non-linearities in the model.

B There is multi-collinearity in the model.

C There is an interaction effect involving this variable in the model.

D None of the above assumptions of linear regression has been violated.

Question 13

3 points

In your data set, create a new variable that is the exponential of the variable you identified in Question 11. Run a linear regression model with all the ‘x’ variables plus this variable. Let us call this Model C. Report the adjusted R-squared value.

Question 14

1 point

Report the p-value for this new variable in Model C.

Question 15

2 points

Run a test that verifies if the R-squared value for Model C is significantly different from the R-squared value for Model A. Report the associated p-value for this test.

Question 16

1 point

Report the p-value for the variable you identified in Question 11 in Model C.

Question 17

1 point

Based on your answers from Questions 14 to 16, what can you conclude?

Choose the most appropriate answer from the following responses and key in one of A, B, C or D into your Excel file.

A Adding the new variable led to a significantly better model. However, we should not keep the original variable from Question 11, but keep only the new variable in the regression model.

B Adding the new variable led to a significantly better model. Moreover, both the original variable from Question 11 and the new variable should be retained in the regression model.

C Adding the new variable did not lead to a significantly better model. However, we should keep the new variable in the regression model.

D Adding the new variable did not lead to a significantly better model. Moreover, we should delete the variable in Question 11 and the new variable from the regression model.

Question 18

3 points

Plot two scatter plots as follows:

(I) The outcome ‘y’ against the predictor ‘x9’ only for the data points where ‘x1’ is larger or equal to 0

(II) The outcome ‘y’ against the predictor ‘x9’ only for the data points where ‘x1’ is less than 0

Compare the two scatter plots. What can you conclude from them?

Choose the most appropriate answer from the following responses and key in one of A, B, C or D into your Excel file.

A There are non-linearities in the model.

B There is multi-collinearity in the model.

C There is an interaction effect involving this variable in the model.

D None of the above assumptions of linear regression has been violated.

Question 19

5 points

Create a new variable that addresses the problem in Question 18. Now, modify Model C to include this new variable. Let us call this Model D. Report the adjusted R-squared value.

Question 20

2 points

Run a test that verifies if the R-squared value for Model D is significantly different from the R-squared value for Model C. Report the associated p-value for this test.

Question 21

3 points

Compute the correlations of all the other 8 predictors against the predictor ‘x8’. Which of these predictors has the strongest correlation with ‘x8’? Your answer should be in the format: ‘x<a number>’, e.g. ‘x1’.

Question 22

1 point

What might Question 21 be indicative of?

Choose the most appropriate answer from the following responses and key in one of A, B, C or D into your Excel file.

A There are non-linearities in the model.

B There is multi-collinearity in the model.

C There is an interaction effect involving this variable in the model.

D None of the above assumptions of linear regression has been violated.

Question 23

5 points

Using a more precise test than correlations, re-perform. the analysis in Question 21 using Model D. If the cut-off for the test is set at 5, how many variables should be omitted from the model?

Question 24

4 points

Using your analysis from Question 23, modify Model D to arrive at a new model, but keeping variable ‘x8’. Let us call this Model E. At this point, remove all non-significant variables from Model E. You should be left with 5 predictors. Report the adjusted R-squared value.

Question 25

2 points

Run a test to verify that Model E is indeed better than Model D. What can you conclude from your test?

Choose the most appropriate answer from the following responses and key in one of A, B, C or D into your Excel file.

A Removing the variables led to a significantly worse model, because the p-value of the test was significant.

B Removing the variables led to a significantly worse model, because the p-value of the test was not significant.

C Removing the variables did not lead to a significantly better model, because the p-value of the test was significant.

D Removing the variables did not lead to a significantly better model, because the p-value of the test was not significant.

Question 26

2 points

On the training data set, calculate the root mean squared error for Model E.

Question 27

2 points

On the testing data set, calculate the root mean squared error for Model E.

Checkpoint

Before proceeding onto the next part of this exercise, verify that you are now indeed left with just 5 predictors in Model E. These might not be ‘x1’ to ‘x9’ and might be some transformation of them. If you are not left with 5 predictors, you might want to verify your working. Additionally, you might want to remove any non-significant variables that are still left in Model E.

Copy the 5 predictors in Model E into a new sheet. To this new sheet, create a new variable that is 1 if ‘y’ is greater or equal to 0 and 0 if ‘y’ is less than 0. From now on, we will refer to this new binary variable as ‘z’, and it will be treated as the outcome variable.

Question 28

3 points

Build a logistic regression model with ‘z’ as the outcome variable and these 5 predictors. Call this Model F. Report the AUC.

Question 29

2 points

What is the significance of the value in Question 28?

Choose the most appropriate answer from the following responses and key in one of A, B, C or D into your Excel file.

A It represents the correlation between the outcome ‘z’ and 5 predictors, so the larger the value the better the model.

B It should not be less than 0.5, otherwise the model is no better than predicting every point as yes or no.

C It is an objective measure of the goodness of the model without having to resort to a testing data set; a high value will guarantee good performance even in a testing data set.

D A high value shows that there is a high trade-off between sensitivity and specificity, indicating that the model is predicting poorly.

Question 30

1 point

Set the cut-off probability to 0.5. What is the in-sample accuracy of Model F?

Question 31

1 point

Set the cut-off probability to 0.5. What is the out-of-sample accuracy of Model F?

Question 32

2 points

Based on your answer to Questions 30 and 31, what can you conclude about Model F?

Choose the most appropriate answer from the following responses and key in one of A, B, C or D into your Excel file.

A Model F seems to generalize out-of-sample as its out-of-sample accuracy is comparable to its in-sample accuracy.

B Model F does not generalize out-of-sample as its out-of-sample accuracy is not comparable to its in-sample accuracy.

C Model F seems to generalize out-of-sample as its out-of-sample accuracy is higher than its in-sample accuracy.

D Model F does not generalize out-of-sample as its out-of-sample accuracy is lower than its in-sample accuracy.

Question 33

1 point

Set the cut-off probability to 0.5. What is the in-sample recall of Model F?

Question 34

1 point

Set the cut-off probability to 0.5. What is the in-sample specificity of Model F?

Question 35

2 points

If the cut-off probability is changed to 0.4, how do we expect the answers to Questions 33 and 34 to change?

Choose the most appropriate answer from the following responses and key in one of A, B, C or D into your Excel file.

A The recall will increase and the specificity will decrease.

B The recall will increase and the specificity will increase.

C The recall will decrease and the specificity will decrease.

D The recall will decrease and the specificity will increase.

Question 36

5 points

In this example, we verify what happens if we use Model E to solve the classification problem of whether or not outcome ‘y’ is greater or equal to 0.

With Model E, form. predictions according to your final model with the same 5 predictors as in Model F on the testing data set. If the outcome or prediction was greater or equal to 0, then we record that as a positive, and vice versa. Based on this, form. a confusion matrix for Model E.

Report the out-of-sample accuracy for Model E.

Question 37

3 points

Compare your answers between Questions 31 and 36. What can you conclude?

Choose the most appropriate answer from the following responses and key in one of A, B, C or D into your Excel file.

A The accuracy for the linear regression model is higher because it is forms continuous predictions, which enables the decision maker to better draw the line dividing the regions where outcome ‘y’ is larger than 0 or not.

B The accuracy for the logistic regression model is higher because it solves the simpler problem of simply deciding if outcome ‘y’ is larger than 0 or not, rather than predicting the actual value of ‘y’. Thus, it can afford to be more precise.

C There is numerically a difference, but this difference is not significant, as both models use the same underlying linear structure to form. predictions and should therefore arrive at similar results.

D There is no reasonable way to conclude on the differences between these two models; it depends on how the predictors relate to the outcome.

Question 38

2 points

Build a classification tree model with ‘z’ as the outcome variable and the same 5 predictors. Call this Model G. Use default parameters of depth = 4, cp = 0.01, minsplit = 20, minbucket = 5. Report the out-of-sample accuracy.

Question 39

3 points

Compare your answers between Questions 31 and 38. What can you conclude?

Choose the most appropriate answer from the following responses and key in one of A, B, C or D into your Excel file.

A We can conclusively decide which model is better based on the accuracy, because this is an out-of-sample statistic.

B Even if the regression tree model posted a higher accuracy, it is too hasty to conclude that it has a better performance than the logistic regression model as one could potentially shift the cut-off probabilities in the latter to arrive at a better model.

C Even if the regression tree model posted a lower accuracy, it is too hasty to conclude that it has a poorer performance than the logistic regression model as one could still change the parameters of the tree.

D It is unreasonable to compare the accuracies as they arise from different models.

Question 40

2 points

Find the leaf that has a proportion of class ‘1’ that is closest to 50%. Report this proportion.

Question 41

3 points

What is the significance of your answer to Question 39?

Choose the most appropriate answer from the following responses and key in one of A, B, C or D into your Excel file.

A This proportion forms our benchmark. Our out-of-sample accuracy should minimally beat this proportion.

B If this proportion is very close to 50%, then we should reduce the depth of the tree so that the data points in this leaf can be diluted with other leaves.

C If this proportion is very far away from 50%, then this might be a sign of overfitting in the tree.

D If this proportion is very close to 50%, then it indicates the presence of leaves where the model is very much undecided on the correct outcome and might indicate poor performance.

Question 42

5 points

Build a hierarchical clustering model with the 5 predictors in Models E, F and G using the training data set. Use the setting of 4 clusters.

Using only training data points in the largest cluster, build a logistic regression model with the 5 predictors and outcome ‘z’. Call this Model H. Report the AUC.

Question 43

3 points

How would you expect your answer to Question 42 to differ from Question 28?

Choose the most appropriate answer from the following responses and key in one of A, B, C or D into your Excel file.

A The AUC in Model H is higher than the AUC in Model F, because it has fewer data points.

B The AUC in Model H is higher than the AUC in Model F, because the data points used in Model H belong to the same cluster, thus exhibit a more consistent relationship.

C The AUC in Model H is lower than the AUC in Model F, because it has fewer data points.

D The AUC in Model H is lower than the AUC in Model F, because the way data points are selected into Model H is not based on the outcome variable ‘z’.

 

热门主题

课程名

mktg2509 csci 2600 38170 lng302 csse3010 phas3226 77938 arch1162 engn4536/engn6536 acx5903 comp151101 phl245 cse12 comp9312 stat3016/6016 phas0038 comp2140 6qqmb312 xjco3011 rest0005 ematm0051 5qqmn219 lubs5062m eee8155 cege0100 eap033 artd1109 mat246 etc3430 ecmm462 mis102 inft6800 ddes9903 comp6521 comp9517 comp3331/9331 comp4337 comp6008 comp9414 bu.231.790.81 man00150m csb352h math1041 eengm4100 isys1002 08 6057cem mktg3504 mthm036 mtrx1701 mth3241 eeee3086 cmp-7038b cmp-7000a ints4010 econ2151 infs5710 fins5516 fin3309 fins5510 gsoe9340 math2007 math2036 soee5010 mark3088 infs3605 elec9714 comp2271 ma214 comp2211 infs3604 600426 sit254 acct3091 bbt405 msin0116 com107/com113 mark5826 sit120 comp9021 eco2101 eeen40700 cs253 ece3114 ecmm447 chns3000 math377 itd102 comp9444 comp(2041|9044) econ0060 econ7230 mgt001371 ecs-323 cs6250 mgdi60012 mdia2012 comm221001 comm5000 ma1008 engl642 econ241 com333 math367 mis201 nbs-7041x meek16104 econ2003 comm1190 mbas902 comp-1027 dpst1091 comp7315 eppd1033 m06 ee3025 msci231 bb113/bbs1063 fc709 comp3425 comp9417 econ42915 cb9101 math1102e chme0017 fc307 mkt60104 5522usst litr1-uc6201.200 ee1102 cosc2803 math39512 omp9727 int2067/int5051 bsb151 mgt253 fc021 babs2202 mis2002s phya21 18-213 cege0012 mdia1002 math38032 mech5125 07 cisc102 mgx3110 cs240 11175 fin3020s eco3420 ictten622 comp9727 cpt111 de114102d mgm320h5s bafi1019 math21112 efim20036 mn-3503 fins5568 110.807 bcpm000028 info6030 bma0092 bcpm0054 math20212 ce335 cs365 cenv6141 ftec5580 math2010 ec3450 comm1170 ecmt1010 csci-ua.0480-003 econ12-200 ib3960 ectb60h3f cs247—assignment tk3163 ics3u ib3j80 comp20008 comp9334 eppd1063 acct2343 cct109 isys1055/3412 math350-real math2014 eec180 stat141b econ2101 msinm014/msing014/msing014b fit2004 comp643 bu1002 cm2030
联系我们
EMail: 99515681@qq.com
QQ: 99515681
留学生作业帮-留学生的知心伴侣!
工作时间:08:00-21:00
python代写
微信客服:codinghelp
站长地图