ITEC 320
FINAL EXAM (PRACTICE)
Part 1: Multiple Choice Questions
1. When applied to new data points, logistic regression provides a column in the RapidMiner output called “Confidence(1).” What does the number in that column tell us?
A) The probability that the new data point is similar to what we’ve observed in the original dataset
B) The probability that the outcome for the new data point will be 1
C) The accuracy of the logistic regression model
D) The probability that logistic regression was the correct model
2. When comparing different predictive methods for numeric outcomes, how do we determine which is the most accurate?
A) Select the method with the highest root mean squared error
B) Select the method with the lowest root mean squared error
C) Select the method with the highest classification accuracy
D) Select the method with the lowest classification accuracy
3. A binary independent variable called SpecialOrder in a linear regression model for predicting ProcessingTime of orders (measured in days) has a coefficient of 3.36? What does that number mean?
A) Each additional special order leads to an average increase in processing time of 3.36 days.
B) The level of significance of SpecialOrder is 3.36.
C) Special orders require an average of 3.36 days to process.
D) On average, special orders have processing times that are 3.36 days longer than regular orders.
4. When trying to figure out what predictive method will work best, all of the following are benefits of using cross validation EXCEPT:
A) Cross validation is often the best predictive method.
B) Cross validation enables each method to produce the same accuracy or error metric.
C) Cross validation provides measures of predictive accuracy rather than measures of fit.
D) Cross validation helps prevent overfitting.
5. The table below shows the performance of a classification model on our dataset. What percentage of the model’s “1” predictions turned out to be correct?
A) 74.88%
B) 38.53%
C) 23.86%
D) 5.25%
6. Which operator in RapidMiner should be used to create a forecasting model for the time series shown in this line chart?
A) Exponential Smoothing
B) Apply Forecast
C) Holt-Winters
D) Decision Tree
Part 2: Problems
1. (10 pts.) Why is it better to use a 5-period moving average to make predictions than it would be to either A) use the most recent value as your prediction, or B) use the average value for the whole time series as your prediction?
2. (10 pts.) The classification tree below is used to predict whether or not a charity’s request for donations by mail will be successful (indicated by a 1). The following independent variables are used:
previous_donor: a binary variable equal to 1 if the person has given to this charity before, and 0 if not
months_since_last_donation: for previous donors, the number of months since their last donation
income: the average household income of the person’s neighborhood
a) (5 pts.) Does the classification tree predict that the following person will donate?
previous_donor = 1
months_since_last_donation = 6
income: = $127,500
b) (6 pts.) Briefly (1-2 sentences) explain the logic that this tree is using to make predictions.
3. (15 pts.) A publishing company is analyzing a dataset of its published books to try to figure out characteristics of a book that make it more or less likely to become a bestseller. They have run a logistic regression model using four of these attributes as independent variables, and obtained the following results (the dependent variable is 1 if the book was a bestseller, and 0 if it was not):
a) (5 pts.) Which two of these attributes were significant?
b) (5 pts.) If a book has lots of action verbs, what effect does that have on the estimated probability that the book will be a bestseller?
c) (5 pts.) What does this logistic regression output tell us about the effect of the length of the book (in pages) on the probability that the book will be a bestseller?
4. (25 pts.) This problem is based on analysis of a dataset from a non-profit called Connect the Planet, which aims to develop infrastructure and help individual countries plan to improve their citizens’ internet access. They believe that the two primary factors associated with a country’s internet usage are its economic productivity (GDP per capita) and its adult literacy rate, and are trying to develop a predictive model to capture these relationships. The attribute being predicted is the country’s number of frequent internet users per 100 people.
a) (5 pts.) The screenshot below shows the subprocess within the Cross Validation operator. Why are we getting an error? What needs to be done to fix it?
b. (5 pts.) After fixing the issue from part a, we ran the process and got this result:
What does that 13.626 number mean (conceptually, not mathematically), and what should we do with it?
We have created the following linear regression model using this dataset, used in the next two questions:
c. (5 pts.) What is the relationship between a country’s adult literacy rate and its number of frequent internet users per 100 people?
d) (5 pts.) This regression model would predict that a country with a per capita GDP of $0 and an adult literacy rate of 0% would have -24.331 frequent internet users per 100 people. Why does it give us an obviously incorrect prediction?
e) (5 pts.) RapidMiner’s linear regression output omits several pieces of information that we get when using Excel. Identify one such number, and explain what it means.
5. (15 pts.) This problem is based on a telecom company’s dataset containing all of its mobile plan customers from last month whose plans were due to expire at the end of the month. The dataset includes, for each customer, the monthly cost of the customer’s plan (in $), the total quantity of data the customer used last month (in GB), and a binary variable indicating whether or not the customer still has a mobile plan with the company (1=Yes, 0=No). If a customer still has a mobile plan with the company, it means that either they renewed their previous plan, or they changed to a different plan. The company would like to be able to predict more accurately which customers are likely to remain and which are likely to leave.
a) (5 pts.) We ran cross validation using k-nearest neighbors with k=5, k=10, and k=30. The overall accuracies of the models were:
k = 5: 68.04%
k = 10: 72.33%
k = 30: 73.46%
Of these three models, which is best at predicting whether customers will stay?
The company applied one of the models from the previous question to five customers whose plans are due to expires soon, and obtained the results shown below, used in parts b & c:
b) (5 pts.) How many of the five customers does the model predict will stay?
c) (5 pts.) A manager at the company believes that customers are likely to leave if they have low-cost plans and high data usage, because the company slows down these customers’ download speeds once their data usage exceeds a given threshold. Do the results from applying k-nearest neighbors to these five customers support the manager’s claim? Why or why not?