MATHS 7107 Data Taming
Assignment 3
Trimester 3 2025
1 Background
Source: Bing Copilot.
The South Australian Tourism Council was happy with your work on their previous problem about magpie swooping for their new advertising campaign, so they’d like you to help with the next part of their project. Based on extensive research they have decided to target their advertising campaign at the public in Iceland, as they believe that Icelanders are the least likely to already know about the danger of Australian magpies.
They have invited the Icelandic Prime Minister Sigmundur Olafsson to visit South Australia this coming Novem- ber, and bring his whole family (all expenses paid). It turns out that the Prime Minister’s 8-year old son Thor is quite a precocious (some might say “spoilt”) child, who is extremely popular with the Icelandic population. He is also an aspiring Olympic cyclist, and he will be doing some cycling during their trip. He insists on cycling through parklands by himself at exactly 5am Reykjavik time (he refuses to change the time on his TAG Heuer Carrera Plasma when travelling). If Thor was to be swooped while riding, it would be a disaster for the Tourism Council, and possibly even trigger a diplomatic incident between Australia and Iceland.
The Director of the Tourism Council is somewhat concerned about this, since it is known that Australian spring (1st September – 30th November) is when the magpies are at their most aggressive. But the Prime Minister’s schedule is so full that he cannot come at any other time. So the Council has been collecting data from several different areas around Australia. They arranged for cameras to be set up at various locations to monitor people going past, and record if they were swooped by a magpie. The cameras recorded data between 6am and 6pm each day. The video from the cameras was fed into some image recognition software to determine some facts about the person who went past. This data is collected in 4 different data sets.
Using this data, try to help the Director determine if dear little Thor is likely to swooped by a magpie. As with your previous report, the CIO uses R and R Markdown, and even completed Data Taming in the past. So make sure you only use commands from the course, so that the CIO can easily see what analysis you’ve done. In your R Markdown code chunks: make sure that you do not set echo = FALSE so that they can see what R code you used to generate your output. But of course, they don’t want to see irrelevant warnings or messages, so you need to make sure they are suppressed.
But remember that your report is for the Director of the Tourism Council, who is not really a technical person, and who certainly doesn’t know R. So make sure you include descriptions allowing the average person to understand what you are doing and what the output means.
1.1 Number of digits
When writing your own text, or USING the output from R:
• For integer results, report the whole integer.
• For non-integers with absolute value > 1: use 2 decimal places
• For non-integers with absolute value < 1: use 3 significant figures.
For example:
◦ 135.5681 ≈ 135.57
◦ −0.0004586 ≈ −0.000459
Exceptions:
• If you’re just PRINTING the output from R, then just keep the output as it is.
– But if you have R do the rounding for you then you need to conform to these two conventions listed above.
• If your data has fewer digits of precision than specified above (eg. because of the way it was stored in the original data, or because of the way it was calculated) then only report that level of precision.
2 The data
The company has four datasets labelled council 0 .csv, ..., council 3 .csv. Each dataset contains 7 columns:
• SWOOPED: If the person in the video was swooped by a magpie this is a 1, otherwise 0.
• Age: The estimated age of the person. This is estimated by the software and uses machine precision.
• Date: The date the video was recorded.
• Mins: The number of minutes after 06:00 that the person was recorded. The minutes are rounded to the nearest 15 seconds.
• LOCATION: The location the video was recorded, either in a park or on a street.
• MOT: This is the mode of transport that the person in the video was using. The modes were catergorised into one of: walk, run, bicycle, other (which includes unicycles).
• Group: A “yes” or “no” recording whether the person was in a group or not.
Each dataset has data on 40,000 potential swoop incidents. Luckily, the data itself has already been cleaned and so there should not be any missing or erroneous rows in the data. (If you do detect any errors in the data, then let the Tourism Council know immediately, so that their data cleaner can be fired.)
3 Your job
To help the Tourism Council, we will analyse the data of swoops, and then make a prediction about how dear little Thor will fare on his ride.
Note
Make sure you write text to explain what you are doing at each point and why you are doing it. You need to justify all the things you do or claim. Also describe the results. This report is for the Director, so aim your explanations at the average person and avoid jargon wherever possible.
1. Load the correct dataset and save it as a tibble. Output the first 10 lines of the dataset and the dimensions of the data set.
2. Using dot points, identify what types of variables we now have in our data set, i.e., “Quantitative Discrete”, “Quantitative Continuous”, “Categorical Nominal”, “Categorical Ordinal”. (Don’t just describe what data type they are in the data set — you need to think about the type of variable in the context of the meaning of the data.) Make sure you provide some justification for your choice of variable types.
• Don’t just provide vague statements, but be very concrete about describing this particular set of data.
3. Now it’s time to tame our data. But since we are going to fit a logistic regression model, we need to modify
our requirements a little bit.
• Tame all variable names.
• Convert the status of being swooped to a <fct> data type, with y for 1 and n for 0.
• Treat the age and number of minutes as quantitative continuous variables. (This is because we want to fit some geometric objects, which assumes that the predictors are continuous.)
• If you have identified any Categorical Ordinal variables, store them as a .
• Make the remaining variables conform to the Tame Data conventions in Module 2 (page 3). Output the first 10 rows of your data and the dimensions of the data set.
4. (a) Replace the date column with a new column called season, containing the elements “summer”, “winter”, “autumn” and “spring”, ie. it should go in the same position as the date column. (Use the month() command to do this.)
(b) Describe what type of variable this column represents (“Quantitative Discrete”, “Quantitative Contin- uous”, “Categorical Nominal”, “Categorical Ordinal”). Is the data type correct in the tibble? (Explain your answer.) If not, make sure you change it.
5. Setting the correct seed, split your data into a training set (with 33,000 rows) and a testing set, with the remaining rows. Output the first 10 lines of each dataset and the dimensions of each data set.
6. Fit a logistic regression model to your training data, with the swooped status as the response and all other variables as the predictors. (Just use them individually, don’t include any interaction terms.) Output the summary of the model.
7. Since we are using general linear models, the model summary in Question 6 describes some geometric ob- jects, where the dimension of each geometric object is determined by the number of quantitative continuous predictors.
(a) What sort of geometric objects do we have in our model in Question 6?
(b) How many of these objects are described by the model in Question 6?
(You must give some valid justification for your answers to get any marks.)
8. Now it is time to get serious with our data. There may be some interactions between the variables in the data set, so fit a new model to your training set using all the individual variables and all the second-order interaction terms. Use the Analysis of Variance to find the p-values for each of the variables. Identify all interaction terms that meet the 98% significance level (make sure you explicitly quote the p-values for the variables that you identify as significant).
• (Hint: if you have three predictors x1 , x2 , x3 , then the second order interaction terms are x1 x2 , x1 x3 , x2 x3 . There is an easy way and a hard way to do this — see the Reminder sheet for the easy way.)
9. We’ll now apply backwards stepwise regression. As we learned in Module 7, best practice is to only
remove terms one-by-one starting with the least significant. However, to shortcut the process, we’ll start by
ignoring the group status variable (since we have some divine premonition that it shouldn’t be significant).
(a) So ignore anything to do with the group status, and fit a new model with all the remaining individual variables and interactions. Show the Analysis of Variance output.
(b) Now continue with step-by-step backwards stepwise regression to find a model where all terms meet the 98% significance level. At each step, identify the term that you will remove, and why you will choose that one. Then show the resulting Analysis of Variance after you fit each model.
• Remember the “principle of marginality”: a variable shouldn’t appear in an interaction term if we don’t have the variable appear by itself.
10. (a) Which interaction terms are significant (at the 98% level) in your final model? (Make sure you specifically write the p-values.)
(b) Thinking about the context of the data, provide some reasonable hypotheses for why those interaction terms might represent real effects (and are not just statistical noise).
11. So we have now fit a logistic regression model for the log-odds, which has the general form.
Write down the general form of this equation for your final model in Question 9. Keep the coefficients as pronumerals for now. Be sure to define all variables in your equation (you don’t need to define the coefficients).
12. Looking at Question 11, the geometric situation is now different to that in Question 7.
(a) What sort of geometric objects does your final model describe?
(b) How many of these geometric objects does your final model describe?
(c) Are they all parallel?
(You must give some valid justification for your answers to get any marks.)
13. Now output the summary of your final model showing the estimated coefficients, and use that to write the specific equation for the log-odds, with all the estimated coefficients replacing the β(ˆ)j pronumerals.
14. What is our estimate for the log-odds of somebody being swooped if:
(a) they are walking with a group of friends, on a street during April, at 9am?
(b) they are riding through a park by themselves, at 3:30pm on a sunny September day?
15. Now apply your final model to the testing data. Produce a new tibble containing the true classes, the predicted classes and the prediction probabilities. Output the first 10 lines of this tibble and the dimensions of the data set.
16. Now we need to evaluate our model.
(a) Find the confusion matrix and the accuracy of the model.
(b) If “being swooped” is classified as a success, find the sensitivity and specificity of our model.
(c) Plot all four possible ROC curves, adding the following code to your autoplot()
+ geom vline(xintercept= . . .) + geom hline(yintercept= . . .)
Use these plots to identify which options are required for the correct ROC curve.
(d) What is the AUC of the correct ROC curve?
17. Now, let’s try to answer the Director’s question. Based on your model, do you predict that Thor will be swooped by a magpie? Write some text to interpret your results for the Director, and make sure you give the probabilities of your predicted class. Also, make a suggestion (based on some data) for a single change that could be made so that he is less likely to be swooped.
4 Submission
You must submit your assignment via MyUni. Do not email it to the teaching staff. Detailed instructions are on the assignment submission page in MyUni. Make sure that all your output is relevant to the questions being asked.
5 Deliverable Specifications (DS)
Before you submit your assignment, make sure you have met all the criteria in the Deliverable Specifications (DS). The client will not be happy if you do not deliver your results in the format that they’ve asked for.