MATHS 7107 Data Taming
Assignment 2
Trimester 1, 2025
1 Background
The Australian magpie was named after the European magpie, which has a similar appearance, but is actually quite distantly related. It is possible that the bird on the South Australian coat-of-arms is a magpie, although it could also be a magpie lark, which looks remarkably similar to an Australian magpie. When the coat-of-arms was designed, the bird was called a “piping shrike” which is not a scientific designation, and the word “piping” could be used to describe the call of either bird. So it is not clear if the designer meant the magpie or the magpie lark.
The Australian magpie is found all over the country, and in some small populations in New Zealand and New Guinea. It is only found in the southern hemisphere. While they are all the same species, there is considerable variation in the appearance of birds from around Australia. In particular, the magpies from southern Australia tend to have more white plumage on their back than the magpies from northern Australia.
Magpies are extremely intelligent birds and have adapted to live amongst humans, often in clans of up to 20–30 birds. Many magpies are found in cities and towns, with some birds even developing relationships with humans. They are excellent problem solvers and can readily learn new things from other magpies, other non-magpie birds and from humans. They have even been known to figure out how to remove tracking devices placed on them by scientists. Another well known aspect of the magpies is their distinctive call, which is very complex and musical. These attributes led to magpies being rated Australia’s favourite bird in the inaugural Australian Bird of the Year survey, and ranking highly every year since.
But an important attribute of Australian magpies is that they are quite territorial, and can be extremely aggressive, particularly when the magpie’s clan has young chicks. It is very common for male magpies to swoop down at humans and either threaten them by flying very close, or even striking the humans with their beaks. They are one of the few bird species known to have killed humans: three people have been confirmed to have died from magpie attacks. Australian news reports regularly feature warnings about the dangers of magpies .
There are many suspected risk factors that can increase the chance of being swooped, for example it seems that young children and old adults at more risk than the rest of the population. The most common trigger seems to be riding a bicycle (although your lecturer has strong evidence that they hate unicyclists even more than cyclists), and there are some techniques that cyclists use to deter the birds from swooping.
The South Australian Tourism Council is hoping to encourage more visitors to the south-east of the state, which is typically much more densely forested than the north and west of the state. However, their research has indicated that potential overseas tourists are concerned about dangerous Australian animals, which is keeping tourists away. This isn’t helped, since one of the types of magpie subspecies in south east SA is called the tyrannica. To combat this, the Tourism Council has a new advertising slogan:
“Visit south-east SA. At least the magpies are pretty safe!”
which will be part of their new advertising campaign. This campaign will run for two years. As part of this campaign, they are planning to sponsor a group of 500 tourists to live in a small rural camp called Safety Hill, which is set in a heavily forested area of south-east SA for the duration of the campaign. The land area of the camp is only 40,000 m2 , but it contains approximately 1,200 trees.
To support their new advertising campaign, they would like you to do some data analysis. They have provided four sets of data from different regions in Australia and they would like to know how likely it is the sponsored tourists would be swooped while living in Safety Hill. The data sets are labelled suburbs 0 .csv, ..., suburbs 3 .csv. You’ll need to provide intervals around your estimates, and the Tourism Council wants the intervals at the 93% level. They would also like your opinion on whether the Tourist Council would be better with the alternative slogan:
“Visit south-east SA. At least we don’t have cassowaries!”
Conveniently for you, they have just started using R and R Markdown, so they want your report as a PDF generated using R Markdown. The Council’s CIO studied Data Taming last trimester, so she wants you to only use commands from the course, so that she can easily see what analysis you’ve done. In your R Markdown code chunks: make sure that you do not set echo = FALSE so that she can see what R code you used to generate your output. But of course, she doesn’t want to see irrelevant warnings or messages.
But remember that your report is for the Director of the Tourism Council, who is not really a technical person, and who certainly doesn’t know R. So make sure you include descriptions allowing the average person to understand what you are doing and what the output means.
1.1 Number of digits
When writing your own text, or USING the output from R:
• For integer results, report the whole integer.
• For non-integers with absolute value > 1: use 2 decimal places
• For non-integers with absolute value < 1: use 3 significant figures.
For example:
◦ 135.5681 ≈ 135.57
◦ −0.0004586 ≈ −0.000459 Exceptions:
• If you’re just PRINTING the output from R, then just keep the output as it is.
- But if you have R do the rounding for you then you need to conform to these two conventions listed above.
• If your data has fewer digits of precision than specified above (eg. because of the way it was stored in the original data, or because of the way it was calculated) then only report that level of precision.
• In some cases, the question may specify a different level of precision — in which case, do what the question says.
2 The data
The Tourist Council has four datasets labelled suburbs 0 .csv, suburbs 1 .csv, suburbs 2 .csv and suburbs 3 .csv, with data from different areas of Australia, collected at the end of the year 2029. Each dataset contains 4 columns:
• REGION: The name of the region where the data was collected.
• POST CODE: The postcode of the region. These identifiers are managed by Australia Post. While the initial development of the postcode system tried to assign the numbers in a sequential pattern, based on geographical location, this has been abandoned. So there is no necessary relationship between two regions and their postcodes.
• STATE: The Australian state or territory in which the region is located. They are denoted by: SA, VIC, WA, NSW, NT, QLD, ACT, TAS.
• ELV: The average elevation of the region (measured in metres).
• POP: The number of people living in the region.
• AREA: The surface area of the region (measured in square kilometres km2 ).
• LAT: The latitude of the region’s centre.
• LONG: The longitude of the region’s centre.
• SWOOPS: The number of swoops recorded in the region for all years 2025–2029.
• TREES: The number of trees in the region.
Each dataset has data on 720 regions. There is likely to be some errors in the data, so make sure you clean it before you do any analysis.
3 Data cleaning
IMPORTANT!
Make sure you only remove data that you must remove. Do not just delete data because it is inconvenient. You must have specific instructions from the client, or it must be an impossible value, before you remove any data from your analysis. Even then, you need to describe why it was removed.
Only perform cleaning operations if you know there is a problem. (Performing unnecessary operations on data is a good way to accidentally introduce errors.) So make sure you have clearly identified the unclean piece of data before you clean it, and explain it in your report.
Instructions:
• There may be some duplicated rows, in which case remove the row higher in the list (ie. the one with smaller row number).
• Some test data may have been left in. Remove it.
• If there are any values that are impossible then remove the entire row.
• There may be some other typos, so fix them if possible. If they’re not possible to fix, then delete the entire row.
4 Your job
Note
Make sure you write text to explain what you are doing at each point and why you are doing it. You need to justify all the things you do or claim. Also describe the results. This report is for the Director, so aim your explanations at the average person and avoid jargon wherever possible.
1. Load the correct dataset directly as a tibble (don’t load it as a general dataset and then convert it, as that can introduce errors). Output the first 10 lines of the dataset and the dimensions of the data set.
2. We want to clean up our data, but first we’ll put in an extra column of row numbers, so we can track some changes we’ve made to the data.
• Add a column at the far left of the dataset called RN that contains the row numbers.
Output the first 10 rows of the dataset.
3. Using dot points, identify what types of variables we now have in our data set, i.e., “Quantitative Discrete”, “Quantitative Continuous”, “Categorical Nominal”, “Categorical Ordinal”. (Don’t just describe what data type they are in the tibble — you need to think about the type of variable in the context of the meaning of the data.) Make sure you provide some justification for your choice of variable types.
• Don’t just provide vague statements, but be very concrete about describing this particular set of data.
4. Now clean the data. Make sure you justify every step of cleaning that you do. Then display the first 10 rows
of the dataset, and the dimensions of the dataset.
Note
If you discover any problems with the data in the following questions then you should come back and redo this question before you submit. Your data should be clean and shiny from this point.
5. Now it’s time to tame our data.
• Make your data set correspond to the Tame Data conventions on page 3 of Module 2. You’ll need to use your answers to Q3.
• Also make sure the R data types in your tibble match the variable types that you identified in Q3.
• (Reminder: Your data should already be clean by this point. You may want to check here if there is any more cleaning required. If so, go back to Q4 and try again.)
Output the first 10 rows, and the dimensions, of your clean, tidy and tame data set.
6. Making sure you set the seed correctly choose a random sample of 500 regions from the dataset, and order them by the row numbers that we introduced in Q2. Then output the first 10 lines of the dataset and the dimensions of the data set.
Note
Use this random subset from Q6 for the remainder of the assignment.
7. (a) Now let’s get on with some analysis. Add two new columns to the data set:
• spt: the number of swoops per one thousand people in the region.
• tph: the number of trees per hectare in each region. (Note that you may want to peform an intermediate calculation here, to get to the final answer.)
and remove the columns containing the post codes and the number of trees. Output the first 10 rows, and the dimensions, of the data set.
(b) Describe what type of variable these two new columns represent (“Quantitative Discrete”, “Quantitative Continuous”, “Categorical Nominal”, “Categorical Ordinal”). Are the data types correct in the tibble? (Explain your answer.) If they are not correct, make sure you change them.
8. Report the following statistics:
(a) The region with the highest number of swoops per 1000 people.
(b) The region with the lowest number of swoops per 1000 people.
(c) The state with the highest number of swoops per 1000 people, averaged over all regions in the state.
(d) The state with the lowest number of swoops per 1000 people, averaged over all regions in the state. (Make sure you write some text for your answer — don’t just present some code output.
9. Generate side-by-side boxplots for each state, with the number of swoops per 1000 people on the y-axis.
10. (a) Now make a scatterplot to see if tph is related to spt. Put the independent/explanatory variable on the horizontal axis, and explain why this is the explanatory variable. Include a straight line of best fit on your plot.
(b) Does it look like there is a linear relationship between the two variables? Explain why/why not, and
also give some reason (based on the context of the data) for why the data might be in this shape.
11. We would like to fit a linear model of tph against spt for this data. But we will first apply a Box-Cox transformation.
(a) Use the Box-Cox algorithm described on page 7 of Module 5 to obtain an estimate of λ . (Extend the range of the search for -4 ≤ λ ≤ 4, in steps of 0.075.) What is the estimated λ?
(b) Apply the transformation to create a new column called spt bc on the right of your dataset. Output the first 10 rows, and the dimensions, of the data set.
12. Produce the following plots:
(a) a scatterplot of the Box-Cox transformed data (with a line of best fit),
(b) a histogram of the Box-Cox transformed data, and the corresponding skewness,
(c) a histogram of the non-Box-Cox transformed data, and the corresponding skewness. Write 2–3 sentences about this output and how it compares to the untransformed data.
13. We will now try to fit a linear model to spt bc using tph as the predictor.
(a) Write down the general equation for the true linear model when fit to the entire population. Make sure you define all of the notation you introduce. (Hint: this equation should include the error terms, and contain the true parameters.)
(b) Now write down the equation for the line of best fit for a sample of the population, with all the estimated parameters. Make sure you use the correct notation and define it.
14. Build a linear model in R, and use the model summary to find estimates for the model parameters. Use these estimates to rewrite your equation from Q13 giving the line of best fit for your model. Also write down the estimated distribution for the errors.
15. Before we use our model for anything, we need to check if it satisfies the 4 assumptions for a linear model (as described on pages 12–15 of Module 6). So now check if our model satisfies these assumptions. Importantly, the client needs to understand the implications, so as part of your answer:
• Describe in your own words what each assumption means in terms of the specific linear model you have fitted.
• If you refer to any graphs make sure you describe (in plain language) what is on the horizontal and vertical axes of those graphs. Your description should be specific to the data used in this assignment, not generic statements.
• Give an explanation of why each assumption is, or is not, satisfied.
• Make sure you identify at least one possible problem with the Independence assumption.
(Note that we are going to use a linear model regardless of any problems that you find in the assumptions, but it is always good to highlight any shortcomings of the model so the client knows about them.)
16. (a) Use your model to find the mean number of swoops (per thousand people) over five years in a region with the mean number of trees per hectare. You also need the correct interval.
(b) Then use your model to predict the actual number of swoops that will happen to the tourists at Safety Hill during the campaign. Again, you also need to provide the correct interval. Pay attention to the units.
(Hint: you will need to transform your predictions and intervals back into the scale of the original variables.)
17. There is some suggestion that black-backed magpies are more aggressive swoopers than white-backed magpies. Make a scatterplot to check if your data confirms or contradicts this claim. Explain how you plot shows this.
18. Write a paragraph or two describing what you have found, and how that is likely to affect the Tourist Council’s campaign. You can also discuss any observations or conjectures that you have. (Everybody’s data will be different so there is no right or wrong answer here, as long as you justify your claim with reasonable arguments.) As part of your answer, give a concrete recommendation for whether the Tourist Council should instead adopt the alternative slogan.
That’s enough for this report. We might investigate this data further in Assignment 3. (Then we can charge the client more money for another deliverable!)
5 Submission
You must submit your assignment via MyUni. Do not email it to the teaching staff. Detailed instructions are on the assignment submission page in MyUni. Make sure that all your output is relevant to the questions being asked.
6 Deliverable Specifications (DS)
Before you submit your assignment, make sure you have met all the criteria in the Deliverable Specifications (DS). The client will not be happy if you do not deliver your results in the format that they’ve asked for.