MAST10010: Data Analysis
Assignment 3
Due Date: Friday October 18th, 11.59pm.
Instructions
Software:
You must use Minitab to produce any graphs, tables and descriptive statis- tics.
Graphs:
● must include your name/student number, which can be added by Edit- ing the graph, right-clicking and selecting Add → Footnote or Add → Subtitle.
● must be relevant. You may look at many graphs, but you should only include the most relevant graph for each question.
● should be clear: ensure that labels and titles are correct and appro- priate; you can add gridlines/change symbols/colour as appropriate to make the graph clearer. There are some marks awarded for improving upon the default from Minitab.
● Mac Users: you will need to use myUniApps in order to edit the graphs as required above.
Statistics:
Must be relevant: you will be penalised for including statistics which are not relevant to the questions asked.
Comments:
● must be in the context of the data.
● should be supported by relevant statistics where possible.
● should be concise and informative. Word limits, where given, must be strictly adhered to (all word limits are a maximum, you will be penalised for going over this limit!). You may use dot-points.
The data
All of the data for this assignment can be downloaded from the LM in a single file: Asst3 2024 data.csv.
The data file contains data for three different studies, each explained in the relevant question: a study on cholesterol for Question 1 (columns C1 and C2); a study on speech sounds for Question 2 (columns C4 and C5); and a study on AI image detection for Question 3 (columns C7 and C8).
Question 1: Cholesterol [3 + 4 + 2 = 9 marks]
Oats and almonds contain beta-glucan, which is meant to help reduce choles- terol levels in humans. The television program “Michael Mosley’s: Trust Me, I’m a Doctor” investigated the effectiveness of various diets on reducing to- tal cholesterol (in mmol/L). As part of this investigation, they considered ten people who ate 75g of oats per day, and ten people who ate 30g almonds per day, both in addition to their usual diet, for 2 weeks. The differences in their cholesterol (final − initial) were recorded.
The data are available as Asst3 2024 data.csv on the LMS in the Assignment 3 page; this question relates to columns C1 (‘Almond’) and C2 (‘Oat’).
(a). Perform. the most appropriate hypothesis test, and include the Minitab output in your assignment. You should justify any choices you make about the assumptions when deciding on the appropriate analysis.
(b). Write the results of the analysis in the style. of a research report.
Note: you are being assessed here on how you present the results, you can receive full marks for this part even if you chose an incorrect analysis in part (a) .
(c). After reading this research, it is decided to design a new (more pre- cise) study. It is decided that a difference in cholesterol of at least 0.5mmol/L over this time period would be clinically important, and the desired power is 0.9 with a significance level will of α = 0.05. You should use the larger standard deviation from the study in this ques- tion as the planning value. What is the minimum sample size the new study should have, based on this information? You should include a suitable graph along with the required sample size.
Question 2: Recognising sounds in speech [6 + 1 + 2 + 3 + 3 + 4 = 19 marks]
This question is based on part of the study by Massimiliano Canzi and Tamara Rathcke (2023) ‘Unmasking the truf: Impact of community masks on the perception of voiceless fricatives in English’, Proceedings of the In- ternational Congress on Phonetic Sciences, 27–31. You can find this article linked on the LMS.
You DO NOT need information from this article to answer the questions; it is provided for interest only.
The study examined differences in understanding of common phonetic sounds, based on the primary language spoken by the participant. All partic- ipants spoke English and were classified by their primary language: English (ENG), Greek (GRE) or Korean (KOR). Participants were provided with video recordings with people saying one of four sounds (‘SS’, ‘SH’, ‘FF’ or ‘TH’) in an English word, which they then had to identify (multiple choice, always these four sounds as options).
For this question, we will only consider the time taken to identify the ‘SS’ sound in the word ‘PASS’. The response variable is the reaction time (in milliseconds). This is a random subset, and only those who correctly identified the sound.
The data are available as Asst3 2024 data.csv on the LMS in the Assignment 3 page; this question relates to columns C4 (‘ResponseTime’) and C5 (‘Language’).
(a). It is proposed to analyse these data using a one-way ANOVA. Write the mathematical formulations for both models being compared. You must clearly define all variables and subscripts used.
(b). Perform. a one-way ANOVA, and include the Minitab output. You should include the ANOVA table and summary, but do not need to include any other output.
(c). It is likely that the normality assumption is not met. Give a reason why this is NOT an reason for discontinuing with the ANOVA.
(d). Interpret the R2 value from the ANOVA in the context of this study.
(e). Produce a diagram summarising any differences between the groups. You must produce this yourself (either by hand or formatting a table appropriately in software), but may use Minitab to calculate all the relevant information.
(f). Compare and contrast the benefits of using LSD (Fisher) and HSD (Tukey) intervals for analysis after completing an ANOVA. Your an- swer should include at least one commonality and at least one differ- ence; you will be assessed on both the correctness and depth of your response.
Your comments must be less than 60 words.
Question 3: Identifying AI Images [2 + 4 + 3 = 9 marks]
This question is inspired by the same study as Assignment 2: Zeyu Lu, Di Huang, Lei Bai, Jingjing Qu, Chengyue Wu, Xihui Liu, and Wanli Ouyang (2024) ‘Seeing is not always believing: benchmarking human and model per- ception of AI-generated images’, Advances in Neural Information Processing Systems, 36.
You can find this article at:
https://proceedings.neurips.cc/paper files/paper/2023/file/505df5ea30f6306 61074145149274af0-Paper-Datasets and Benchmarks.pdf, also linked on the LMS.
You DO NOT need information from this article to answer the questions; it is provided for context only.
This question is considering 120 AI-generated images, and for each im- age two proportions were recorded: the proportion of AI-detectors which correctly identified it as AI (C7 ‘AI’) and the proportion of humans who correctly identified it as AI (C8 ‘Human’). We are primarily interested in whether it is possible to predict the proportion of humans who will be able to correctly identify AI images, based on the AI-detectors.
(a). Produce a suitable plot for these data.
(b). It is proposed to perform. a linear regression for these data. State all of the necessary assumptions, and check them where possible.
You should answer the following part (c) using a linear regression, even if you thought the assumptions in (b) were not satisfied.
(c). Interpret a 95% Confidence interval for the slope, β .
Relevance, Formatting & Submission [2 marks]
You can gain an additional 2 marks by:
● only including relevant material;
● submitting a clearly legible assignment (eg all pages correct orienta- tion);
● selecting correct page(s) for each part of each question (when you upload your assignment to Gradescope, it will ask you to select pages: you can select multiple pages for a question part, you can also select the same page for multiple parts).