FIT1006 Business Information Analysis
Assignment 3 (Final assignment)
1st Semester 2025
This assignment is worth 28% of your final mark (subject to the hurdles described in the FIT1006 handbook entry, FIT1006 week 1 Lecture 1 and links therein). Among other things (see below), note the need to hit the `Submit’ button - and the possible requirement of an interview between student and one or more members of teaching staff.
Due Date: Wednesday 11th June 2025, 11:55 pm
Method of submission: Your submission should consist of 1 file:
1. A text-based .pdf file named as: FamilyName-StudentId-1stSem2025FIT1006Asst3.pdf The file must be uploaded on the FIT1006 Moodle site by the due date and time.
The text-based .pdf file will undergo a similarity check by Turnitin at the time you submit to Moodle. If you have any relevant output from MicroSoft Excel and/or from SYSTAT then make sure to include that in the appropriate place(s) in your .pdf file. Please read submission instructions here and elsewhere carefully regarding the use of Moodle.
Total available marks: 34 + 17 + 21 + 24 = 96 marks.
Note 1: Please recall support, conferring with https://www.monash.edu/student-academic- success, the Academic Integrity rules and the `Welcome to FIT1006 ’ post in Ed Discussion. This is an individual assignment.
In submitting this assignment, you acknowledge both that you are familiar with the relevant policies, rules and regulations regarding Academic Integrity (including, e.g., doing your own work, not sharing your work, only using generative AI as instructed) and also that you are familiar with the consequences of being deemed to be in contravention of these policies. The only place that generative AI is to be used in this assignment is Qu 4(g).
Note 2: And a reminder not to post even a hint of part of a proposed partial solution - or a hint of how you might be proposing to do a certain question - to a forum or other public location. This includes when you are seeking clarification of a question.
If you are seeking clarification of a question or help with how to do a question then please follow instructions from the `Welcome to FIT1006 ’ Ed Discussion forum post.
If you seek clarification on an Assignment question then – bearing in mind the above – word your question very carefully and/or (if necessary) send private e-mail.
If you are seeking to understand a concept better, then try to word your question both so that it is a long way removed from the Assignment and in an Ed Discussion category that does not pertain to assessment. If you do this properly then it will both allow your fellow students to reply to the post and allow for a more prompt response.
If you wish to ask about an assessment clarification and also about a technical clarification
- i.e., about two separate matters - then you’ll probably be doing everyone a favour if you separate it into two separate posts.
Taking the care to follow these instructions will help to get the desired responses more smoothly and more promptly.
When instructions are overlooked or ignored, we do not enjoy having to hide (make private)
- or, in worse cases - remove and delete posts (and, in more extreme cases, report the person who posted).
As per the Welcome post in FIT1006 Ed Discussion (and please see the relevant text),
where we deem assessment items to be hard, we reserve the right to seek to have the marks adjusted upwards.
Students are reminded that Monash University takes academic integrity very seriously. Thank you for your consideration.
Note 3: As previously advised, it is your responsibility to be familiar with the special consideration policies and special consideration process – as well as academic integrity.
Students should be familiar with the special consideration policies and the process for applying. As has been stated several times, (unless you are explicitly instructed otherwise by the university’s Special Considerations team or the FIT1006 teaching team) such applications should not be sent to any of the FIT1006 teaching team.
On those occasions that you obtain a two-day extension from the Special Considerations team and the FIT1006 teaching team subsequently grants an across-the-board extension, please do not assume that the Special Considerations team will automatically carry the two-day extension across again.
Note 4: As a general rule, don’t just give a number or an answer like `Yes’ or `No’ without at least some clear and sufficient explanation - or, otherwise, you risk being awarded 0 marks for the relevant exercise. Make it easy for the person/people marking your work to follow your reasoning. Without clear explanation, there is the possibility that any such exercise will be awarded 0 marks.
On the issue of significant figures and decimal places, try to give at least 2 decimal places and at least 3 significant figures.
Re-iterating a point above, for each and every question, sub-question and exercise, clearly explain your answer, state any assumptions (and why you’re making them), and clearly show any working.
Note 5: All of your submitted work should be in machine readable form, and none of your submitted work should be hand-written. Nor should your submitted work include a file obtained from an electronic sketch.
Note 6: If you wish for your work to be marked and not to accrue (possibly considerable)
late penalties, then make sure to upload the correct files and (not to leave your files as Draft). You then need to determine whether you have all files uploaded and that you are ready to hit `Submit’ . Once you hit `Submit’, you give consent for us to begin marking your work. If you hit `Submit’ without all files uploaded then you will probably be deemed not to have followed the instructions from the Notes above. If you leave your work as Draft and
have not hit `Submit’ then we have not received it, and it can accrue late penalties once the deadline passes. In short, make sure to hit ‘Submit’ at the appropriate time to make sure that your work is submitted. Late penalties will be as per Monash University Faculty of IT and Monash University policies (see, e.g.,
https://publicpolicydms.monash.edu/Monash/documents/1935752 and, e.g., sec. 1.11). It is expected that any work submitted at least 10 calendar days after the deadline will possibly automatically be given a mark of 0.
Note 7: Save your work regularly.
Note 8: Refer to posts and discussions in the Ed Discussion forum, including anything pertaining to (e.g.) comments to the original post announcing Assignment 3 and/or (e.g.) Assignment 3 updates. Make sure to have done this prior to submission.
Note 9: Clearly and explicitly state that you have adhered to submission instructions, and that all work is yours, and that you have not shared your work with any other, and that you have done your work with academic integrity. Also, click any provided relevant submission box to that effect.
Note 10: After you submit (and, again, in accordance with academic integrity), please do not post any material about what you have done. Likewise, after you submit, please do not ask about how you will be marked or what late penalties you might get - rather, please then wait until marking is done.
Some Questions and Answers – further to the above What help am I entitled to have with this assignment?
Academic integrity is an important concern. As such, you must write your work yourself, without collaborating with other students nor anyone else – nor using generative AI (e.g., ChatGPT) except as specified. (In this assignment, you may use generative AI in Qu 4(g) and nowhere else.) This includes doing your own reading of any references.
Are there any other matters that relate to academic integrity?
Yes. You must be honest in reporting the results.
Introduction
There are many data-sets which are collected by whatever means, and there are many ways to analyse these. Many data sources were mentioned in the introduction to this semester’s FIT1006 Assignment 2. W S Gosset’s (or Student’s) original work on the t distribution was motivated by his work in the brewery.
Data can come from a variety of sources.
A venue for publishing scientific data is https://www.nature.com/sdata/research-articles .
Throughout this Assignment, recall all notes and instructions - including showing reasoning, calculations and working.
Qu 1 ((4 + [2 + 2 + 2 + 2] + 4) + (2 + 8 + 4) + 4 = (4 + 8 + 4) + (2 + 8 + 4) + 4 = 16 + 14 + 4 = 34 marks)
Evaluating an Engagement-Detection Model
Dataset: student_engagement.xlsx (50 records).
Background
The learning-analytics team for an online first-year Statistics unit built a rule-based classifier that tries to decide, at the end of a given session, whether a student was “Engaged” or “Not Engaged.” The decision is based on time-on-task, number of attempts, and the correctness of the final submission.
A small validation study was run. Two experienced tutors (TAs) independently coded each session and agreed on the ground-truth label. The spreadsheet data you have been given contains the following columns:
Column Description
Student ID Student identifier ID
Time “< 30 sec”, “30–120 sec”, “> 120 sec”
Attempts Integer submissions in the given session
Last Action “Correct” or “Incorrect”
Engaged? (GT) Ground-truth label
Predicted Engaged? Model’s prediction
Gender Self-reported (F / M)
Table 1
The data for Table 1 is supplied at the FIT1006 Moodle site from which you can access Assignment 3.
Tasks
1. First you need to evaluate the overall model.
a. Construct a 2 × 2 confusion matrix for Engaged? (GT) vs Predicted Engaged?.
b. From the matrix compute the following:
(i) Accuracy,
(ii) Precision (positive = Engaged),
(iii) Recall,
(iv) F1-score.
c. Briefly explain what each metric means in this context.
2. Next you need to analyse the model,s fairness by gender.
a. Split the data into Female and Male subsets. (We note that Female/Male is an artificial binary and that people exist outside of that - but, for the purposes of this exercise, all
identified as either Female or Male.)
b. Repeat Task 1b (from above) for each subset. Present the results in a table.
c. Is the model equally good at detecting engagement for both groups? Which metrics show the biggest gap?
3. Finally, you need to make a recommendation to the learning analytics team.
Write a note to the learning analytics team summarising model’s quality, fairness concerns, and some next steps such as suggesting some attributes or features that may help the model improve.
Qu 2 (1 + 1 + 1 + 2 + 5 + 7 = 17 marks)
Leading into part (a), count the number of words in your unredacted Enron document from Assignment 1 Qu 4 and Assignment 2 Qu 3. Call this w.
Let W = min{100, w}. I.e., if w >= 100 then W = 100, and if w <= 100 then W = w. If W < 80 then, at the first opportunity, please
(i) advise the FIT1006 teaching team of this, with evidence, and either
(ii a) suggest a nearby document in the Enron database that you would like to use or (ii b) say why you think all will be oka with the current document of length W.
(a) Include your Enron document (copy and paste), and state the value ofw, and state the value ofW.
Leading into part (b), state the 4th last digit (i.e., 5th digit) of your StudentId and consider the following 5 documents, numbered (1) to (5).
We can get to R J Solomonoff (1967) by going https://RaySolomonoff.com -->
https://RaySolomonoff.com/publications/pubs.html -->
https://RaySolomonoff.com/publications/67.pdf .
(1) R J Solomonoff (1967) https://RaySolomonoff.com/publications/67.pdf sec. 6 The Problem of the Ambitious Subordinate
(2) https://www.AmStat.org/asa/files/pdfs/P-ValueStatement.pdf
(3) sec. 1 More from Chris of https://doi.org/10.1093/comjnl/bxm117 (2008a)
(4) the transcript at https://www.ABC.net.au/listen/programs/scienceshow/hedy-lamarr- actress-inventor-and-amateur-engineer/104462346
(5) https://www.monash.edu/indigenous-australians/news-and-events/news/80th-memorial- of-william-cooper
The 5th digit of your StudentId gives you two of the abovementioned datasets as follows.
5th digit of StudentId
|
Document numbers
|
|
|
0
|
(1), (2)
|
1
|
(1), (3)
|
2
|
(1), (4)
|
3
|
(1), (5)
|
4
|
(2), (3)
|
5
|
(2), (4)
|
6
|
(2), (5)
|
7
|
(3), (4)
|
8
|
(3), (5)
|
9
|
(4), (5)
|
Table 2
The 5th digit of your StudentId gives you two of the abovementioned data sets as above. Take the first W words of both of these.
Also take W words from the start of your unredacted Enron document from Assignment 1 Qu
4 and Assignment 2 Qu 3.
You now have 3 data sets (or documents), all of size W.
(b) State the 5th (4th last) digit of your StudentId, and also state which three datasets/documents (each of length W) you are using.
(c) Copy and paste the contents of these 3 datasets (each of length W) into your assignment. Make sure that there is clear separation between these 3 datasets.
For what is to follow, the words、a' and、an' are treated as equivalent.
(In a grammatical sense, they are both the indefinite article, and、a' is typically used when preceding a consonant, and、an' is typically used when preceding a vowel.)
An analyst doing forensics is interested in the possibility that various words occur with the same frequency.
For each of these three documents of length W, count the number of times that the word `the' occurs. Call these numbers {a1, a2, a3}.
For each of these three documents of length W, count the number of times that the word `and' occurs. Call these numbers {b1, b2, b3}.
For each of these three documents of length W, count the number of times that the word `a' or the word `an' occurs - and add these two numbers up (the number of times the document of length W has `a’ plus the number of times the document of length W has `an’). Call these
numbers {c1, c2, c3}.
(d)
Give the values for `the': {a1, a2, a3}.
Give the values for `and': {b1, b2, b3}.
Give the values for `a'/`an': {c1, c2, c3}.
State these clearly.
As above, an analyst doing forensics is interested in the possibility that various words occur with the same frequency.
(e) State a suitable null hypothesis and a suitable alternative hypothesis.
(f) Show your analysis of this hypothesis.
Give a significance level at which you could reject this.
In the tables available to you from this subject and elsewhere (e.g., binomial, cumulative
binomial, Poisson, cumulative Poisson, chi-squared, F, t, z, etc.), try to give the most
significant level (smallest value of alpha, smallest value of a) at which you could reject this. If possible, give the largest value of alpha (or a) at which you could not reject this.
Reminder throughout Qu 2 and throughout Assignment 3: As always, make your work clear to the person marking and also to anyone else reading.
As always, as per Note 2, please do not post even a hint of a proposed partial solution in public to the forum or other public location. So, as one example, if you plan to do a certain test and need a table with some value of significance level or sample size or degrees of
freedom or whatever but the FIT1006 Moodle web site doesn't have what you need, then please do your best web searching to find what you need and then document your answer.
Again, please don't post about your intentions to do such-and-such a test and how you need such-and-such a table to do such-and-such a question. One of the possible consequences
would be that such a post is deleted.
As always, please see the Welcome post (in FIT1006 Ed Discussion) about guidelines about what to post - and what not to post - when there are open assessment items.
Qu 3 (0.25 + 0.25 + 0.5 + 4 + 6 + 4 + 6 = 21 marks)
(a) Write out your StudentId.
(b) Write out the 8 digits of your StudentId in order from highest to lowest.
You will then split thes ordered 8 digits into four 2-digit numbers.
E.g., if your StudentId is 63920178 then write it as 98763210 and split it as 98, 76, 32, 10 and a = 98, b = 76, c = 32, d = 10.
And, e.g., if your StudentId is 45590609 then write it as 99655400 and split it as 99, 65, 54, 00 and a = 98, b = 65, c = 54, d = 00 = 0.
And, e.g., if your StudentId is 40410440 then write it as 44441000 and split it as 44, 44, 10, 00 and a = 44, b = 44, c = 10, d = 00 = 0.
(c) Write the values of a, b, c and d resulting from your StudentId.
Now consider an intervention (such as, e.g., a vaccination) and the question of whether or not people's condition(s) improved as a result of the intervention.
Let a be the number infected with the intervention.
Let b be the number infected without the intervention.
Let c be the number not infected who had the intervention.
Let d be the number not infected who did not have the intervention.
(d) State a suitable null hypothesis and a suitable alternative hypothesis.
(e) Show your analysis of this hypothesis.
Give a significance level at which you could reject this.
In the tables available to you from this subject and elsewhere, try to give the most significant level (smallest value of alpha, or a) at which you could reject this. If possible, give the
largest value of alpha (or a) at which you could not reject this.
We now go back to the end of part (c) and we re-interpret the numbers. We will use the numbers a, b, c, and d again.
Leading into part (f), we examine two machine learning algorithms, Alg1 and Alg2. They are both fed two-class classification problems from the same probability of being in Class1 and Class2, and we consider the possibility that they are the same algorithm or at least will have identical performance.
a now refers to the number of cases when Alg1 and Alg2 are both correct.
b now refers to the number of cases where Alg1 is correct and Alg2 is incorrect.
c now refers to the number of cases where Alg1 is incorrect and Alg2 is correct.
d now refers to the number of cases when Alg1 and Alg2 are both incorrect.
(f) State a suitable null hypothesis and a suitable alternative hypothesis.
(g) Show your analysis of this hypothesis from part (f).
Give an appropriate significance level (preferably as small as possible) at which you could reject this null hypothesis.
Give an appropriate significance level (preferably as large as possible) at which you could not reject this null hypothesis.