FIT5196-S1-2025 assessment 1 (35%)
This is a group assessment and worth 35% of your total mark for FIT5196.
Due date: 11:55 PM, Friday, 11 April 2025
Text documents, such as those derived from web crawling, typically consist of topically coherent content. Within each segment of topically coherent data, word usage exhibits more consistent lexical distributions compared to the entire dataset. For text analysis tasks, such as passage retrieval in information retrieval (IR), document summarization, recommender systems, and learning-to-rank methods, a linear partitioning of texts into topic segments is effective. In this assessment, your group is required to successfully complete all the five tasks listed below to achieve full marks.
Task 1: Parsing Raw Files(7/35)
This task touches on the very first step of analysing textual data, i.e., extracting data from semi-structured text files.
Allowed libraries: re, json, pandas, datetime, os
Input Files
|
Output Files (submission)
|
group.txt
group.xlsx
(all input files are in student_group# zip file)
|
task1_.json
task1_.csv
task1_>.ipynb
task1_.py
(the has 0 paddings ie.001,010… )
|
Your group is provided with Amazon product review data (ratings, text, images, etc.). Please use the input data files with your group_number, i.e. student_group.zip in the Google drive folder (student_data).
Note: Using a wrong input dataset will result in ZERO marks for ‘Output’ as in A1 marking rubrics. Please double check that you have the correct input data files!
Your dataset is a modified version of Amazon Product Reviews. Each review is encapsulated in a record that contains 11 attributes. Please check with the sample input files (sample_input) for all the available attributes.
Your task is to extract the data from all of your input files (in total 15 files, including 1 excel file and 14 text (txt) files following a mis-structured xml format). You are asked to extract and transform. the data into a csv file and a JSON file. The format requirements are listed as follows.
For the csv file, you are required to produce an output with following columns:
● parent_product_id: Parent ID of the product.- output format: string
● review_count: the number of total reviews for a parent product ID. -output format: int
● review_text_count: the number of reviews that contains a text (excluding 'none').
-output format: int
For the JSON file, you are required to produce an output file with the following fields (All fields in the output file should be of type ‘string9):
● parent_product_id
● reviews: a root element with one or more reviews, contains fields:
○ category - Category of the product
○ reviewer_id - ID of the reviewer
○ rating - Rating of the product
○ review_title - Title of the user review
○ review_text - Text body of the user review.
○ attached_images - Images that users post after they have received the product
○ product_id - ID of the product
○ review_timestamp - Time of the review (unix time)- output format: UTC time as a string in 'YYYY-MM-DD HH:MM:SS'
○ is_verified_purchase - User purchase verification
○ helpful_votes - Helpful votes of the review
VERY IMPORTANT NOTE:
1. All the tag names are case-sensitive in the output json file. You can refer to the sample output for the correct json file structure. Your output csv and JSON files MUST follow the above attribute lists to avoid mark redundancy.
2. The sample output files are just for you to understand the structure of the required output and the correctness of their content in task 1 is not guaranteed. So please do not try to reverse engineer the outputs as it will fail to generate the correct content.
Task 1 Guidelines
To complete the above task, please follow the steps below:
Step 0: Study the sample files
● Open and check your input .txt files and try to find any ‘potential interesting’ patterns for different data elements
Step 1: Txt file parsing and excel file parsing
● Load the input files
● Use regular expression (Regex) to extract the required attributes and their values as listed from the txt files
● Extract necessary data from the excel file
● Combine all data together
Step 2: Further process the extracted text from Step 1
● Remove any duplicates
● Replaces empty values to ‘none’ across all variables
● Convert all text to lowercase
● Further process the extracted data
● Note for review_texts: they must be transformed into lowercase, with no HTML tags, no emojis, only valid UTF-8 characters, and be entirely in English. To ensure this:
○ To remove emojis, make sure your text data is in utf-8 format
○ Remove all HTML tags while keeping the content intact
○ Remove all emoji symbols and non-UTF-8 characters, including unreadable symbols (e.g., ◆ , □) and invalid Unicode sequences
○ If a review text does not contain enough English letters(this determination is based on the proportion of English letters in the text, with a minimum threshold of 1), it will be labeled as 'none'
Step 3: file output
● Output the required files based on the specified structures provided above, make sure your data is utf-8 encoded.
Submission Requirements
You need to submit the following four files:
● A task1_<group_number>.json file containing the correct review information with all the elements listed above.
● A task1_<group_number>.csv file containing the correct review information with all the elements listed above.
● A Python notebook named task1_<group_number>.ipynb containing a well-documented report that will demonstrate your solution to Task 1. You need to present the methodology clearly, i.e., including the entire step-by-step process of your solution with appropriate comments and explanations. You should follow the suggested steps in the guideline above. Please keep this notebook easy-to-read. You will lose marks if it is hard to read and understand (make sure you PRINT OUT your cell output).
● A task1_<group_number>.py file. This file will be used for plagiarism check (make sure you clear your cell output before exporting).
In Google colab:
Requirements on the Python notebook (report)
● Methodology - 35%
○ You need to demonstrate your solution using correct regular expressions. Results from each step would help to demonstrate your solution better and be easier to understand.
○ You should present your solution in a proper way including all the required steps. Skipping any steps will cause a penalty on marks/grades.
○ You need to select and use the appropriate Python functions for input, process and output.
○ Your solution should be computationally efficient without redundant operations, and without unnecessary data (read and write) operations.
● Report organisation and writing - 15%
○ The report should be organised in a proper and well-organized structure that can allow you to present your Task 1 solutions. Make sure you include clear and meaningful titles for sections (or subsections/ sub-subsections) if needed.
○ Each step in your solution should be clearly described. For example, you should explain your solution idea, any specific settings, and the reasons for using any particular functions, etc.
○ Explanation of your results including all the intermediate steps is required. This can help the marking team to understand your solution and give partial marks even if the final results are not fully correct.
○ All your codes need to be properly commented. Try to focus on writing concise and precise comments (but not excessive, lengthy, and inaccurate paragraphs).
○ You can refer to the notebook templates provided as a guideline for a properly formatted notebook report.
Task 2: Text Pre-Processing (10/35)
This task involves the next step in textual data analysis: converting extracted text into a numerical representation for downstream modelling tasks. You are required to write Python code to preprocess Amazon product reviews text from Task 1 and transform it into numerical representations. These numerical representations are the standard format for text data, suitable for input into NLP systems such as recommender systems, information retrieval algorithms, and machine translation. The most fundamental step in natural language processing (NLP) tasks is converting words into numbers to enable machines to understand and decode patterns within a language. This step, although iterative, is crucial in determining the features for your machine learning models and algorithms.
Allowed libraries: ALL
Input Files
|
Output Files (submission)
|
task1_.json
|
_vocab.txt
_countvec.txt task2_.ipynb
task2_.py
|
In this task you are required to continue working with the data from task1.
You are asked to use the review text from all reviews of parent products that have at least 50 text reviews. Then pre-process the abstract text and generate a vocabulary list and numerical representation for the corresponding text, which will be used in the model training by your colleagues. The information regarding output files is listed below:
● _vocab.txt comprises unique stemmed tokens sorted alphabetically, presented in the format of token:token_index
● _countvec.txt includes numerical representations of all tokens, organised by parent_product_id and token index, following the format parent_product_id, token_index:frequency.
Carefully examine the sample output files (here) for detailed information about the output structure. For further details, please refer to the subsequent sections.
VERY IMPORTANT NOTE: The sample outputs are just for you to understand the structure of the required output and the correctness of their content in task 2 is not guaranteed. So please do not try to reverse engineer the outputs as it will fail to generate the correct content.
Task 2 Guideline
To complete the above task, please follow the steps below:
Step 1: Text extraction
● You are required to extract the review text from the output of task 1.
● You are only required to extract the vocab and countvec lists for reviews from parent products that have at least 50 text reviews (excluding 'none')
Step 2: Generate the unigram and bigram lists and output as vocab.txt
● The following steps must be performed (not necessarily in the same order) to complete the assessment. Please note that the order of preprocessing matters and will result in different vocabulary and hence different count vectors. It is part of the assessment to figure out the correct order of preprocessing which makes the most sense as we learned in the unit. You are encouraged to ask questions and discuss them with the teaching team if in doubt.
a. The word tokenization must use the following regular expression, "[a-zA-Z]+"
b. The context-independent and context-dependent stopwords must be removed from the vocabulary.
■ For context-independent, The provided context-independent stop words list (i.e, stopwords_en.txt) must be used.
■ For context-dependent stopwords, you must set the threshold to words that appear in more than 95% of the parent products that have at least 50 text reviews.
c. Tokens should be stemmed using the Porter stemmer.
d. Rare tokens must be removed from the vocab (with the threshold set to be words that appear in less than 5% of the parent products that have at least 50 text reviews.
e. Tokens with a length less than 3 should be removed from the vocab.
f. First 200 meaningful bigrams (i.e., collocations) must be included in the vocab using PMI measure, then makes sure the collocations can be collocated within the same review.
g. Calculate the vocabulary containing both unigrams and bigrams.
● Combine the unigrams and bigrams, sort the list alphabetically in an ascending order and output as vocab.txt
Step 3: Generate the sparse numerical representation and output as countvec.txt
1. Generate sparse representation by using the countvectorizer() function OR directly count the frequency using FreqDist().
2. Output the sparse numerical representation into txt file with the following format:
parent_product_id1,token1_index:token1_frequency,
token2_index:token2_frequency, token3_index:token3_frequency, …
parent_product_id2,token2_index:token2_frequency,
token5_index:token5_frequency, token7_index:token7_frequency, …
parent_product_id3,token6_index:token6_frequency,
token9_index:token9_frequency, token12_index:token12_frequency, …
Note: the token_index comes from the vocab.txt and make sure you are counting bigrams
Submission Requirements
You need to submit the following four files:
1. A _vocab.txt that contains the unigrams and bigrams tokens in the following format, token:token_index. Words in the vocabulary must be sorted in alphabetical order.
2. A _countvec.txt file, in which each line contains the sparse representations of one of the parent product id in the following format:
parent_product_id1,token1_index:token1_frequency,token2_index:token2_frequen
cy, token3_index:token3_frequency, …
Please note: the tokens with zero word count should NOT be included in the sparse representation.
3. A task2_.ipynb file that contains your report explaining the code and the methodology. (make sure you PRINT OUT your cell outputs)
4. A task2_.py file for plagiarism checks. (make sure you clear your cell outputs)
Requirements on the Python notebook (report)
● Methodology - 35%
○ You need to demonstrate your solution using correct regular expressions.
○ You should present your solution in a proper way including all required steps.
○ You need to select and use the appropriate Python functions for input, process and output.
○ Your solution should be computationally efficient without redundant operations and unnecessary data read/write operations.
● Report organisation and writing - 15%
○ The report should be organised in a proper and well-organized structure that can allow you to present your Task 2 solutions. Make sure you include clear and meaningful titles for sections (or subsections/ sub-subsections) if needed.
○ Each step in your solution should be clearly described. For example, you should explain your solution idea, any specific settings, and the reasons for using any particular functions, etc.
○ Explanation of your results including all the intermediate steps is required. This can help the marking team to understand your solution and give partial marks even if the final results are not fully correct.
○ All your codes need to be properly commented. Try to focus on writing concise and precise comments (but not excessive, lengthy, and inaccurate paragraphs).
○ You can refer to the notebook templates provided as a guideline for a properly formatted notebook report.
Task 3: Data Exploratory Analysis (15/35)
In this task, you are asked to conduct a comprehensive exploratory data analysis (EDA) on the provided Amazon product review data. The goal is to uncover interesting insights that can be useful for further analysis or decision-making.
Allowed libraries: ALL
Input Files
|
Output Files (submission)
|
task1_.json
task1_.csv
|
task3_.ipynb
task3_.py
|
Task 3 Guideline
To complete the above task, please follow the steps below:
Step 1: Understand the Amazon product review data:
● Review and try to understand the data.
● Summarise the key features and variables included in the dataset.
● Identify any initial patterns and trends
Step 2: Data Analysis:
● Perform. an exploratory data analysis to investigate and uncover interesting insights.
● You are required to investigate and present at least 5 insights from your data analysis.
Example of a basic insight
● Question: What is the distribution of ratings in the selected category?
● Visualisation: A simple bar chart showing the percentage of each rating (1 to 5 stars) in the dataset.
● Interpretation: This reveals general user satisfaction trends for products in this
category … which means there could be potential chances to improve profits by … . For future suggestions, the owner could …
You are strongly recommended to read the detailed grading guidelines in the marking rubric.
Submission Requirements
You need to submit 2 files:
5. A task3_.ipynb file that contains your report explaining the code and the methodology. (make sure you PRINT OUT your cell outputs)
6. A task3_.py file for plagiarism check. (make sure you clear your cell outputs)
Task 4: Video presentation for Task 3 (2/35)
Create a video presentation (5-8 minutes) to effectively communicate the findings from your exploratory data analysis (EDA) on the Amazon product review data. The goal is to present your methodology and insights in a clear, concise, and engaging manner.
Output Files (submission)
|
Submission Requirements
Here are the key components you need to include in your submission:
Introduction:
● Please briefly introduce yourself, including your student ID, and provide context for the analysis.
● Explain the purpose of the EDA and the datasets used (Amazon product review data).
Methodology:
● Describe the steps taken during the data analysis process.
Insights:
● Present at least 5 insights uncovered from the analysis.
● Use visual aids such as charts, graphs, or tables to support your insights.
● Explain the significance of each insight and how it can be applied or interpreted.
Conclusion:
● Summarise the key findings and their potential implications.
● Discuss any limitations of the analysis and suggest areas for further research.