代写FIT5202 - Data processing for Big Data 2025 SSB Assignment 2代做Python编程-留学生作业帮

代写FIT5202 - Data processing for Big Data 2025 SSB Assignment 2代做Python编程

FIT5202 - Data processing for Big Data 2025 SSB

Assignment 2: Building Models for Realtime Food Delivery Prediction

Weight: 30% (15% A2A+15% A2B)

Background

Food delivery services have become an integral part of modern society, revolutionising the way we consume meals and interact with the food industry. These platforms, accessible through websites and mobile apps, provide a convenient bridge between restaurants and consumers, allowing users to browse menus, place orders, and have food delivered directly to their doorstep with just a few taps. In today's fast-paced world, where time is a precious commodity, food delivery services offer an invaluable solution, catering to busy lifestyles, limited mobility, and the ever-present desire for convenience.

In the food delivery industry, accurate on-time delivery prediction is paramount. Big data processing allows companies to achieve this by analysing vast datasets encompassing order details, driver performance, real-time traffic, and even weather.

Objective of the Project

In the first assignment (A1), we performed data analysis with the datasets to uncover key trends and patterns related to delivery times, order volumes, and other crucial metrics.

In assignment 2A, we will harness the power of Apache Spark's MLLib to construct and train machine learning models. We will focus on accurately and efficiently predicting delivery times.

Finally, assignment 2B will utilise Apache Spark Structured Streaming and our ML model from 2A to process live data streams and dynamically make predictions.

Key Information

This is a two-part assignment (A2A and A2B) that requires staged submissions. In part A2A, you are going to use the provided dataset, complete the assignment tasks, and build your

ML model; then, in part A2B, the trained ML model will be used in combination with streaming data to make real-time predictions.

A2A Due Date: (23:55 Friday 31/Jan/2025, End of Week 5)

A2B Due Date: (23:55 Wednesday 5/Feb/2025, Mid of Week 6) Submission links can be found in Moodle.

Weight: 30% of Final Marks (15% each for 2A and 2B) A2A and A2B will be marked separately.

A2B has a compulsory interview/demo component, which will be conducted during the last lab. The teaching team only marks A2B submissions during your demo session. Failure to attend this demo will result in 0 marks (for A2B).

(Please pay attention to the unit announcement in the final teaching week. If you have an extension/special consideration, more demo sessions will be arranged.)

The Datasets:

- order.csv

- driver.csv

- delivery_address.csv

- restaurant.csv

- new_order.csv (for A2B)

What you need to achieve

Use Case 1 (A2A)	Based on the historical dataset, build a ML model that can predict food delivery time.	Regression
Use Case 2 (A2B)	Use streaming data to perform. real-time prediction and visualise the results	Spark Structured Streaming

Architecture

The following figure represents the overall architecture of the assignment setup. Part A of the assignment consists of preparing the data, performing data exploration and extracting features, and building and persisting the machine learning models.

Fig 1: Overall Architecture for Assignment 2

In both parts, you must implement the solutions using PySpark DataFrame/MLlib for the data pre-processing and machine learning pipelines. Excessive use of Pandas for data processing is discouraged. Please follow the steps to document the processes and write the codes in your Jupyter Notebook.

Getting Started

● Download the datasets from Moodle.

● Download a template file for submission purposes:

● A2A_template.ipynb file in Jupyter Notebook to write your solution. Rename it into the format (for example, A2A_xxxx0000.ipynb. xxxx0000 is your authcate ID.

● You will use Python 3+ and PySpark 3.5.0+ for this assignment (This environment is the same as we used in labs.)

IMPORTANT:

Please answer each question in your Jupyter Notebook file using code/markdown cells. Acknowledge any ideas or codes you referenced from others in the markdown cell or reference list.

If you use generative AI tools, all prompts you use should also be included in the reference section or appendix.

A2A Part 1: Data Loading, Transformation and Exploration (40%)

In this section, you must load the given datasets into PySpark DataFrames and use DataFrame. functions to process the data. Spark SQL usage is discouraged, and you can only use pandas to format results. For plotting, various visualisation packages can be used, but please ensure that you have included instructions to install the additional packages, and that the installation will be successful in the provided docker container (in case your marker needs to clear the notebook and rerun it).

1.1 Data Loading (5%)

1. Write the code to create a SparkSession. Please use a SparkConf object to configure the Spark app with a proper application name, to ensure the maximum partition size does not exceed 16MB, and to run locally with 4 CPU cores on your machine. (2%)

2. Write code to define the schemas for the datasets, following the data types suggested in the metadata. Then, using predefined schemas, write code to load the CSV files into separate data frames. Print the schemas of all data frames. (3%)

1.2 Data Transformation to Create Features (10%)

Feature engineering involves transforming, combining or extracting information from the raw data to create more informative and relevant features that improve the performance of your ML models. In our food delivery use case, the order_ts is not very useful when it is treated as a timestamp. However, it provides more information if you perform. transformation and extract valuable information from it, for example, extracting the day of the week (it may tell you how busy a restaurant is) or hours (peak hours may have bad traffic conditions).

(Note: Some tasks may overlap with A1, feel free to use/reuse your own code/UDF from A1.)

Perform. the following tasks based on the loaded data frames and create a new one. We will refer to this as feature_df, but feel free to use your own naming. (2% each) Please print 5 rows from the feature_df after each step.

1. Extract the day of the week (Monday-Sunday) and hour of the day (0-23) from order_ts, and store the extract information in 2 columns.

2. Create a new boolean column (isPeak) to indicate peak/non-peak hours. (Peak hours are defined as 7-9 and 16-18 in 24-hour format.)

3. Join the geolocation data frame of the restaurant and delivery location, get suburb information and add two columns.

4. Join data frames to add restaurant information to the feature_df: primary_cuisine, latitude, longitude, suburb and postcode.

5. Add columns you deem necessary from the dataset (at least one column is required). (hint: delivery driver’s vehicle type may affect the delivery time.)

1.3 Exploring the data (25%)

1. With the feature_df, write code to show the basic statistics: a) For each numeric column, show count, mean, stddev, min, max, 25 percentile, 50 percentile, 75 percentile; b) For each non-numeric column, display the top-5 values and the corresponding counts; c) For each boolean column, display the value and count. (5%)

2. Explore the dataframe and write code to present two plots, describe your plots and discuss the findings from the plots. (20%)

○ One of the plots must be related to our use case (predicting delivery time).

○ Hint 1: You can use basic plots (e.g., histograms, line charts, scatter plots) to show the relationship between a column and the label or use more advanced plots like correlation plots.

○ Hint 2: If your data is too large for plotting, consider using sampling before plotting.

○ 150 words max for each plot’s description and discussion

○ Feel free to use any plotting libraries: matplotlib, seabon, plotly, etc.

A2A Part 2. Feature extraction and ML training (50%)

In this section, you must use PySpark DataFrame. functions and ML packages for data preparation, model building, and evaluation. Other ML packages, such as scikit-learn, should not be used to process the data; however, it’s fine to use them to display the result or evaluate your model.

2.1 Discuss the feature selection and prepare the feature columns (10%)

1. Based on the data exploration from 1.2 and considering the use case, discuss the importance of those features (For example, which features may be useless and should be removed, which feature has a significant impact on the label column, which should be transformed), which features you are planning to use? Discuss the reasons for selecting them and how you plan to create/transform them.

○ 300 words max for the discussion

○ Please only use the provided data for model building

○ You can create/add additional features based on the dataset

○ Hint - Use the insights from the data exploration/domain knowledge/statistical models to consider whether to create more feature columns, whether to remove some columns

2. Write code to create/transform. the columns based on your discussion above.

2.2 Preparing Spark ML Transformers/Estimators for features, labels, and models (10%)

1. Write code to create Transformers/Estimators for transforming/assembling the columns you selected above in 2.1 and create ML model Estimators for Random Forest (RF) and Gradient-boosted tree (GBT) model.

o Please DO NOT fit/transform. the data yet.

2. Write code to include the above Transformers/Estimators into two pipelines.

o Please DO NOT fit/transform. the data yet.

(Some students may be confused about the differences between 2.1.2 and 2.2.1. Task 2.1.2 is for the new or customised feature you discussed and created, while 2.2.1 is for the “standard” features in the dataset like road_condition or weather_condition, which obviously affect the delivery time.)

2.3 Preparing the training data and testing data (5%)

1. Write code to split the data for training and testing, using 2025 as the random seed. You can decide the train/test split ratio based on the resources available on your laptop. Note: Due to the large dataset size, you can use random sampling (say 20% of the dataset).

2.4 Training and evaluating models (25%)

1. Write code to use the corresponding ML Pipelines to train the models on the training data from 2.3. And then use the trained models to predict the testing data from 2.3

2. For both models (RF and GBT): with the test data, decide on which metrics to use for model evaluation and discuss which one is the better model (no word limit; please keep it concise). You may also use a plot for visualisation (not mandatory).

3. Save the better model (you’ll need it for A2B).

(Note: You may need to go through a few training loops or use more data to create a better-performing model.)

A2A Part 3. Hyperparameter Tuning and Model Optimisation (10%)

3.1 Apply the techniques you have learnt from the labs, for example, CrossValidator, TrainValidationSplit, ParamGridBuilder, etc., to perform further hyperparameter tuning and model optimisation.

The assessment is based on the quality of your work/process, not the quality of your model. Please include your thoughts/ideas/discussions.

(A2 Part B Specification: To be released in Week 5)

Submission A2A

You should submit your final version of the assignment solution online via Moodle. You must submit the files created:

- Your jupyter notebook file A2A_authcate.ipynb

- A pdf file saved from Jupyter Notebook with all output following the file naming format as follows: A2A_authcate.pdf

- A2A Due date: (23:55 Friday 31/Jan/2025)

Note that both submitted (ipynb and pdf) files will be scanned using plagiarism detection software. The highest similarity score among students may be interviewed to prove the originality of the task.

Submission A2B

You should submit your final version of the assignment solution via Moodle. You must submit the following:

● A zip file named based on your authcate name (e.g. abcd1234). The zip file should contain

○ Assignment-2B-Task1_producer_authcate.ipynb

○ Assignment-2B-Task2_spark_streaming_authcate.ipynb

○ Assignment-2B-Task3_consumer_authcate.ipynb

The file in submission should be a ZIP file and not any other kind of

compressed folder (e.g. .rar, .7zip, .tar). Please do not include the data files in the ZIP file.

The A2B due date is 23:55 Wednesday 5/Feb/2025

课程名

mktg2509 csci 2600 38170 lng302 csse3010 phas3226 77938 arch1162 engn4536/engn6536 acx5903 comp151101 phl245 cse12 comp9312 stat3016/6016 phas0038 comp2140 6qqmb312 xjco3011 rest0005 ematm0051 5qqmn219 lubs5062m eee8155 cege0100 eap033 artd1109 mat246 etc3430 ecmm462 mis102 inft6800 ddes9903 comp6521 comp9517 comp3331/9331 comp4337 comp6008 comp9414 bu.231.790.81 man00150m csb352h math1041 eengm4100 isys1002 08 6057cem mktg3504 mthm036 mtrx1701 mth3241 eeee3086 cmp-7038b cmp-7000a ints4010 econ2151 infs5710 fins5516 fin3309 fins5510 gsoe9340 math2007 math2036 soee5010 mark3088 infs3605 elec9714 comp2271 ma214 comp2211 infs3604 600426 sit254 acct3091 bbt405 msin0116 com107/com113 mark5826 sit120 comp9021 eco2101 eeen40700 cs253 ece3114 ecmm447 chns3000 math377 itd102 comp9444 comp(2041|9044) econ0060 econ7230 mgt001371 ecs-323 cs6250 mgdi60012 mdia2012 comm221001 comm5000 ma1008 engl642 econ241 com333 math367 mis201 nbs-7041x meek16104 econ2003 comm1190 mbas902 comp-1027 dpst1091 comp7315 eppd1033 m06 ee3025 msci231 bb113/bbs1063 fc709 comp3425 comp9417 econ42915 cb9101 math1102e chme0017 fc307 mkt60104 5522usst litr1-uc6201.200 ee1102 cosc2803 math39512 omp9727 int2067/int5051 bsb151 mgt253 fc021 babs2202 mis2002s phya21 18-213 cege0012 mdia1002 math38032 mech5125 07 cisc102 mgx3110 cs240 11175 fin3020s eco3420 ictten622 comp9727 cpt111 de114102d mgm320h5s bafi1019 math21112 efim20036 mn-3503 fins5568 110.807 bcpm000028 info6030 bma0092 bcpm0054 math20212 ce335 cs365 cenv6141 ftec5580 math2010 ec3450 comm1170 ecmt1010 csci-ua.0480-003 econ12-200 ib3960 ectb60h3f cs247—assignment tk3163 ics3u ib3j80 comp20008 comp9334 eppd1063 acct2343 cct109 isys1055/3412 math350-real math2014 eec180 stat141b econ2101 msinm014/msing014/msing014b fit2004 comp643 bu1002 cm2030

联系我们

EMail: 99515681@qq.com

QQ: 99515681

留学生作业帮-留学生的知心伴侣！

工作时间：08:00-21:00

微信客服：codinghelp

热门主题

课程名