代写COMP809、Python/C++编程代写

ASSIGNMENT TWO

Semester 2 - 2022


PAPER NAME: Data Mining and Machine Learning
PAPER CODE: COMP809
DUE DATE: Sunday 30 Oct 2022 at midnight
TOTAL MARKS: 100
Students’ Names: ………………………………………………………….……………………………………………….

Students’ IDs: ………………………………………………………….………………………………….………………….

Due date: 30 Oct 2022 midnight NZ time.
Late penalty: maximum late submission time is 24 hours after the due date. In this case,
a 5% late penalty will be applied.
Include your actual code (no screenshot) in an appendix with appropriate comments for
each task.
Note: This assignment should be complemented by a group of two students.
Submission: a soft copy needs to be submitted through the canvas assessment link.


INSTRUCTIONS:
1. The following actions may be deemed to constitute a breach of the General Academic
Regulations Part 7: Academic Discipline,
Communicating with or collaborating with another person regarding the Assignment
Copying from any other student work for your Assignment
Copying from any third-party websites unless it is an open book Assignment
Uses any other unfair means
2. Please email DCT.EXAM@AUT.AC.NZ if you have any technical issues with your submission
on Canvas immediately
3. Attach your code for all the datasets in the appendix section.
Student IDs Numbers: COMP809 S2 2022
2 | P a g e

Part A: Clustering Methods (40 marks)
For this question, you will explore the clustering methods you have learnt in this course. You have
been given datasets from three very different application environments and you are required to
explore three widely used clustering algorithms and deploy each of them on the different datasets.
The three algorithms that you have decided to explore are 1) K-Means 2) DBSCAN and 3)
Agglomerative.
The three datasets that you have been given are:
Dow Jones Index
Facebook Live Sellers in Thailand
Sales Transactions

You need to complete the following three tasks as detailed below.
Task 1
For each activity in this task, you must explain each dataset and perform data exploration, data pre-
processing and apply a suitable feature selection algorithm before deploying each clustering
algorithm. Your clustering results should include the following measures:
The time taken, Sum of Squares Errors (SSE), Cluster Silhouette Measure (CSM). You may use the
Davis-Bouldin score as an alternative to SSE. Submit Python code used for parts a) to c) below. You
only need to submit the code for one of the 3 datasets.
a) Run the K means algorithm on each of the three datasets. Obtain the best value of K using
either SSE and/or CSM. Tabulate your results in a 3 by 3 table, with each row corresponding
to a dataset and each column corresponding to one of the three measures mentioned above.
Display the CSM plot for the best value of the K parameter for each dataset. [5 marks]

b) Repeat the same activity for DBSCAN algorithm and tabulate your results once again, just as
you did for part a). Display the CSM plot and the 3 by 3 table for each dataset. [5 marks]

c) Finally, use the Agglomerative algorithm and document your results as you did for parts a)
and b). Display the CSM plot and the 3 by 3 table for each dataset. [5 marks]

Task 2
a) For each dataset identify which clustering algorithm performed best. Justify your answer.
In the event that no single algorithm performs best on all three performance measures, you
will need to carefully consider how you will rate each of the measures and then decide how
you will produce an overall measure that will enable you to rank the algorithms. [5 marks]

b) For each winner algorithm and for each dataset explain why it produced the best value for the
CSM measure. This explanation must refer directly to the conceptual design details of the
algorithm. There is no need to produce any further experimental evidence for this part of the
question. [5 marks]

c) Based on what you produced in a) above, which clustering algorithm would you consider to
be the overall winner (i.e., after taking into consideration performance across all three
datasets). Justify your answer. [5 marks]
Student IDs Numbers: COMP809 S2 2022

Part B: Predictions of Particulate Matter (PM2.5 or PM10) [60 marks]
Air pollution causes serious damage to public health and based on existing research, particulate matter
(PM) smaller than PM2.5 is currently considered to have the strongest correlation with the effects of
cardiovascular disease. Therefore, making accurate predictions of PM2.5 is a crucial task. In this part,
you are required to build prediction models based on multi-layer perceptron (MLP) and long short-term
memory (LSTM).

Dataset
The dataset for this experiment can be downloaded from the Environmental Auckland Data Portal. Your
dataset includes PM2.5 or PM10 (output) and different predictors such as air pollution, Air Quality Index
(AQI), and meteorological data collected on an hourly basis from only one air quality monitoring
stations station listed below:
Penrose Station (ID:7)
Takapuna Station (ID:23)
Two PMlag measurements, lag1 and lag2, should be included in your dataset. For example, lag1 for PM2.5
is the measurement for the previous hour (h-1) and lag2 is PM2.5 concentration for h-2.
Download relevant PM concentration, air pollution data (SO2, NO, NO2), and meteorological data
Solar Radiation (W/m2), Air Temperature (°C), Relative Humidity (%), Wind Direction (°), and
Wind Speed (m/s)). The dataset should be hourly measurement starting from January 2017 to
December 2021 (5 years).
Note 1: Not all mentioned independent variables are collected at these monitoring stations.
Note 2: The unit of measurement for PM and air pollution data should be (μg/m3).
Data Pre-processing [5 marks]
Make sure your dataset all has the same temporal resolution (i.e. hourly measurement). Perform data
exploration and identify missing data and outliers (data that are out of the expected range). For example,
unusual measurements of air temperature of 40(°C) for Auckland, Relative Humidity measurements
above 100, and negative or unexplained high concentrations are outliers.
Provide attribute-specific information about outliers and missing data. How can these affect
dataset quality?
Based on this analysis, decide, and justify your approach for data cleaning. Once your dataset
is cleaned move to the next step for feature selection.
Feature Selection [5 marks]
Choose five attributes of your dataset that has the highest correlation with PM2.5 concentration using
Pearson Correlation or any other feature selection method of your choice with justification.
Provide the correlation plot (or results of any other feature selection method of your choice)
and elaborate on the rationale for your selection.
Describe your chosen attributes and their influence on PMconcentration.
Provide graphical visualisation of variation of PMvariation.
Provide summary statistics of the PM concentration.
Provide summary statistic of predictors of your choice that has the highest correlation in tabular
format.
Student IDs Numbers: COMP809 S2 2022

Experimental Methods
Use 70% of the data for training and the rest for testing the MLP and LSTM models. Use a Workflow
diagram to illustrate the process of predicting PM concentrations using the MLP and LSTM models.
[5 marks]

For both models, provide root mean square error (RMSE), Mean Absolute Error (MAE), and correlation
coefficient (R2) to quantify the prediction performance of each model.

Multilayer Perceptron (MLP)
1) In your own words, describe multilayer perceptron (MLP). You may use one diagram in your
explanation (one page). [5 marks]

2) Use the sklearn.neural_network.MLPRegressor with default values for parameters and a single
hidden layer with k= 25 neurons. Use default values for all parameters and experimentally
determine the best learning rate that gives the highest performance on the testing dataset. Use this
as a baseline for comparison in later parts of this question. [5 marks]

3) Experiment with two hidden layers and experimentally determine the split of the number of
neurons across each of the two layers that gives the highest accuracy. In part 2, we had all k
neurons in a single layer, in this part we will transfer neurons from the first hidden layer to the
second iteratively in step size of 1. Thus, for example in the first iteration, the first hidden layer
will have k-1 neurons whilst the second layer will have 1, in the second iteration k-2 neurons
will be in the first layer with 2 in the second, and so on. [5 marks]

4) From the results in part 3 of this question, you will observe a variation in the obtained
performance metrics with the split of neurons across the two layers. Give explanations for some
possible reasons for this variation and which architecture gives the best performance.
[5 marks]

Long Short-Term Memory (LSTM)
1) Describe LSTM architecture including the gates and state functions. How does LSTM differ
from MLP? Discuss how does the number of neurons and batch size affect the performance of
the network? [5 marks]

2) To create the LSTM Model, apply Adaptive Moment Estimation (ADAM) to train the networks.
Identify an appropriate cost function to measure model performance based on training samples
and the related prediction outputs. To find the best epoch, based on your cost function
results, complete 30 runs keeping the learning rate and the number of batch sizes
constant at 0.01 and 4 respectively. Provide a line plot of the test and train cost function
scores for each epoch. Report the summary statistics (Mean, Standard Deviation,
Minimum and Maximum) of the cost function as well as the run time for each epoch.
Choose the best epoch with justification. [5 marks]

3) Investigate the impact of differing the number of the batch size, complete 30 runs keeping the
learning rate constant at 0.01, and use the best number of epochs obtained in previous step 2.
Report the summary statistics (Mean, Standard Deviation, Minimum and Maximum) of the cost
Student IDs Numbers: COMP809 S2 2022

function as well as the run time for each batch size. Choose the best batch size with
justification... [5 marks]

4) Investigate the impact of differing the number of neurons in the hidden layer while keeping the
epoch (step 2) and Batch size (step 3) constant for 30 runs. Report the summary statistics (Mean,
Standard Deviation, Minimum and Maximum) of the cost function as well as the run time.
Discuss how does the number of neurons affect performance and what is the optimal number of
neurons in your experiment? [5 marks]
Model Comparison
1) Plot model-specific actual and predicted PM to visually compare the model performance. What
is your observation? [2.5 marks]

2) Compare the performance of both MLP and LSTM using RMSE. Which model performed
better? Justify your finding. [2.5 marks]

Report Presentation [5 marks]

热门主题

课程名

mktg2509 csci 2600 38170 lng302 csse3010 phas3226 77938 arch1162 engn4536/engn6536 acx5903 comp151101 phl245 cse12 comp9312 stat3016/6016 phas0038 comp2140 6qqmb312 xjco3011 rest0005 ematm0051 5qqmn219 lubs5062m eee8155 cege0100 eap033 artd1109 mat246 etc3430 ecmm462 mis102 inft6800 ddes9903 comp6521 comp9517 comp3331/9331 comp4337 comp6008 comp9414 bu.231.790.81 man00150m csb352h math1041 eengm4100 isys1002 08 6057cem mktg3504 mthm036 mtrx1701 mth3241 eeee3086 cmp-7038b cmp-7000a ints4010 econ2151 infs5710 fins5516 fin3309 fins5510 gsoe9340 math2007 math2036 soee5010 mark3088 infs3605 elec9714 comp2271 ma214 comp2211 infs3604 600426 sit254 acct3091 bbt405 msin0116 com107/com113 mark5826 sit120 comp9021 eco2101 eeen40700 cs253 ece3114 ecmm447 chns3000 math377 itd102 comp9444 comp(2041|9044) econ0060 econ7230 mgt001371 ecs-323 cs6250 mgdi60012 mdia2012 comm221001 comm5000 ma1008 engl642 econ241 com333 math367 mis201 nbs-7041x meek16104 econ2003 comm1190 mbas902 comp-1027 dpst1091 comp7315 eppd1033 m06 ee3025 msci231 bb113/bbs1063 fc709 comp3425 comp9417 econ42915 cb9101 math1102e chme0017 fc307 mkt60104 5522usst litr1-uc6201.200 ee1102 cosc2803 math39512 omp9727 int2067/int5051 bsb151 mgt253 fc021 babs2202 mis2002s phya21 18-213 cege0012 mdia1002 math38032 mech5125 07 cisc102 mgx3110 cs240 11175 fin3020s eco3420 ictten622 comp9727 cpt111 de114102d mgm320h5s bafi1019 math21112 efim20036 mn-3503 fins5568 110.807 bcpm000028 info6030 bma0092 bcpm0054 math20212 ce335 cs365 cenv6141 ftec5580 math2010 ec3450 comm1170 ecmt1010 csci-ua.0480-003 econ12-200 ib3960 ectb60h3f cs247—assignment tk3163 ics3u ib3j80 comp20008 comp9334 eppd1063 acct2343 cct109 isys1055/3412 math350-real math2014 eec180 stat141b econ2101 msinm014/msing014/msing014b fit2004 comp643 bu1002 cm2030
联系我们
EMail: 99515681@qq.com
QQ: 99515681
留学生作业帮-留学生的知心伴侣!
工作时间:08:00-21:00
python代写
微信客服:codinghelp
站长地图