代写COMP809 Data Mining and Machine Learning ASSIGNMENT TWO Semester 1 - 2023代写R编程

ASSIGNMENT TWO

Semester 1 - 2023

PAPER NAME: Data Mining and Machine Learning

PAPER CODE: COMP809

DUE DATE: Sunday 9th Jun 2024 at midnight

TOTAL MARKS: 100

Part A Assessment Tasks                      [20 marks]

The objective of this assignment is to conduct preliminary research on data mining methods used in variousapplication domains. The survey is intended to assist you in establishing a suitable framework (application area, tools, algorithms) on which your mining project will be based.

To achieve this objective, you need to follow the steps below:

1.   Select a topic based on the application domain listed in Section 2 topics or you can come up with your own topic but need to confirm it with the teaching team.

2.   Read and analyse recent peer-reviewed papers (minimum of 6 articles) on your specific topic.

3.   From your research identify at least two themes and discuss these themes by

comparing themto the various papers in your research. Some examples of themes are listed below:

.    Approaches/algorithms to solve the problem

.    Scientific results from experimentation

.    Perspectives on an issue

.    Advantages/disadvantages

4.   Express your own opinions, e.g., new ideas, proposed approaches/models, how to extend the existing work, etc. Your opinions about machine learning and data mining related issues  should be presented.

5.   Write the report using LaTex / Word. Minimum 4 pages (including

references) and nomore than 6 pages in2 columnsIEEE proceedings format

2. Topics

You can pick one of the following topics or come up with a topic of your interest.

.    Healthcare

.    Banking and Finance

.    Retail, Customer Relationship Management, Product Recommendation

.    Computer Vision

.    Fraud Analysis

3. Layout for Research Report

The research report must include:

.     Title

.     Abstract

.     Introduction

.     Background/motivation

.     Comparison of related work (from peer-reviewed sources)

.     Your opinion new ideas, proposed approaches/models, how to extend the existing work

.     Conclusion and future issues

.     References

Part B: Predictions of Particulate Matter (PM2.5 or PM10)               [80 marks]

Air pollution causes serious damage to public health and based on existing research; particulate matter (PM) smaller than PM2.5 is currently considered to have the strongest correlation with the effects of cardiovascular disease. Therefore, making accurate predictions of PM2.5 is a crucial task. In this part, you are required to build prediction models based on regression model, multi-layer perceptron (MLP) and long short-term memory (LSTM).

Dataset: The dataset for this experiment can be downloaded from theEnvironmental Auckland Data Portal. Your dataset includes PM2.5 /  PM10  (Output) and different predictors such as air pollution, Air Quality Index (AQI), and meteorological data collected on an hourly basis from only one air quality  monitoring stations station listed below:

.    Penrose Station (ID:7)

.    Takapuna Station (ID:23)

Two PM_lag measurements, lag1 and lag2, should be included in your dataset. For example, lag1 for PM2.5 is the measurement for the previous hour (h-1) and lag2   is PM2.5 concentration for h-2.

Download relevant PM concentration, air pollution data (SO2, NO, NO2), and meteorological data Solar Radiation (W/m²), Air Temperature (°C), Relative Humidity (%), Wind Direction (°), and Wind  Speed  (m/s)).  The  dataset  should  be  hourly  measurement  starting  from  January  2019  to December 2023 (5 years).

Note 1:  Not all mentioned independent variables are collected at these monitoring stations.

Note 2: The unit of measurement for PM and air pollution data should be (μg/m3).

Introduction and Data Pre-processing                                                                                      [10 marks]

Make sure your dataset all has the same temporal resolution (i.e. hourly measurement). Perform data exploration and identify missing data and outliers (data that are out of the expected range). For example, unusual measurements of air temperature of 40(°C) for Auckland, Relative Humidity measurements above 100, and negative or unexplained high concentrations are outliers.

.    Introduce the problem being addresses in this assignment.

.    Provide attribute-specific information about outliers and missing data. How can these affect dataset quality?

.    Based on this analysis, decide, and justify your approach for data cleaning. Once your dataset is cleaned move to the next step for feature selection.

Data Exploration and Feature Selection                                                                                   [10 marks]

Choose five attributes of your dataset that has the highest correlation with PM2.5 or PM10 concentration using Pearson Correlation or any other feature selection method of your choice with justification.

.    Provide the correlation plot (or results of any other feature selection method of your choice) and elaborate on the rationale for your selection.

.    Describe your chosen attributes and their influence on PM concentration.

.    Provide graphical visualisation of variation of PM variation.

.    Provide summary statistics of the PM concentration.

.    Provide summary statistic of predictors of your choice that has the highest correlation in tabular format.

Experimental Methods

Use 70% of the data for training and the rest for testing theMLP and LSTM models. Use a Workflow diagram to illustrate the process of predicting PM concentrations using the MLP and LSTM models. [5 marks]

For both models, provide root mean square error (RMSE), Mean Absolute Error (MAE), and correlation coefficient (R2) to quantify the prediction performance of each model.

Multilayer Perceptron (MLP)

1)   In your own words, describe multilayer perceptron (MLP). You may use one diagram in your explanation (one page).        [5 marks]

2)   Use the sklearn.neural_network.MLPRegressor with default values for parameters and a single hidden layer with k= 25 neurons. Use default values for all parameters and experimentally determine the best learning rate that gives the highest performance on the testing dataset. Use this as a baseline for comparison in later parts of this question.                           [5 marks]

3)   Experiment with two hidden layers and experimentally determine the split of the number of neurons across each of the two layers that gives the highest accuracy. In part 2, we had all k neurons in a single layer, in this part we will transfer neurons from the first hidden layer to the second iteratively in step size of 1. Thus, for example in the first iteration, the first hidden layer will have k- 1 neurons whilst the second layer will have  1, in the second iteration k-2 neurons will be in the first layer with 2 in the second, and so on.                  [5 marks]

4)   From the results in part 3 of this question, you will observe a variation in the obtained performance metrics with the split of neurons across the two layers. Give explanations for some possible reasons for this variation and which architecture gives the best performance.   [5 marks]

Long Short-Term Memory (LSTM)

1)   Describe LSTM architecture including the gates and state functions. How does LSTM differ from MLP? Discuss how does the number of neurons and batch size affect the performance of the network?          [5 marks]

2)   To create the LSTM Model and determine the optimal  architecture, apply Adaptive Moment Estimation (ADAM) to train the networks. Identify an appropriate cost function to measure model performance based on training samples and the related prediction outputs. To find the best epoch, based on your cost function results, complete up to 30 runs keeping the learning rate and the number of batch sizes constant (e.g. at 0.01 and 4 respectively). Provide a line plot of the test and train cost function scores for each epoch. Report the summary statistics (Mean, Standard Deviation, Minimum and Maximum) of the cost function as well as the run time for each epoch. Choose the best epoch with justification. [5 marks]

3)   Investigate the impact of differing the number of the batch size,  complete 30 runs keeping the learning rate constant at 0.01 and use the best number of epochs obtained in previous step 2. Report the summary statistics (Mean, Standard Deviation, Minimum and Maximum) of the cost. function as well as the run time for each batch size. Choose the best batch size with justification.       [5 marks]

4)   Investigate the impact of differing the number of neurons in the hidden layer while keeping the epoch (step 2) and Batch size (step 3) constant for 30 runs. Report the summary statistics (Mean, Standard Deviation, Minimum and Maximum) of the cost function as well as the run time. Discuss how does the number of neurons affect performance and what is the optimal number of neurons in your experiment?                                                             [5 marks]

Model Comparison

1)   Plot model-specific actual and predicted PM to visually compare the model performance. What is your observation?            [2.5 marks]

2)   Compare the performance of both MLP  and LSTM using RMSE. Which model performed better? Justify your finding.        [2.5 marks]

Report Presentation                                                                                                                 [10 marks]








热门主题

课程名

int2067/int5051 bsb151 babs2202 mis2002s phya21 18-213 cege0012 mgt253 fc021 mdia1002 math39512 math38032 mech5125 cisc102 07 mgx3110 cs240 11175 fin3020s eco3420 ictten622 comp9727 cpt111 de114102d mgm320h5s bafi1019 efim20036 mn-3503 comp9414 math21112 fins5568 comp4337 bcpm000028 info6030 inft6800 bcpm0054 comp(2041|9044) 110.807 bma0092 cs365 math20212 ce335 math2010 ec3450 comm1170 cenv6141 ftec5580 ecmt1010 csci-ua.0480-003 econ12-200 ectb60h3f cs247—assignment ib3960 tk3163 ics3u ib3j80 comp20008 comp9334 eppd1063 acct2343 cct109 isys1055/3412 econ7230 msinm014/msing014/msing014b math2014 math350-real eec180 stat141b econ2101 fit2004 comp643 bu1002 cm2030 mn7182sr ectb60h3s ib2d30 ohss7000 fit3175 econ20120/econ30320 acct7104 compsci 369 math226 127.241 info1110 37007 math137a mgt4701 comm1180 fc300 ectb60h3 llp120 bio99 econ7030 csse2310/csse7231 comm1190 125.330 110.309 csc3100 bu1007 comp 636 qbus3600 compx222 stat437 kit317 hw1 ag942 fit3139 115.213 ipa61006 econ214 envm7512 6010acc fit4005 fins5542 slsp5360m 119729 cs148 hld-4267-r comp4002/gam cava1001 or4023 cosc2758/cosc2938 cse140 fu010055 csci410 finc3017 comp9417 fsc60504 24309 bsys702 mgec61 cive9831m pubh5010 5bus1037 info90004 p6769 bsan3209 plana4310 caes1000 econ0060 ap/adms4540 ast101h5f plan6392 625.609.81 csmai21 fnce6012 misy262 ifb106tc csci910 502it comp603/ense600 4035 csca08 8iar101 bsd131 msci242l csci 4261 elec51020 blaw1002 ec3044 acct40115 csi2108–cryptographic 158225 7014mhr econ60822 ecn302 philo225-24a acst2001 fit9132 comp1117b ad654 comp3221 st332 cs170 econ0033 engr228-digital law-10027u fit5057 ve311 sle210 n1608 msim3101 badp2003 mth002 6012acc 072243a 3809ict amath 483 ifn556 cven4051 2024 comp9024 158.739-2024 comp 3023 ecs122a com63004 bms5021 comp1028 genc3004 phil2617
联系我们
EMail: 99515681@qq.com
QQ: 99515681
留学生作业帮-留学生的知心伴侣!
工作时间:08:00-21:00
python代写
微信客服:codinghelp
站长地图