代做COMP20008、代写algorithm、代做Python、Python编程语言调试代做数据库SQL|代做Java程序

COMP20008 Project 2
V1.0: September 29, 2020
DUE DATE
The assignment is worth 25 marks, worth (25% of subject grade) and is due 8:00AM on
Wednesday 21st October 2020. Submission is via the LMS. Late penalty structure is
described at the end of this document.
Introduction
This project is designed to give you practice at solving some practical data science tasks. You
will need to implement and evaluate a data linkage system and a classification system using
sound data science principles. We intentionally do not prescribe the exact steps required as
a key part of the assessment is to evaluate your ability to exercise your own judgement in
deciding how to solve these practical data science problems, rather than simply implementing
a pre-defined algorithm. This is a big part of any real-world data science task, as there
are usually many approaches that can be taken and many irregularities in the data that are
impossible to exhaustively predefine in a task description. A data scientist needs to be able
to select the best approach for resolving these issues and justify it in order for their results
to be convincing.
For this project, you will perform a data linkage on two real-world datasets (Part 1) and
explore different classification algorithms (Part 2).
The project is to be coded in Python 3. Eight (8) relevant data sets can be downloaded from
LMS and are available on Juptyer Hub:
❼ datasets for Part 1: Data Linkage
– abt.csv
– buy.csv
– abt buy truth.csv
– abt small.csv
– buy small.csv
– abt buy truth small.csv
❼ datasets for Part 2: Classification:
– life.csv
– world.csv
1
Part 1 - Data Linkage (12 marks)
Abt and Buy both have product databases. In order to perform joint market research they
need to link the same products in their databases. Therefore the research team has manually
linked a subset of the data in order to build a linkage algorithm to automate the remaining
items to be linked. This manually linked data is what you will base your work on in this
assignment. However the real dataset is unknown to you, as it would be reality and this
unknown data is what you will be assessed on.
Na¨ıve data linkage without blocking (4 marks)
For this part, data linkage without blocking is performed on two smaller data sets:
abt small.csv and buy small.csv.
Task - 1A: Using abt small.csv and buy small.csv, implement the linkage between the
two data sets.
Your code for this question is to be contained in a single Python file called task1a.py and
produce a single csv file task1a.csv containing the following two column headings:
idAbt,idBuy
Each row in the datafile must contain a pair of matched products. For example, if your
algorithm only matched product 10102 from the Abt dataset with product
203897877 from the Buy dataset your output task1a.csv would be as follows:
idAbt, idBuy
10102,203897877
The performance is evaluated in terms of recall and precision and the marks in this section
will be awarded based on the two measures of your algorithm.
recall = tp/(tp + fn)
precision = tp/(tp + f p)
where tp (true-positive) is the number of true positive pairs, f p the number of false positive
pairs, tn the number of true negatives, and fn the number of false negative pairs.
Note: The python package textdistance implements many similarity functions for strings
(https://pypi.org/project/textdistance/). You can use this package for the similarity
calculations for strings. You may also choose to implement your own similarity functions.
2
Blocking for efficient data linkage (4 marks)
Blocking is a method to reduce the computational cost for record linkage.
Task - 1B: Implement a blocking method for the linkage of the abt.csv and buy.csv data
sets.
Your code is be contained in a single Python file called task1b.py and must produce two
csv files abt blocks.csv and buy blocks.csv, each containing the following two column
headings:
block_key, product_id
The product id field corresponds to the idAbt and idBuy of the abt.csv and buy.csv files
respectively. Each row in the output files matches a product to a block. For example, if your
algorithm placed product 10102 from the Abt dataset in blocks with block keys x & y, your
abt blocks.csv would be as follows:
block_key, product_id
x,10102
y,10102
A block is uniquely identified by the block key. The same block key in the two block-files
(abt blocks.csv and buy blocks.csv) indicates that the corresponding products co-occur
in the same block.
For example, if your algorithm placed the Abt product 10102 in block x and placed Buy
product 203897877 in block x, your abt blocks.csv and buy blocks.csv would be as follows
respectively:
abt_blocks.csv:
block_key, product_id
x,10102
buy_blocks.csv:
block_key, product_id
x,203897877
The two products co-occur in the same block x.
To measure the quality of blocking, we assume that when comparing a pair of records, the
pair are always 100% similar and are a match. A pair of records are categorised as follows:
❼ a record-pair is a true positive if the pair are found in the ground truth set and also the
pair co-occur in the same block.
❼ a record-pair is a false positive if the pair co-occur in some block but are not found in
the ground truth set.
❼ a record-pair is a false negative if the pair do not co-occur in any block but are found
in the ground truth set
3
❼ a record-pair is a true negative if the pair do not co-occur in any block and are also not
found in the ground truth set.
Then, the quality of blocking can be evaluated using the following two measures:
P C (pair completeness) = tp/(tp + fn)
RR (reduction ratio) = 1 − (tp + fp)/n
where n is the total number of all possible record pairs from the two data sets
(n = f p + fn + tp + tn).
The marks in this section will be awarded based on the pair completeness and reduction ratio
of your blocking algorithm.
Note: The time taken to produce your blocking implementation must be linear in the number
of items in the dataset. This means that you cannot, for example, compare each product to
every other product in the dataset in order to allocate it to a block. Implementations that
do so will be severely penalised.
Report on the Data Linkage project(4 marks)
Task - 1C Write a one-page report describing your algorithms and implementations of tasks
1a and 1b. You should discuss:
❼ How your product comparison works, including your choice of similarity functions and
final scoring function and threshold.
❼ An evaluation of the overall performance of your product comparison and what opportunities
exist to improve it.
❼ How your blocking implementation works.
❼ An evaluation of the overall performance of your blocking method, how the method
relates to the performance measures and what opportunities exist to improve it.
Your report for this task should be contained in a single file called task1c.pdf or task1c.docx.
4
Part 2 - Classification (13 marks)
Each year, the World Bank publishes the World Development Indicators which provide high
quality and international comparable statistics about global development and the fight against
poverty [1]. As data scientists, we wish to understand how the information can be used to
predict average lifespan in different countries. To this end, we have provided the world.csv
file, which contains some of the World Development Indicators for each country and the
life.csv file containing information about the average lifespan for each country (based on
data from the World Health Organization) [2]. Each data file also contains a country name,
country code and year as identifiers for each record. These may be used to link the two
datasets but should not be considered features.
Comparing Classification Algorithms (3 marks)
Task - 2A Compare the performance of the following 3 classification algorithms: k-NN (k=3
and k=7) and Decision tree (with a maximum depth of 3) on the provided data. You may use
sklearn’s KNeighborsClassifier and DecisionTreeClassifier functions for this task. To ensure
consistency, please ensure that all functions that can be called with a random state parameter
use a random state of 200. For the k-NN classifier, all parameters other than k should be
kept at their defaults. For the Decision tree classifier, all parameters other than the maximum
depth and random state should be kept at their defaults. Use each classification algorithm to
predict the class feature life expectancy at birth(years) of the data (Low, Medium and
High life expectancy) using the remaining features.
Organise and verify your data: Before you begin you should ensure that your dataset is
sorted in ascending alphabetical order by country and that any countries not present in both
world.csv and life.csv are discarded.
For each of the algorithms, fit a model with the following processing steps:
❼ Split the dataset into a training set comprising 70% of the data and a test set comprising
the remaining 30% using the train test split function with a random state of 200.
❼ Perform the same imputation and scaling to the training set:
– For each feature, perform median imputation to impute missing values.
– Scale each feature by removing the mean and scaling to unit variance.
❼ Train the classifiers using the training set
❼ Test the classifiers by applying them to the test set.
Your code must produce a CSV file called task2a.csv describing the median used for imputation
for each feature, as well as the mean and variance used for scaling, all rounded to
three decimal places. The CSV file must have one row corresponding to each feature. The
first three lines of the output should be as follows (where x is a number calculated by your
program):
5
feature, median, mean, variance
Access to electricity, rural (% of rural population) [EG.ELC.ACCS.RU.ZS], x, x, x
Adjusted savings: particulate emission damage (% of GNI) [NY.ADJ.DPEM.GN.ZS], x, x, x
Your code must print the classification accuracy of each classifier to standard output. Your
output should look as follows (where the # symbol is replaced by the accuracy of each
algorithm, rounded to 3 decimal places):
Accuracy of decision tree: #
Accuracy of k-nn (k=3): #
Accuracy of k-nn (k=7): #
Your code for this question should be contained in a single Python file called task2a.py
Feature Engineering and Selection(6 marks)
Task - 2B This task will focus on k-NN with k=3 (from here on referred to as 3-NN). In
order to achieve higher prediction accuracy for 3-NN, one can investigate the use of feature
engineering and selection to predict the class feature of the data. Feature generation involves
the creation of additional features. Two possible methods are:
❼ Interaction term pairs. Given a pair of features f1 and f2, create a new feature f12 =
f1 × f2. All possible pairs can be considered.
❼ Clustering labels: apply k-means clustering to the data in world and then use the
resulting cluster labels as the values for a new feature fclusterlabel. You will need to
decide how many clusters to use. At test time, a label for a testing instance can be
created by assigning it to its nearest cluster.
Given a set of N features (the original features plus generated features), feature selection
involves selecting a smaller set of n features (n < N).
An alternative method of performing feature engineering & selection is to use Principal Component
Analysis (PCA). The first n principal components can be used as features.
Your task in this question is to evaluate how the above methods for feature engineering and
selection affect the prediction accuracy compared to using 3-NN on a subset of the original
features in world. You should:
❼ Implement feature engineering using interaction term pairs and clustering labels. This
should produce a dataset with 211 features (20 original features, 190 features generated
by interaction term pairs and 1 feature generated by clustering). You should (in some
principled manner) select 4 features from this dataset and perform 3-NN classification.
❼ Implement feature engineering and selection via PCA by taking the first four principal
components. You should use only these four features to perform 3-NN classification.
❼ Take the first four features (columns D-G, if the dataset is opened in Excel) from the
original dataset as a sample of the original 20 features. Perform 3-NN classification.
Your output for this question should include:
6
❼ Any text, numbers, or other numerical data you reference in your report, printed to
standard output
❼ Any graphs or charts as png files with the prefix task2b (e.g. task2bgraph1.png,
task2bgraph2.png)
❼ The classification accuracy for the test set for of the three methods in the following
format, as the last three lines printed to standard output (where the # symbol is
replaced by the accuracy of 3-NN using each feature set, rounded to 3 decimal places):
Accuracy of feature engineering: #
Accuracy of PCA: #
Accuracy of first four features: #
Note: For Task 2B you do not need to use a random state of 200.
Your code for this question should be contained in a single Python file called task2b.py
Your Report (4 marks)
Task - 2C Write a 1-2 page report describing your implementations. You should discuss:
❼ Which algorithm (decision trees or k-nn) in Task-2A performed better on this dataset?
For k-nn, which value of k performed better? Explain your experiments and the results.
❼ A description of the precise steps you took to perform the analysis in Task-2B.
❼ The method you used to select the number of clusters for the clustering label feature
generation and a justification for the number of clusters you selected.
❼ The method you used to select four features from the generated dataset of 211 features
for your analysis and a justification for this method.
❼ Which of the three methods investigated in Task-2B produced the best results for classification
using 3-NN and why this was the case.
❼ What other techniques you could implement to improve classification accuracy with this
data.
❼ How reliable you consider the classification model to be.
Your report for this task should be contained in a single file called task2c.pdf or task2c.docx.
Resources
The following are some useful resources, for refreshing your knowledge of Python, and for
learning about functionality of pandas.
❼ Python tutorial
❼ Python beginner reference
7
❼ pandas 10min tutorial
❼ Official pandas documentation
❼ Official mathplotlib tutorials
❼ Python pandas Tutorial by Tutorialspoint
❼ pandas: A Complete Introduction by Learn Data Sci
❼ pandas quick reference sheet
❼ sklearn library reference
❼ NumPy library reference
❼ Textdistance library reference
❼ Python Data Analytics by Fabio Nelli (available via University of Melbourne sign on)
Submission Instructions
Your code should be contained in four Python files, task1a.py, task1b.py, task2a.py and
task2b.py which are to be submitted to the LMS by the due date. Your two reports task1c
and task2c should be also be submitted to the LMS by the due date.
Other
Extensions and Late Submission Penalties: All requests for extensions must be made by
email, sent to Chris Ewin. If requesting an extension due to illness, please attach a medical
certificate or other supporting evidence. All extension requests must be made at least 24
hours before the due date. Late submissions without an approved extension will attract the
following penalties
❼ 0 < hourslate <= 24 (2 marks deduction)
❼ 24 < hourslate <= 48 (4 marks deduction)
❼ 48 < hourslate <= 72: (6 marks deduction)
❼ 72 < hourslate <= 96: (8 marks deduction)
❼ 96 < hourslate <= 120: (10 marks deduction)
❼ 120 < hourslate <= 144: (12 marks deduction)
❼ 144 < hourslate: (25 marks deduction)
where hourslate is the elapsed time in hours (or fractions of hours).
Silent Policy: A silent policy will take effect 48 hours before this assignment is due. This
means that no question about the assignment will be answered, whether it is asked on the
8
discussion forum or via email. By this stage, you should have a clear idea of what the assignment
is about and have figured out any issues that require staff input.
This project is expected to require 30-35 hours work.
Academic Honesty
You are expected to follow the academic honesty guidelines on the University website
https://academichonesty.unimelb.edu.au
Further Information
A project discussion forum has also been created on the Ed forum. Please use this in the
first instance if you have questions, since it will allow discussion and responses to be seen by
everyone. The teaching team will support the discussions on the forum until 24 hours prior
to the project deadline. There will also be a list of frequently asked questions on the project
page.
References
[1] The World Bank, “World development indicators,” 2016, data retrieved
from World Development Indicators, https://databank.worldbank.org/source/
world-development-indicators.
[2] World Health Organization, “Life expectancy and healthy life expectancy data by country,”
2016, data retrieved from Global Health Observatory data repository, https://apps.
who.int/gho/data/node.main.688?lang=en.

热门主题

课程名

mktg2509 csci 2600 38170 lng302 csse3010 phas3226 77938 arch1162 engn4536/engn6536 acx5903 comp151101 phl245 cse12 comp9312 stat3016/6016 phas0038 comp2140 6qqmb312 xjco3011 rest0005 ematm0051 5qqmn219 lubs5062m eee8155 cege0100 eap033 artd1109 mat246 etc3430 ecmm462 mis102 inft6800 ddes9903 comp6521 comp9517 comp3331/9331 comp4337 comp6008 comp9414 bu.231.790.81 man00150m csb352h math1041 eengm4100 isys1002 08 6057cem mktg3504 mthm036 mtrx1701 mth3241 eeee3086 cmp-7038b cmp-7000a ints4010 econ2151 infs5710 fins5516 fin3309 fins5510 gsoe9340 math2007 math2036 soee5010 mark3088 infs3605 elec9714 comp2271 ma214 comp2211 infs3604 600426 sit254 acct3091 bbt405 msin0116 com107/com113 mark5826 sit120 comp9021 eco2101 eeen40700 cs253 ece3114 ecmm447 chns3000 math377 itd102 comp9444 comp(2041|9044) econ0060 econ7230 mgt001371 ecs-323 cs6250 mgdi60012 mdia2012 comm221001 comm5000 ma1008 engl642 econ241 com333 math367 mis201 nbs-7041x meek16104 econ2003 comm1190 mbas902 comp-1027 dpst1091 comp7315 eppd1033 m06 ee3025 msci231 bb113/bbs1063 fc709 comp3425 comp9417 econ42915 cb9101 math1102e chme0017 fc307 mkt60104 5522usst litr1-uc6201.200 ee1102 cosc2803 math39512 omp9727 int2067/int5051 bsb151 mgt253 fc021 babs2202 mis2002s phya21 18-213 cege0012 mdia1002 math38032 mech5125 07 cisc102 mgx3110 cs240 11175 fin3020s eco3420 ictten622 comp9727 cpt111 de114102d mgm320h5s bafi1019 math21112 efim20036 mn-3503 fins5568 110.807 bcpm000028 info6030 bma0092 bcpm0054 math20212 ce335 cs365 cenv6141 ftec5580 math2010 ec3450 comm1170 ecmt1010 csci-ua.0480-003 econ12-200 ib3960 ectb60h3f cs247—assignment tk3163 ics3u ib3j80 comp20008 comp9334 eppd1063 acct2343 cct109 isys1055/3412 math350-real math2014 eec180 stat141b econ2101 msinm014/msing014/msing014b fit2004 comp643 bu1002 cm2030
联系我们
EMail: 99515681@qq.com
QQ: 99515681
留学生作业帮-留学生的知心伴侣!
工作时间:08:00-21:00
python代写
微信客服:codinghelp
站长地图