COMP20008代做、代写Data Processing、Python编程语言调试、Python代做代写留学生Prolog|代做R语言编程

COMP20008 Elements of Data Processing
Project 1
August 27, 2020
Due date
The assignment is worth 25 marks, (25% of subject grade) and is due 8:00am Monday
21st September 2020 Australia/Melbourne time.
Background
A web server has been setup at http: // comp20008-jh. eng. unimelb. edu. au: 9889/ main/
containing a number of media reports on Rugby games. As data scientists, we would like to
extract information from those reports and use that information to improve our understanding
of team performance.
Rugby scores
Understanding the rugby scoring system is important in order to be able to extract scores
from match reports. A rubgy score is listed as x-y where x and y are the number of points
obtained by each team. For example, the following are all valid scores:
10-8
16-0
4-12
Learning outcomes
The learning objectives of this assignment are:
To gain practical experience in written communication skills for data science projects.
To practice a selection of processing and exploratory analysis techniques through visualisation
discussed in lectures and workshops.
To practice crawling and scraping data from the Internet.
To practice using widely used Python library for data processing and gain experience
using library functions which may be unfamiliar and which require consultation of additional
documentation from resources on the Web.
1
COMP20008 2020 SM2
Your tasks
You are to perform a small data science project including some data processing and analysis
using Python. Your responses to Tasks 1-5 must be contained in a single .py file. Specifically,
you have the following tasks:
Task 1 (2 marks)
Crawl the http: // comp20008-jh. eng. unimelb. edu. au: 9889/ main/ website to find a complete
list of articles available.
Produce a csv file containing the URL and headline of each the articles your crawler has found.
The CSV file should have two column headings url and headline and be called task1.csv.
Note: You might want to start with a smaller website to test your crawling implementation
with this site (http: // comp20008-jh. eng. unimelb. edu. au: 9889/ sample/ ).
Task 2 (4 marks)
For each article found in Task 1,
a) extract the name of the first team mentioned in the article. You can find a list of team
names as part of the rugby.json file provided. We will assume the article is written
about that team (and only that team). (2 marks)
Note: Your implementation must make use of the list of teams in rugby.json. We
will run your program with a different rugby.json file and expect to find all the articles
that refer to the teams listed in the modified file. The file we use will follow the same
format, but may have different teams.
b) extract the largest match score identified in the article. You will need to use regular
expressions to accomplish this. We will assume this score relates to the first named
team in the article. (2 marks)
Produce a csv file containing the URL, headline, first team mentioned and first complete
match score of each the articles your crawler has found. The csv file should have four column
headings url, headline, team and score and be called task2.csv.
Note: Some articles may not contain a team name and/or a match score. These articles can
be discarded.
Task 3 (1 mark)
For each article used in Task 2, identify the absolute value of the game difference. E.g. a
14-6 score and a 5-13 score both have a game difference of 8. The value is referred to as the
game difference
Produce a csv file containing the team name and average game difference for each team that
at least one article has been written about. The csv file should have two column headings
team and avg game difference and be called task3.csv.
Page 2
COMP20008 2020 SM2
Task 4 (2 marks)
Generate a suitable plot showing five teams that articles are most frequently written about
and the number of times an article is written about that team.
Save this plot as a png file called task4.png
Task 5 (2 marks)
Generate a suitable plot comparing the average game difference for each team with their
game difference. Ignore any teams that have no articles written about them.
Save this plot as a png file called task5.png
Task 6 (14 marks)
Write a 3-4 page report to communicate the process and activities undertaken in the project,
the analysis, and some limitations. Specifically, the report should contain the following information:
A description of the crawling method and a brief summary the output for Task 1.
(2 marks)
A description of how you scraped data from each page, including any regular expressions
used for Task 2 and a brief summary of the output. (3 marks)
An analysis of the information shown in the two plots produced for Tasks 4 & 5, including
a brief summary of the data used. The plots are to be shown (included) along
with your analysis. (4 marks)
A discussion of the appropriateness of associating the first named team in the article
with the first match score. (2 marks)
At least two suggested methods for how you could figure out from the contents of the
article whether the first named team won or lost the match being reported on and a
comment on the advantages and disadvantages of each approach. (2 marks)
A discussion of what other information could be extracted from the articles to better
understand team performance and a brief suggestion for how this could be done.
(1 mark)
Submission instructions
Your responses to Tasks 1 - 5 must be contained in a single python script (.py) file. As the
output of this file will be verified automatically, it is essential that the program runs without
producing errors. For this assignment you may NOT install any additional packages that
aren’t present on the JupyterHub server, e.g. by using the pip install command. Doing so
will cause your submission to fail our marking scripts.
Submission is via the LMS. Two submission links will be provided, one for the .py file
Page 3
COMP20008 2020 SM2
containing your responses to Tasks 1 - 5 and a second for a .pdf or .docx file containing
your response to Task 6.
Extensions and late submission penalties
If requesting an extension due to illness, please submit a medical certificate to the lecturer.
If there are any other exceptional circumstances, please contact the lecturer with plenty of
notice. Late submissions without an approved extension will attract the following penalties
0 < hourslate <= 24 (2 marks deduction)
24 < hourslate <= 48 (4 marks deduction)
48 < hourslate <= 72: (6 marks deduction)
72 < hourslate <= 96: (8 marks deduction)
96 < hourslate <= 120: (10 marks deduction)
120 < hourslate <= 144: (12 marks deduction)
144 < hourslate: (25 marks deduction)
where hourslate is the elapsed time in hours (or fractions of hours).
This project is expected to require 30-35 hours work.
Academic honesty
You are expected to follow the academic honesty guidelines on the University website
https://academichonesty.unimelb.edu.au
Further information
A project discussion forum has also been created on the Ed forum. Please use this in the
first instance if you have questions, since it will allow discussion and responses to be seen by
everyone. There will also be a list of frequently asked questions on the project page.

热门主题

课程名

mktg2509 csci 2600 38170 lng302 csse3010 phas3226 77938 arch1162 engn4536/engn6536 acx5903 comp151101 phl245 cse12 comp9312 stat3016/6016 phas0038 comp2140 6qqmb312 xjco3011 rest0005 ematm0051 5qqmn219 lubs5062m eee8155 cege0100 eap033 artd1109 mat246 etc3430 ecmm462 mis102 inft6800 ddes9903 comp6521 comp9517 comp3331/9331 comp4337 comp6008 comp9414 bu.231.790.81 man00150m csb352h math1041 eengm4100 isys1002 08 6057cem mktg3504 mthm036 mtrx1701 mth3241 eeee3086 cmp-7038b cmp-7000a ints4010 econ2151 infs5710 fins5516 fin3309 fins5510 gsoe9340 math2007 math2036 soee5010 mark3088 infs3605 elec9714 comp2271 ma214 comp2211 infs3604 600426 sit254 acct3091 bbt405 msin0116 com107/com113 mark5826 sit120 comp9021 eco2101 eeen40700 cs253 ece3114 ecmm447 chns3000 math377 itd102 comp9444 comp(2041|9044) econ0060 econ7230 mgt001371 ecs-323 cs6250 mgdi60012 mdia2012 comm221001 comm5000 ma1008 engl642 econ241 com333 math367 mis201 nbs-7041x meek16104 econ2003 comm1190 mbas902 comp-1027 dpst1091 comp7315 eppd1033 m06 ee3025 msci231 bb113/bbs1063 fc709 comp3425 comp9417 econ42915 cb9101 math1102e chme0017 fc307 mkt60104 5522usst litr1-uc6201.200 ee1102 cosc2803 math39512 omp9727 int2067/int5051 bsb151 mgt253 fc021 babs2202 mis2002s phya21 18-213 cege0012 mdia1002 math38032 mech5125 07 cisc102 mgx3110 cs240 11175 fin3020s eco3420 ictten622 comp9727 cpt111 de114102d mgm320h5s bafi1019 math21112 efim20036 mn-3503 fins5568 110.807 bcpm000028 info6030 bma0092 bcpm0054 math20212 ce335 cs365 cenv6141 ftec5580 math2010 ec3450 comm1170 ecmt1010 csci-ua.0480-003 econ12-200 ib3960 ectb60h3f cs247—assignment tk3163 ics3u ib3j80 comp20008 comp9334 eppd1063 acct2343 cct109 isys1055/3412 math350-real math2014 eec180 stat141b econ2101 msinm014/msing014/msing014b fit2004 comp643 bu1002 cm2030
联系我们
EMail: 99515681@qq.com
QQ: 99515681
留学生作业帮-留学生的知心伴侣!
工作时间:08:00-21:00
python代写
微信客服:codinghelp
站长地图