代写QBUS2820留学生、代做Python编程、代写Python、代做Predictive Analytics 代写Web开发|代做R语言程序
QBUS2820 Predictive Analytics
Semester 2, 2018
Assignment 2
Key information
Required submissions: Written report (word or pdf format, through Turnitin submission)
and Jupyter Notebook (through Ed). Group leader needs to submit the Written report and
Jupyter Notebook.
Due date: Saturday 3
rd November 2018, 2pm (report and Jupyter notebook submission).
The late penalty for the assignment is 10% of the assigned mark per day, starting after 2pm
on the due date. The closing date Saturday 10th November 2018, 2pm is the last date on
which an assessment will be accepted for marking.
Weight: 30 out of 100 marks in your final grade.
Groups: You can complete the assignment in groups of up to three students. There are no
exceptions: if there are more than three you need to split the group.
Length: The main text of your report (including Task 1 and Task 2) should have a
maximum of 20 pages. Especially for Task 2, you should write a complete report. You may
refer to Assignment 1-Task 2 as reference for the structure of the report.
If you wish to include additional material, you can do so by creating an appendix. There is
no page limit for the appendix. Keep in mind that making good use of your audience’s time
is an essential business skill. Every sentence, table and figure has to count. Extraneous
and/or wrong material will reduce your mark no matter the quality of the assignment.
Anonymous marking: As the anonymous marking policy of the University, please only
include your student ID and group ID in the submitted report, and do NOT include your
name. The file name of your report should follow the following format. Replace "123" with
your group SID. Example: Group123Qbus2820Assignment2S22018.
Presentation of the assignment is part of the assignment. Markers might assign up to 10%
of the mark for clarity of writing and presentation. Numbers with decimals should be
reported to the third decimal point.
Key rules:
Carefully read the requirements for each part of the assignment.
Please follow any further instructions announced on Canvas, particularly for submissions.
You must use Python for the assignment.
Reproducibility is fundamental in data analysis, so that you will be required to submit a
Jupyter Notebook that generates your results. Unfortunately, Turnitin does not accept multiple files, so that you will do this through Ed instead. Not submitting your code will
lead to a loss of 50% of the assignment marks.
Failure to read information and follow instructions may lead to a loss of marks.
Furthermore, note that it is your responsibility to be informed of the University of Sydney
and Business School rules and guidelines, and follow them.
Referencing: Harvard Referencing System. (You may find the details at:
http://libguides.library.usyd.edu.au/c.php?g=508212&p=3476130)
Task 1 (35 Marks)
Part A: Logistic Regression (15 Marks)
Use Logistic Regression to predict diagnosis of breast cancer patients on the Breast Cancer
Wisconsin (Diagnostic) Dataset “wdbc.data”. See Section “About the datasets” as detailed
data description.
(a) Write Python code to load the data. For the target feature Diagnosis, change its literal M
(malignant) to 1 and B (benign) to 0.
Then define and train a logistic regression model with intercept by using scikit-learn’s
LogisticRegression model, using default parameter values.
Based on the estimated parameters from your model, calculate the probability of sample ID
8510426 (20th sample) having a benign diagnosis.
(b) Based on slides 26 to 31 of Lecture 9, write your own python code to implement the
gradient ascend algorithm for the logistic regression with intercept:
You may use the following defined logistic function.
def logistic_function(reg_input):
return np.exp(reg_input) / (1 + np.exp(reg_input))
Using the given data, write python code to use initial values ?? = [0,0, … ,0]
, to run the
gradient ascend algorithm to maximize the the log-likelihood function of logistic regression
with respect to the parameters.
Find the optimal learning rate and resulting estimated ??? . Then re-do task (a): probability
of sample ID 8510426 (20th sample) having a benign diagnosis. Compare the results and
explain the major reasons why you may have different answers with scikit-learn.
Now change the initial values to ?? = [1,1, … ,1]
, and re-do the above tasks and report
your results and findings.About the dataset:
https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic)
https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancerwisconsin/wdbc.names
Part B: Youtube Comment Spam Classification (20 Marks)
Some questions in Task 2 need you to do some self-learning, e.g., exploring how to build
features for the text data using bag of words. You should discuss with your group members
on how to deal with the problem and do necessary self-learning which is an important ability
to have for your future study and career.
Your goal is to build a Random Forest (RF) classifier that classifies whether a youtube
comment is spam or not.
Use the ytube_spam dataset. We have already split the data into train and test sets:
"ytube_spam_trainset.csv" and "ytube_spam_testset.csv".
General instructions:
1. "CLASS" in the data is the target variable ??.
2. 3-fold cross validation if needed.
3. Make sure set your random number generator seed to 0 for this question:
"np.random.seed(0)".
(a) Self-study and use the following Python package:
from sklearn.feature_extraction.text import TfidfVectorizer
Build a bag of words representation of the data with:
Max 1000 features
Remove the top 1% of frequently occurring words
A word must occur at least twice to be included as a feature
Remove common English words
b) Build a random forest classifier and use cross validation to optimise the parameters of the
random forest. You need to at least optimise the number of trees in the random forest and can
explore and optimise other parameters as well.
Use the following Python packages:
from sklearn import ensemble
from sklearn.model_selection import GridSearchCVWith your CV selected optimal parameters' values, re-train the RF on the full training set and
produce your best performing model.
Test your best performing model on the test set, and you must achieve an average score ("avg
/ total") of at least 0.96 for precision, recall and f1-score of "sklearn classification_report".
Report "sklearn classification_report" output.
(c) Based on your cross validation results from GridSearchCV, plot the "mean_test_score"
and "mean_train_score" vs number of trees on the same Figure.
If you optimised other parameters, then fix these parameters to their optimal values.
(d) Report your random forest settings that achieve the best classification.
(e) Produce a histogram of the depths of the trees of your best performing model.
(f) Report the top 10 most important text features of your best performing model.
Task 2 (25 Marks)
1. Problem description
Rossmann is a German drug store chain that operates over 3000 stores in 7 European
countries. In this assignment, you will use “Rossman_Sales.csv” data to forecast six weeks
of daily sales following the last period in the dataset.
Your objective inthis assignment isto developunivariate forecastingmodels, e.g. only
using the historical sales, to address this problem.
We focus on the sales forecasting of store 1. You can download the dataset
“Rossman_Sales.csv” from Canvas.
2. Report andrequirements
a. The purpose of the report is to discuss the business context, exploratory data
analysis, methodology, model diagnostics, model validation and present
forecasts and conclusions for six weeks of daily sales following the last
period in thedataset.
b. Your group must identify at least 1 simple benchmark model and at least 2 different
forecastingmethodsormodelsthat can be used to forecastsales.
c. The report should also include an analysis of a monthly sales (with the limitation
that the sample size is small at this frequency).
3. Further analysis for bonus marks
The group can earn up to 2 bonus marks (in the final mark for the unit) by developing a
system to automatically generate forecasts for all stores. In order to obtain the bonusmarks, you should present interesting results based on thistool (use the appendix and refer
to it the main text of the report). The ability to summarise information and be concise is
essential here.

热门主题

课程名

mktg2509 csci 2600 38170 lng302 csse3010 phas3226 77938 arch1162 engn4536/engn6536 acx5903 comp151101 phl245 cse12 comp9312 stat3016/6016 phas0038 comp2140 6qqmb312 xjco3011 rest0005 ematm0051 5qqmn219 lubs5062m eee8155 cege0100 eap033 artd1109 mat246 etc3430 ecmm462 mis102 inft6800 ddes9903 comp6521 comp9517 comp3331/9331 comp4337 comp6008 comp9414 bu.231.790.81 man00150m csb352h math1041 eengm4100 isys1002 08 6057cem mktg3504 mthm036 mtrx1701 mth3241 eeee3086 cmp-7038b cmp-7000a ints4010 econ2151 infs5710 fins5516 fin3309 fins5510 gsoe9340 math2007 math2036 soee5010 mark3088 infs3605 elec9714 comp2271 ma214 comp2211 infs3604 600426 sit254 acct3091 bbt405 msin0116 com107/com113 mark5826 sit120 comp9021 eco2101 eeen40700 cs253 ece3114 ecmm447 chns3000 math377 itd102 comp9444 comp(2041|9044) econ0060 econ7230 mgt001371 ecs-323 cs6250 mgdi60012 mdia2012 comm221001 comm5000 ma1008 engl642 econ241 com333 math367 mis201 nbs-7041x meek16104 econ2003 comm1190 mbas902 comp-1027 dpst1091 comp7315 eppd1033 m06 ee3025 msci231 bb113/bbs1063 fc709 comp3425 comp9417 econ42915 cb9101 math1102e chme0017 fc307 mkt60104 5522usst litr1-uc6201.200 ee1102 cosc2803 math39512 omp9727 int2067/int5051 bsb151 mgt253 fc021 babs2202 mis2002s phya21 18-213 cege0012 mdia1002 math38032 mech5125 07 cisc102 mgx3110 cs240 11175 fin3020s eco3420 ictten622 comp9727 cpt111 de114102d mgm320h5s bafi1019 math21112 efim20036 mn-3503 fins5568 110.807 bcpm000028 info6030 bma0092 bcpm0054 math20212 ce335 cs365 cenv6141 ftec5580 math2010 ec3450 comm1170 ecmt1010 csci-ua.0480-003 econ12-200 ib3960 ectb60h3f cs247—assignment tk3163 ics3u ib3j80 comp20008 comp9334 eppd1063 acct2343 cct109 isys1055/3412 math350-real math2014 eec180 stat141b econ2101 msinm014/msing014/msing014b fit2004 comp643 bu1002 cm2030
联系我们
EMail: 99515681@qq.com
QQ: 99515681
留学生作业帮-留学生的知心伴侣!
工作时间:08:00-21:00
python代写
微信客服:codinghelp
站长地图