代写QBUS2820、代写Python语言编程

QBUS2820 Predictive Analytics

Individual Assignment 1
Key information

1. Required submissions (through Canvas/Assignments/Individual Assignment 1)
a. ONE written report (word or pdf format)
b. ONE Jupyter Notebook .ipynb
Please upload both files to canvas in the SAME submission, as separate files (NO
zip file).
2. Due date/time and closing date/time: See Canvas. The late penalty for the
assignment is 5% of the assigned mark per day, starting after 23.59pm on the due
date.
3. Weight: 30% of the total mark of the unit.
4. Length: The main text of your report should have a maximum of 10 pages with the
usual font size 11-12. You should write a complete report including sections such as
business context, problem formulation, data processing, Exploratory Data Analysis
(EDA), methodology, analysis, conclusions and limitations, etc.
5. If you wish to include additional material, you can do so by creating an appendix.
There is no page limit for the appendix. Keep in mind that making good use of your
audience’s time is an essential business skill. Every sentence, table and figure have
to count. Extraneous and/or wrong material will reduce your mark no matter the
quality of the assignment.
6. Anonymous marking: As the anonymous marking policy of the University, please
only include your student ID in the submitted report, and do NOT include your
name. The file name of your report and code file should follow the following format.
Replace "SID" with your Student ID. Example: SID_Qbus2820_Assignment1.
7. Presentation/clarity is part of the assignment. Markers will allocate 10% marks for
clarity of writing and presentation. Numbers with decimals should be reported to
the fourth decimal point.

Key rules:
Carefully read the requirements for each part of the assignment.
Please follow any further instructions announced on Canvas.
You must use Python for the assignment. Use "random_state= 1" when needed, e.g.
when using “train_test_split” function of Python. For all other parameters that are not
specified in the questions, use the default values of the corresponding Python
functions.
Reproducibility is fundamental in data analysis, so that you will be required to submit a
Jupyter Notebook that generates your results. Not submitting your code will lead to a
loss of 50% of the assignment marks.

The notebook must run without errors and produce results consistent with the report
when accessed through Kernel -> Restart & Run All from the Jupyter menu, assuming
that the train and test datasets are in the same folder as the notebook. Failure to do so
can results in a loss of up to 50% of the assignment marks.
Failure to read information and follow instructions may lead to a loss of marks.
Furthermore, note that it is your responsibility to be informed of the University of
Sydney and Business School rules and guidelines, and follow them.

The Task

You will work on the Houses Data set. This is a dataset about residential property sales in
the US, gathered from 2006 to 2010. The dataset consists of multiple variables measuring
properties of the houses.
The assignment consists of applying models and model selection methodologies to arrive at
models that predict the price of the sale of the houses, given some of the other variables
measured.

1. Problem description

A primary goal of finding a model that is accurate in predicting the prices of the houses
when they are sold. The accuracy of the predictions is measured in Mean Absolute Error
(MAE).
A secondary goal is to get an understanding of which are the main factors that drive prices,
according to the model, this would require that at least one of the models uses a few
variables or that you can create a coherent explanation out of one of the models if all use
many variables.

Select three models, one from each model family to predict the target variable 'SalePrice’.
These model families are:
a linear regression model,
a kNN regression model,
A third model. This model can be any model of your choice that is not linear
regression nor kNN (might even be a model not covered in the QBUS2820 unit). This
is to encourage you to self-explore and self-study, since the ability of self-study is
critical in the field of machine learning which is evolving rapidly.

All the models need to be fine-tuned with hyperparameter search (when appropriate) and
potentially variable selection. The methodology should maximize the predictive accuracy
and the. When the three models have been tuned, you will compute an accurate estimate
of the prediction error of these models and make a final decision among the three. In the
conclusions, you also have to give an explanation of the driving factors of house prices, if
the chosen model is not explainable, then use another (or several) and carefully justify the
tradeoffs.
The model selection exercise:
intro/business context/problem formulation
3

exploratory data analysis
The three models
The conclusions section
Represents the main body of the report and makes 85% of the grade of the assignment.

In addition to the model selection above, the following short exercises. Create a section for
each of the questions and remember to explain and discuss the methodology in the report
as in the main body.

(5%) Find the best predictive model that uses a single predictor (only one variable).
(5%) Instead of optimizing for the mean absolute error, how would you change your
methodology to optimize for the median error? This is a theoretical question,
answer with a proposed methodology, no need to code it in the notebook.

Select 3 houses at random from the dataset and:

(5%) Predict sale prices for those three houses for the year 2022. Reason your
answer. You can use any of the three candidate models or use a new model.

Bonus question: Approximate bias and variance of the selected of the Expected Prediction
Error of the more accurate model chosen in the main body of the report. The
approximation is for a dataset of 70% size of the original dataset. Comment on the
limitations of your solution. The bonus exercise is an extra 5% on top of the grade of the
assignment, it can be used to counteract errors but cannot make the total grade for the
assignment over 100%.

The main marks come from the report, this is you can have a ‘perfect’ notebook but
if there is no explanation if the report then it will not be given marks.
Always give a reasoned answer, why do you chose a particular variable selection
method and not other? Why did you choose a particular ‘third’ model? Why did you
choose a particular method for estimating the errors? Etc.
You do not need to ‘re-state’ the properties of the models, but need to critically
justify what the models are adding to the analysis what are the benefits to those
models and the drawbacks. The same goes for other decisions in the
You might need to make ‘suboptimal’ decisions due to, for example, computation
times, failure to meet assumptions of the models, etc. In this case remember to
state the reason for the decision and the potential problems.

The grading of the assignment will be based on the methodology and justifications,
removing points for methodological errors, incomplete sections, etc. There is no ‘minimum’
predictive accuracy to be reached, but you need to apply a good methodology.

The dataset is a popular one in data analysis, described in:
https://doi.org/10.1080/10691898.2011.11889627
or http://jse.amstat.org/v19n3/decock.pdf
It is used in a practice Kaggle competition:

2. Written report

The purpose of the report is to describe, explain, and justify your solution. Be concise and
objective. Find ways to say more with less. When in doubt, put it in the appendix. Below are
some guidelines on how to work on the Task.
Preparation. You read and understood the assignment requirements and are aware that
this is part of the assessment. You understand that machine learning is grounded in
rigorous logic and theory that should inform your practical analysis. You understand that
there is no single right solution and that trying different approaches and discovering
empirically what works best for a particular problem is natural and desirable in this type of
analysis.

Business context and problem formulation. The report includes a discussion of the context
for the analysis, the problem and questions/hypotheses to be addressed, and how you plan
to measure the success of your proposed solutions.

Data processing. You make sure that the dataset is free of errors and correctly processed
for your analysis. You handle missing values and other issues appropriately. You describe
the data processing steps in a clear and concise way.

Exploratory data analysis (EDA). Your report describes your EDA process, presenting only
selected results. You studied key variables individually. You note any features of the data
that are relevant for model building (some variables might be ‘invalid’ for predictive
purposes). You note the presence of outliers and any other anomalies that can affect the
analysis. You explain the relevance of the EDA results to your subsequent modelling. Your
EDA section in the report is concise, leaving additional figures and tables to the appendix if
needed. Outliers should be clear (e.g. negative values for counting variables). EDA is not the
place to do variable selection and outliers of a non clear nature (e.g. very large values)
should be either not removed or further analyzed using the predictive model performance.
The dataset has many variables and you are not expected to report on all of them
individually, just report your methodology and main findingd.

Variable selection. You describe and explain your process for variable selection. Your
choices are justified by data analysis and/or trial and error. Other that potentially invalid
variables from the dataset, the decision should be driven by the performance of the models,
not based on opinions (you are free to comment on the disagreements between your
background knowledge and the models).

Methodology and modelling. You clearly describe and justify the models, methods, and
algorithms in your analysis. The choice of methods is logically related to the assignment
requirements, the substantive problem, underlying theoretical knowledge, and data
analysis. This may involve systematic trial and error, but the report should focus on your
final solutions. Your methodology pays attention to statistical variability. You report all
5

crucial assumptions and check them as relevant via formal and informal diagnostics. You
clearly recognize when an assumption is not satisfied or questionable. Some problems may
be unfixable given the available data and methods. In this case you can identify what
additional information or methodology could allow you to fix these problems.

Analysis and conclusions. Your analysis is rich. You correctly interpret the results and
discuss how they address the substantive question. The reasoning from methodology and
results to your conclusions is logical and convincing. You are not misled by overfitting. Your
analysis pays attention to statistical variability. You make no claims for which you have no
evidence. You do not make statements that imply causation when discussing associations.
You explicitly acknowledge when limitations of the data or methods lead to uncertainty
about your answer to the substantive question.

Writing. Your writing is concise, clear, precise, and free of grammatical and spelling errors.
You use appropriate technical terminology. Your paragraphs and sentences follow a clear
logic and are well connected. There is a clear distinction between the essential parts of the
report and less important material (use the appendix). Your text refers to meaningful names
for variables and subjects. If you use an abbreviation or label, you first have to define it.

Report. Your report is well organized and professionally presented and formatted, as if it
had been prepared for a client later in your career. There are clear divisions between
sections and paragraphs.

Tables. Your tables are appropriately formatted and have a clear layout. The tables have
informative rows and column labels. The tables are as much as possible easy to be
understood on their own (in the real world, a significant part of your audience will skim-read
by going straight to the tables). The tables do not contain information which is irrelevant to
the discussion in your report. Your table is not an image. The tables are placed near the
relevant discussion in your report. There is no text around your tables.

Figures. Your figures are easy to understand and have informative titles, captions, labels,
and legends. The figures are well formatted and laid out. The figures are placed near the
relevant discussion in your report and are references from the text of the report. Your
figures have appropriate definition and were directly saved from Python into an image file
format. There is no text around your figures.

Numbers. All numerical results are reported to four-decimal point.

Referencing. You add citations for your sources. The references follow a recognizable style
(e.g. the Harvard Referencing System, MLA, APA, Vancouver, etc.)

Python code. The code is presented in a neat and compact way. The code uses meaningful
variable names and can be easily followed by someone with training in Python and statistics.
Someone should be able to run your code and reproduce all the results that appear in your
report. Your code has comments that clearly indicate which parts correspond to which
sections of your report. You explicitly acknowledge when you borrow pieces of code from
sources other than the lecture and tutorial materials.

热门主题

课程名

int2067/int5051 bsb151 babs2202 mis2002s phya21 18-213 cege0012 mgt253 fc021 mdia1002 math39512 math38032 mech5125 cisc102 07 mgx3110 cs240 11175 fin3020s eco3420 ictten622 comp9727 cpt111 de114102d mgm320h5s bafi1019 efim20036 mn-3503 comp9414 math21112 fins5568 comp4337 bcpm000028 info6030 inft6800 bcpm0054 comp(2041|9044) 110.807 bma0092 cs365 math20212 ce335 math2010 ec3450 comm1170 cenv6141 ftec5580 ecmt1010 csci-ua.0480-003 econ12-200 ectb60h3f cs247—assignment ib3960 tk3163 ics3u ib3j80 comp20008 comp9334 eppd1063 acct2343 cct109 isys1055/3412 econ7230 msinm014/msing014/msing014b math2014 math350-real eec180 stat141b econ2101 fit2004 comp643 bu1002 cm2030 mn7182sr ectb60h3s ib2d30 ohss7000 fit3175 econ20120/econ30320 acct7104 compsci 369 math226 127.241 info1110 37007 math137a mgt4701 comm1180 fc300 ectb60h3 llp120 bio99 econ7030 csse2310/csse7231 comm1190 125.330 110.309 csc3100 bu1007 comp 636 qbus3600 compx222 stat437 kit317 hw1 ag942 fit3139 115.213 ipa61006 econ214 envm7512 6010acc fit4005 fins5542 slsp5360m 119729 cs148 hld-4267-r comp4002/gam cava1001 or4023 cosc2758/cosc2938 cse140 fu010055 csci410 finc3017 comp9417 fsc60504 24309 bsys702 mgec61 cive9831m pubh5010 5bus1037 info90004 p6769 bsan3209 plana4310 caes1000 econ0060 ap/adms4540 ast101h5f plan6392 625.609.81 csmai21 fnce6012 misy262 ifb106tc csci910 502it comp603/ense600 4035 csca08 8iar101 bsd131 msci242l csci 4261 elec51020 blaw1002 ec3044 acct40115 csi2108–cryptographic 158225 7014mhr econ60822 ecn302 philo225-24a acst2001 fit9132 comp1117b ad654 comp3221 st332 cs170 econ0033 engr228-digital law-10027u fit5057 ve311 sle210 n1608 msim3101 badp2003 mth002 6012acc 072243a 3809ict amath 483 ifn556 cven4051 2024 comp9024 158.739-2024 comp 3023 ecs122a com63004 bms5021 comp1028 genc3004 phil2617
联系我们
EMail: 99515681@qq.com
QQ: 99515681
留学生作业帮-留学生的知心伴侣!
工作时间:08:00-21:00
python代写
微信客服:codinghelp
站长地图