代做COMP809 – Data Mining & Machine Learning Semester 1, 2024 Assignment 1调试Python程序

COMP809 – Data Mining & Machine Learning

Semester 1, 2024

Assignment 1

Maximum Marks: 100

22 March 2024

Paper Description: Data Mining & Machine Learning

Paper Code: COMP809

Total Marks: 100

Date: 22 March 2024

Deadline: 12 April 2024

INSTRUCTIONS:

1. This is an individual assignment.

2. Only documents in pdf or html will be accepted. You can generate the document directly on Jupyter Notebook.

3. Submit the pdf or html file via Canvas.

4. Formats other than pdf will be ignored and the author will be asked to re-submit the assignment. Resubmissions will be subject to the late policy outlined in the study guide (i.e., 5% per day up to 5 days).

5. The python code required to complete this assignment, which includes code to support your conclusions & answers, must be embedded in the document in the corresponding answer as text (not image), unless otherwise specified. This code will be marked. Unsolicited python scripts submitted separately will not be marked.

6. Read carefully and answer all the questions as requested. Any material or information unrelated to the correct answer may result in a significant reduction of marks for that question.

7. Do not forget to fill in and sign the cover sheet which must be the very first page in the pdf. Use software such as Adobe Acrobat Pro on the Uni computers to include the file at the start of your document.

8. If you need an extension or if your performance has been impacted by some extenuating circumstances, then you must complete a special consideration form. on Canvas.

9. Only the techniques studied in this course will be accepted.

10. The comprehension of the questions is part of the assignment.

Grade table:

Question:

1

2

3

Total

Points:

30

35

35

100

Score:





QUESTIONS:

1. Nitrogen Oxides are a family of poisonous, highly reactive gases. These gases form. when fuel is burned at high temperatures. NOx pollution is emitted by automobiles, trucks and various non-road vehicles (e.g., construction equipment, boats, etc.) as well as industrial sources such as power plants, industrial boilers, cement kilns, and turbines. NOx often appears as a brownish gas. It is a strong oxidizing agent and plays a major role in the atmospheric reactions with volatile organic compounds (VOC) that produce ozone (smog) on hot summer days. Source: United States Environmental Protection Agency−EPA.

The NOxEmissions.csv file contains 8088 observations on the following 4 variables:

• julday: day number, a factor with levels 373 ... 730, typically with 24 hourly measurements.

• LNOx : log of hourly mean of NOx concentration in ambient air [ppb] next to a highly frequented motorway.

• LNOxEm: log of hourly sum of NOx emission of cars on this motorway in arbitrary units.

• sqrtWS: Square root of wind speed [m/s].

In this study, we are interested in modelling the LNOx concentration as a function of the other variables. The time dependency will be omitted. The same with the space dependency if there is any.

Total for Question 1: 30

(a) Describe the data preprocessing. Justify your answers. [3]

(b) Describe the distribution of the variable LNOx. [4]

(c) Fit a linear model to explain the variable LNOx as a function of LNOxEm and sqrtWS. Comment on the model. Justify your answer. [15]

(d) Discuss the relationship between the dependent and independent variables. Interpret in a way that someone who is not familiar with the field can understand the parameter associated to the predictor LNOxEm. [4]

(e) Predict the Nitrogen Oxides concentration for a LNOxEm = 7.5 and sqrtWS = 1.3. Interpret your results in a way that someone who is not familiar with linear models can understand. [4]

2. The nassCDS.csv file contains data on Airbag and other influences on accident fatalities. The data has been collected in US, for 1997-2002, from police-reported car crashes in which there is a harmful event (people or property), and from which at least one vehicle was towed. Data are restricted to front-seat occupants, include only a subset of the variables recorded, and are restricted in other ways also. The data has 26,217 observations on the following 15 variables:

• dvcat: ordered factor with levels (estimated impact speeds) 1-9km/h, 10-24, 25-39, 40-54, 55+.

• weight: Observation weights, albeit of uncertain accuracy, designed to account for varying sampling probabilities.

• dead: factor with levels alive dead.

• airbag: a factor with levels none airbag.

• seatbelt: a factor with levels none belted.

• frontal: a numeric vector; 0 = non-frontal, 1=frontal impact.

• sex : a factor with levels f m.

• ageOFocc: age of occupant in years.

• yearacc: year of accident.

• yearVeh: Year of model of vehicle; a numeric vector.

• abcat: Did one or more (driver or passenger) airbag(s) deploy? This factor has levels deploy nodeploy unavail.

• occRole:a factor with levels driver pass.

• deploy: a numeric vector: 0 if an airbag was unavailable or did not deploy; 1 if one or more bags deployed.

• injSeverity: a numeric vector; 0:none, 1:possible injury, 2:no incapacity, 3:incapacity, 4:killed; 5:unknown, 6:prior death.

• caseid: character, created by pasting together the populations sampling unit, the case number, and the vehicle number. Within each year, use this to uniquely identify the vehicle.

In this project we are interested in explaining the probability of surviving (variable dead) given airbag, seatbelt, frontal, sex, ageOFocc, yearVeh, and deploy. For this fit a proper generalised linear regression model.

Total for Question 2: 35

(a) Describe the data preprocessing. Justify your answers. [3]

(b) Is the use of the seat belt independent of whether the passenger survives or not? Justify your answer. Use only the variables related to this question in your analysis. [6]

(c) Is there a mean age difference between the following injury severity (injSeverity) groups: none, possible injury, no incapacity, incapacity, and killed? Justify your answer. Use only the variables related to this question in your analysis. [6]

(d) Fit a model that explains the dependent variable as a function of airbag, seatbelt, frontal, sex, ageOFocc, yearVeh, and deploy. Use 70% to train the model and 30% to test it. Comment on the performance of the model in a way that someone who is not familiar with the concepts can understand. [12]

(e) Interpret the parameter associated to seatbelt and ageOFocc in a way that anybody can understand.  [4]

(f) Predict the odds of not surviving for the following two scenarios:

1. There is no airbag, the passenger is not wearing seatbelt, it is a frontal impact, the passenger is female 70 years old, and the airbag is not deployed.

2. There is airbag, the passenger is wearing seatbelt, it is a frontal impact, the passenger is female 70 years old, and the airbag is deployed.

Interpret your results. [4]

3. The dataset is being used in a project that aims to study the international student flow and its relationship with the level of globalisation of countries and their national higher educational systems in 2019. The data were extracted and merged from three sources: UNESCO, Swiss KOF Economic Institute Dataset, QS World University Rankings. The data q3.xlsx file contains many variables, but in this study, we are interested in the following ones:

• InboundRatio: Inbound mobility rate for international students.

• InternationalStudentsNO: Total number of inbound international students.

• KOFPoGI: Political Globalization index.

• KOFEcGI: Economic Globalisation index.

• KOFSoGI: Social Globalization index.

• ISCED5 Percentage: Participation in short cycle tertiary education.

• ISCED6 Percentage: Participation in Bachelor’s level.

• ISCED7 Percentage: Participation in Master’s level.

• ISCED8 Percentage: Participation in Doctoral level.

• top_50 count: Number of universities ranked in the top 50.

• top_100 count: Number of universities ranked in the top 100.

• top_500 count: Number of universities ranked in the top 500.

• top_1000 count: Number of universities ranked in the top 1000.

• WESP: World Economic Situation and Prospects 2021.

• country_x : country.

Notes: ISCEDs variables refer to the participation percentage by level of higher education. It is calculated by the number of students enrolled in the corresponding level divided by the population of the official age for tertiary education, multiplied by 100. About KOFPoGI, KOFEcGI, and KOFSoGI indexes, the higher the better. The inbound mobility rate is the number of students from abroad studying in a given country, expressed as a percentage of total tertiary enrolment in that country.

Total for Question 3: 35

(a) Describe the data preprocessing. Justify your answers. [3]

(b) Perform. an exhaustive K-mean cluster analysis on the variables of interest. How many clusters do you propose? Justify your answer. [15]

(c) Perform. an agglomerative cluster analyses. How many clusters do you propose? Justify your answer. [12]

(d) What do you conclude? Provide an interesting remark(s). Justify your answers. (3-4 sentences). [5]




热门主题

课程名

int2067/int5051 bsb151 babs2202 mis2002s phya21 18-213 cege0012 mgt253 fc021 mdia1002 math39512 math38032 mech5125 cisc102 07 mgx3110 cs240 11175 fin3020s eco3420 ictten622 comp9727 cpt111 de114102d mgm320h5s bafi1019 efim20036 mn-3503 comp9414 math21112 fins5568 comp4337 bcpm000028 info6030 inft6800 bcpm0054 comp(2041|9044) 110.807 bma0092 cs365 math20212 ce335 math2010 ec3450 comm1170 cenv6141 ftec5580 ecmt1010 csci-ua.0480-003 econ12-200 ectb60h3f cs247—assignment ib3960 tk3163 ics3u ib3j80 comp20008 comp9334 eppd1063 acct2343 cct109 isys1055/3412 econ7230 msinm014/msing014/msing014b math2014 math350-real eec180 stat141b econ2101 fit2004 comp643 bu1002 cm2030 mn7182sr ectb60h3s ib2d30 ohss7000 fit3175 econ20120/econ30320 acct7104 compsci 369 math226 127.241 info1110 37007 math137a mgt4701 comm1180 fc300 ectb60h3 llp120 bio99 econ7030 csse2310/csse7231 comm1190 125.330 110.309 csc3100 bu1007 comp 636 qbus3600 compx222 stat437 kit317 hw1 ag942 fit3139 115.213 ipa61006 econ214 envm7512 6010acc fit4005 fins5542 slsp5360m 119729 cs148 hld-4267-r comp4002/gam cava1001 or4023 cosc2758/cosc2938 cse140 fu010055 csci410 finc3017 comp9417 fsc60504 24309 bsys702 mgec61 cive9831m pubh5010 5bus1037 info90004 p6769 bsan3209 plana4310 caes1000 econ0060 ap/adms4540 ast101h5f plan6392 625.609.81 csmai21 fnce6012 misy262 ifb106tc csci910 502it comp603/ense600 4035 csca08 8iar101 bsd131 msci242l csci 4261 elec51020 blaw1002 ec3044 acct40115 csi2108–cryptographic 158225 7014mhr econ60822 ecn302 philo225-24a acst2001 fit9132 comp1117b ad654 comp3221 st332 cs170 econ0033 engr228-digital law-10027u fit5057 ve311 sle210 n1608 msim3101 badp2003 mth002 6012acc 072243a 3809ict amath 483 ifn556 cven4051 2024 comp9024 158.739-2024 comp 3023 ecs122a com63004 bms5021 comp1028 genc3004 phil2617
联系我们
EMail: 99515681@qq.com
QQ: 99515681
留学生作业帮-留学生的知心伴侣!
工作时间:08:00-21:00
python代写
微信客服:codinghelp
站长地图