代做SOCS0055 Advanced Computational代写R编程

SOCS0055 Advanced Computational

Techniques for Data Science

Summative Assessment 100% - Autumn Term 2025-2026

1.   There are four tasks in this assessment that add up to 100 marks

2.   There is no word limit. We advise to write concise, short, focused answers; otherwise, you may run out of time

3.   To achieve full marks, no additional references/bibliography are required or expected. However, if you use them, list them at the end of the assessment following an accepted referencing style (check these UCL guidelines)

4.   AI can be used as an assistive tool. However, it is not permitted to generate any part of your exam. Use of AI needs to be acknowledged at the end of the exam.

5.   The assessment is split into two parts, A and B. For Part A, students must create their own wrangled version of the UKHLS dataset. For Part B, a cleaned version of the dataset is provided.

a.   Instructions to download the full USoc data from the UK Data Service are provided in an Appendix. Note, these  instructions will download data for both USoc and its predecessor, the British Household Panel Survey (BHPS), but we only want you to use USoc data in this assessment.

b.   For Part B, a cleaned version of the  dataset is provided on the moodle asssesment section.

The specific outputs required differ between Part A and B. You must submit:

 

    Coversheet.docx

Part A

•    “partA_task1.qmd” (Quarto file containing code)

•    “partA_task2.qmd” (Quarto file containing code)

   “partA_task2_table.html” (Output table as a HTML)

•    “partA_task3.qmd” (Quarto file containing code)

Part B

•    “partB_task4.html” (html file with your explanation, your code and your output)

•    partB_task4.qmd” (Quarto file of your task)

The individual Quarto files needs to be self-contained and must be capable of running in full, from scratch in a new, ‘clean’ session (i.e., from loading packages and data to performing analyses). The code should also be well formatted with comments, sections, sensible variable names, and so on.

Your final submission must be a zip folder containing the above files and everything necessary to run the code (except the data, which your examiners will have access to already). Please name this file:  [CandidateNumber].zip” where you replace [CandidateNumber] with your anonymous candidate number (not your student id). You can use the course materials as an example.

Part A

Task 1 (25% of Marks)

Create a “long” dataset with one row per pidp x wave combination from all of the adult interview (*_indresp.dta) files. The dataset should have rows only for those participants who appeared in the a_indresp.dta file (i.e., completed an adult interview in Wave 1 of USoc).

The dataset should have cleaned versions of current smoking status (binary, derived from *_smever and *_smnow), obesity (binary, BMI ≥ 30 kg/m2, derived from *_bmi_dv), SF- 12 mental and physical component summary scores (*_sf12pcs_dv and *_sf12mcs_dv), overall life satisfaction, age at interview (*_age_dv), date of interview (month-year), sex (*_sex_dv), whether the participant has a degree or above level education (binary, derived from *_hiqual_dv), and employment status (binary, derived from *_jbhas and *_jboff).

Include a step to save the cleaned dataset as using the saveRDS() function — you will need to load and use this dataset in Tasks 2 and 3.

Most of the stubs of the variables that should be used to create these variables are provided above, but for overall life satisfaction and date of interview, you are required to find relevant variables yourself. Note, some of the variables are only available in some waves but not all. See this website for a helpful tool which provides variable names and waves of collection. You can alternatively use the labelled::lookfor() function.

All that is required for this question is the code used to complete this task, provided in a Quarto (.qmd) document. Marks will be awarded to students who reduce the amount of code written to complete this task, compared with writing out all instructions manually (e.g., using functions called repeatedly or combining data to clean variables in one fell swoop), and to students who include helpful comments explaining their code.

Task 2 (10% of marks)

Load the dataset from Task 1 and create a ‘Table 1 ’ which shows descriptive statistics for each variable in the dataset. This table should report statistics separated by study wave (i.e., in separate columns). For categorical variables, report sample sizes and proportions (as % of non- missing) in each category. For continuous variables, report mean and standard deviation.

Format the table so that variable names are descriptive and human interpretable rather than repeating the R column names. Also remove any variables that are not helpful to report.

Save the table as an  .html file. You can use whatever package you wish to create the table (e.g., gt, gtsummary, flextable, and so on).

Along with the code used to complete this task (and helpful explanations) in a Quarto (.qmd) file, we require the saved table file(s).  Marks will be awarded for completing the task successfully (e.g., displaying the correct descriptive statistics), but also for producing a table that is ‘publication-ready’ – i.e., well formatted, aesthetically pleasing, and understandable.

Task 3 (15% of marks)

Load the dataset from Task 1 and run a series of cross-sectional regressions for each (valid) combination of:

-    Outcome: smoking status, SF-12 MCS, SF-12 PCS, obesity, and life satisfaction

-    Exposure: employment status and degree qualification

-    Control variables: age, age-squared and date of interview

-    Wave: 1-14, separately.

-    Sex: males, females, and males and females combined

Store the results of the regressions in a single tibble containing coefficients, standard errors, and upper and lower confidence intervals for the exposure of interest, as well as details on the parameters of the regression (outcome, exposure, … , sex used). For each regression, use OLS regression (lm()), regardless of whether the outcome variable is continuous or binary.

As with Task 1, all that is required for this question is the code used to complete this task, provided in a Quarto (.qmd) document. Marks will again be awarded to students who reduce the amount of code written to complete this task, compared with writing out all instructions manually, and to students who include helpful comments explaining their code.

Part B

Task 4 (50% of marks)

The datafile “Brexit.RData” contains data from Understanding Society, wave 8. It includes approx. 25,000 individuals and information on the following characteristics:

•   pidp: Person ID

•   bornuk_dv: Born in UK

•   gor_dv: Region of residence

•   brexit_leave: Intention to vote "leave" (1 means intention to vote “leave”)

•   age_dv: Age

•   sex_male: Male

•   marstat_dv: Marital status

•   migback_gen: Migration background

•   ethn_dv: Ethnic group

•   hiqual_dv: Highest level of education

•   nkids_dv: Number of kids

•   jobstat: Employment status

•   unemployed: Unemployed

•   fihhmnnet1_dv: Household labour income

•   hh_inc_oecd: Household equivalence income

•   ind_lab_inc: Individual labour income

•   tenure_dv: Housing tenure

•   area_rural: Lives in rural area

•   lkmove: Likes to move to a new place of residence

•   financial_diff_fut: Subjective financial situation - future (Looking ahead, how do you think you will be financially a year from now, will you be)

•   financial_diff_now: Subjective financial situation - current (How well would you say you yourself are managing financially these days? Would you say you are)

•   health: General health

•   distress: Level of distress (0 indicating the least amount of distress; 36 indicating the greatest amount of distress)

•   lifesat: Satisfaction with life overall (1 Completely dissatisfied; 7 Completely satisfied)

•   deprivation: Household-level material deprivation (0 no deprivation - 100 highest level of deprivation)

•   problems_bills: Having trouble to pay the bills

•   problems_counciltax: Having trouble to pay the council tax

•   nbh_deprivation: Neighbourhood deprivation percentile (1 lowest level of deprivation, 10 highest level of deprivation)

•   nbh_foreign: Neighbourhood share of foreigners (1 lowest percent of foreign residents, 10 highest percent of foreign residents)

•   nbh_above65:  Neighbourhood share of residents above 65 (1 lowest percent of residents above 65, 10 highest percent of residents above 65)

Factors have labels in the data. For binary indicators (1 mean yes and 0 mean no).

Here is your task: Build a machine learning algorithm to predict the intention to vote “Leave” in the Brexit referendum  (brexit_leave). Your main aim is to increase the out-of-sample accuracy. You can use any algorithm that we have covered in our module, and you can also use different algorithms if you can justify your decision.

Your submission should include an HTML document that clearly addresses the following:

A) Data Preparation

•  Describe any recoding or preprocessing steps you performed.

B)  Sample Size

•  Report the final sample size used in your analysis.

•  If it differs from the original sample, explain why.

C)  Train/Test Split

•  Explain how you split the data into training and test sets.

D) Hyperparameter Tuning

•  Describe how you selected or tuned hyperparameters for your model(s). If you used Cross-validation, explain briefly what  you did. The code for CV needs to be executable.

E)  Prediction Methods

•  Briefly explain which machine learning methods you used to generate predictions and why you selected those. You can also compare the performance of multiple algorithms to make your final choice — if so, include the comparison in your code.

F)  Performance Evaluation

•  Select a final model that is your preferred choice for predicting the outcome. Your final model must be recognizable in the code – so that we can use it for prediction with our hold-out sample.

•  Report your final performance metric (e.g., accuracy, AUC, MSE).

G) Feature Importance

•  Provide a summary of feature importance and a short explanation of the key findings from your preferred model.

Your document should make clear how you selected your final prediction model and provide evidence of its out-of-sample performance using the provided data.

Note: The primary goal is out-of-sample prediction accuracy.

For Part B of your coursework, please submit:

•  A document with a narrative explanation of your approach (HTML / DOC / PDF)

•  Your quarto / R script. with fully replicable code (must run on a different device)

This code should be self-contained and be capable of running in full from scratch – i.e., from a clean R session, to loading packages, wrangling the provided data, and producing and saving relevant outputs (e.g., plots or tables).

We will test your model’s performance on hold-out data that is not included in the dataset you received. Extra marks will be awarded to submissions that achieve a high prediction accuracy on this hold-out data (defined as being within a reasonable range of the best-performing model).



热门主题

课程名

mktg2509 csci 2600 38170 lng302 csse3010 phas3226 77938 arch1162 engn4536/engn6536 acx5903 comp151101 phl245 cse12 comp9312 stat3016/6016 phas0038 comp2140 6qqmb312 xjco3011 rest0005 ematm0051 5qqmn219 lubs5062m eee8155 cege0100 eap033 artd1109 mat246 etc3430 ecmm462 mis102 inft6800 ddes9903 comp6521 comp9517 comp3331/9331 comp4337 comp6008 comp9414 bu.231.790.81 man00150m csb352h math1041 eengm4100 isys1002 08 6057cem mktg3504 mthm036 mtrx1701 mth3241 eeee3086 cmp-7038b cmp-7000a ints4010 econ2151 infs5710 fins5516 fin3309 fins5510 gsoe9340 math2007 math2036 soee5010 mark3088 infs3605 elec9714 comp2271 ma214 comp2211 infs3604 600426 sit254 acct3091 bbt405 msin0116 com107/com113 mark5826 sit120 comp9021 eco2101 eeen40700 cs253 ece3114 ecmm447 chns3000 math377 itd102 comp9444 comp(2041|9044) econ0060 econ7230 mgt001371 ecs-323 cs6250 mgdi60012 mdia2012 comm221001 comm5000 ma1008 engl642 econ241 com333 math367 mis201 nbs-7041x meek16104 econ2003 comm1190 mbas902 comp-1027 dpst1091 comp7315 eppd1033 m06 ee3025 msci231 bb113/bbs1063 fc709 comp3425 comp9417 econ42915 cb9101 math1102e chme0017 fc307 mkt60104 5522usst litr1-uc6201.200 ee1102 cosc2803 math39512 omp9727 int2067/int5051 bsb151 mgt253 fc021 babs2202 mis2002s phya21 18-213 cege0012 mdia1002 math38032 mech5125 07 cisc102 mgx3110 cs240 11175 fin3020s eco3420 ictten622 comp9727 cpt111 de114102d mgm320h5s bafi1019 math21112 efim20036 mn-3503 fins5568 110.807 bcpm000028 info6030 bma0092 bcpm0054 math20212 ce335 cs365 cenv6141 ftec5580 math2010 ec3450 comm1170 ecmt1010 csci-ua.0480-003 econ12-200 ib3960 ectb60h3f cs247—assignment tk3163 ics3u ib3j80 comp20008 comp9334 eppd1063 acct2343 cct109 isys1055/3412 math350-real math2014 eec180 stat141b econ2101 msinm014/msing014/msing014b fit2004 comp643 bu1002 cm2030
联系我们
EMail: 99515681@qq.com
QQ: 99515681
留学生作业帮-留学生的知心伴侣!
工作时间:08:00-21:00
python代写
微信客服:codinghelp
站长地图