代做STAT0045 In-course Assessment 2 (2024/25 S帮做R程序-留学生作业帮

代做STAT0045 In-course Assessment 2 (2024/25 S帮做R程序

STAT0045. In-course Assessment 2 (2024/25 Session)

Department of Statistical Science

General Instructions

· This assessment is classified as Coursework as defined in the UCL Student Regulations for Exams and Assessments. It contributes 40% to the overall mark for this module.

· The release date for this assessment is 12:00 (UK time) on Tuesday, 11 March 2025.

· The submission deadline is 16:00 (UK time) on Tuesday, 18 March 2025.

· Individual extensions to the submission deadline can only be granted where a student has been issued with a Summary of Reasonable Adjustments (SoRA), has used a Delayed Assessment Permit (if the assessment is eligible), or has made a valid claim for Extenuating Circumstances. The standard extension length for this assessment type is five working days.

。 If you have a SoRA, your extension should be setup automatically and you should see it

reflected in the deadline displayed in the submission portal. If you think that your SoRA

adjustment has not been applied, please contact the module lead at the earliest opportunity.

。 Delayed Assessment Permits and Extenuating Circumstances claims should be submitted through Portico. The module lead will be notified and will act on extensions approved via these routes, but the deadline displayed in the submission portal will not update instantly.

· In preparation for this assessment, please ensure that you are familiar with the Department of Statistical Science’s guidance on academic integrity. When submitting your work, you will be required to make a declaration that you have read and understood this guidance.

· Parts of your submission may be scanned using similarity detection software. If any breach of the assessment regulations is suspected, it will be investigated in accordance with UCL’s Student

Academic Misconduct Procedure.

· To facilitate anonymous marking, you should not write your name anywhere on your work,

including in file names or file descriptions requested as part of the submission process.

· You must only submit your work via the designated portal in Moodle. If you try to submit via email or any other channel this will not count as a submission and will not be marked.

· There are strict, non-negotiable penalties for late submission, which for coursework are as follows.

。 Up to 2 working days late: deduction of 10 percentage points, but no lower than the pass mark.

。 2-5 working days late: capped at the pass mark.

。 More than 5 working days late: mark of 1.00%.

· If the module lead becomes aware of a significant technical issue or outage affecting Moodle

during the assessment, a message will be circulated to explain what has happened and the steps being taken to mitigate the issue. If you do not receive notification of a more widespread issue and you experience technical difficulties, you should refer to the Help & Support resources provided by UCL’s central IT service. However, last-minute technical issues will not be considered as valid

grounds for missing the deadline, so ensure that you leave plenty of time to prepare, upload and check your submission.

· Non-submission (in the absence of any valid Extenuating Circumstances) will mean that your

mark for this component is recorded as 0.00% and you will be deemed to have made an attempt.

· You should expect to receive feedback on this assessment within 20 working days of the

submission deadline. In the event of a delay, the module lead will contact students directly with details of the revised timeline.

The assessment

· This is an individual assessment; you must work alone.

· This assessment consists of two parts. For Part A, you can submit scanned/photographed hand- written solutions. Make sure that scanned work can be read clearly. Note the UCL advice on submitting scanned/photographed work (link(https://www.ucl.ac.uk/news/2020/apr/seven-simple- steps-submit-handwritten-answers-moodle-exams-or-assessments)). For Part B you are required to write a report and this report should be typed. Include a word count for this part.

· The relevant course material for this assessment is all the material up to and including Section

4.6. Exercise Sheet 7 is not included.

· Keep your answers concise. Answers that are unnecessarily elaborate or include information that is not asked for will be penalised.

· Part A and Part B are both marked on a scale 0-100, and are equally weighted for the final mark. For Part A, marks for the constituent parts are listed in bold face. Marks are given for correct answers, but also for succinctness and clarity of explanation.

· To ensure anonymous marking, only provide your Student ID number at the top of Part A and B (and not your name). Part A and B should be submitted together in one PDF file. Submit the file with your Student ID as name; for example, if your ID is 20001234, use the name 20001234.pdf .

· You can use R for the questions in Part A, but do not hand in R code. R code is information that is not asked for; see above.

· For Part B, you are allowed to use an AI tool (such as ChatGTP), but you should acknowledge the use of this and explain the way you used it.

· You can use the forum to raise queries during the assessment, but only if the queries concern

clarification of tasks in the assessment. This option will only be available till 12 noon on March 17th, 2025. From that time onward the forum will be read-only till March 19th.

Part A

Question 1

For this question, you have to download a data set that is identified by your Student ID number.

· You can find the data in the Section ICA 2 on Moodle. Your data set is identified by your student ID number. Be careful to identify your data specifically. Marking is partly based on student-specific

data analysis.

· If your ID is 20001234 for example, then select and download the text file 20001234.txt and put the file in the working directory of your R session.

· Read in the file in your R session by the command dta <- read.csv(file="20001234.txt") , and have a look at the data. Example of using R for this:

> dta <- read.csv(file="20001234.txt")

> head(dta)

y x

1 8.27 1

2 5.06 1

3 12.14 1

4 4.92 1

5 6.30 1

6 9.81 1

· If you cannot read in the data in R, contact the module lead as soon as possible via email:

[email protected] .

· The data are created in the format of a 100 × 2 table. The first column ( y ) is for response Y, and the second column ( x ) identifies the level of the treatment variable.

Your data concern a one-way ANOVA experiment regarding the height of a flower plant. Response Y is the response in centimeters, and x identifies the five treatment levels. Values x = 1, 2, 3, and 4 correspond with the use of four different fertilisers. The value x = 5 corresponds with the no-fertiliser treatment. The aim of the experiment is to establish how the height of the plant is affected by choice regarding fertilisers.

For the statistical inference, use a significance level of 5%.

(a) Consider the hypothetical case where data for this experiment are collected by someone who uses her garden. Say she collected the data by using the front and the back garden as follows: plants with fertiliser x = 1 and x = 2 in the front garden, and the plants with the other treatment levels in the back garden. Explain briefly and in simple terms (without using the word “randomisation”) why this is not a good way to collect the data. [3]

(b) Potassium is a common ingredient in fertilers. Consider the hypothetical case that fertiliser x = 1 has twice the amount of potassium compared to fertilisers identified by x = 2, 3, and 4. Would this undermine your statistical inference? Explain your answer. [3]

(c) Define a one-way linear ANOVA model for response Y with an intercept. Define the model such that the intercept can be estimated by the mean of the observed values for Y under the no-fertiliser treatment. Write down the model equation and specify this equation completely for your data. [8]

(d) Fit the model in (c) to your data and report the ANOVA table with clearly defined rows and columns. Using the model definition in (c), define the hypothesis for testing whether all five treatment group means are equal. Test this hypothesis using the ANOVA table. Be explicit about the distribution you use for this test. [4]

(e) Provide the point estimates for all the model parameters in (c). [6]

(f) Define the estimator of the intercept in the model in (c) as a function of response mean(s) and derive the variance of this estimator as a function of the error variance σ 2 and the sample sizes for the treatment levels. Clearly explain your derivations. [8]

Question 2

(a) Show how to derive

in Section 2.6.3.4. Do not derive anything that is already shown in the lecture notes; solve for a using the equations that are available in the notes. Mind that a previous version of the lecture notes contained a typo in the expression for a. [6]

Consider stratified sampling with the following specifications:

Stratum

Stratum size

Strata variance

2000

5000

3000

2000

(b) Consider that the variance of the stratified sample mean Y-ST is fixed at 1/10. Assume that costs are defined by where c0 = 100, and (c1, c2, c3, c4, c5 ) = (1, 2, 1, 2, 4). Derive the

optimal strata sample sizes nℓ , for ℓ = 1, 2, 3, 4, 5. Explain your derivation. Hint: you may want to use R for the computation, but do not include R code in your answer. [14]

Question 3

Consider the following randomised response (RR) design for a yes-or-no question that asks respondents whether or not they have committed fraud. Instead of answering the question directly, the respondent throws a dice and keeps the outcome of the throw hidden from the interviewer.

· If the outcome of the throw is 1 or 2, then the respondent answers yes.

· If the outcome of the throw is 6, then the respondent answers no.

· If the outcome of the throw is 3, 4, or 5, then respondent answers yes or no in line with whether or not he or she committed fraud.

Assume that respondents follow the RR design. Let π 1 denote the probability that a respondent has committed fraud.

(a) For this RR design, give the values of the conditional probabilities P(observed yes|latent yes) and P(observed no|latent no), where observed refers to the data collected, and latent refers to the unknown status regarding fraud. [7]

(b) The observed data are given by 300 yes-answers, and 500 no-answers. Estimate π 1 and calculate the standard error for this estimate. Explain your answers. [8]

Consider the RR design by Warner as specified on Slide 109. Define Yi = 1 when respondent i used illegal drugs, and Yi = 0 otherwise. Say there are two non-overlapping groups of respondents in the RR survey; Group A and Group B. Consider the logistic regression model

where xi = 0 when respondent i belongs to Group A, and xi = 1 when i belongs to Group B. Using ˆ

the RR design, assume that the probability of observing a yes-response in Group A is estimated by λA .

(c) Define an estimate of β0 as a function of A . Explain your derivation. [10]

Question 4

(a) Consider Theorem 2.3 in Chapter 2 of the lecture notes. Using the notation in Section 2.8.2 provide the final details of the proof; that is, show that

is indeed an unbiased estimator of Var(Y-CL). Provide the details of your derivation. Do not explain the existing equations in the proof of Theorem 2.3 in the lecture notes, but be clear which of the equations you use in your derivation. [9]

There are ten schools in a particular area. As part of an investigation into teaching standards, an inspection team proposes to visit three of the schools and administer a test to all of the 14-year old students in each school visited. The school sizes (in hundreds of pupils) are as follows:

School Size

(b) Three pseudo-random numbers, distributed uniformly on (0, 1) , have been obtained using R. They are 0.821, 0.228 and 0.307. Use these to select a PPS sample of three schools, explaining your procedure clearly. [8]

(c) Suppose that schools 4, 7 and 2 were selected (note that these are not necessarily the schools that would be chosen using the random numbers provided above), and that the average test results for these three schools were 14.5, 16.7 and 13.6 respectively. Use these data to estimate the average test result across all ten schools. Provide an estimated standard error for your estimate. [6]

Part B

For this part you are required to write a short report discussing aspects of data ethics for a given scenario.

The scenario: In the UK, housing benefit can help you pay your rent if you are unemployed or on a low income. If you receive benefit, then you need to report a change of circumstances for you and anyone else in your house. Examples of housing benefit fraud are not reporting all income or not reporting a

change of income.

In a large city in the UK, the manager who deals with housing benefit in the city wants to use data science to help identity fraud.

The manager’s idea is to use data from past receivers of housing benefit who were investigated for fraud. Assume that for these people individual information is available on whether or not fraud was detected.

The manager envisages using the data to define a statistical prediction model and next to use this model to identify current receivers of housing benefit who are likely to commit fraud.

You are asked to lead this project. The main statistical parts of the project are: collecting relevant data, data analysis, defining a model that can be used for prediction, and using the model to make a prediction for current receivers of housing benefit.

Assume that the chosen prediction model is a logistic regression model for a binary response variable with value 1 for fraud and value 0 otherwise.

Instructions and guidelines for the report:

· Write a report that discusses the scenario with a focus on data ethics. Limit the scope of data ethics to the material that is discussed in STAT0045.

· You should explicitly use the following terms in the report (and reflect on the concepts attached to these terms): “data subject”, “model subject”, “fairness”, and “transparency”.

· You should discuss to some extent the importance of GDPR in this project and give at least one concrete example of a measure that you would implement to warrant that GDPR guidelines are followed. In the discussion of GDPR you should explicitly use the term “personal data” in the report.

· Assume the reader knows the logistic regression model; do not discuss standard aspects; for example, how the model is defined or how to estimate model parameters.

· Give the report a title.

· Type the report in a text editor and add the word count at the end of the report. Use font size 12.

· Write the report in paragraphs and complete sentences. Using a few bullet points is OK, but do not write the report as a list of bullet points.

· Maximum word count for the report (including the title) is 700 words. Report longer than 700 words will be penalised.

· If you use an AI tool (see instructions), then use an appendix to acknowledge this use. This appendix does not count towards the maximum word count.

· You can add literature references to the report. References do not count towards the maximum word count. No need to add references to the STAT0045 course material.

Hints:

· There is no specific need to use AI tools for this report. Mind the danger of using AI tools; see the slides on Use ofAI Tools in Chapter 1.

· Although it is fine to refer to literature beyond the course material, there is no specific need to do so.

· The aim of this assignment is to see whether you are able to critically reflect on aspects of data ethics in a practical scenario. Do not just enumerate definitions or aspects of data ethics, focus instead on some of the aspects and explain why they are important in this scenario.

· You are not asked to solve potential problems in this scenario, or provide details of specific

actions. The report should focus on potential issues with respect to data ethics - not knowing how the issues can be addressed in detail is OK.

Marking criteria: adherence to the above instructions and guidelines, and the quality of the presentation (readability, structure, language). [100]

课程名

mktg2509 csci 2600 38170 lng302 csse3010 phas3226 77938 arch1162 engn4536/engn6536 acx5903 comp151101 phl245 cse12 comp9312 stat3016/6016 phas0038 comp2140 6qqmb312 xjco3011 rest0005 ematm0051 5qqmn219 lubs5062m eee8155 cege0100 eap033 artd1109 mat246 etc3430 ecmm462 mis102 inft6800 ddes9903 comp6521 comp9517 comp3331/9331 comp4337 comp6008 comp9414 bu.231.790.81 man00150m csb352h math1041 eengm4100 isys1002 08 6057cem mktg3504 mthm036 mtrx1701 mth3241 eeee3086 cmp-7038b cmp-7000a ints4010 econ2151 infs5710 fins5516 fin3309 fins5510 gsoe9340 math2007 math2036 soee5010 mark3088 infs3605 elec9714 comp2271 ma214 comp2211 infs3604 600426 sit254 acct3091 bbt405 msin0116 com107/com113 mark5826 sit120 comp9021 eco2101 eeen40700 cs253 ece3114 ecmm447 chns3000 math377 itd102 comp9444 comp(2041|9044) econ0060 econ7230 mgt001371 ecs-323 cs6250 mgdi60012 mdia2012 comm221001 comm5000 ma1008 engl642 econ241 com333 math367 mis201 nbs-7041x meek16104 econ2003 comm1190 mbas902 comp-1027 dpst1091 comp7315 eppd1033 m06 ee3025 msci231 bb113/bbs1063 fc709 comp3425 comp9417 econ42915 cb9101 math1102e chme0017 fc307 mkt60104 5522usst litr1-uc6201.200 ee1102 cosc2803 math39512 omp9727 int2067/int5051 bsb151 mgt253 fc021 babs2202 mis2002s phya21 18-213 cege0012 mdia1002 math38032 mech5125 07 cisc102 mgx3110 cs240 11175 fin3020s eco3420 ictten622 comp9727 cpt111 de114102d mgm320h5s bafi1019 math21112 efim20036 mn-3503 fins5568 110.807 bcpm000028 info6030 bma0092 bcpm0054 math20212 ce335 cs365 cenv6141 ftec5580 math2010 ec3450 comm1170 ecmt1010 csci-ua.0480-003 econ12-200 ib3960 ectb60h3f cs247—assignment tk3163 ics3u ib3j80 comp20008 comp9334 eppd1063 acct2343 cct109 isys1055/3412 math350-real math2014 eec180 stat141b econ2101 msinm014/msing014/msing014b fit2004 comp643 bu1002 cm2030

联系我们

EMail: 99515681@qq.com

QQ: 99515681

留学生作业帮-留学生的知心伴侣！

工作时间：08:00-21:00

微信客服：codinghelp

热门主题

课程名