STATS 763 (10/06/2021 17:30) Adv Regression Methodology (Exam)
Question 1: [Total: 30 marks]
The relationship between BMI and various other factors was investigated by Noh et al. (2016) using 2012 data from the Korean Longitudinal Study of Aging. Data on 7730 individuals aged 51 and over was available.
A BMI of 25 or more corresponds to an Overweight status, and a BMI of 30 or more to an Obese status. We are interested in the efect of Age on the Overweight and Obese indicators. The relationship between Age and BMI by biological Sex is depicted in the figure below.
a) [10 marks] We are interested in the efect of Age for each biological Sex (i.e. the efect of Age*Sex) on the probability of being Overweight. We assume that adjusting for unemployment status (Unemployed= 1) and Marital status accounts for all confounding.
The following model is fitted using all the data in data frame df; note the identity link:
mod .ow <- glm(Overweight~Age*Sex+Unemployed+‘Marital status‘, data=df,family=quasibinomial(link="identity"))
Model summary:
|
|
|
|
|
Coefficients:
|
|
|
|
|
|
Estimate
|
Std . Error
|
t value
|
Pr(>| t | )
|
(Intercept)
|
0 .5593657
|
0 .0504992
|
11 .077
|
< 2e-16 ***
|
Age
|
-0 .0055935
|
0 .0007615
|
-7.346
|
2 .27e-13 ***
|
Sexfemale
|
-0 .2507854
|
0 .0673329
|
-3.725
|
0 .000197 ***
|
UnemployedYes
|
0 .0527118
|
0 .0113543
|
4 .642
|
3 .50e-06 ***
|
‘Marital status‘Not
|
married -0 .0166579
|
0 .0133623
|
-1 .247
|
0 .212573
|
Age:Sexfemale
|
0 .0042259
|
0 .0009950
|
4 .247
|
2 .19e-05 ***
|
(Dispersion parameter for quasibinomial family taken to be 1.00081)
Null deviance: 7846 .2 on 7229 degrees of freedom Residual deviance: 7777 .5 on 7224 degrees of freedom
Estimated variance matrix
|
|
Unemployed
|
‘Marital status‘
|
(Intercept) Age
|
Sexfemale
|
Yes
|
Not married Age:Sexfemale
|
(Intercept) 2 .55e-03 -3.79e-05
|
-2 .47e-03
|
1 .56e-04
|
9 .96e-06
|
3 .50e-05
|
Age -3.79e-05 5 .80e-07
|
3 .60e-05
|
-3 .16e-06
|
-4 .02e-07
|
-5 .18e-07
|
Sexfemale -2 .47e-03 3 .60e-05
|
4 .53e-03
|
-1 .01e-04
|
1 .92e-04
|
-6 .61e-05
|
UnemployedYes 1 .56e-04 -3 .16e-06
|
-1 .01e-04
|
1 .29e-04
|
-1 .93e-06
|
1 .02e-06
|
Not married 9 .96e-06 -4 .02e-07
|
1 .92e-04
|
-1 .93e-06
|
1 .79e-04
|
-3 .52e-06
|
Age:Sexfemale 3 .50e-05 -5 .18e-07
|
-6 .61e-05
|
1 .02e-06
|
-3 .52e-06
|
9 .90e-07
|
Describe precisely the estimatedefect of a 10-year increase in Age in a fe- male, unemployed, married person on the probability of being Overweight according to the above model. In your answer, include an estimate,a Wald 95% confidence interval and a p-value testing the null hypothesis that the efect is 0 .
b) [3 marks] We fit a model as above but without the Unemployed variable. The (quasi-)deviance diference is 19.6 with a p-value of 9.6 × 10-6; the Age estimate is -0.0041 with an s.e. of 0.0007 and p-value of 1.4 × 10-8 , and the Age:Sex estimate is 0.0036 with an s.e. of 0.001 and p-value of 0.003 . What among these statistics, if anything, is consistent with Unemployed being a confounder of the relationship between Age*Sex and Overweight?
c) [5 marks] There are 98 individuals classified as Obese in the data. We form a new data set newdf with these 98 observations and a random sample of 98 non-Obese individuals from the rest of the data.
Explain in detail how you can fit a quasibinomial generalised linear model to these data (with an identity link) to obtain unbiased risk diference estimates for the efects of the Age by Sex interaction, or explain in detail why it is not possible to do so.
d) [Total: 12 marks] Consider the following fitted model and derived ob- jects:
mod.new <- glm(Overweight~Age*Sex+Unemployed+‘Marital status‘, family=quasi(variance = mu^2,link="log"),data=df)
# Derived objects:
X <- model .matrix(mod .new) # model matrix n <- nrow(X) # number of observations
p <- ncol(X) # number of linear parameters muhat <- fitted(mod .new) # fitted values
k <- df$Overweight-muhat # ?
q <- k^2/muhat^2 # ? z <-sum(q)/(n-p) # ?
i. [1 marks] How can we interpret the fitted parameters from this
model?
ii. [3 marks] By what names do we usually call the objects k, q and z?
iii. [4 marks] Express the naive variance estimate (corresponding to vcov(mod.rr)) only in terms of the derived objects above.
iv. [4 marks] Write down an expression for a sandwich estimator of the variance in terms of the derived objects above.
Question 2: [Total: 30 marks]
In April 2020, during New Zealand’s Level 4 Covid lockdown, the number of cars on the road dropped to about 15% of its usual value for April, and the number of road deaths was 12, compared to an average in recent years of 32 during April.
a) [5 marks] Given daily data on traffic (in millions of driver-km), deaths, and an indicator for lockdown, give a call to glm() specifying a suitable model for estimating the efect of lockdown on the rate of deaths per million driver-km.
b) [3 marks] Based on the information given, will the coe代cient for lockdown in your model be positive or negative?
c) [12 marks] One explanation for the diference in rates during lockdown is that reducing traffic allows drivers to go faster. Another is that reducing traffic makes drivers less careful. And another is that essential workers, who were more likely to be driving at that time, are more likely to be young, and young people are less safe drivers. Draw a causal graph that allows for these possible relationships.
d) [5 marks] Assuming you have data on the average age and the average speed (in addition to tra代c density, deaths, and lockdown status), de- scribe how you would estimate the proportion of the lockdown efect on the rate of deaths per million driver-km that acts through traffic density.
e) [5 marks] If traffic density is measured with error that has approximately zero mean, what will be the efect on your estimate in the previous part of this question?
Question 3: [Total: 30 marks] You are asked to build a predictive model for house prices, using a set of prices paid in Auckland during the first half of 2021, and information on the property itself (e.g. size, house size, rooms, bathrooms, building age, type of heating and cooling, kitchen age) and the neighbourhood (e.g. schools, average income, ethnicity distribution, public transport availabil- ity, distance to shops, travel time to city centre, crime rate, age distribution) . Auckland real estate prices are the highest in the country, and are expected to increase further.
a) [3 marks] Why would using all the variables in a predictive model not be expected to give the lowest prediction error for new data?
b) [8 marks] Describe two ways of regularising the model that might be expected to give better predictions.
c) [9 marks] Explain one way of obtaining an approximately unbiased estimate of the prediction error from one of these model selection strategies.
d) [5 marks] Suppose the model is applied to predict real estate prices in Auckland in the second half of 2021 . What systematic diferences, if any, would you expect between the model predictions and the actual prices? Explain.
e) [5 marks] Suppose the model is applied to predict real estate prices in Hamilton in the second half of 2021 . What systematic diferences, if any, would you expect between the model predictions and the actual prices? Explain.