STA303 Winter 2025 Practice Exam
Wednesday 26 March 2025
1 Short answer
For each of the statistical problems below (itemized with letters) assign a phrase from the second list (itemized with numbers). The second (numbered) list is longer than the first, some items in the second do not have a partner in the first.
Problems:
a. How do you evaluate the conditional density and likelihood of a generalized linear mixed model?
b. Why do matched case control studies work?
c. How can I fit a GLM?
d. I want a CI for a parameter when the MLE is on a boundary. A score test can be used here, how does a score test work?
e. Why are spline functions (B-splines, cubic splines or thin plate splines) used in generalized additive models?
Solutions:
1. The true parameter should be close to the MLE, so the derivative evaluated at the true parameter should be close to zero.
2. Conditioning on total within a strata, independent Poissons conditional on sums are multinomial, normalizing Poisson means makes some unwanted terms cancel.
3. Taking a second order Taylor series expansion evaluated at the maximum, first derivative is zero, log density is quadratic, treat things as Gaussian.
4. It is a method for finding patterns in data and using them to make predictions or decisions without being explicitly programmed for each specific task.
5. Fitting a model to your data, then repeatedly simulating new datasets from that model to estimate the variability or uncertainty of your estimates.
6. Set model parameters so that the expected moments calculated from the model are the same as the average moments from the dataset
7. A smooth curve or integrated Wiener process can be approximated by a linear combination of basis functions.
8. Writing a function to evaluate the likelihood, using automatic differentiation (i.e. with TMB) to get derivatives, use a numerical optimizer to find maximum
2 Roads
Figure 1: Monthly deaths of passengers on motorcycles in the UK
Below are data on road accidents in the UK from
https://data.gov.uk/dataset/road-accidents-safety-data,
which contains 10.5 million individual records. In this analysis, cases are defined as passengers on motorcycles (not the drivers) killed in a traffic accident. Figure fig. 1 shows over time, for males (M) and females (F)
separately, the number of monthly deaths. It is hypothesized that most motorcycle passengers are being driven by their partner or spouse, which is in most cases someone of the oposite sex (male passengers were riding behind female drivers and vice versa). Deaths have been falling, which is believed to be because of better safety technology in road vehicles.
The bikes object is a data.frame with the numbers of deaths for Males and Females, the date column is the month (or more technically the first day of the month) corresponding to the death count.
bikes[1:4, ]
date Male Female
1 1979-01-01 0 0
2 1979-02-01 4 1
3 1979-03-01 5 1
4 1979-04-01 10 3
Here’s a bunch of code to format time variables. bikes$date is a Date object, which stores dates as number of days since 1 January 1970.
bikes$Ndays = Hmisc::monthDays(bikes$date)
bikes$logDays = log(bikes$Ndays)
bikes$timeNumeric = as.numeric(bikes$date)
bikes$timeScaled = (2 * pi/365.25) * bikes$timeNumeric
bikes$sin12 = sin(bikes$timeScaled)
bikes$cos12 = cos(bikes$timeScaled)
bikes$sin6 = sin(2 * bikes$timeScaled)
bikes$cos6 = cos(2 * bikes$timeScaled)
bikes[1:3, ]
date Male Female Ndays logDays timeNumeric timeScaled sin12
1 1979-01-01 0 0 31 3.433987 3287 56.54437 -0.004300593
2 1979-02-01 4 1 28 3.332205 3318 57.07764 0.504648295
3 1979-03-01 5 1 31 3.433987 3346 57.55931 0.847173337
cos12 sin6 cos6
1 0.9999908 -0.008601106 0.9999630
2 0.8633250 0.871351004 0.4906602
3 0.5313166 0.900234525 –0.4354053
The following code has been used to fit a model to the data.
library(ImgcvI)
resM = gam(Male ~ cos12 + sin12 + cos6 + sin6 + offset(logDays) +
s(timeNumeric, k=100),
data=bikes, family=nb(link=IlogI))
resF = gam(Female ~ cos12 + sin12 + cos6 + sin6 + offset(logDays) +
s(timeNumeric, k=100),
data=bikes, family=nb(link=IlogI))
1. Write down a statistical model corresponding to the model fit and explain the terms and variables in the model. Write in Human language and mathematics, not R.
2. Write down figure sub-caption (from a to f) and a caption for Figure fig. 2, using (where appropriate) mathematical notation from your model description.
3. Write a paragraph summarizing what you have learned about trends in motorcycle accidents from this analysis. Use non-technical language and refer to figures and tables as appropriate.
Code for Figure fig. 2
bsMat = predict(resM, type = "lpmatrix")
simCoefM = mvtnorm::rmvnorm(20, coef(resM), vcov(resM))
seasonVar = grep("^(sinIcos)", colnames(simCoefM))
simCoefF = mvtnorm::rmvnorm(20, coef(resF), vcov(resF))
matplot(bikes$date, exp(tcrossprod(bsMat, simCoefM)), type = "l",
lty = 1, log = "y")
matplot(bikes$date, exp(tcrossprod(bsMat, simCoefF)), type = "l",
lty = 1, log = "y")
matplot(bikes$date, exp(tcrossprod(bsMat[, –seasonVar],
simCoefM[, –seasonVar])), type = "l", lty = 1, log = "y")
matplot(bikes$date, exp(tcrossprod(bsMat[, –seasonVar],
simCoefF[, –seasonVar])), type = "l", lty = 1, log = "y")
matplot(bikes$date, exp(tcrossprod(bsMat[, seasonVar], simCoefM[,
seasonVar])), xlim = as.Date(c("2009/12/15", "2011/1/15")),
type = "l", lty = 1, log = "y")
matplot(bikes$date, exp(tcrossprod(bsMat[, seasonVar], simCoefF[,
seasonVar])), xlim = as.Date(c("2009/12/15", "2011/1/15")),
type = "l", lty = 1, log = "y")
Figure 2: Stuff
3 Smoking
The 2014 American National Youth Tobacco Survey was analyzed to explore how urban/rural differences, age and sex affect smoking tobacco from a pipe. The research hypotheses to be investigated using this survey are as follows.
1. The group of high school students most like to smoke tobacco from a hookah or waterpipe is older white boys in urban areas; and
2. Urban girls of any age are much less likely to have smoked from a waterpipe than urban boys of a comparable age.
Your task is to fill in the “methods” and “results” section of a short report on smoking waterpipies in highschool. I asked Deepseek to write a one paragraph introduction and it said
The use of waterpipes (also known as hookahs) for smoking tobacco has become an increasing pub- lic health concern, particularly among adolescents. While cigarette smoking has declined in some youth populations, waterpipe use remains prevalent, potentially due to misconceptions about its safety and social acceptability. Understanding demographic patterns in waterpipe smoking—such as differences by urban/rural location, age, and sex—is critical for targeted prevention efforts.
This report analyzes data from the 2014 National Youth Tobacco Survey (NYTS) to investigate two key hypotheses: (1) that older white boys in urban areas are the most likely subgroup to smoke tobacco from a waterpipe, and (2) that urban girls are significantly less likely to engage in this behavior compared to urban boys of the same age. By examining these trends, the report aims to inform policies and interventions aimed at reducing youth waterpipe use.
Your task is to write a “methods” and “results” section as follows:
• a couple of paragraphs on ‘methods’ giving the statistical models used (in mathematical notation, not R syntax) and explaining why they are appropriate (or why they are not if you believe that to be the case); and
• a ‘results’ paragraph or two where the results are described and interpreted. You can refer to imaginary tables and figures, if you do write captions for them after your text.
I’ll ask Deepseek to write a short conclusion, then send your report off to the US Department of Education with an invoice for a $250k consulting fee. I’ll give you 10% of whatever they end up paying me.
The report will be assessed in terms of:
• clarity of presentation,
• demonstration of an understanding of the statistical models used, and
• drawing conclusions which are consistent with the analysis.
You may refer to the output below, if you deem it to be useful. The variable ever_hookah_or_waterpipe is coded as true if the respondent had ever smoked tobacco from a hookah or water pipe. The baseline race is white.
smokeSub = as.data.frame(smoke[smoke$Age > 10 & !is.na(smoke$Race),
])
smokeSub$ageFac = factor(smokeSub$Age)
smokeSub$Sex = relevel(smokeSub$Sex, "F")
smokeSub[1:3, c("Sex", "ageFac", "Race", "RuralUrban", "ever_hookah_or_waterpipe")]
Sex ageFac Race RuralUrban ever_hookah_or_waterpipe
1 M 16 hispanic Rural FALSE
2 M 14 hispanic Rural FALSE
3 M 14 hispanic Rural FALSE
library(glmmTMB)
resSmoke = glmmTMB(ever_hookah_or_waterpipe ~ ageFac * Sex +
Sex * RuralUrban + Race + (1 | school), data = smokeSub,
family = binomial())
knitr::kable(confint(resSmoke, full = TRUE), digits = 2)
Use the predict function to get the probability, in percent, that a 15 year old has smoked a hookah by sex/urban group for the baseline race. The argument re.form=NA specifies that random effects are not included in the predictions.
toPredictR = na.omit(expand.grid(Sex = unique(smokeSub$Sex), RuralUrban = unique(smokeSub$RuralUrban), ageFac = "15",
Race = "white"))
smokePredR = predict(resSmoke, toPredictR, se.fit = TRUE,
re.form. = NA)
smokePredR = cbind(toPredictR[, c("Sex", "RuralUrban")],
100 * exp(data.frame(est = smokePredR$fit, lower = smokePredR$fit -
2 * smokePredR$se.fit, upper = smokePredR$fit +
2 * smokePredR$se.fit)))
knitr::kable(smokePredR, digits = 2)
Estimate and plot probability of using a pipe as a function of age.
toPredictAge = na.omit(expand.grid(Sex = unique(smokeSub$Sex), ageFac = unique(smokeSub$ageFac), RuralUrban = "Urban",
Race = "white"))
smokePredAge = predict(resSmoke, toPredictAge, se.fit = TRUE,
re.form. = NA)
smokePredAge = data.frame(age = toPredictAge$ageFac, sex = toPredictAge$Sex,
est = smokePredAge$fit, lower = smokePredAge$fit - 2 *
smokePredAge$se.fit, upper = smokePredAge$fit +
2 * smokePredAge$se.fit)
The smokePredAge data frame is reformatted and 100*exp(smokePredAge) is plotted with matplot to produce the graph below.
A likelihood ratio test
resSmoke2 = glmmTMB(ever_hookah_or_waterpipe ~ ageFac *
Sex + Sex + RuralUrban + Race + (1 | school), data = smokeSub,
family = binomial())
lmtest::lrtest(resSmoke, resSmoke2) Likelihood ratio test
Model 1: ever_hookah_or_waterpipe ~ ageFac * Sex + Sex * RuralUrban + Race + (1 | school)
Model 2: ever_hookah_or_waterpipe ~ ageFac * Sex + Sex + RuralUrban +
Race + (1 | school)
#Df LogLik Df Chisq Pr(>Chisq)
1 26 -4187.0
2 25 -4187.1 -1 0.0094 0.9226