Final Project
Stat 428
In the lecture, we discussed the distance correlation for the independence test problem. Suppose the data we observed are (X1, Y1), . . . ,(Xn, Yn), where Xi ∈ R
dX and Yi ∈ R
dY are multivariate random vectors. Here, (X1, Y1), . . . ,(Xn, Yn) are drawn from joint distribution F. The marginal distribution of X is FX and marginal distribution of Y is FY . The hypothesis of interest in the independence test problem is
H0 : F = FXFY vs H1 : F ≠ FXFY .
Besides distance correlation test, we also consider another three independence tests: the Hilbert-Schmidt independence criterion test, the sum of rank correlations test and the maxima of rank correlations test.
• The Hilbert-Schmidt independence criterion (HSIC) test statistic is defined as
where Aij = aij − a¯i· − a¯·j + ¯a·· and Bij = bij − ¯bi· − ¯b·j + ¯b··. Here
aij = kX(Xi
, Xj ) and bij = kY (Yi
, Yj ).
Here kX(·, ·) and kY (·, ·) are two kernels. For example, a common used kernel is Gaussian kernel
• Sum of rank correlations test is defined in the following way.
where ρl,m is the Spearman’s rank correlation coefficient between the lth coordinate of X and mth coordinate of Y . The Spearman’s rank correlation coefficient between (x1, . . . , xn) and (y1, . . . , yn) is defined as
where Rx,i and Ry,i are the rank of x1, . . . , xn and y1, . . . , yn, ¯Rx and ¯Ry are the mean of Rx,1, . . . , Rx,n and Ry,1, . . . , Ry,n.
• Maxima of rank correlations test is defined in the following way.
where ρl,m is the Spearman’s rank correlation coefficient between the lth coordinate of X and mth coordinate of Y .
You need to submit both the Rmd and pdf file for Question 1-4, and do NOT zipped them together, as the zip file cannot be previewed in Canvas. You may get a penalty if wrong format is submitted.
Question 1 Test Implementation (15 points)
In this question, you are required to implement these four independence test methods from scratch: the distance correlation test, the Hilbert-Schmidt independence criterion test, the sum of rank correlations test and the maxima of rank correlations test. Specifically, you need to implement two functions for each method: one is used to calculate the test statistics; the other is used to make the decision by permutation test.
• For the distance correlation test, you need to implement DCOR(distx, disty) and DCOR.perm(distx, disty,alpha, B), where:
– distx is distance matrix of X,
– disty is distance matrix of Y ,
– alpha is the significance level,
– and B is number of replicate in permutation test.
DCOR(distx, disty) returns the value of the test statistics and DCOR.perm(distx, disty,alpha, B) returns the decision on whether the null hypothesis is rejected.
• For the Hilbert-Schmidt independence criterion test, you need to implement HSIC(kernelx, kernely) and HSIC.perm(kernelx, kernely,alpha, B) where:
– kernelx is kernel matrix of X,
– kernely is kernel matrix of Y ,
– alpha is the significance level,
– and B is number of replicate in permutation test.
HSIC(kernelx, kernely) returns the value of the test statistics and HSIC.perm(kernelx, kernely,alpha, B) returns the decision on whether the null hypothesis is rejected.
• For the sum of rank correlations test, you need to implement SRC(x, y) and SRC.perm(x, y, alpha, B) where:
– x is matrix of X (each row is an observation),
– y is matrix of Y (each row is an observation),
– alpha is significance level,
– and B is number of replicate in permutation test.
SRC(x, y) returns the value of the test statistics and SRC.perm(x, y, alpha, B) returns the decision on whether the null hypothesis is rejected.
• For the maxima of rank correlations test, you need to implement MRC(x, y) and MRC.perm(x, y, alpha, B) where:
– x is matrix of X (each row is an observation),
– y is matrix of Y (each row is an observation),
– alpha is significance level,
– and B is number of replicate in permutation test.
MRC(x, y) returns the value of the test statistics and MRC.perm(x, y, alpha, B) returns the decision on whether the null hypothesis is rejected.
Question 2 Choice of Tuning Parameter and Distance (10 points)
Several parts in these four tests can be customized. In this question, you need to use simulation experiments to make recommendations for the choices of tuning parameters and distances. Specifically, we consider the following tuning parameters and distances:
• In the distance correlation test, we can consider different distances. In particular, we can consider ℓp distance
What p should we use?
• In the the Hilbert-Schmidt independence criterion test, we can use Gaussian kernel with different choice of σ2. How should we choose σ2?
You need to show some numerical experiments as your evidence.
Question 3 Test Comparisons (15 points)
In this question, you are required to use simulation experiments to make recommendations for the choice of these four independence tests. In particular, you need to answer the following questions:
• Which test is more suitable for low dimensional data set (i.e., dX and dY are small)? Which is better for high dimensional data set (i.e. dX and dY are large)?
• Which test is more sensitive to different choices of the specific distribution of F, FX, and FY ?
• Are these tests able to control type I error?
• Which test is more powerful?
• Which test is more computationally efficient?
Question 4 Application to Real Data Set (10 points)
We’re going to look at a data set on 97 men who have prostate cancer (from the book The Elements of Statistical Learning). There are 10 variables measured on these 97 men:
1. lpsa: log PSA score
2. lcavol: log cancer volume
3. lweight: log prostate cancer weight
4. age: age of patient
5. lbph: log of the amount of benign prostatic hyperplasia
6. svi: seminal vesicle invasion
7. lcp: log of capsular penetration
8. gleason: Gleason score
9. pgg45: percent of Gleason scores 4 or 5
10. train: if belonging to training data set
To load this prostate cancer data set and store it as a matrix pros.data, we can do as following:
pros.data = read.table("https://web.stanford.edu/~hastie/ElemStatLearn/datasets/prostate.data")
Based on this data set, we are interested in if (log PSA score, log cancer volume, log prostate cancer weight) is independent from (age of patient, log of the amount of benign prostatic hyperplasia, log of capsular penetration). We can split the data set into two parts
X=pros.data[, c('lpsa','lcavol','lweight')]
Y=pros.data[, c('age','lbph','lcp')]
Then, you can apply these four tests (with the best choice of tuning parameters and distances) to X and Y. What conclusion can you make?
Question 5 Report (20 points)
A retail corporation would like to test whether the customers’ purchase records is associated with their demographic and economic status. The manager wants to choose one independence testing method among the four mentioned above, and ask your opinion on choosing the best independence test. Could you prepare a report to provide some suggestions to this manager? In this report, you need to summarize all your findings in Question 1-4. The report should be limited to one page. You are encouraged to use figures to deliver your messages. The manager who reads your report has only a minimal statistical background, so you may want to avoid technical terminologies.
Question 6 Presentation and Slides (30 points)
Based on your report, could you prepare a 3-5 minutes presentation to summarize your findings and suggestions? Assume your audience is the manager from this retail corporation, who has only a very limited statistical background. Try to avoid technical terminologies. In this question, you need to submit a video (I need to see you in this video) and your slides (you need to use R Markdown and submit both Rmd and pdf file).
Question 7 R package (Bonus question: extra 10 points for the final project, and the total points of the final project may not exceed 100 points)
Could you prepare an R package to include all your four independence testing methods and a manual that explains how to use these methods? To complete this question, you need to submit a compressed R package.
Submission Check List
• A report for Question 1-4 (Rmd and pdf), which can be long and technical.
• A short report for Question 5 (Rmd and pdf), which is limited to one page.
• A video presentation (I need to see you in this video)
• Presentation slides for Question 6 (Rmd and pdf)
• A compressed R package for Question 7 (optional)