代做Stat 428 Final Project帮做R语言-留学生作业帮

代做Stat 428 Final Project帮做R语言

Final Project

Stat 428

In the lecture, we discussed the distance correlation for the independence test problem. Suppose the data we observed are (X1, Y1), . . . ,(Xn, Yn), where Xi ∈ R dX and Yi ∈ R dY are multivariate random vectors. Here, (X1, Y1), . . . ,(Xn, Yn) are drawn from joint distribution F. The marginal distribution of X is FX and marginal distribution of Y is FY . The hypothesis of interest in the independence test problem is

H0 : F = FXFY vs H1 : F ≠ FXFY .

Besides distance correlation test, we also consider another three independence tests: the Hilbert-Schmidt independence criterion test, the sum of rank correlations test and the maxima of rank correlations test.

• The Hilbert-Schmidt independence criterion (HSIC) test statistic is defined as

where Aij = aij − a¯i· − a¯·j + ¯a·· and Bij = bij − ¯bi· − ¯b·j + ¯b··. Here

aij = kX(Xi , Xj ) and bij = kY (Yi , Yj ).

Here kX(·, ·) and kY (·, ·) are two kernels. For example, a common used kernel is Gaussian kernel

• Sum of rank correlations test is defined in the following way.

where ρl,m is the Spearman’s rank correlation coefficient between the lth coordinate of X and mth coordinate of Y . The Spearman’s rank correlation coefficient between (x1, . . . , xn) and (y1, . . . , yn) is defined as

where Rx,i and Ry,i are the rank of x1, . . . , xn and y1, . . . , yn, ¯Rx and ¯Ry are the mean of Rx,1, . . . , Rx,n and Ry,1, . . . , Ry,n.

• Maxima of rank correlations test is defined in the following way.

where ρl,m is the Spearman’s rank correlation coefficient between the lth coordinate of X and mth coordinate of Y .

You need to submit both the Rmd and pdf file for Question 1-4, and do NOT zipped them together, as the zip file cannot be previewed in Canvas. You may get a penalty if wrong format is submitted.

Question 1 Test Implementation (15 points)

In this question, you are required to implement these four independence test methods from scratch: the distance correlation test, the Hilbert-Schmidt independence criterion test, the sum of rank correlations test and the maxima of rank correlations test. Specifically, you need to implement two functions for each method: one is used to calculate the test statistics; the other is used to make the decision by permutation test.

• For the distance correlation test, you need to implement DCOR(distx, disty) and DCOR.perm(distx, disty,alpha, B), where:

– distx is distance matrix of X,

– disty is distance matrix of Y ,

– alpha is the significance level,

– and B is number of replicate in permutation test.

DCOR(distx, disty) returns the value of the test statistics and DCOR.perm(distx, disty,alpha, B) returns the decision on whether the null hypothesis is rejected.

• For the Hilbert-Schmidt independence criterion test, you need to implement HSIC(kernelx, kernely) and HSIC.perm(kernelx, kernely,alpha, B) where:

– kernelx is kernel matrix of X,

– kernely is kernel matrix of Y ,

– alpha is the significance level,

– and B is number of replicate in permutation test.

HSIC(kernelx, kernely) returns the value of the test statistics and HSIC.perm(kernelx, kernely,alpha, B) returns the decision on whether the null hypothesis is rejected.

• For the sum of rank correlations test, you need to implement SRC(x, y) and SRC.perm(x, y, alpha, B) where:

– x is matrix of X (each row is an observation),

– y is matrix of Y (each row is an observation),

– alpha is significance level,

– and B is number of replicate in permutation test.

SRC(x, y) returns the value of the test statistics and SRC.perm(x, y, alpha, B) returns the decision on whether the null hypothesis is rejected.

• For the maxima of rank correlations test, you need to implement MRC(x, y) and MRC.perm(x, y, alpha, B) where:

– x is matrix of X (each row is an observation),

– y is matrix of Y (each row is an observation),

– alpha is significance level,

– and B is number of replicate in permutation test.

MRC(x, y) returns the value of the test statistics and MRC.perm(x, y, alpha, B) returns the decision on whether the null hypothesis is rejected.

Question 2 Choice of Tuning Parameter and Distance (10 points)

Several parts in these four tests can be customized. In this question, you need to use simulation experiments to make recommendations for the choices of tuning parameters and distances. Specifically, we consider the following tuning parameters and distances:

• In the distance correlation test, we can consider different distances. In particular, we can consider ℓp distance

What p should we use?

• In the the Hilbert-Schmidt independence criterion test, we can use Gaussian kernel with different choice of σ2. How should we choose σ2?

You need to show some numerical experiments as your evidence.

Question 3 Test Comparisons (15 points)

In this question, you are required to use simulation experiments to make recommendations for the choice of these four independence tests. In particular, you need to answer the following questions:

• Which test is more suitable for low dimensional data set (i.e., dX and dY are small)? Which is better for high dimensional data set (i.e. dX and dY are large)?

• Which test is more sensitive to different choices of the specific distribution of F, FX, and FY ?

• Are these tests able to control type I error?

• Which test is more powerful?

• Which test is more computationally efficient?

Question 4 Application to Real Data Set (10 points)

We’re going to look at a data set on 97 men who have prostate cancer (from the book The Elements of Statistical Learning). There are 10 variables measured on these 97 men:

1. lpsa: log PSA score

2. lcavol: log cancer volume

3. lweight: log prostate cancer weight

4. age: age of patient

5. lbph: log of the amount of benign prostatic hyperplasia

6. svi: seminal vesicle invasion

7. lcp: log of capsular penetration

8. gleason: Gleason score

9. pgg45: percent of Gleason scores 4 or 5

10. train: if belonging to training data set

To load this prostate cancer data set and store it as a matrix pros.data, we can do as following:

pros.data = read.table("https://web.stanford.edu/~hastie/ElemStatLearn/datasets/prostate.data")

Based on this data set, we are interested in if (log PSA score, log cancer volume, log prostate cancer weight) is independent from (age of patient, log of the amount of benign prostatic hyperplasia, log of capsular penetration). We can split the data set into two parts

X=pros.data[, c('lpsa','lcavol','lweight')]

Y=pros.data[, c('age','lbph','lcp')]

Then, you can apply these four tests (with the best choice of tuning parameters and distances) to X and Y. What conclusion can you make?

Question 5 Report (20 points)

A retail corporation would like to test whether the customers’ purchase records is associated with their demographic and economic status. The manager wants to choose one independence testing method among the four mentioned above, and ask your opinion on choosing the best independence test. Could you prepare a report to provide some suggestions to this manager? In this report, you need to summarize all your findings in Question 1-4. The report should be limited to one page. You are encouraged to use figures to deliver your messages. The manager who reads your report has only a minimal statistical background, so you may want to avoid technical terminologies.

Question 6 Presentation and Slides (30 points)

Based on your report, could you prepare a 3-5 minutes presentation to summarize your findings and suggestions? Assume your audience is the manager from this retail corporation, who has only a very limited statistical background. Try to avoid technical terminologies. In this question, you need to submit a video (I need to see you in this video) and your slides (you need to use R Markdown and submit both Rmd and pdf file).

Question 7 R package (Bonus question: extra 10 points for the final project, and the total points of the final project may not exceed 100 points)

Could you prepare an R package to include all your four independence testing methods and a manual that explains how to use these methods? To complete this question, you need to submit a compressed R package.

Submission Check List

• A report for Question 1-4 (Rmd and pdf), which can be long and technical.

• A short report for Question 5 (Rmd and pdf), which is limited to one page.

• A video presentation (I need to see you in this video)

• Presentation slides for Question 6 (Rmd and pdf)

• A compressed R package for Question 7 (optional)

课程名

mktg2509 csci 2600 38170 lng302 csse3010 phas3226 77938 arch1162 engn4536/engn6536 acx5903 comp151101 phl245 cse12 comp9312 stat3016/6016 phas0038 comp2140 6qqmb312 xjco3011 rest0005 ematm0051 5qqmn219 lubs5062m eee8155 cege0100 eap033 artd1109 mat246 etc3430 ecmm462 mis102 inft6800 ddes9903 comp6521 comp9517 comp3331/9331 comp4337 comp6008 comp9414 bu.231.790.81 man00150m csb352h math1041 eengm4100 isys1002 08 6057cem mktg3504 mthm036 mtrx1701 mth3241 eeee3086 cmp-7038b cmp-7000a ints4010 econ2151 infs5710 fins5516 fin3309 fins5510 gsoe9340 math2007 math2036 soee5010 mark3088 infs3605 elec9714 comp2271 ma214 comp2211 infs3604 600426 sit254 acct3091 bbt405 msin0116 com107/com113 mark5826 sit120 comp9021 eco2101 eeen40700 cs253 ece3114 ecmm447 chns3000 math377 itd102 comp9444 comp(2041|9044) econ0060 econ7230 mgt001371 ecs-323 cs6250 mgdi60012 mdia2012 comm221001 comm5000 ma1008 engl642 econ241 com333 math367 mis201 nbs-7041x meek16104 econ2003 comm1190 mbas902 comp-1027 dpst1091 comp7315 eppd1033 m06 ee3025 msci231 bb113/bbs1063 fc709 comp3425 comp9417 econ42915 cb9101 math1102e chme0017 fc307 mkt60104 5522usst litr1-uc6201.200 ee1102 cosc2803 math39512 omp9727 int2067/int5051 bsb151 mgt253 fc021 babs2202 mis2002s phya21 18-213 cege0012 mdia1002 math38032 mech5125 07 cisc102 mgx3110 cs240 11175 fin3020s eco3420 ictten622 comp9727 cpt111 de114102d mgm320h5s bafi1019 math21112 efim20036 mn-3503 fins5568 110.807 bcpm000028 info6030 bma0092 bcpm0054 math20212 ce335 cs365 cenv6141 ftec5580 math2010 ec3450 comm1170 ecmt1010 csci-ua.0480-003 econ12-200 ib3960 ectb60h3f cs247—assignment tk3163 ics3u ib3j80 comp20008 comp9334 eppd1063 acct2343 cct109 isys1055/3412 math350-real math2014 eec180 stat141b econ2101 msinm014/msing014/msing014b fit2004 comp643 bu1002 cm2030

联系我们

EMail: 99515681@qq.com

QQ: 99515681

留学生作业帮-留学生的知心伴侣！

工作时间：08:00-21:00

微信客服：codinghelp

热门主题

课程名