End-of-Semester Data Analysis Assignment
Data and Problem Overview
An avid fan of the PGA TOUR, who has limited statistical background knowledge, is asking for your help in answering one of the age-old questions in golf, namely what is the relative importance of each aspect of the game on the average prize money in professional golf.
The data needed on the top 196 Tour players in 2006 can be found in the file pgatour2006.csv.
The meaning of the variables is as follows
PrizeMoney (y) Average prize money per tournament.
DrivingAccuracy (x1 ) Driving accuracy is the percentage of time a player is able to hit the fairway with his tee shot.
GIR (x2 ) Greens in regulation (GIR) is the percent of time a player was able to hit the green in regulation (greens hit in regulation/holes played).
PuttingAverage (x3 ) Putting average measures putting performance on those holes where the green is hit in regulation (GIR). By using greens hit in regulation the effects of chipping close and one putting are eliminated.
BirdieConversion (x4 ) Birdie conversion is the percent of time a player makes a birdie or better after hitting green in regulation.
SandSaves (x5 ) Sand saves is the percent of time a player was able to get up and down once in a greenside sand bunker.
Scrambling (x6 ) Scrambling is the percent of time that a player misses the green in regulation, but still makes par or better.
PuttsPerRound (x7 ) Putts per round is the average total number of putts per round.
There are other variables in the data set, but these will not be considered in more detail in the questions’ section.
Instructions & Assessment
Use R Markdown to prepare your answers to the questions posed in the parts below. Unlike a usual homework assignment, where an answer to a question might include some R output and numerical values from calculations, most questions below require written responses in sentence/paragraph form. For these questions, you will not receive full credit for simply providing R output or the result of calculations: you need to clearly describe what you have done and provide appropriate discussion and interpretation.
The assignment will be graded out of 100 total points. Ninety of the 100 points are allocated across the parts below; the remaining 10 points will be awarded based on the quality of your write-up. Your write-up should be easy to read and appropriately formatted; plots and graphs should be appropriately sized, with easy to read labels and symbols; numeric results should be presented in a way that is easy to read. Please use maximum 7 pages (in .pdf format) including figures and tables.
As indicated in the course syllabus, this assignment is worth 15% of your final grade for the semester.
Questions
The following questions build on each other and will ultimately guide you over the steps to get a regression model to realistically represent the average prize money in professional golf.
1. A statistician from Australia suggests to the analyst that they should not transform any of the co- variates, but that they should apply the log transformation to y. Do you agree with this suggestion? Justify your answer. 18 points
For the rest of the assignment, use the log transformations you decided for. To facilitate grading, make sure to use natural—base e—logarithms, not any other base, when transforming the variables.
2. Develop a regression model that contains all seven of the potential covariates listed above. If relevant, explore methods of variable transformation (polynomials, logarithm, etc.) and comment on the results. Explain your reasoning. 18 points
3. The golf fan wants to remove all covariates with “insignificant” t values (β(^)k /σ(^)β(ˆ)k ) from the entire model
in a single step. Explain why you do not recommend this approach. What alternatives would you recommend? A verbal answer suffices here. 18 points
4. Based on your reply to 3, create a final regression model to realistically represent the average prize money in professional golf and justify the choice of this specific model. 18 points
5. Diagnose your model by considering how the model assumptions of linear regression models are being met. 18 points