代做7SSGN110 Environmental Data Analysis | Practical 3 | Hypothesis testing代写R编程

7SSGN110 Environmental Data Analysis | Practical 3 | Hypothesis testing

1. Introduction

1.1. About this practical

This practical will help you develop your skills using R to analyse data using inferential statistics. It will also require you to manipulate data, building on what you did in Practical 1  and Practical 2. At various points in the instructions it will be assumed that you can look back to those practical instructions to understand what to do. The practical is composed of two sections:

1. Pebbles: Introduction to the basics of running inferential statistics in R & testing for differences between two groups.

2. Forests: Introduction to comparing 3 or more groups

1.2. Practical data

You can access the files for this practical, ‘Pebbles.csv’, ‘Scot_Forests.csv’ via KEATS.

1.3. Getting started with R

Before you start the data analysis In R, ensure that you have:

• Created a folder on your computer for this week’s practical

• Set the working directory for the practical work in R Studio

• Loaded the required packages. We will be using the packages tidyr, ggplot & car

• Read the Pebbles.csv’ and ‘Scot_Forests.csv’ files into R.

Hint: Refer to Practical 2, section 4.1 for guidance on how to do these things.

2. Testing for differences between two groups

2.1. Background

A coastal geomorphologist postulates that bedrock geology is one of the fundamental controls over the size of the material making up beaches. To test this, they wish to determine whether pebble size differs between beaches in sandstone vs. limestone regions. They take a sample of 50 pebbles from each beach and measure the length of the B-axis for each pebble, in millimetres. They wish to use a t-test to analyse whether there is a significant difference in the mean pebble size between the two beaches.

Q1: What would be the null and alternative hypothesis of the two-tailed t-test?

2.2. Descriptive statistics

Double click on the pebbles.data file shown in the workspace window (top right hand side of the window). This will open the pebbles.data dataframe. in a new tab. You will see that there are two columns: ‘beach’ describes which beach the data is from and ‘pebblesize’ gives us all the measured pebble sizes.

Q2. What type of data formatting is being used - wide or long?

Let’s first calculate descriptive statistics for the data set. We can do this using the summary() function

summary(pebbles.data)

##     beach             pebblesize   
##  Length:100         Min.   :15.00  
##  Class :character   1st Qu.:38.00  
##  Mode  :character   Median :44.00  
##                     Mean   :45.17  
##                     3rd Qu.:52.00  
##                     Max.   :70.00

You will notice that the summary statistics produced are for the pebble data set as a whole, and is not taking into account the different geologies.

To produce summary statistics using the summary function we need to transform. our data into wide format. We can do this using the spread() function as part of the tidyr package.

pebbles.data$row <- 1:nrow(pebbles.data) # adds a column to the pebbles data so that spread will work properly.
pebbles.wide <- spread(pebbles.data, key = beach, value = pebblesize) # spreads the data from long to wide
summary(pebbles.wide) # produces the summary statistics for each column

##       row           Limestone       Sandstone    
##  Min.   :  1.00   Min.   :15.00   Min.   :24.00  
##  1st Qu.: 25.75   1st Qu.:36.00   1st Qu.:43.00  
##  Median : 50.50   Median :40.50   Median :49.00  
##  Mean   : 50.50   Mean   :41.06   Mean   :49.28  
##  3rd Qu.: 75.25   3rd Qu.:47.50   3rd Qu.:57.75  
##  Max.   :100.00   Max.   :65.00   Max.   :70.00  
##                   NA's   :50      NA's   :50

Q3. What is the mean pebble size for the Limestone & Sandstone beach locations?

2.3. Plotting data

Before we conduct any statistical tests, it is important to graph our data to identify any patterns or trends. To do this, create boxplots and histograms of the data. We are going to use the package ggplot2 to do this.

We are first going to produce a boxplot of our data. ggplot2 uses data in long format, so we will need to use the original pebbles.data dataframe. to do this.

Q4. Complete the following code to produce a boxplot of the pebbles data. You need to insert the correct variables in our data frame, so the boxplot shows the distribution of pebble size at each location. (Hint: refer to practical 2 if you get stuck)

plot1 <- ggplot(pebbles.data, aes(x = INSERT VARIABLE HERE, y = INSERT VARIABLE HERE))
plot1 + geom_boxplot() + ylab("Pebblesize (mm)")

Your completed boxplot should look like this:

 

Q5. Complete the following code to produce a histogram of the pebbles data. You need to insert the correct variable and an appropriate binwidth so the distribution of the pebble size for each location is shown. (Hint: refer to practical 2 if you get stuck)

plot2 <- ggplot(pebbles.data, aes(INSERT VARIABLE HERE))
plot2 + geom_histogram(binwidth = INSERT BINWIDTH HERE) + facet_grid(beach ~ .)

Your completed histogram may look something like this:

 

Q6. From looking at the boxplots, do you think there is any difference in pebble size between the sites?

Q7. From looking at the histograms do you think the data are normally distributed

2.3.1. Checking assumptions

Before we run any statistical tests we should also explore the data to check if the assumptions of the tests are valid. The primary assumption of the t-test is that the data being analysed are normally distributed.

We can first check for normality visually using QQ plots.

The base R functions qqnorm() and qqplot() can be used to produce these plots

qqnorm(): produces a normal QQ plot of the variable, qqline(): adds a ‘normal’ reference line

qqnorm(pebbles.wide$Limestone, pch = 1) # Vector of data to plot, plot symbol
qqline(pebbles.wide$Limestone, col = "steelblue") # Vector of data to add reference line to, colour of line

 

Q8. Produce a QQ plot for the limestone data. Based on the Q-Q plots, do you think the pebble size data is normally distributed?

We can now test for normality using statistical methods. In R the Shapiro-Wilk test for normality is run on vectors of data. We have already converted our data into wide format (pebbles.wide). However, an alternative approach is to use the subset() function to subset the data into individual vectors for each variable.

In the code below we used the subset function with an expression to tell R which data to retain for the new data set. The perator == means “is equal to” (note it is two = signs together). We also needed to ensure we put the value in the beach column between speech marks (“). So the first command above tells R to retain all rows of data in which the value in the ‘beach’ column is equal to ‘Sandstone’. Similarly select =”pebblesize" means retain only the pebblesize column.

sandstone <- subset(pebbles.data, beach == "Sandstone", select = "pebblesize")

We can check the format of the subsetted data by using the str() command. This command tells us about the structure of the data of interest - so how it is organized and the type of data displayed (character, factor, numeric, etc)

str(sandstone)

## 'data.frame':    50 obs. of  1 variable:
##  $ pebblesize: int  52 57 43 62 45 65 61 45 41 45 ...

The output of str() tells us that the subsetted data is still in dataframe. format. We can convert this into vector format by using the following code:

sandstone.vec <- sandstone$pebblesize
str(sandstone.vec)

##  int [1:50] 52 57 43 62 45 65 61 45 41 45 ...

Q9: Repeat the code above to subset the limestone beach data.

To test for normality in R you can use the shapiro.test() function to run a Shapiro-Wilk test. The shapiro.test() function should be passed a numeric vector of data values (e.g. a column from your ‘wide’ data frames’) and can be used with functions like apply().

Before we start using this function, we need to state our null and alternative hypotheses and select the alpha value (the cut off point in which you can reject the null hypothesis).

Q10: State the null and alternative hypotheses and the alpha value for the Shapiro-Wilk test

We can run the Shapiro-Wilk test in three ways:

shapiro.test(pebbles.wide$Sandstone)
shapiro.test(sandstone.vec) # Using the vector we created through subsetting the data
apply(pebbles.wide[, 2:3], 2, shapiro.test) #Using the apply function to select wide columns to run the test on

The result of the Shapiro-Wilk test for sandstone is as follows:

shapiro.test(pebbles.wide$Sandstone)

##
##  Shapiro-Wilk normality test
##
## data:  pebbles.wide$Sandstone
## W = 0.98197, p-value = 0.6375

Q11: Do the results of the test for normality of Sandstone data indicate they are normally distributed?

Q12: Repeat the test for limestone. Do the results of the test for normality indicate they are normally distributed?

Next, we will test the assumption of homogeneity of variances using Levene’s test. The Levene’s test is a type of F-test and is more appropriate for samples with a larger number of observations. We are going to use the ‘car’ package to do this, so make sure it is installed and loaded into R Studio.

To conduct a Levene’s test, run the following code:

leveneTest(pebblesize ~ beach, data = pebbles.data, center = mean)

## Warning in leveneTest.default(y = y, group = group, ...): group coerced to
## factor.

## Levene's Test for Homogeneity of Variance (center = mean)
##       Df F value Pr(>F)
## group  1  1.1046 0.2959
##       98

Q13: Do the results of the Levene’s test indicate homogeneity of variances?

You can also run a ‘F’ test using the function var.test().

2.4. Conducting a t-test

We will assume that the data are normally distributed and that variances are homogenous. To test whether mean pebble sizes are statistically different between the two beaches, we will use an independent t-test.

To run a two-tailed T-test in R, you need to specify the two groups of data you would like to compare, and whether we are assuming equal variances. As we are assuming equal variances for our pebble data, we add argument var.equal= TRUE to our function. If we had unequal variances and therefore wanted to conduct a Welch’s t-test then we would add the argument var.equal= FALSE.

Before we run the test, we need to state our null and alternative hypothesis and select an alpha value. Do this now.

To run the t-test, use the following line of code:

t.test(pebbles.wide$Sandstone, pebbles.wide$Limestone, var.equal = TRUE)

##
##  Two Sample t-test
##
## data:  pebbles.wide$Sandstone and pebbles.wide$Limestone
## t = 4.1006, df = 98, p-value = 8.513e-05
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##   4.241983 12.198017
## sample estimates:
## mean of x mean of y
##     49.28     41.06

Q14: What was the t statistic?

Q15: Which hypothesis would you accept (null or alternative) and why?

Q16: What can we conclude from these findings?

3. Working with 3 or more groups

3.1. Background

In this part of the practical, you will analyse ecological data from the Scottish Highlands to examine how different land uses have impacted ecosystem changes. The dataset we are using is from Warner et al (2021) who were interested in examining the differences in ecosystem function in native reforestation, unforested and mature forest plots in the Scottish Highlands. To do this the researchers measured a series of environmental variables in matched plots across 16 reforestation sites in the Scottish Highlands. Data was collected using a 10 × 10 m plot in each reforestation site. The same data was collected in a matched plot in an unforested, grazed area associated with each site and in a matched plot in an unforested area within the fenced site.

Warner et al. (2021) collected measurements from each plot for a range of variables. Today we are going to look at 4 of these:

1. Topsoil carbon. Carbon content estimate for topsoil in plot, calculated from % carbon and bulk density (in kg/m2)

2. Topsoil nitrogen. Nitrogen content estimate for topsoil in plot, calculated from % nitrogen and bulk density (in kg/m2)

3. Shrub height. Mean height of shrub layer (mean of shrub height measured across the three quadrats per plot(in cm)).

4. Moss cover. Mean depth of moss layer (mean of depth of moss layer measured across the three quadrats per plot(in cm)).

The aims of the project are:

· To assess the response of the ecosystem function of reforested sites in relation to the unforested state and the target state (mature forests)

· To assess how the ecosystem function differs in areas of differing grazing status

Before you start, think about how forest and grazing status may affect the variables above. What might be your provisional hypotheses for these data? What statistical tests could we use to address these hypotheses?

3.2. Import the data into R

Before we decide on the type of data analysis to undertake, we first need to import the data into R Studio. First, create a new script. in R. As we are now working with a different data set this makes sense. To do this go to File>New>Script.

In your new script, enter code to set the working directory. Then load the data into R Studio:

forest.data <- read.csv("Scot_Forests.csv", header = T)

We first want to look at the dataset to see the data are organised and if we need to do any dataframe. manipulation before we start our analysis. Look the at data structure and summary:

head(forest.data)

##             Name        Forest  Grazing Carbon Nitrogen Shrub.Height Moss
## 1 Athnamulloch 1 Reforestation Ungrazed  13.71     0.56        16.13 9.93
## 2 Athnamulloch 1    Unforested   Grazed  26.93     0.91        15.47 4.00
## 3 Athnamulloch 1    Unforested Ungrazed   8.04     0.30        20.33 8.40
## 4 Athnamulloch 2 Reforestation Ungrazed  27.47     1.70        27.00 3.87
## 5 Athnamulloch 2    Unforested   Grazed  16.70     0.47        13.53 6.60
## 6 Athnamulloch 2    Unforested Ungrazed  42.47     1.86        28.27 3.40

summary(forest.data)

##      Name              Forest            Grazing              Carbon     
##  Length:52          Length:52          Length:52          Min.   : 5.67  
##  Class :character   Class :character   Class :character   1st Qu.:14.42  
##  Mode  :character   Mode  :character   Mode  :character   Median :27.20  
##                                                           Mean   :26.87  
##                                                           3rd Qu.:38.02  
##                                                           Max.   :52.06  
##     Nitrogen       Shrub.Height        Moss       
##  Min.   :0.2100   Min.   : 1.07   Min.   : 0.270  
##  1st Qu.:0.5975   1st Qu.:16.13   1st Qu.: 2.635  
##  Median :1.0000   Median :24.43   Median : 4.670  
##  Mean   :1.0642   Mean   :24.15   Mean   : 5.404  

##  3rd Qu.:1.4125   3rd Qu.:30.53   3rd Qu.: 7.567  
##  Max.   :2.6900   Max.   :55.13   Max.   :16.330

You will see that the data are in wide format and there are many variables.

We are first interested in how the environmental variables differ by forest status. At this stage the ‘Name’ and ’Grazing grouping variables are rather superfluous. We can remove these by subsetting the dataframe.

In the code below, we are telling R to drop variables x and z. The ‘-’ sign indicates dropping variables

forest.sub<-subset(forest.data, select = -c(Name,Grazing))
head(forest.sub)

##          Forest Carbon Nitrogen Shrub.Height Moss
## 1 Reforestation  13.71     0.56        16.13 9.93
## 2    Unforested  26.93     0.91        15.47 4.00
## 3    Unforested   8.04     0.30        20.33 8.40
## 4 Reforestation  27.47     1.70        27.00 3.87
## 5    Unforested  16.70     0.47        13.53 6.60
## 6    Unforested  42.47     1.86        28.27 3.40

We now have a dataframe. that includes forest status information for each plot and associated measurements of topsoil carbon, topsoil nitrogen, shrub height and moss cover.

3.3. Testing for differences between 3 or more groups

We are interested in how the ecological variables differ in forested, unforested and mature forested plots. On this basis we are working with 3 grouping variables (forest status). On this basis, we will need to use statistical approaches that allow for 3 or more groups to be compared - ANOVA or Kruskall-Wallis.

We first want to determine if there is a difference in topsoil carbon content between plots of differing forest status. Because we are only interested in the carbon content for the moment, we can also drop the other variables so we have a dataframe. that just contains the forest status and topsoil carbon content data:

forest.carbon<-subset(forest.sub, select = c(Forest,Carbon))
head(forest.carbon)

##          Forest Carbon
## 1 Reforestation  13.71
## 2    Unforested  26.93
## 3    Unforested   8.04
## 4 Reforestation  27.47
## 5    Unforested  16.70
## 6    Unforested  42.47

Because we have dropped so many variables. We can put this data back into wide format using the spread function:

forest.carbon$row <- 1:nrow(forest.carbon) # adds a column to the forest so that spread will work properly.
forest.carbon.wide <- spread(forest.carbon, key = Forest, value = Carbon) # spreads the data from long to wide
summary(forest.carbon.wide) # produces the summary statistics for each column

##       row            Mature      Reforestation     Unforested   
##  Min.   : 1.00   Min.   :12.74   Min.   : 5.67   Min.   : 8.04  
##  1st Qu.:13.75   1st Qu.:33.47   1st Qu.:10.64   1st Qu.:17.50  
##  Median :26.50   Median :36.15   Median :16.64   Median :26.54  
##  Mean   :26.50   Mean   :34.64   Mean   :20.87   Mean   :27.10  
##  3rd Qu.:39.25   3rd Qu.:39.75   3rd Qu.:27.56   3rd Qu.:36.47  
##  Max.   :52.00   Max.   :52.06   Max.   :51.68   Max.   :49.48  
##                  NA's   :42      NA's   :38      NA's   :24

Before we decide on which statistical test to use, we first need to check the assumptions of the data

Q17: Adapting the code from pebble size data analysis undertaken earlier, determine whether the soil carbon content is normally distributed

Hint: If you chose to plot the data graphically using ggplot2, the data should be in long format.

We then need to check for homogeneity of variances:

leveneTest(Carbon ~ Forest, data = forest.carbon, center = mean)

## Levene's Test for Homogeneity of Variance (center = mean)
##       Df F value Pr(>F)
## group  2  0.6855 0.5086
##       49

The results of this analysis indicate that the groups have equal variances and are normally distributed. On this basis, we can use an ANOVA test to compare the means of each group.

Q18: What will the null and alternative hypothesis be for this test?

The R function aov() is used to run a one-way ANOVA. We can summarize the results of this using the summary.aov() function:

carbon.aov <- aov(Carbon ~ Forest, data = forest.carbon)
summary(carbon.aov)

##             Df Sum Sq Mean Sq F value Pr(>F)  
## Forest       2   1108   554.0   3.403 0.0413 *
## Residuals   49   7978   162.8                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Q19: What was the F statistic?

Q20: Which hypothesis would you accept (null or alternative) and why?

Q21: What could we suggest about how topsoil carbon varies by forest status?

Q22: Run the above analyses again for nitrogen content, shrub cover and moss cover.

You may find that your data does not meet the assumptions needed to run ANOVA. In this case you may need to use the non-parametric equivalent Kruskall Wallis test. Example code for this is:

kruskal.test(MEASUREMENT VARIABLE ~ GROUPING VARIABLE, data = DATAFRAME)

You may also find that the results of ANOVA and Kruskall Wallis indicate there are differences between groups. In this case, you would need to undertake post-hoc pairwise comparison of groups. For data that meets the assumptions on ANOVA you can use Tukey’s HSD using the function TukeyHSD(), and for non-parametric data you can use the function pairwise.wilcox.test().

Following the analyses above, what can you conclude about how the ecosystem function of reforested sites differs in relation to the unforested state and the target state (mature forests)

3.4. Analysing grazing status

The second aim of this study was to assess how the ecosystem function differs in areas of differing grazing status. The plots used in this study have also been grouped into ‘grazed’ and ‘ungrazed’. Therefore, we can look at how the environmental variables differ between these two groups and make inferences about why this may be.

You can manipulate the code from the preceding sections of the practical to undertake this analysis. The steps you will need to take generally are as follows:

· Manipulate your dataset so it contains the appropriate data and is in the correct format for analysis

· Check the data visually (e.g. using boxplots, histograms, etc.)

· Check assumptions for tests (e.g., normality, equality of variance)

· Decide what tests to run (e.g., ANOVA, Kruskal-Wallis, t-test, Wilcoxon-Mann-Whitney)

· Set your null and alternative hypothesis based on your chosen test

· Run the test on

·  Draw conclusions about the influence of grazing status on

Hint: You may come across data that does not meet the assumptions for a t-test. In this case, you can use a Wilcox test (wilcox.test()). If two samples (vectors) are supplied this is equivalent to the Mann-Whitney U test

Q23: What inferences can you make about grazing status and ecosystem responses




热门主题

课程名

mktg2509 csci 2600 38170 lng302 csse3010 phas3226 77938 arch1162 engn4536/engn6536 acx5903 comp151101 phl245 cse12 comp9312 stat3016/6016 phas0038 comp2140 6qqmb312 xjco3011 rest0005 ematm0051 5qqmn219 lubs5062m eee8155 cege0100 eap033 artd1109 mat246 etc3430 ecmm462 mis102 inft6800 ddes9903 comp6521 comp9517 comp3331/9331 comp4337 comp6008 comp9414 bu.231.790.81 man00150m csb352h math1041 eengm4100 isys1002 08 6057cem mktg3504 mthm036 mtrx1701 mth3241 eeee3086 cmp-7038b cmp-7000a ints4010 econ2151 infs5710 fins5516 fin3309 fins5510 gsoe9340 math2007 math2036 soee5010 mark3088 infs3605 elec9714 comp2271 ma214 comp2211 infs3604 600426 sit254 acct3091 bbt405 msin0116 com107/com113 mark5826 sit120 comp9021 eco2101 eeen40700 cs253 ece3114 ecmm447 chns3000 math377 itd102 comp9444 comp(2041|9044) econ0060 econ7230 mgt001371 ecs-323 cs6250 mgdi60012 mdia2012 comm221001 comm5000 ma1008 engl642 econ241 com333 math367 mis201 nbs-7041x meek16104 econ2003 comm1190 mbas902 comp-1027 dpst1091 comp7315 eppd1033 m06 ee3025 msci231 bb113/bbs1063 fc709 comp3425 comp9417 econ42915 cb9101 math1102e chme0017 fc307 mkt60104 5522usst litr1-uc6201.200 ee1102 cosc2803 math39512 omp9727 int2067/int5051 bsb151 mgt253 fc021 babs2202 mis2002s phya21 18-213 cege0012 mdia1002 math38032 mech5125 07 cisc102 mgx3110 cs240 11175 fin3020s eco3420 ictten622 comp9727 cpt111 de114102d mgm320h5s bafi1019 math21112 efim20036 mn-3503 fins5568 110.807 bcpm000028 info6030 bma0092 bcpm0054 math20212 ce335 cs365 cenv6141 ftec5580 math2010 ec3450 comm1170 ecmt1010 csci-ua.0480-003 econ12-200 ib3960 ectb60h3f cs247—assignment tk3163 ics3u ib3j80 comp20008 comp9334 eppd1063 acct2343 cct109 isys1055/3412 math350-real math2014 eec180 stat141b econ2101 msinm014/msing014/msing014b fit2004 comp643 bu1002 cm2030
联系我们
EMail: 99515681@qq.com
QQ: 99515681
留学生作业帮-留学生的知心伴侣!
工作时间:08:00-21:00
python代写
微信客服:codinghelp
站长地图