Summarising responses for groups and mutating variables

This dataset includes heights and diameters of trees grown under different fertilisation regimes over a measurement period of more than four years. Fertiliser treatments were applied at the plot level, with measurements made on multiple trees within each plot. We wish to calculate growth increments after summarising plot averages for diameter on each measurement date.

First we load libraries, read the data and convert variables to appropriate classes.

# load libraries
library(tidyverse)
library(zoo)

# read in tree data and look at first few rows
hfetree <- read.csv('data/HFEIFbytree.csv')
head(hfetree)
##     ID plotnr treat       Date height diameter
## 1 7445     17     C 2008-01-01   2.88     2.18
## 2 9451     25     C 2008-01-01   2.52     1.55
## 3 7066     17     C 2008-01-01   2.34     1.52
## 4 3194      9     C 2008-01-01   2.32     1.56
## 5 2108      7     C 2008-01-01   2.28     1.34
## 6 7120     17     C 2008-01-01   2.26     1.45
# convert dates to 'Date' format
hfetree$Date <- as.Date(hfetree$Date)
unique(hfetree$Date)
##  [1] "2008-01-01" "2008-03-01" "2008-04-01" "2008-05-01" "2008-07-01"
##  [6] "2008-09-01" "2008-10-01" "2008-12-01" "2009-01-01" "2009-03-01"
## [11] "2009-07-01" "2009-12-01" "2010-02-01" "2010-03-01" "2010-09-01"
## [16] "2011-06-01" "2012-04-15"
length(unique(hfetree$Date))
## [1] 17
# convert plot numbers to 'factor')
hfetree$plotnr <- factor(hfetree$plotnr)
length(levels(hfetree$plotnr))
## [1] 16

Now that we have our dataframe ready we can specify groups and calculate group means.

# calculate average diameter for each plot and date
hfeplot <- hfetree %>% 
  group_by(Date, treat, plotnr) %>% 
  summarise(diameter=mean(diameter))
hfeplot
## # A tibble: 272 x 4
## # Groups:   Date, treat [68]
##    Date       treat plotnr diameter
##    <date>     <fct> <fct>     <dbl>
##  1 2008-01-01 C     7         0.902
##  2 2008-01-01 C     9         1.07 
##  3 2008-01-01 C     17        1.25 
##  4 2008-01-01 C     25        0.961
##  5 2008-01-01 F     3         1.58 
##  6 2008-01-01 F     8         0.738
##  7 2008-01-01 F     14        0.46 
##  8 2008-01-01 F     22        1.02 
##  9 2008-01-01 I     2         1.12 
## 10 2008-01-01 I     10        0.852
## # … with 262 more rows
# 17 measurement dates so 17 observations for each plot
summary(hfeplot)
##       Date            treat       plotnr       diameter     
##  Min.   :2008-01-01   C :68   1      : 17   Min.   : 0.460  
##  1st Qu.:2008-07-01   F :68   2      : 17   1st Qu.: 2.744  
##  Median :2009-01-01   I :68   3      : 17   Median : 5.497  
##  Mean   :2009-05-25   IL:68   7      : 17   Mean   : 6.697  
##  3rd Qu.:2010-02-01           8      : 17   3rd Qu.:10.311  
##  Max.   :2012-04-15           9      : 17   Max.   :17.073  
##                               (Other):170

Measurements were taken at intervals with different durations, so we need to calculate increments for both diameter and Date and then divide the former by the latter. We can do this using the lag function from the dplyr package (note that this masks the function of the same name from the stats package). lag works by shifting a vector by certain number of observations (k) – k = 1 means that the vector is shifted so that it starts one observation earlier.

# calculate growth increment between measurement dates
hfegrowth <- hfeplot %>% 
  group_by(plotnr) %>% 
  # remove groups where n <= 1 since can't calculate lags for these
  filter(n() > 1) %>%  
  # sort by Date to ensure calculations correct
  arrange(Date) %>%  
  # calculate increments for diameter and day relative to the current observation, 
  # then calculate rate per day based on increments
  mutate(incr.diam = diameter - lag(diameter, 1), 
         incr.days = as.numeric(Date - lag(Date, 1)), 
         growth.rate = incr.diam / incr.days)

# NAs in the new columns are for the first observations in the data 
# (no available lagged observations)
summary(hfegrowth)
##       Date            treat       plotnr       diameter     
##  Min.   :2008-01-01   C :68   1      : 17   Min.   : 0.460  
##  1st Qu.:2008-07-01   F :68   2      : 17   1st Qu.: 2.744  
##  Median :2009-01-01   I :68   3      : 17   Median : 5.497  
##  Mean   :2009-05-25   IL:68   7      : 17   Mean   : 6.697  
##  3rd Qu.:2010-02-01           8      : 17   3rd Qu.:10.311  
##  Max.   :2012-04-15           9      : 17   Max.   :17.073  
##                               (Other):170                   
##    incr.diam         incr.days       growth.rate       
##  Min.   :-0.4104   Min.   : 28.00   Min.   :-0.007286  
##  1st Qu.: 0.3137   1st Qu.: 31.00   1st Qu.: 0.005069  
##  Median : 0.7988   Median : 61.00   Median : 0.009891  
##  Mean   : 0.8554   Mean   : 97.88   Mean   : 0.011299  
##  3rd Qu.: 1.3340   3rd Qu.:129.75   3rd Qu.: 0.016591  
##  Max.   : 2.5737   Max.   :319.00   Max.   : 0.044474  
##  NA's   :16        NA's   :16       NA's   :16

Joining compatible dataframes

There is a weather station at the experimental site, which gathers data that might be useful for explaining variation in growth rates on each date. These data are measured ona half-hourly basis. HFEmet2008.csv includes weather data from 2008.

# read in weather data and look at summary
hfemet <- read.csv('data/HFEmet2008.csv')
summary(hfemet)
##            DateTime          Tair           AirPress           RH       
##  1/1/2008 0:00 :    1   Min.   :-3.048   Min.   : 95.3   Min.   :14.85  
##  1/1/2008 0:30 :    1   1st Qu.:11.730   1st Qu.:101.3   1st Qu.:57.27  
##  1/1/2008 1:00 :    1   Median :16.690   Median :101.8   Median :79.80  
##  1/1/2008 1:30 :    1   Mean   :16.291   Mean   :101.7   Mean   :73.51  
##  1/1/2008 10:00:    1   3rd Qu.:20.800   3rd Qu.:102.3   3rd Qu.:93.20  
##  1/1/2008 10:30:    1   Max.   :37.740   Max.   :103.4   Max.   :97.60  
##  (Other)       :17562   NA's   :1        NA's   :67      NA's   :1      
##       VPD               PAR              Rain               wind       
##  Min.   :0.01453   Min.   :   0.0   Min.   : 0.00000   Min.   :0.0000  
##  1st Qu.:0.10679   1st Qu.:   0.0   1st Qu.: 0.00000   1st Qu.:0.0000  
##  Median :0.36148   Median :   6.4   Median : 0.00000   Median :0.2380  
##  Mean   :0.66822   Mean   : 352.3   Mean   : 0.05188   Mean   :0.6471  
##  3rd Qu.:0.98770   3rd Qu.: 559.0   3rd Qu.: 0.00000   3rd Qu.:1.1140  
##  Max.   :4.76178   Max.   :2428.7   Max.   :21.20000   Max.   :7.6900  
##  NA's   :1         NA's   :67       NA's   :1          NA's   :1       
##  winddirection    
##  Min.   :  0.086  
##  1st Qu.:112.600  
##  Median :164.300  
##  Mean   :171.106  
##  3rd Qu.:225.900  
##  Max.   :354.500  
##  NA's   :1
# convert dates and times and look at summary, unique dates
# Note, lubridate functions do not work well with %>%
hfemet$DateTime <- mdy_hm(hfemet$DateTime)
hfemet$Date <- as.Date(hfemet$DateTime)
summary(hfemet)
##     DateTime                        Tair           AirPress    
##  Min.   :2008-01-01 00:00:00   Min.   :-3.048   Min.   : 95.3  
##  1st Qu.:2008-04-01 11:52:30   1st Qu.:11.730   1st Qu.:101.3  
##  Median :2008-07-01 23:45:00   Median :16.690   Median :101.8  
##  Mean   :2008-07-01 23:45:00   Mean   :16.291   Mean   :101.7  
##  3rd Qu.:2008-10-01 11:37:30   3rd Qu.:20.800   3rd Qu.:102.3  
##  Max.   :2008-12-31 23:30:00   Max.   :37.740   Max.   :103.4  
##                                NA's   :1        NA's   :67     
##        RH             VPD               PAR              Rain         
##  Min.   :14.85   Min.   :0.01453   Min.   :   0.0   Min.   : 0.00000  
##  1st Qu.:57.27   1st Qu.:0.10679   1st Qu.:   0.0   1st Qu.: 0.00000  
##  Median :79.80   Median :0.36148   Median :   6.4   Median : 0.00000  
##  Mean   :73.51   Mean   :0.66822   Mean   : 352.3   Mean   : 0.05188  
##  3rd Qu.:93.20   3rd Qu.:0.98770   3rd Qu.: 559.0   3rd Qu.: 0.00000  
##  Max.   :97.60   Max.   :4.76178   Max.   :2428.7   Max.   :21.20000  
##  NA's   :1       NA's   :1         NA's   :67       NA's   :1         
##       wind        winddirection          Date           
##  Min.   :0.0000   Min.   :  0.086   Min.   :2008-01-01  
##  1st Qu.:0.0000   1st Qu.:112.600   1st Qu.:2008-04-01  
##  Median :0.2380   Median :164.300   Median :2008-07-01  
##  Mean   :0.6471   Mean   :171.106   Mean   :2008-07-01  
##  3rd Qu.:1.1140   3rd Qu.:225.900   3rd Qu.:2008-10-01  
##  Max.   :7.6900   Max.   :354.500   Max.   :2008-12-31  
##  NA's   :1        NA's   :1
length(unique(hfemet$Date))
## [1] 366

The half-hourly measures are too fine a resolution for trying to explain variation in growth, estimates of average temperature or total precipitation over a longer timespan prior to growth measurements are more valuable here. We can calculate estimates for rolling means and sums using functions in the zoo library.

# calculate a rolling average for Tair, rolling sum for Rain for each month
hferoll <- hfemet %>% 
  # first generate a summary for Tair (mean) and Rain (sum) by Date
  # to ensure that there is only one observation per date, otherwise rolling estimates
  # will be based on 'k' half hourly measurements
  group_by(Date) %>% 
  summarise(Tair=mean(Tair), 
            Rain=sum(Rain)) %>% 
  # then calculate rolling average/sum over a 30-day window prior to the observation date
  # align='right' ensures averages are based on prior observations
  arrange(Date) %>% 
  mutate(Tair.30=rollmean(Tair, 30, fill=NA, na.rm=T, align='right'), 
         Rain.30=rollsum(Rain, 30, fill=NA, na.rm=T, align='right')) %>%
  select(Date, Tair.30, Rain.30)

# NAs in the new columns are for the first 29 observations in the data 
# (no available lagged observations)
summary(hferoll)
##       Date               Tair.30          Rain.30      
##  Min.   :2008-01-01   Min.   : 8.727   Min.   :  1.40  
##  1st Qu.:2008-04-01   1st Qu.:12.187   1st Qu.: 30.80  
##  Median :2008-07-01   Median :16.462   Median : 65.00  
##  Mean   :2008-07-01   Mean   :15.766   Mean   : 74.94  
##  3rd Qu.:2008-09-30   3rd Qu.:20.044   3rd Qu.: 94.20  
##  Max.   :2008-12-31   Max.   :22.739   Max.   :289.60  
##                       NA's   :29       NA's   :29

Now we have two dataframes containing unique information but that can be joined one (or more) variables that are shared between the two. The ‘dplyr’ package has a series of ‘join’ functions that can do this, the choice of function depends on the desired output. Here we are only interested in matching weather data from dates relevant to growth measurements.

# merge growth and weather data using 'join'
# we don't need to keep all observations from 'hferoll' so we use 'left_join'
# and enter 'hfegrowth' as the table on the left 
dfr <- left_join(hfegrowth, hferoll, by='Date')
summary(dfr)
##       Date            treat       plotnr       diameter     
##  Min.   :2008-01-01   C :68   1      : 17   Min.   : 0.460  
##  1st Qu.:2008-07-01   F :68   2      : 17   1st Qu.: 2.744  
##  Median :2009-01-01   I :68   3      : 17   Median : 5.497  
##  Mean   :2009-05-25   IL:68   7      : 17   Mean   : 6.697  
##  3rd Qu.:2010-02-01           8      : 17   3rd Qu.:10.311  
##  Max.   :2012-04-15           9      : 17   Max.   :17.073  
##                               (Other):170                   
##    incr.diam         incr.days       growth.rate           Tair.30      
##  Min.   :-0.4104   Min.   : 28.00   Min.   :-0.007286   Min.   : 9.487  
##  1st Qu.: 0.3137   1st Qu.: 31.00   1st Qu.: 0.005069   1st Qu.:12.539  
##  Median : 0.7988   Median : 61.00   Median : 0.009891   Median :15.129  
##  Mean   : 0.8554   Mean   : 97.88   Mean   : 0.011299   Mean   :16.021  
##  3rd Qu.: 1.3340   3rd Qu.:129.75   3rd Qu.: 0.016591   3rd Qu.:20.030  
##  Max.   : 2.5737   Max.   :319.00   Max.   : 0.044474   Max.   :20.310  
##  NA's   :16        NA's   :16       NA's   :16          NA's   :160     
##     Rain.30      
##  Min.   : 36.60  
##  1st Qu.: 40.00  
##  Median : 68.40  
##  Mean   : 88.11  
##  3rd Qu.:101.00  
##  Max.   :243.60  
##  NA's   :160
# clean up the data, we don't need all variables and NAs are not informative
dfr <- dfr %>% 
  drop_na %>% 
  # naming some variables strategically means that we can use 'tidyselect' helpers
  select(Date, treat, plotnr, growth.rate, ends_with('.30')) %>% 
  # then rename variables for ease of use
  rename(Tair=Tair.30, Rain=Rain.30)
dfr
## # A tibble: 112 x 6
## # Groups:   plotnr [16]
##    Date       treat plotnr growth.rate  Tair  Rain
##    <date>     <fct> <fct>        <dbl> <dbl> <dbl>
##  1 2008-03-01 C     7           0.0143  20.3  244.
##  2 2008-03-01 C     9           0.0194  20.3  244.
##  3 2008-03-01 C     17          0.0195  20.3  244.
##  4 2008-03-01 C     25          0.0165  20.3  244.
##  5 2008-03-01 F     3           0.0229  20.3  244.
##  6 2008-03-01 F     8           0.0199  20.3  244.
##  7 2008-03-01 F     14          0.0136  20.3  244.
##  8 2008-03-01 F     22          0.0238  20.3  244.
##  9 2008-03-01 I     2           0.0206  20.3  244.
## 10 2008-03-01 I     10          0.0130  20.3  244.
## # … with 102 more rows

Now we can plot relationships between growth rate, air temperature and rainfall. Growth rate appears to be positively correlated with both.

# plot growth rate by temperature and treatment
ggplot(dfr, aes(Tair, growth.rate, col=treat)) + 
  geom_point() + 
  xlab(expression(Mean~ ~air~ ~ temperature~ ~(''^'o'*C))) + 
  ylab(expression(Growth~ ~increment~ ~(mm~d^-1)))

ggplot(dfr, aes(Rain, growth.rate, col=treat)) + 
  geom_point() + 
  xlab('Sum rainfall (mm)') + 
  ylab(expression(Growth~ ~increment~ ~(mm~d^-1)))