4 Section 3 Overview

Section 3 introduces you to summarizing with dplyr.

After completing Section 3, you will:

understand the importance of summarizing data in exploratory data analysis.
be able to use the “summarize” verb in dplyr to facilitate summarizing data.
be able to use the “group_by” verb in dplyr to facilitate summarizing data.
be able to access values using the dot placeholder.
be able to use “arrange” to examine data after sorting.

4.1 dplyr

The textbook for this section is available here.

Key points

summarize from the dplyr/tidyverse package computes summary statistics from the data frame. It returns a data frame whose column names are defined within the function call.
summarize can compute any summary function that operates on vectors and returns a single value, but it cannot operate on functions that return multiple values.
Like most dplyr functions, summarize is aware of variable names within data frames and can use them directly.

Code

# compute average and standard deviation for males
s <- heights %>%
    filter(sex == "Male") %>%
    summarize(average = mean(height), standard_deviation = sd(height))
    
# access average and standard deviation from summary table
s$average

## [1] 69.31475

s$standard_deviation

## [1] 3.611024

# compute median, min and max
heights %>%
    filter(sex == "Male") %>%
    summarize(median = median(height),
                       minimum = min(height),
                       maximum = max(height))

##   median minimum  maximum
## 1     69      50 82.67717

# alternative way to get min, median, max in base R
quantile(heights$height, c(0, 0.5, 1))

##       0%      50%     100% 
## 50.00000 68.50000 82.67717

# generates an error: summarize can only take functions that return a single value
heights %>%
    filter(sex == "Male") %>%
    summarize(range = quantile(height, c(0, 0.5, 1)))

4.2 The Dot Placeholder

The textbook for this section is available here.

Note that a common replacement for the dot operator is the pull function. Here is the textbook section on the pull function.

Key points

The dot operator allows you to access values stored in data that is being piped in using the %>% character. The dot is a placeholder for the data being passed in through the pipe.
The dot operator allows dplyr functions to return single vectors or numbers instead of only data frames.
us_murder_rate %>% .$rate is equivalent to us_murder_rate$rate.
Note that an equivalent way to extract a single column using the pipe is us_murder_rate %>% pull(rate). The pull function will be used in later course material.

Code

murders <- murders %>% mutate(murder_rate = total/population*100000)
summarize(murders, mean(murder_rate))

##   mean(murder_rate)
## 1          2.779125

# calculate US murder rate, generating a data frame
us_murder_rate <- murders %>%
    summarize(rate = sum(total) / sum(population) * 100000)
us_murder_rate

##       rate
## 1 3.034555

# extract the numeric US murder rate with the dot operator
us_murder_rate %>% .$rate

## [1] 3.034555

# calculate and extract the murder rate with one pipe
us_murder_rate <- murders %>%
    summarize(rate = sum(total) / sum(population * 100000)) %>%
    .$rate

4.3 Group By

The textbook for this section is available here.

Key points

The group_by function from dplyr converts a data frame to a grouped data frame, creating groups using one or more variables.
summarize and some other dplyr functions will behave differently on grouped data frames.
Using summarize on a grouped data frame computes the summary statistics for each of the separate groups.

Code

# compute separate average and standard deviation for male/female heights
heights %>%
    group_by(sex) %>%
    summarize(average = mean(height), standard_deviation = sd(height))

## `summarise()` ungrouping output (override with `.groups` argument)

## # A tibble: 2 x 3
##   sex    average standard_deviation
##   <fct>    <dbl>              <dbl>
## 1 Female    64.9               3.76
## 2 Male      69.3               3.61

# compute median murder rate in 4 regions of country
murders <- murders %>%
    mutate(murder_rate = total/population * 100000)
murders %>%
    group_by(region) %>%
    summarize(median_rate = median(murder_rate))

## `summarise()` ungrouping output (override with `.groups` argument)

## # A tibble: 4 x 2
##   region        median_rate
##   <fct>               <dbl>
## 1 Northeast            1.80
## 2 South                3.40
## 3 North Central        1.97
## 4 West                 1.29

4.4 Sorting Data Tables

The textbook for this section is available here.

Key points

The arrange function from dplyr sorts a data frame by a given column.
By default, arrange sorts in ascending order (lowest to highest). To instead sort in descending order, use the function desc inside of arrange.
You can arrange by multiple levels: within equivalent values of the first level, observations are sorted by the second level, and so on.
The top_n function shows the top results ranked by a given variable, but the results are not ordered. You can combine top_n with arrange to return the top results in order.

Code

# set up murders object
murders <- murders %>%
    mutate(murder_rate = total/population * 100000)

# arrange by population column, smallest to largest
murders %>% arrange(population) %>% head()

##                  state abb        region population total murder_rate
## 1              Wyoming  WY          West     563626     5   0.8871131
## 2 District of Columbia  DC         South     601723    99  16.4527532
## 3              Vermont  VT     Northeast     625741     2   0.3196211
## 4         North Dakota  ND North Central     672591     4   0.5947151
## 5               Alaska  AK          West     710231    19   2.6751860
## 6         South Dakota  SD North Central     814180     8   0.9825837

# arrange by murder rate, smallest to largest
murders %>% arrange(murder_rate) %>% head()

##           state abb        region population total murder_rate
## 1       Vermont  VT     Northeast     625741     2   0.3196211
## 2 New Hampshire  NH     Northeast    1316470     5   0.3798036
## 3        Hawaii  HI          West    1360301     7   0.5145920
## 4  North Dakota  ND North Central     672591     4   0.5947151
## 5          Iowa  IA North Central    3046355    21   0.6893484
## 6         Idaho  ID          West    1567582    12   0.7655102

# arrange by murder rate in descending order
murders %>% arrange(desc(murder_rate)) %>% head()

##                  state abb        region population total murder_rate
## 1 District of Columbia  DC         South     601723    99   16.452753
## 2            Louisiana  LA         South    4533372   351    7.742581
## 3             Missouri  MO North Central    5988927   321    5.359892
## 4             Maryland  MD         South    5773552   293    5.074866
## 5       South Carolina  SC         South    4625364   207    4.475323
## 6             Delaware  DE         South     897934    38    4.231937

# arrange by region alphabetically, then by murder rate within each region
murders %>% arrange(region, murder_rate) %>% head()

##           state abb    region population total murder_rate
## 1       Vermont  VT Northeast     625741     2   0.3196211
## 2 New Hampshire  NH Northeast    1316470     5   0.3798036
## 3         Maine  ME Northeast    1328361    11   0.8280881
## 4  Rhode Island  RI Northeast    1052567    16   1.5200933
## 5 Massachusetts  MA Northeast    6547629   118   1.8021791
## 6      New York  NY Northeast   19378102   517   2.6679599

# show the top 10 states with highest murder rate, not ordered by rate
murders %>% top_n(10, murder_rate)

##                   state abb        region population total murder_rate
## 1               Arizona  AZ          West    6392017   232    3.629527
## 2              Delaware  DE         South     897934    38    4.231937
## 3  District of Columbia  DC         South     601723    99   16.452753
## 4               Georgia  GA         South    9920000   376    3.790323
## 5             Louisiana  LA         South    4533372   351    7.742581
## 6              Maryland  MD         South    5773552   293    5.074866
## 7              Michigan  MI North Central    9883640   413    4.178622
## 8           Mississippi  MS         South    2967297   120    4.044085
## 9              Missouri  MO North Central    5988927   321    5.359892
## 10       South Carolina  SC         South    4625364   207    4.475323

# show the top 10 states with highest murder rate, ordered by rate
murders %>% arrange(desc(murder_rate)) %>% top_n(10)

## Selecting by murder_rate

##                   state abb        region population total murder_rate
## 1  District of Columbia  DC         South     601723    99   16.452753
## 2             Louisiana  LA         South    4533372   351    7.742581
## 3              Missouri  MO North Central    5988927   321    5.359892
## 4              Maryland  MD         South    5773552   293    5.074866
## 5        South Carolina  SC         South    4625364   207    4.475323
## 6              Delaware  DE         South     897934    38    4.231937
## 7              Michigan  MI North Central    9883640   413    4.178622
## 8           Mississippi  MS         South    2967297   120    4.044085
## 9               Georgia  GA         South    9920000   376    3.790323
## 10              Arizona  AZ          West    6392017   232    3.629527

4.5 Assessment - Summarizing with dplyr

To practice our dplyr skills we will be working with data from the survey collected by the United States National Center for Health Statistics (NCHS). This center has conducted a series of health and nutrition surveys since the 1960’s.

Starting in 1999, about 5,000 individuals of all ages have been interviewed every year and then they complete the health examination component of the survey. Part of this dataset is made available via the NHANES package which can be loaded this way:

if(!require(NHANES)) install.packages("NHANES")

## Loading required package: NHANES

library(NHANES)
data(NHANES)

The NHANES data has many missing values. Remember that the main summarization function in R will return NA if any of the entries of the input vector is an NA. Here is an example:

data(na_example)
mean(na_example)

## [1] NA

sd(na_example)

## [1] NA

To ignore the NAs, we can use the na.rm argument:

mean(na_example, na.rm = TRUE)

## [1] 2.301754

sd(na_example, na.rm = TRUE)

## [1] 1.22338

Try running this code, then let us know you are ready to proceed with the analysis.

Let’s explore the NHANES data. We will be exploring blood pressure in this dataset.

First let’s select a group to set the standard. We will use 20-29 year old females. Note that the category is coded with 20-29, with a space in front of the 20! The AgeDecade is a categorical variable with these ages.

To know if someone is female, you can look at the Gender variable.

## fill in what is needed
tab <- NHANES %>% filter(AgeDecade == " 20-29" & Gender == "female")
head(tab)

## # A tibble: 6 x 76
##      ID SurveyYr Gender   Age AgeDecade AgeMonths Race1 Race3 Education MaritalStatus HHIncome HHIncomeMid Poverty HomeRooms HomeOwn Work  Weight Length HeadCirc Height   BMI
##   <int> <fct>    <fct>  <int> <fct>         <int> <fct> <fct> <fct>     <fct>         <fct>          <int>   <dbl>     <int> <fct>   <fct>  <dbl>  <dbl>    <dbl>  <dbl> <dbl>
## 1 51710 2009_10  female    26 " 20-29"        319 White <NA>  College … Married       "75000-…       87500    5            4 Own     Work…   64.9     NA       NA   176.  21  
## 2 51731 2009_10  female    28 " 20-29"        346 Black <NA>  High Sch… NeverMarried  " 0-499…        2500    0.07         4 Rent    Work…  134.      NA       NA   170.  46.7
## 3 51741 2009_10  female    21 " 20-29"        253 Black <NA>  Some Col… NeverMarried  "45000-…       50000    2.04         4 Rent    Work…  104.      NA       NA   166.  37.3
## 4 51741 2009_10  female    21 " 20-29"        253 Black <NA>  Some Col… NeverMarried  "45000-…       50000    2.04         4 Rent    Work…  104.      NA       NA   166.  37.3
## 5 51760 2009_10  female    27 " 20-29"        334 Hisp… <NA>  9 - 11th… NeverMarried   <NA>             NA    0.72         4 Rent    NotW…   69.9     NA       NA   166.  25.5
## 6 51764 2009_10  female    29 " 20-29"        357 White <NA>  College … NeverMarried  "10000-…       12500    1.11         4 Rent    Work…   87.8     NA       NA   164.  32.5
## # … with 55 more variables: BMICatUnder20yrs <fct>, BMI_WHO <fct>, Pulse <int>, BPSysAve <int>, BPDiaAve <int>, BPSys1 <int>, BPDia1 <int>, BPSys2 <int>, BPDia2 <int>, BPSys3 <int>,
## #   BPDia3 <int>, Testosterone <dbl>, DirectChol <dbl>, TotChol <dbl>, UrineVol1 <int>, UrineFlow1 <dbl>, UrineVol2 <int>, UrineFlow2 <dbl>, Diabetes <fct>, DiabetesAge <int>,
## #   HealthGen <fct>, DaysPhysHlthBad <int>, DaysMentHlthBad <int>, LittleInterest <fct>, Depressed <fct>, nPregnancies <int>, nBabies <int>, Age1stBaby <int>, SleepHrsNight <int>,
## #   SleepTrouble <fct>, PhysActive <fct>, PhysActiveDays <int>, TVHrsDay <fct>, CompHrsDay <fct>, TVHrsDayChild <int>, CompHrsDayChild <int>, Alcohol12PlusYr <fct>,
## #   AlcoholDay <int>, AlcoholYear <int>, SmokeNow <fct>, Smoke100 <fct>, Smoke100n <fct>, SmokeAge <int>, Marijuana <fct>, AgeFirstMarij <int>, RegularMarij <fct>,
## #   AgeRegMarij <int>, HardDrugs <fct>, SexEver <fct>, SexAge <int>, SexNumPartnLife <int>, SexNumPartYear <int>, SameSex <fct>, SexOrientation <fct>, PregnantNow <fct>

Now we will compute the average and standard deviation for the subgroup we defined in the previous exercise (20-29 year old females), which we will use reference for what is typical.

You will determine the average and standard deviation of systolic blood pressure, which are stored in the BPSysAve variable in the NHANES dataset.

## complete this line of code.
ref <- NHANES %>% filter(AgeDecade == " 20-29" & Gender == "female") %>% summarize(average = mean(BPSysAve, na.rm = TRUE), standard_deviation = sd(BPSysAve, na.rm = TRUE))
ref

## # A tibble: 1 x 2
##   average standard_deviation
##     <dbl>              <dbl>
## 1    108.               10.1

Now we will repeat the exercise and generate only the average blood pressure for 20-29 year old females.

For this exercise, you should review how to use the place holder . in dplyr or the pull function.

## modify the code we wrote for previous exercise.
ref_avg <- NHANES %>%
  filter(AgeDecade == " 20-29" & Gender == "female") %>%
  summarize(average = mean(BPSysAve, na.rm = TRUE), 
            standard_deviation = sd(BPSysAve, na.rm=TRUE)) %>% .$average
ref_avg

## [1] 108.4224

Let’s continue practicing by calculating two other data summaries: the minimum and the maximum.

Again we will do it for the BPSysAve variable and the group of 20-29 year old females.

## complete the line
NHANES %>%
      filter(AgeDecade == " 20-29"  & Gender == "female") %>% summarize(minbp = min(BPSysAve, na.rm = TRUE), 
            maxbp = max(BPSysAve, na.rm=TRUE))

## # A tibble: 1 x 2
##   minbp maxbp
##   <int> <int>
## 1    84   179

Now let’s practice using the group_by function.

What we are about to do is a very common operation in data science: you will split a data table into groups and then compute summary statistics for each group.

We will compute the average and standard deviation of systolic blood pressure for females for each age group separately. Remember that the age groups are contained in AgeDecade.

##complete the line with group_by and summarize
NHANES %>%
      filter(Gender == "female") %>% group_by(AgeDecade) %>% summarize(average = mean(BPSysAve, na.rm = TRUE), 
            standard_deviation = sd(BPSysAve, na.rm=TRUE))

## `summarise()` ungrouping output (override with `.groups` argument)

## # A tibble: 9 x 3
##   AgeDecade average standard_deviation
##   <fct>       <dbl>              <dbl>
## 1 " 0-9"       100.               9.07
## 2 " 10-19"     104.               9.46
## 3 " 20-29"     108.              10.1 
## 4 " 30-39"     111.              12.3 
## 5 " 40-49"     115.              14.5 
## 6 " 50-59"     122.              16.2 
## 7 " 60-69"     127.              17.1 
## 8 " 70+"       134.              19.8 
## 9  <NA>        142.              22.9

Now let’s practice using group_by some more.

We are going to repeat the previous exercise of calculating the average and standard deviation of systolic blood pressure, but for males instead of females.

This time we will not provide much sample code. You are on your own!

NHANES %>%
      filter(Gender == "male") %>% group_by(AgeDecade) %>% summarize(average = mean(BPSysAve, na.rm = TRUE), 
            standard_deviation = sd(BPSysAve, na.rm=TRUE))

## `summarise()` ungrouping output (override with `.groups` argument)

## # A tibble: 9 x 3
##   AgeDecade average standard_deviation
##   <fct>       <dbl>              <dbl>
## 1 " 0-9"       97.4               8.32
## 2 " 10-19"    110.               11.2 
## 3 " 20-29"    118.               11.3 
## 4 " 30-39"    119.               12.3 
## 5 " 40-49"    121.               14.0 
## 6 " 50-59"    126.               17.8 
## 7 " 60-69"    127.               17.5 
## 8 " 70+"      130.               18.7 
## 9  <NA>       136.               23.5

We can actually combine both of these summaries into a single line of code.

This is because group_by permits us to group by more than one variable.

We can use group_by(AgeDecade, Gender) to group by both age decades and gender.

NHANES %>% group_by(AgeDecade, Gender) %>% summarize(average = mean(BPSysAve, na.rm = TRUE), 
            standard_deviation = sd(BPSysAve, na.rm=TRUE))

## `summarise()` regrouping output by 'AgeDecade' (override with `.groups` argument)

## # A tibble: 18 x 4
## # Groups:   AgeDecade [9]
##    AgeDecade Gender average standard_deviation
##    <fct>     <fct>    <dbl>              <dbl>
##  1 " 0-9"    female   100.                9.07
##  2 " 0-9"    male      97.4               8.32
##  3 " 10-19"  female   104.                9.46
##  4 " 10-19"  male     110.               11.2 
##  5 " 20-29"  female   108.               10.1 
##  6 " 20-29"  male     118.               11.3 
##  7 " 30-39"  female   111.               12.3 
##  8 " 30-39"  male     119.               12.3 
##  9 " 40-49"  female   115.               14.5 
## 10 " 40-49"  male     121.               14.0 
## 11 " 50-59"  female   122.               16.2 
## 12 " 50-59"  male     126.               17.8 
## 13 " 60-69"  female   127.               17.1 
## 14 " 60-69"  male     127.               17.5 
## 15 " 70+"    female   134.               19.8 
## 16 " 70+"    male     130.               18.7 
## 17  <NA>     female   142.               22.9 
## 18  <NA>     male     136.               23.5

Now we are going to explore differences in systolic blood pressure across races, as reported in the Race1 variable.

We will learn to use the arrange function to order the outcome acording to one variable.

Note that this function can be used to order any table by a given outcome. Here is an example that arranges by systolic blood pressure.

NHANES %>% arrange(BPSysAve)

If we want it in descending order we can use the desc function like this:

NHANES %>% arrange(desc(BPSysAve))

In this example, we will compare systolic blood pressure across values of the Race1 variable for males between the ages of 40-49.

NHANES %>% filter(AgeDecade == " 40-49" & Gender == "male") %>% group_by(Race1) %>% summarize(average = mean(BPSysAve, na.rm = TRUE), standard_deviation = sd(BPSysAve, na.rm=TRUE)) %>% arrange(average)

## `summarise()` ungrouping output (override with `.groups` argument)

## # A tibble: 5 x 3
##   Race1    average standard_deviation
##   <fct>      <dbl>              <dbl>
## 1 White       120.               13.4
## 2 Other       120.               16.2
## 3 Hispanic    122.               11.1
## 4 Mexican     122.               13.9
## 5 Black       126.               17.1