4 Section 3 Overview
Section 3 introduces you to summarizing with dplyr.
After completing Section 3, you will:
- understand the importance of summarizing data in exploratory data analysis.
- be able to use the “summarize” verb in dplyr to facilitate summarizing data.
- be able to use the “group_by” verb in dplyr to facilitate summarizing data.
- be able to access values using the dot placeholder.
- be able to use “arrange” to examine data after sorting.
4.1 dplyr
The textbook for this section is available here.
Key points
summarize
from the dplyr/tidyverse package computes summary statistics from the data frame. It returns a data frame whose column names are defined within the function call.summarize
can compute any summary function that operates on vectors and returns a single value, but it cannot operate on functions that return multiple values.- Like most dplyr functions,
summarize
is aware of variable names within data frames and can use them directly.
Code
# compute average and standard deviation for males
heights %>%
s <- filter(sex == "Male") %>%
summarize(average = mean(height), standard_deviation = sd(height))
# access average and standard deviation from summary table
$average s
## [1] 69.31475
$standard_deviation s
## [1] 3.611024
# compute median, min and max
%>%
heights filter(sex == "Male") %>%
summarize(median = median(height),
minimum = min(height),
maximum = max(height))
## median minimum maximum
## 1 69 50 82.67717
# alternative way to get min, median, max in base R
quantile(heights$height, c(0, 0.5, 1))
## 0% 50% 100%
## 50.00000 68.50000 82.67717
# generates an error: summarize can only take functions that return a single value
%>%
heights filter(sex == "Male") %>%
summarize(range = quantile(height, c(0, 0.5, 1)))
4.2 The Dot Placeholder
The textbook for this section is available here.
Note that a common replacement for the dot operator is the pull function. Here is the textbook section on the pull function.
Key points
- The dot operator allows you to access values stored in data that is being piped in using the
%>%
character. The dot is a placeholder for the data being passed in through the pipe. - The dot operator allows
dplyr
functions to return single vectors or numbers instead of only data frames. us_murder_rate %>% .$rate
is equivalent tous_murder_rate$rate
.- Note that an equivalent way to extract a single column using the pipe is
us_murder_rate %>% pull(rate)
. The pull function will be used in later course material.
Code
murders %>% mutate(murder_rate = total/population*100000)
murders <-summarize(murders, mean(murder_rate))
## mean(murder_rate)
## 1 2.779125
# calculate US murder rate, generating a data frame
murders %>%
us_murder_rate <- summarize(rate = sum(total) / sum(population) * 100000)
us_murder_rate
## rate
## 1 3.034555
# extract the numeric US murder rate with the dot operator
%>% .$rate us_murder_rate
## [1] 3.034555
# calculate and extract the murder rate with one pipe
murders %>%
us_murder_rate <- summarize(rate = sum(total) / sum(population * 100000)) %>%
.$rate
4.3 Group By
The textbook for this section is available here.
Key points
- The
group_by
function from dplyr converts a data frame to a grouped data frame, creating groups using one or more variables. summarize
and some other dplyr functions will behave differently on grouped data frames.- Using
summarize
on a grouped data frame computes the summary statistics for each of the separate groups.
Code
# compute separate average and standard deviation for male/female heights
%>%
heights group_by(sex) %>%
summarize(average = mean(height), standard_deviation = sd(height))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 2 x 3
## sex average standard_deviation
## <fct> <dbl> <dbl>
## 1 Female 64.9 3.76
## 2 Male 69.3 3.61
# compute median murder rate in 4 regions of country
murders %>%
murders <- mutate(murder_rate = total/population * 100000)
%>%
murders group_by(region) %>%
summarize(median_rate = median(murder_rate))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 4 x 2
## region median_rate
## <fct> <dbl>
## 1 Northeast 1.80
## 2 South 3.40
## 3 North Central 1.97
## 4 West 1.29
4.4 Sorting Data Tables
The textbook for this section is available here.
Key points
- The
arrange
function from dplyr sorts a data frame by a given column. - By default,
arrange
sorts in ascending order (lowest to highest). To instead sort in descending order, use the functiondesc
inside ofarrange
. - You can
arrange
by multiple levels: within equivalent values of the first level, observations are sorted by the second level, and so on. - The
top_n
function shows the top results ranked by a given variable, but the results are not ordered. You can combinetop_n
witharrange
to return the top results in order.
Code
# set up murders object
murders %>%
murders <- mutate(murder_rate = total/population * 100000)
# arrange by population column, smallest to largest
%>% arrange(population) %>% head() murders
## state abb region population total murder_rate
## 1 Wyoming WY West 563626 5 0.8871131
## 2 District of Columbia DC South 601723 99 16.4527532
## 3 Vermont VT Northeast 625741 2 0.3196211
## 4 North Dakota ND North Central 672591 4 0.5947151
## 5 Alaska AK West 710231 19 2.6751860
## 6 South Dakota SD North Central 814180 8 0.9825837
# arrange by murder rate, smallest to largest
%>% arrange(murder_rate) %>% head() murders
## state abb region population total murder_rate
## 1 Vermont VT Northeast 625741 2 0.3196211
## 2 New Hampshire NH Northeast 1316470 5 0.3798036
## 3 Hawaii HI West 1360301 7 0.5145920
## 4 North Dakota ND North Central 672591 4 0.5947151
## 5 Iowa IA North Central 3046355 21 0.6893484
## 6 Idaho ID West 1567582 12 0.7655102
# arrange by murder rate in descending order
%>% arrange(desc(murder_rate)) %>% head() murders
## state abb region population total murder_rate
## 1 District of Columbia DC South 601723 99 16.452753
## 2 Louisiana LA South 4533372 351 7.742581
## 3 Missouri MO North Central 5988927 321 5.359892
## 4 Maryland MD South 5773552 293 5.074866
## 5 South Carolina SC South 4625364 207 4.475323
## 6 Delaware DE South 897934 38 4.231937
# arrange by region alphabetically, then by murder rate within each region
%>% arrange(region, murder_rate) %>% head() murders
## state abb region population total murder_rate
## 1 Vermont VT Northeast 625741 2 0.3196211
## 2 New Hampshire NH Northeast 1316470 5 0.3798036
## 3 Maine ME Northeast 1328361 11 0.8280881
## 4 Rhode Island RI Northeast 1052567 16 1.5200933
## 5 Massachusetts MA Northeast 6547629 118 1.8021791
## 6 New York NY Northeast 19378102 517 2.6679599
# show the top 10 states with highest murder rate, not ordered by rate
%>% top_n(10, murder_rate) murders
## state abb region population total murder_rate
## 1 Arizona AZ West 6392017 232 3.629527
## 2 Delaware DE South 897934 38 4.231937
## 3 District of Columbia DC South 601723 99 16.452753
## 4 Georgia GA South 9920000 376 3.790323
## 5 Louisiana LA South 4533372 351 7.742581
## 6 Maryland MD South 5773552 293 5.074866
## 7 Michigan MI North Central 9883640 413 4.178622
## 8 Mississippi MS South 2967297 120 4.044085
## 9 Missouri MO North Central 5988927 321 5.359892
## 10 South Carolina SC South 4625364 207 4.475323
# show the top 10 states with highest murder rate, ordered by rate
%>% arrange(desc(murder_rate)) %>% top_n(10) murders
## Selecting by murder_rate
## state abb region population total murder_rate
## 1 District of Columbia DC South 601723 99 16.452753
## 2 Louisiana LA South 4533372 351 7.742581
## 3 Missouri MO North Central 5988927 321 5.359892
## 4 Maryland MD South 5773552 293 5.074866
## 5 South Carolina SC South 4625364 207 4.475323
## 6 Delaware DE South 897934 38 4.231937
## 7 Michigan MI North Central 9883640 413 4.178622
## 8 Mississippi MS South 2967297 120 4.044085
## 9 Georgia GA South 9920000 376 3.790323
## 10 Arizona AZ West 6392017 232 3.629527
4.5 Assessment - Summarizing with dplyr
To practice our dplyr skills we will be working with data from the survey collected by the United States National Center for Health Statistics (NCHS). This center has conducted a series of health and nutrition surveys since the 1960’s.
Starting in 1999, about 5,000 individuals of all ages have been interviewed every year and then they complete the health examination component of the survey. Part of this dataset is made available via the NHANES package which can be loaded this way:
if(!require(NHANES)) install.packages("NHANES")
## Loading required package: NHANES
library(NHANES)
data(NHANES)
The NHANES data has many missing values. Remember that the main summarization function in R will return NA
if any of the entries of the input vector is an NA
. Here is an example:
data(na_example)
mean(na_example)
## [1] NA
sd(na_example)
## [1] NA
To ignore the NA
s, we can use the na.rm
argument:
mean(na_example, na.rm = TRUE)
## [1] 2.301754
sd(na_example, na.rm = TRUE)
## [1] 1.22338
Try running this code, then let us know you are ready to proceed with the analysis.
- Let’s explore the
NHANES
data. We will be exploring blood pressure in this dataset.
First let’s select a group to set the standard. We will use 20-29 year old females. Note that the category is coded with 20-29
, with a space in front of the 20
! The AgeDecade
is a categorical variable with these ages.
To know if someone is female, you can look at the Gender
variable.
## fill in what is needed
NHANES %>% filter(AgeDecade == " 20-29" & Gender == "female")
tab <-head(tab)
## # A tibble: 6 x 76
## ID SurveyYr Gender Age AgeDecade AgeMonths Race1 Race3 Education MaritalStatus HHIncome HHIncomeMid Poverty HomeRooms HomeOwn Work Weight Length HeadCirc Height BMI
## <int> <fct> <fct> <int> <fct> <int> <fct> <fct> <fct> <fct> <fct> <int> <dbl> <int> <fct> <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 51710 2009_10 female 26 " 20-29" 319 White <NA> College … Married "75000-… 87500 5 4 Own Work… 64.9 NA NA 176. 21
## 2 51731 2009_10 female 28 " 20-29" 346 Black <NA> High Sch… NeverMarried " 0-499… 2500 0.07 4 Rent Work… 134. NA NA 170. 46.7
## 3 51741 2009_10 female 21 " 20-29" 253 Black <NA> Some Col… NeverMarried "45000-… 50000 2.04 4 Rent Work… 104. NA NA 166. 37.3
## 4 51741 2009_10 female 21 " 20-29" 253 Black <NA> Some Col… NeverMarried "45000-… 50000 2.04 4 Rent Work… 104. NA NA 166. 37.3
## 5 51760 2009_10 female 27 " 20-29" 334 Hisp… <NA> 9 - 11th… NeverMarried <NA> NA 0.72 4 Rent NotW… 69.9 NA NA 166. 25.5
## 6 51764 2009_10 female 29 " 20-29" 357 White <NA> College … NeverMarried "10000-… 12500 1.11 4 Rent Work… 87.8 NA NA 164. 32.5
## # … with 55 more variables: BMICatUnder20yrs <fct>, BMI_WHO <fct>, Pulse <int>, BPSysAve <int>, BPDiaAve <int>, BPSys1 <int>, BPDia1 <int>, BPSys2 <int>, BPDia2 <int>, BPSys3 <int>,
## # BPDia3 <int>, Testosterone <dbl>, DirectChol <dbl>, TotChol <dbl>, UrineVol1 <int>, UrineFlow1 <dbl>, UrineVol2 <int>, UrineFlow2 <dbl>, Diabetes <fct>, DiabetesAge <int>,
## # HealthGen <fct>, DaysPhysHlthBad <int>, DaysMentHlthBad <int>, LittleInterest <fct>, Depressed <fct>, nPregnancies <int>, nBabies <int>, Age1stBaby <int>, SleepHrsNight <int>,
## # SleepTrouble <fct>, PhysActive <fct>, PhysActiveDays <int>, TVHrsDay <fct>, CompHrsDay <fct>, TVHrsDayChild <int>, CompHrsDayChild <int>, Alcohol12PlusYr <fct>,
## # AlcoholDay <int>, AlcoholYear <int>, SmokeNow <fct>, Smoke100 <fct>, Smoke100n <fct>, SmokeAge <int>, Marijuana <fct>, AgeFirstMarij <int>, RegularMarij <fct>,
## # AgeRegMarij <int>, HardDrugs <fct>, SexEver <fct>, SexAge <int>, SexNumPartnLife <int>, SexNumPartYear <int>, SameSex <fct>, SexOrientation <fct>, PregnantNow <fct>
- Now we will compute the average and standard deviation for the subgroup we defined in the previous exercise (20-29 year old females), which we will use reference for what is typical.
You will determine the average and standard deviation of systolic blood pressure, which are stored in the BPSysAve
variable in the NHANES dataset.
## complete this line of code.
NHANES %>% filter(AgeDecade == " 20-29" & Gender == "female") %>% summarize(average = mean(BPSysAve, na.rm = TRUE), standard_deviation = sd(BPSysAve, na.rm = TRUE))
ref <- ref
## # A tibble: 1 x 2
## average standard_deviation
## <dbl> <dbl>
## 1 108. 10.1
- Now we will repeat the exercise and generate only the average blood pressure for 20-29 year old females.
For this exercise, you should review how to use the place holder .
in dplyr or the pull
function.
## modify the code we wrote for previous exercise.
NHANES %>%
ref_avg <- filter(AgeDecade == " 20-29" & Gender == "female") %>%
summarize(average = mean(BPSysAve, na.rm = TRUE),
standard_deviation = sd(BPSysAve, na.rm=TRUE)) %>% .$average
ref_avg
## [1] 108.4224
- Let’s continue practicing by calculating two other data summaries: the minimum and the maximum.
Again we will do it for the BPSysAve
variable and the group of 20-29 year old females.
## complete the line
%>%
NHANES filter(AgeDecade == " 20-29" & Gender == "female") %>% summarize(minbp = min(BPSysAve, na.rm = TRUE),
maxbp = max(BPSysAve, na.rm=TRUE))
## # A tibble: 1 x 2
## minbp maxbp
## <int> <int>
## 1 84 179
- Now let’s practice using the
group_by
function.
What we are about to do is a very common operation in data science: you will split a data table into groups and then compute summary statistics for each group.
We will compute the average and standard deviation of systolic blood pressure for females for each age group separately. Remember that the age groups are contained in AgeDecade
.
##complete the line with group_by and summarize
%>%
NHANES filter(Gender == "female") %>% group_by(AgeDecade) %>% summarize(average = mean(BPSysAve, na.rm = TRUE),
standard_deviation = sd(BPSysAve, na.rm=TRUE))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 9 x 3
## AgeDecade average standard_deviation
## <fct> <dbl> <dbl>
## 1 " 0-9" 100. 9.07
## 2 " 10-19" 104. 9.46
## 3 " 20-29" 108. 10.1
## 4 " 30-39" 111. 12.3
## 5 " 40-49" 115. 14.5
## 6 " 50-59" 122. 16.2
## 7 " 60-69" 127. 17.1
## 8 " 70+" 134. 19.8
## 9 <NA> 142. 22.9
- Now let’s practice using
group_by
some more.
We are going to repeat the previous exercise of calculating the average and standard deviation of systolic blood pressure, but for males instead of females.
This time we will not provide much sample code. You are on your own!
%>%
NHANES filter(Gender == "male") %>% group_by(AgeDecade) %>% summarize(average = mean(BPSysAve, na.rm = TRUE),
standard_deviation = sd(BPSysAve, na.rm=TRUE))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 9 x 3
## AgeDecade average standard_deviation
## <fct> <dbl> <dbl>
## 1 " 0-9" 97.4 8.32
## 2 " 10-19" 110. 11.2
## 3 " 20-29" 118. 11.3
## 4 " 30-39" 119. 12.3
## 5 " 40-49" 121. 14.0
## 6 " 50-59" 126. 17.8
## 7 " 60-69" 127. 17.5
## 8 " 70+" 130. 18.7
## 9 <NA> 136. 23.5
- We can actually combine both of these summaries into a single line of code.
This is because group_by
permits us to group by more than one variable.
We can use group_by(AgeDecade, Gender)
to group by both age decades and gender.
%>% group_by(AgeDecade, Gender) %>% summarize(average = mean(BPSysAve, na.rm = TRUE),
NHANES standard_deviation = sd(BPSysAve, na.rm=TRUE))
## `summarise()` regrouping output by 'AgeDecade' (override with `.groups` argument)
## # A tibble: 18 x 4
## # Groups: AgeDecade [9]
## AgeDecade Gender average standard_deviation
## <fct> <fct> <dbl> <dbl>
## 1 " 0-9" female 100. 9.07
## 2 " 0-9" male 97.4 8.32
## 3 " 10-19" female 104. 9.46
## 4 " 10-19" male 110. 11.2
## 5 " 20-29" female 108. 10.1
## 6 " 20-29" male 118. 11.3
## 7 " 30-39" female 111. 12.3
## 8 " 30-39" male 119. 12.3
## 9 " 40-49" female 115. 14.5
## 10 " 40-49" male 121. 14.0
## 11 " 50-59" female 122. 16.2
## 12 " 50-59" male 126. 17.8
## 13 " 60-69" female 127. 17.1
## 14 " 60-69" male 127. 17.5
## 15 " 70+" female 134. 19.8
## 16 " 70+" male 130. 18.7
## 17 <NA> female 142. 22.9
## 18 <NA> male 136. 23.5
- Now we are going to explore differences in systolic blood pressure across races, as reported in the
Race1
variable.
We will learn to use the arrange
function to order the outcome acording to one variable.
Note that this function can be used to order any table by a given outcome. Here is an example that arranges by systolic blood pressure.
%>% arrange(BPSysAve) NHANES
If we want it in descending order we can use the desc
function like this:
%>% arrange(desc(BPSysAve)) NHANES
In this example, we will compare systolic blood pressure across values of the Race1
variable for males between the ages of 40-49.
%>% filter(AgeDecade == " 40-49" & Gender == "male") %>% group_by(Race1) %>% summarize(average = mean(BPSysAve, na.rm = TRUE), standard_deviation = sd(BPSysAve, na.rm=TRUE)) %>% arrange(average) NHANES
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 5 x 3
## Race1 average standard_deviation
## <fct> <dbl> <dbl>
## 1 White 120. 13.4
## 2 Other 120. 16.2
## 3 Hispanic 122. 11.1
## 4 Mexican 122. 13.9
## 5 Black 126. 17.1