7 Section 6 Overview
In Section 6, you will learn about election forecasting, building on what you’ve learned in the previous sections about statistical modeling and Bayesian statistics.
After completing Section 6, you will be able to:
- Understand how pollsters use hierarchical models to forecast the results of elections.
- Incorporate multiple sources of variability into a mathematical model to make predictions.
- Construct confidence intervals that better model deviations such as those seen in election data using the t-distribution.
7.1 Election Forecasting
The textbook for this section is available here.
Key points
- In our model:
- The spread \(d \sim N(\mu, \tau)\) describes our best guess in the absence of polling data. We set \(\mu = 0\) and \(\tau = 0.035\) using historical data.
- The average of observed data \(\bar{X} \mid d \sim N(d, \sigma)\) describes randomness due to sampling and the pollster effect.
- Because the posterior distribution is normal, we can report a 95% credible interval that has a 95% chance of overlapping the parameter using \(\mbox{E}(p \mid Y)\) and \(\mbox{SE}(p \mid Y)\).
- Given an estimate of \(\mbox{E}(p \mid Y)\) and \(\mbox{SE}(p \mid Y)\), we can use
pnorm
to compute the probability that \(d>0\). - It is common to see a general bias that affects all pollsters in the same way. This bias cannot be predicted or measured before the election. We will include a term in later models to account for this variability.
Code: Definition of results object
This code defines the results object used for empirical Bayes election forecasting.
polls_us_election_2016 %>%
polls <- filter(state == "U.S." & enddate >= "2016-10-31" &
(grade %in% c("A+", "A", "A-", "B+") | is.na(grade))) %>%
mutate(spread = rawpoll_clinton/100 - rawpoll_trump/100)
polls %>% group_by(pollster) %>%
one_poll_per_pollster <- filter(enddate == max(enddate)) %>%
ungroup()
one_poll_per_pollster %>%
results <- summarize(avg = mean(spread), se = sd(spread)/sqrt(length(spread))) %>%
mutate(start = avg - 1.96*se, end = avg + 1.96*se)
Code: Computing the posterior mean, standard error, credible interval and probability
Note that to compute an exact 95% credible interval, we would use qnorm(.975)
instead of 1.96.
0
mu <- 0.035
tau <- results$se
sigma <- results$avg
Y <- sigma^2 / (sigma^2 + tau^2)
B <- B*mu + (1-B)*Y
posterior_mean <- sqrt(1 / (1/sigma^2 + 1/tau^2))
posterior_se <-
posterior_mean
## [1] 0.02808534
posterior_se
## [1] 0.006149604
# 95% credible interval
+ c(-1.96, 1.96)*posterior_se posterior_mean
## [1] 0.01603212 0.04013857
# probability of d > 0
1 - pnorm(0, posterior_mean, posterior_se)
## [1] 0.9999975
7.2 Mathematical Representations of Models
The textbook for this section is available here.
Key points
- If we collect several polls with measured spreads \(X_1, ..., X_j\) with a sample size of \(N\) , these random variables have expected value \(d\) and standard error \(2\sqrt{p(1-p)/N}\).
- We represent each measurement as \(X_{i,j} = d + b + h_i + \epsilon_{i,j}\) where:
- The index \(i\) represents the different pollsters
- The index \(j\) represents the different polls
- \(X_{i,j}\) is the \(j\)th poll by the \(i\)th pollster
- \(d\) is the actual spread of the election
- \(b\) is the general bias affecting all pollsters
- \(h_i\) represents the house effect for the \(i\)th pollster
- \(\epsilon_{i,j}\) represents the random error associated with the \(i,j\)th poll.
- The sample average is now \(\bar{X} = d + b + \frac{1}{N} \sum\limits_{i=1}^{N} X_i\) with standard deviation \(\mbox{SE}(\bar{X}) = \sqrt{\sigma^2/N + \sigma_b^2}\).
- The standard error of the general bias \(\sigma_b\) does not get reduced by averaging multiple polls, which increases the variability of our final estimate.
Code: Simulated data with \(X_j = d + \epsilon_j\)
6
J <- 2000
N <- .021
d <- (d+1)/2
p <- d + rnorm(J, 0, 2*sqrt(p*(1-p)/N)) X <-
Code: Simulated data with \(X_j = d + \epsilon_{i,j}\)
5
I <- 6
J <- 2000
N <- .021
d <- (d+1)/2
p <- sapply(1:I, function(i){
X <-+ rnorm(J, 0, 2*sqrt(p*(1-p)/N))
d })
Code: Simulated data with \(X_{i,j} = d + h_i + \epsilon_{i,j}\)
5
I <- 6
J <- 2000
N <- .021
d <- (d+1)/2
p <- rnorm(I, 0, 0.025) # assume standard error of pollster-to-pollster variability is 0.025
h <- sapply(1:I, function(i){
X <-+ rnorm(J, 0, 2*sqrt(p*(1-p)/N))
d })
Code: Calculating probability of \(d > 0\) with general bias
Note that sigma
now includes an estimate of the variability due to general bias \(\sigma_b = .025\).
0
mu <- 0.035
tau <- sqrt(results$se^2 + .025^2)
sigma <- results$avg
Y <- sigma^2 / (sigma^2 + tau^2)
B <-
B*mu + (1-B)*Y
posterior_mean <- sqrt(1 / (1/sigma^2 + 1/tau^2))
posterior_se <-
1 - pnorm(0, posterior_mean, posterior_se)
## [1] 0.8174373
7.3 Predicting the Electoral College
The textbook for this section is available here.
Key points
- In the US election, each state has a certain number of votes that are won all-or-nothing based on the popular vote result in that state (with minor exceptions not discussed here).
- We use the
left_join()
function to combine the number of electoral votes with our poll results. - For each state, we apply a Bayesian approach to generate an Election Day \(d\) . We keep our prior simple by assuming an expected value of 0 and a standard deviation based on recent history of 0.02.
- We can run a Monte Carlo simulation that for each iteration simulates poll results in each state using that state’s average and standard deviation, awards electoral votes for each state to Clinton if the spread is greater than 0, then compares the number of electoral votes won to the number of votes required to win the election (over 269).
- If we run a Monte Carlo simulation for the electoral college without accounting for general bias, we overestimate Clinton’s chances of winning at over 99%.
- If we include a general bias term, the estimated probability of Clinton winning decreases significantly.
Code: Top 5 states ranked by electoral votes
The results_us_election_2016 object is defined in the dslabs package:
head(results_us_election_2016)
## state electoral_votes clinton trump others
## 1 California 55 61.7 31.6 6.7
## 2 Texas 38 43.2 52.2 4.5
## 3 Florida 29 47.8 49.0 3.2
## 4 New York 29 59.0 36.5 4.5
## 5 Illinois 20 55.8 38.8 5.4
## 6 Pennsylvania 20 47.9 48.6 3.6
2016 %>% arrange(desc(electoral_votes)) %>% top_n(5, electoral_votes) results_us_election_
## state electoral_votes clinton trump others
## 1 California 55 61.7 31.6 6.7
## 2 Texas 38 43.2 52.2 4.5
## 3 Florida 29 47.8 49.0 3.2
## 4 New York 29 59.0 36.5 4.5
## 5 Illinois 20 55.8 38.8 5.4
## 6 Pennsylvania 20 47.9 48.6 3.6
Code: Computing the average and standard deviation for each state
polls_us_election_2016 %>%
results <- filter(state != "U.S." &
!grepl("CD", "state") &
enddate >= "2016-10-31" &
(grade %in% c("A+", "A", "A-", "B+") | is.na(grade))) %>%
mutate(spread = rawpoll_clinton/100 - rawpoll_trump/100) %>%
group_by(state) %>%
summarize(avg = mean(spread), sd = sd(spread), n = n()) %>%
mutate(state = as.character(state))
## `summarise()` ungrouping output (override with `.groups` argument)
# 10 closest races = battleground states
%>% arrange(abs(avg)) results
## # A tibble: 47 x 4
## state avg sd n
## <chr> <dbl> <dbl> <int>
## 1 Florida 0.00356 0.0163 7
## 2 North Carolina -0.00730 0.0306 9
## 3 Ohio -0.0104 0.0252 6
## 4 Nevada 0.0169 0.0441 7
## 5 Iowa -0.0197 0.0437 3
## 6 Michigan 0.0209 0.0203 6
## 7 Arizona -0.0326 0.0270 9
## 8 Pennsylvania 0.0353 0.0116 9
## 9 New Mexico 0.0389 0.0226 6
## 10 Georgia -0.0448 0.0238 4
## # … with 37 more rows
# joining electoral college votes and results
left_join(results, results_us_election_2016, by="state")
results <-
# states with no polls: note Rhode Island and District of Columbia = Democrat
2016 %>% filter(!state %in% results$state) results_us_election_
## state electoral_votes clinton trump others
## 1 Rhode Island 4 54.4 38.9 6.7
## 2 Alaska 3 36.6 51.3 12.2
## 3 Wyoming 3 21.9 68.2 10.0
## 4 District of Columbia 3 90.9 4.1 5.0
# assigns sd to states with just one poll as median of other sd values
results %>%
results <- mutate(sd = ifelse(is.na(sd), median(results$sd, na.rm = TRUE), sd))
Code: Calculating the posterior mean and posterior standard error
0
mu <- 0.02
tau <-%>% mutate(sigma = sd/sqrt(n),
results B = sigma^2/ (sigma^2 + tau^2),
posterior_mean = B*mu + (1-B)*avg,
posterior_se = sqrt( 1 / (1/sigma^2 + 1/tau^2))) %>%
arrange(abs(posterior_mean))
## # A tibble: 47 x 12
## state avg sd n electoral_votes clinton trump others sigma B posterior_mean posterior_se
## <chr> <dbl> <dbl> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Florida 0.00356 0.0163 7 29 47.8 49 3.2 0.00618 0.0871 0.00325 0.00590
## 2 North Carolina -0.00730 0.0306 9 15 46.2 49.8 4 0.0102 0.207 -0.00579 0.00909
## 3 Iowa -0.0197 0.0437 3 6 41.7 51.1 7.1 0.0252 0.614 -0.00762 0.0157
## 4 Ohio -0.0104 0.0252 6 18 43.5 51.7 4.8 0.0103 0.210 -0.00823 0.00916
## 5 Nevada 0.0169 0.0441 7 6 47.9 45.5 6.6 0.0167 0.410 0.00995 0.0128
## 6 Michigan 0.0209 0.0203 6 16 47.3 47.5 5.2 0.00827 0.146 0.0179 0.00764
## 7 Arizona -0.0326 0.0270 9 11 45.1 48.7 6.2 0.00898 0.168 -0.0272 0.00820
## 8 New Mexico 0.0389 0.0226 6 5 48.3 40 11.7 0.00921 0.175 0.0321 0.00836
## 9 Georgia -0.0448 0.0238 4 16 45.9 51 3.1 0.0119 0.262 -0.0331 0.0102
## 10 Pennsylvania 0.0353 0.0116 9 20 47.9 48.6 3.6 0.00387 0.0361 0.0341 0.00380
## # … with 37 more rows
Code: Monte Carlo simulation of Election Night results (no general bias)
0
mu <- 0.02
tau <- replicate(1000, {
clinton_EV <-%>% mutate(sigma = sd/sqrt(n),
results B = sigma^2/ (sigma^2 + tau^2),
posterior_mean = B*mu + (1-B)*avg,
posterior_se = sqrt( 1 / (1/sigma^2 + 1/tau^2)),
simulated_result = rnorm(length(posterior_mean), posterior_mean, posterior_se),
clinton = ifelse(simulated_result > 0, electoral_votes, 0)) %>% # award votes if Clinton wins state
summarize(clinton = sum(clinton)) %>% # total votes for Clinton
.$clinton + 7 # 7 votes for Rhode Island and DC
})mean(clinton_EV > 269) # over 269 votes wins election
## [1] 0.998
# histogram of outcomes
data.frame(clinton_EV) %>%
ggplot(aes(clinton_EV)) +
geom_histogram(binwidth = 1) +
geom_vline(xintercept = 269)
Code: Monte Carlo simulation including general bias
0
mu <- 0.02
tau <- 0.03
bias_sd <-2 <- replicate(1000, {
clinton_EV_%>% mutate(sigma = sqrt(sd^2/(n) + bias_sd^2), # added bias_sd term
results B = sigma^2/ (sigma^2 + tau^2),
posterior_mean = B*mu + (1-B)*avg,
posterior_se = sqrt( 1 / (1/sigma^2 + 1/tau^2)),
simulated_result = rnorm(length(posterior_mean), posterior_mean, posterior_se),
clinton = ifelse(simulated_result > 0, electoral_votes, 0)) %>% # award votes if Clinton wins state
summarize(clinton = sum(clinton)) %>% # total votes for Clinton
.$clinton + 7 # 7 votes for Rhode Island and DC
})mean(clinton_EV_2 > 269) # over 269 votes wins election
## [1] 0.832
7.4 Forecasting
The textbook for this section is available here.
Key points
- In poll results, \(p\) is not fixed over time. Variability within a single pollster comes from time variation.
- In order to forecast, our model must include a bias term \(b_t\) to model the time effect.
- Pollsters also try to estimate \(f(t)\), the trend of \(p\) given time \(t\) using a model like:
\(Y_{i, j, t} = d + b + h_j + b_t + f(t) + \epsilon_{i, j, t}\)
- Once we decide on a model, we can use historical data and current data to estimate the necessary parameters to make predictions.
Code: Variability across one pollster
# select all national polls by one pollster
polls_us_election_2016 %>%
one_pollster <- filter(pollster == "Ipsos" & state == "U.S.") %>%
mutate(spread = rawpoll_clinton/100 - rawpoll_trump/100)
# the observed standard error is higher than theory predicts
one_pollster %>%
se <- summarize(empirical = sd(spread),
theoretical = 2*sqrt(mean(spread)*(1-mean(spread))/min(samplesize)))
se
## empirical theoretical
## 1 0.04025194 0.03256719
# the distribution of the data is not normal
%>% ggplot(aes(spread)) +
one_pollster geom_histogram(binwidth = 0.01, color = "black")
Code: Trend across time for several pollsters
2016 %>%
polls_us_election_ filter(state == "U.S." & enddate >= "2016-07-01") %>%
group_by(pollster) %>%
filter(n() >= 10) %>%
ungroup() %>%
mutate(spread = rawpoll_clinton/100 - rawpoll_trump/100) %>%
ggplot(aes(enddate, spread)) +
geom_smooth(method = "loess", span = 0.1) +
geom_point(aes(color = pollster), show.legend = FALSE, alpha = 0.6)
## `geom_smooth()` using formula 'y ~ x'
Code: Plotting raw percentages across time
2016 %>%
polls_us_election_ filter(state == "U.S." & enddate >= "2016-07-01") %>%
select(enddate, pollster, rawpoll_clinton, rawpoll_trump) %>%
rename(Clinton = rawpoll_clinton, Trump = rawpoll_trump) %>%
gather(candidate, percentage, -enddate, -pollster) %>%
mutate(candidate = factor(candidate, levels = c("Trump", "Clinton"))) %>%
group_by(pollster) %>%
filter(n() >= 10) %>%
ungroup() %>%
ggplot(aes(enddate, percentage, color = candidate)) +
geom_point(show.legend = FALSE, alpha = 0.4) +
geom_smooth(method = "loess", span = 0.15) +
scale_y_continuous(limits = c(30, 50))
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 22 rows containing non-finite values (stat_smooth).
## Warning: Removed 22 rows containing missing values (geom_point).
7.5 Assessment - Election Forecasting
- For each poll in the polling data set, use the CLT to create a 95% confidence interval for the spread.
Create a new table called cis
that contains columns for the lower and upper limits of the confidence intervals.
# Create a table called `polls` that filters by state, date, and reports the spread
polls_us_election_2016 %>%
polls <- filter(state != "U.S." & enddate >= "2016-10-31") %>%
mutate(spread = rawpoll_clinton/100 - rawpoll_trump/100)
# Create an object called `cis` that has the columns indicated in the instructions
polls %>% mutate(X_hat = (spread+1)/2, se = 2*sqrt(X_hat*(1-X_hat)/samplesize), lower = spread - qnorm(0.975)*se, upper = spread + qnorm(0.975)*se) %>% select(state, startdate, enddate, pollster, grade, spread, lower, upper) cis <-
- You can add the final result to the
cis
table you just created using theleft_join
function as shown in the sample code.
Now determine how often the 95% confidence interval includes the actual result.
# Add the actual results to the `cis` data set
results_us_election_2016 %>% mutate(actual_spread = clinton/100 - trump/100) %>% select(state, actual_spread)
add <- cis %>% mutate(state = as.character(state)) %>% left_join(add, by = "state")
ci_data <-
# Create an object called `p_hits` that summarizes the proportion of confidence intervals that contain the actual value. Print this object to the console.
ci_data %>% mutate(hit = lower <= actual_spread & upper >= actual_spread) %>% summarize(proportion_hits = mean(hit))
p_hits <- p_hits
## proportion_hits
## 1 0.66133
- Now find the proportion of hits for each pollster.
Show only pollsters with at least 5 polls and order them from best to worst. Show the number of polls conducted by each pollster and the FiveThirtyEight grade of each pollster.
# The `cis` data have already been loaded for you
results_us_election_2016 %>% mutate(actual_spread = clinton/100 - trump/100) %>% select(state, actual_spread)
add <- cis %>% mutate(state = as.character(state)) %>% left_join(add, by = "state")
ci_data <-
# Create an object called `p_hits` that summarizes the proportion of hits for each pollster that has at least 5 polls.
ci_data %>% mutate(hit = lower <= actual_spread & upper >= actual_spread) %>% group_by(pollster) %>% filter(n() >= 5) %>% summarize(proportion_hits = mean(hit), n = n(), grade = grade[1]) %>% arrange(desc(proportion_hits)) p_hits <-
## `summarise()` ungrouping output (override with `.groups` argument)
p_hits
## # A tibble: 13 x 4
## pollster proportion_hits n grade
## <fct> <dbl> <int> <fct>
## 1 Quinnipiac University 1 6 A-
## 2 Emerson College 0.909 11 B
## 3 Public Policy Polling 0.889 9 B+
## 4 University of New Hampshire 0.857 7 B+
## 5 Ipsos 0.807 119 A-
## 6 Mitchell Research & Communications 0.8 5 D
## 7 Gravis Marketing 0.783 23 B-
## 8 Trafalgar Group 0.778 9 C
## 9 Rasmussen Reports/Pulse Opinion Research 0.774 31 C+
## 10 Remington 0.667 9 <NA>
## 11 Google Consumer Surveys 0.588 102 B
## 12 SurveyMonkey 0.577 357 C-
## 13 YouGov 0.544 57 B
- Repeat the previous exercise, but instead of pollster, stratify by state. Here we can’t show grades.
# The `cis` data have already been loaded for you
results_us_election_2016 %>% mutate(actual_spread = clinton/100 - trump/100) %>% select(state, actual_spread)
add <- cis %>% mutate(state = as.character(state)) %>% left_join(add, by = "state")
ci_data <-
# Create an object called `p_hits` that summarizes the proportion of hits for each state that has more than 5 polls.
ci_data %>% mutate(hit = lower <= actual_spread & upper >= actual_spread) %>% group_by(state) %>% filter(n() >= 5) %>% summarize(proportion_hits = mean(hit), n = n()) %>% arrange(desc(proportion_hits)) p_hits <-
## `summarise()` ungrouping output (override with `.groups` argument)
p_hits
## # A tibble: 51 x 3
## state proportion_hits n
## <chr> <dbl> <int>
## 1 Connecticut 1 13
## 2 Delaware 1 12
## 3 Rhode Island 1 10
## 4 New Mexico 0.941 17
## 5 Washington 0.933 15
## 6 Oregon 0.929 14
## 7 Illinois 0.923 13
## 8 Nevada 0.923 26
## 9 Maine 0.917 12
## 10 Montana 0.917 12
## # … with 41 more rows
- Make a barplot based on the result from the previous exercise.
# The `p_hits` data have already been loaded for you. Use the `head` function to examine it.
head(p_hits)
## # A tibble: 6 x 3
## state proportion_hits n
## <chr> <dbl> <int>
## 1 Connecticut 1 13
## 2 Delaware 1 12
## 3 Rhode Island 1 10
## 4 New Mexico 0.941 17
## 5 Washington 0.933 15
## 6 Oregon 0.929 14
# Make a barplot of the proportion of hits for each state
%>% mutate(state = reorder(state, proportion_hits)) %>%
p_hits ggplot(aes(state, proportion_hits)) +
geom_bar(stat = "identity") +
coord_flip()
- Even if a forecaster’s confidence interval is incorrect, the overall predictions will do better if they correctly called the right winner.
Add two columns to the cis
table by computing, for each poll, the difference between the predicted spread and the actual spread, and define a column hit
that is true if the signs are the same.
# The `cis` data have already been loaded. Examine it using the `head` function.
head(cis)
## state startdate enddate pollster grade spread lower upper
## 1 New Mexico 2016-11-06 2016-11-06 Zia Poll <NA> 0.02 -0.001331221 0.0413312213
## 2 Virginia 2016-11-03 2016-11-04 Public Policy Polling B+ 0.05 -0.005634504 0.1056345040
## 3 Iowa 2016-11-01 2016-11-04 Selzer & Company A+ -0.07 -0.139125210 -0.0008747905
## 4 Wisconsin 2016-10-26 2016-10-31 Marquette University A 0.06 0.004774064 0.1152259363
## 5 North Carolina 2016-11-04 2016-11-06 Siena College A 0.00 -0.069295191 0.0692951912
## 6 Georgia 2016-11-06 2016-11-06 Landmark Communications B -0.03 -0.086553820 0.0265538203
# Create an object called `errors` that calculates the difference between the predicted and actual spread and indicates if the correct winner was predicted
cis %>% mutate(error = spread - ci_data$actual_spread, hit = sign(spread) == sign(ci_data$actual_spread))
errors <-
# Examine the last 6 rows of `errors`
tail(errors)
## state startdate enddate pollster grade spread lower upper error hit
## 807 Utah 2016-10-04 2016-11-06 YouGov B -0.0910 -0.1660704570 -0.01592954 0.0890 TRUE
## 808 Utah 2016-10-25 2016-10-31 Google Consumer Surveys B -0.0121 -0.1373083389 0.11310834 0.1679 TRUE
## 809 South Dakota 2016-10-28 2016-11-02 Ipsos A- -0.1875 -0.3351563485 -0.03984365 0.1105 TRUE
## 810 Washington 2016-10-21 2016-11-02 Ipsos A- 0.0838 -0.0004028265 0.16800283 -0.0782 TRUE
## 811 Utah 2016-11-01 2016-11-07 Google Consumer Surveys B -0.1372 -0.2519991224 -0.02240088 0.0428 TRUE
## 812 Oregon 2016-10-21 2016-11-02 Ipsos A- 0.0905 -0.0019261469 0.18292615 -0.0195 TRUE
- Create an object called
p_hits
that contains the proportion of instances when the sign of the actual spread matches the predicted spread for states with 5 or more polls.
Make a barplot based on the result from the previous exercise that shows the proportion of times the sign of the spread matched the actual result for the data in p_hits
.
# Create an object called `errors` that calculates the difference between the predicted and actual spread and indicates if the correct winner was predicted
cis %>% mutate(error = spread - ci_data$actual_spread, hit = sign(spread) == sign(ci_data$actual_spread))
errors <-
# Create an object called `p_hits` that summarizes the proportion of hits for each state that has 5 or more polls
errors %>% group_by(state) %>% filter(n() >= 5) %>% summarize(proportion_hits = mean(hit), n = n()) p_hits <-
## `summarise()` ungrouping output (override with `.groups` argument)
# Make a barplot of the proportion of hits for each state
%>% mutate(state = reorder(state, proportion_hits)) %>%
p_hits ggplot(aes(state, proportion_hits)) +
geom_bar(stat = "identity") +
coord_flip()
- In the previous graph, we see that most states’ polls predicted the correct winner 100% of the time.
Only a few states polls’ were incorrect more than 25% of the time. Wisconsin got every single poll wrong. In Pennsylvania and Michigan, more than 90% of the polls had the signs wrong.
Make a histogram of the errors. What is the median of these errors?
# The `errors` data have already been loaded. Examine them using the `head` function.
head(errors)
## state startdate enddate pollster grade spread lower upper error hit
## 1 New Mexico 2016-11-06 2016-11-06 Zia Poll <NA> 0.02 -0.001331221 0.0413312213 -0.063 TRUE
## 2 Virginia 2016-11-03 2016-11-04 Public Policy Polling B+ 0.05 -0.005634504 0.1056345040 -0.004 TRUE
## 3 Iowa 2016-11-01 2016-11-04 Selzer & Company A+ -0.07 -0.139125210 -0.0008747905 0.024 TRUE
## 4 Wisconsin 2016-10-26 2016-10-31 Marquette University A 0.06 0.004774064 0.1152259363 0.067 FALSE
## 5 North Carolina 2016-11-04 2016-11-06 Siena College A 0.00 -0.069295191 0.0692951912 0.036 FALSE
## 6 Georgia 2016-11-06 2016-11-06 Landmark Communications B -0.03 -0.086553820 0.0265538203 0.021 TRUE
# Generate a histogram of the error
hist(errors$error)
# Calculate the median of the errors. Print this value to the console.
median(errors$error)
## [1] 0.037
- We see that, at the state level, the median error was slightly in favor of Clinton. The distribution is not centered at 0, but at 0.037. This value represents the general bias we described in an earlier section.
Create a boxplot to examine if the bias was general to all states or if it affected some states differently. Filter the data to include only pollsters with grades B+ or higher.
# The `errors` data have already been loaded. Examine them using the `head` function.
head(errors)
## state startdate enddate pollster grade spread lower upper error hit
## 1 New Mexico 2016-11-06 2016-11-06 Zia Poll <NA> 0.02 -0.001331221 0.0413312213 -0.063 TRUE
## 2 Virginia 2016-11-03 2016-11-04 Public Policy Polling B+ 0.05 -0.005634504 0.1056345040 -0.004 TRUE
## 3 Iowa 2016-11-01 2016-11-04 Selzer & Company A+ -0.07 -0.139125210 -0.0008747905 0.024 TRUE
## 4 Wisconsin 2016-10-26 2016-10-31 Marquette University A 0.06 0.004774064 0.1152259363 0.067 FALSE
## 5 North Carolina 2016-11-04 2016-11-06 Siena College A 0.00 -0.069295191 0.0692951912 0.036 FALSE
## 6 Georgia 2016-11-06 2016-11-06 Landmark Communications B -0.03 -0.086553820 0.0265538203 0.021 TRUE
# Create a boxplot showing the errors by state for polls with grades B+ or higher
%>% filter(grade %in% c("A+","A","A-","B+") | is.na(grade)) %>% mutate(state = reorder(state, error)) %>% ggplot(aes(state, error)) + geom_boxplot() + geom_point() errors
- Some of these states only have a few polls.
Repeat the previous exercise to plot the errors for each state, but only include states with five good polls or more.
# The `errors` data have already been loaded. Examine them using the `head` function.
head(errors)
## state startdate enddate pollster grade spread lower upper error hit
## 1 New Mexico 2016-11-06 2016-11-06 Zia Poll <NA> 0.02 -0.001331221 0.0413312213 -0.063 TRUE
## 2 Virginia 2016-11-03 2016-11-04 Public Policy Polling B+ 0.05 -0.005634504 0.1056345040 -0.004 TRUE
## 3 Iowa 2016-11-01 2016-11-04 Selzer & Company A+ -0.07 -0.139125210 -0.0008747905 0.024 TRUE
## 4 Wisconsin 2016-10-26 2016-10-31 Marquette University A 0.06 0.004774064 0.1152259363 0.067 FALSE
## 5 North Carolina 2016-11-04 2016-11-06 Siena College A 0.00 -0.069295191 0.0692951912 0.036 FALSE
## 6 Georgia 2016-11-06 2016-11-06 Landmark Communications B -0.03 -0.086553820 0.0265538203 0.021 TRUE
# Create a boxplot showing the errors by state for states with at least 5 polls with grades B+ or higher
%>% filter(grade %in% c("A+","A","A-","B+") | is.na(grade)) %>% group_by(state) %>% filter(n() >= 5) %>% ungroup() %>%mutate(state = reorder(state, error)) %>% ggplot(aes(state, error)) + geom_boxplot() + geom_point() errors
7.6 The t-Distribution
The textbook for this section is available here.
Key points
- In models where we must estimate two parameters, \(p\) and \(\sigma\), the Central Limit Theorem can result in overconfident confidence intervals for sample sizes smaller than approximately 30.
- If the population data are known to follow a normal distribution, theory tells us how much larger to make confidence intervals to account for estimation of \(\sigma\).
- Given \(s\) as an estimate of \(\sigma\), then \(Z = \frac{\bar{X} - d}{s/\sqrt{N}}\) follows a t-distribution with \(N - 1\) degrees of freedom.
- Degrees of freedom determine the weight of the tails of the distribution. Small values of degrees of freedom lead to increased probabilities of extreme values.
- We can determine confidence intervals using the t-distribution instead of the normal distribution by calculating the desired quantile with the function
qt()
.
Code: Calculating 95% confidence intervals with the t-distribution
qt(0.975, nrow(one_poll_per_pollster) - 1)
z <-%>%
one_poll_per_pollster summarize(avg = mean(spread), moe = z*sd(spread)/sqrt(length(spread))) %>%
mutate(start = avg - moe, end = avg + moe)
## # A tibble: 1 x 4
## avg moe start end
## <dbl> <dbl> <dbl> <dbl>
## 1 0.0290 0.0134 0.0156 0.0424
# quantile from t-distribution versus normal distribution
qt(0.975, 14) # 14 = nrow(one_poll_per_pollster) - 1
## [1] 2.144787
qnorm(0.975)
## [1] 1.959964
7.7 Assessment - The t-Distribution
- We know that, with a normal distribution, only 5% of values are more than 2 standard deviations away from the mean.
Calculate the probability of seeing t-distributed random variables being more than 2 in absolute value when the degrees of freedom are 3.
# Calculate the probability of seeing t-distributed random variables being more than 2 in absolute value when 'df = 3'.
1 - pt(2, df = 3) + pt(-2, df = 3)
## [1] 0.139326
- Now use
sapply
to compute the same probability for degrees of freedom from 3 to 50.
Make a plot and notice when this probability converges to the normal distribution’s 5%.
# Generate a vector 'df' that contains a sequence of numbers from 3 to 50
c(3:50)
df <-
# Make a function called 'pt_func' that calculates the probability that a value is more than |2| for any degrees of freedom
function(f) {
pt_function <-1 - pt(2, f) + pt(-2, f)
}
# Generate a vector 'probs' that uses the `pt_func` function to calculate the probabilities
sapply(df, pt_function)
probs <-
# Plot 'df' on the x-axis and 'probs' on the y-axis
plot(df, probs)
- In a previous section, we repeatedly took random samples of 50 heights from a distribution of heights.
We noticed that about 95% of the samples had confidence intervals spanning the true population mean.
Re-do this Monte Carlo simulation, but now instead of \(N = 50\), use \(N = 15\). Notice what happens to the proportion of hits.
# Use the sample code to generate 'x', a vector of male heights
heights %>% filter(sex == "Male") %>%
x <- .$height
# Create variables for the mean height 'mu', the sample size 'N', and the number of times the simulation should run 'B'
mean(x)
mu <- 15
N <- 10000
B <-
# Use the `set.seed` function to make sure your answer matches the expected result after random sampling
set.seed(1)
# Generate a logical vector 'res' that contains the results of the simulations
replicate(B, {
res <- sample(x, N, replace = TRUE)
X <- mean(X) + c(-1,1)*qnorm(0.975)*sd(X)/sqrt(N)
interval <-between(mu, interval[1], interval[2])
}
)
# Calculate the proportion of times the simulation produced values within the 95% confidence interval. Print this value to the console.
mean(res)
p_hit <- p_hit
## [1] 0.9331
- \(N = 15\) is not that big. We know that heights are normally distributed, so the t-distribution should apply.
Repeat the previous Monte Carlo simulation using the t-distribution instead of using the normal distribution to construct the confidence intervals.
What are the proportion of 95% confidence intervals that span the actual mean height now?
# The vector of filtered heights 'x' has already been loaded for you. Calculate the mean.
mean(x)
mu <-
# Use the same sampling parameters as in the previous exercise.
set.seed(1)
15
N <- 10000
B <-
# Generate a logical vector 'res' that contains the results of the simulations using the t-distribution
replicate(B, {
res <- sample(x, N, replace = TRUE)
X <- mean(X) + c(-1,1)*qt(0.975, N - 1) * sd(X) / sqrt(N)
interval <-between(mu, interval[1], interval[2])
}
)
# Calculate the proportion of times the simulation produced values within the 95% confidence interval. Print this value to the console.
mean(res)
p_hit <- p_hit
## [1] 0.9512
- Why did the t-distribution confidence intervals work so much better?
- A. The t-distribution takes the variability into account and generates larger confidence intervals.
- B. Because the t-distribution shifts the intervals in the direction towards the actual mean.
- C. This was just a chance occurrence. If we run it again, the CLT will work better.
- D. The t-distribution is always a better approximation than the normal distribution.