3 Section 2 Overview
In Section 2, you will look at the Central Limit Theorem in practice.
After completing Section 2, you will be able to:
- Use the Central Limit Theorem to calculate the probability that a sample estimate \(\bar{X}\) is close to the population proportion \(p\).
- Run a Monte Carlo simulation to corroborate theoretical results built using probability theory.
- Estimate the spread based on estimates of \(\bar{X}\) and \(\hat{\mbox{SE}}(\bar{X})\).
- Understand why bias can mean that larger sample sizes aren’t necessarily better.
3.1 The Central Limit Theorem in Practice
The textbook for this section is available here.
Key points
- Because \(\bar{X}\) is the sum of random draws divided by a constant, the distribution of \(\bar{X}\) is approximately normal.
- We can convert \(\bar{X}\) to a standard normal random variable \(Z\):
\(Z = \frac{\bar{X} - \mbox{E}(\bar{X})}{\mbox{SE}(\bar{X})}\)
- The probability that \(\bar{X}\) is within .01 of the actual value of \(p\) is:
\(\mbox{Pr}(Z \le .01/\sqrt {p(1 - p)/N}) - \mbox{Pr}(Z \le -.01/\sqrt {p(1 - p)/N})\)
- The Central Limit Theorem (CLT) still works if \(\bar{X}\) is used in place of \(p\). This is called a plug-in estimate. Hats over values denote estimates. Therefore:
\(\hat{\mbox{SE}}(\bar{X}) = \sqrt{\bar{X}(1 - \bar{X})/N}\)
Using the CLT, the probability that \(\bar{X}\) is within .01 of the actual value of \(p\) is:
\(\mbox{Pr}(Z \le .01/\sqrt {\bar{X}(1 - \bar{X})/N}) - \mbox{Pr}(Z \le -.01/\sqrt {\bar{X}(1 - \bar{X})/N})\)
Code: Computing the probability of \(\bar{X}\) being within .01 of \(p\)
0.48
X_hat <- sqrt(X_hat*(1-X_hat)/25)
se <-pnorm(0.01/se) - pnorm(-0.01/se)
## [1] 0.07971926
3.2 Margin of Error
The textbook for this section is available here.
Key points
- The margin of error is defined as 2 times the standard error of the estimate \(\bar{X}\).
- There is about a 95% chance that \(\bar{X}\) will be within two standard errors of the actual parameter \(p\).
3.3 A Monte Carlo Simulation for the CLT
The textbook for this section is available here.
Key points
- We can run Monte Carlo simulations to compare with theoretical results assuming a value of \(p\).
- In practice, \(p\) is unknown. We can corroborate theoretical results by running Monte Carlo simulations with one or several values of \(p\).
- One practical choice for \(p\) when modeling is \(\bar{X}\), the observed value of \(\hat{X}\) in a sample.
Code: Monte Carlo simulation using a set value of p
0.45 # unknown p to estimate
p <- 1000
N <-
# simulate one poll of size N and determine x_hat
sample(c(0,1), size = N, replace = TRUE, prob = c(1-p, p))
x <- mean(x)
x_hat <-
# simulate B polls of size N and determine average x_hat
10000 # number of replicates
B <- 1000 # sample size per replicate
N <- replicate(B, {
x_hat <- sample(c(0,1), size = N, replace = TRUE, prob = c(1-p, p))
x <-mean(x)
})
Code: Histogram and QQ-plot of Monte Carlo results
if(!require(gridExtra)) install.packages("gridExtra")
## Loading required package: gridExtra
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
library(gridExtra)
data.frame(x_hat = x_hat) %>%
p1 <- ggplot(aes(x_hat)) +
geom_histogram(binwidth = 0.005, color = "black")
data.frame(x_hat = x_hat) %>%
p2 <- ggplot(aes(sample = x_hat)) +
stat_qq(dparams = list(mean = mean(x_hat), sd = sd(x_hat))) +
geom_abline() +
ylab("X_hat") +
xlab("Theoretical normal")
grid.arrange(p1, p2, nrow=1)
3.4 The Spread
The textbook for this section is available here.
Key points
- The spread between two outcomes with probabilities \(p\) and \(1 - p\) is \(2p - 1\).
- The expected value of the spread is \(2 \bar{X} - 1\).
- The standard error of the spread is \(2 \hat{\mbox{SE}}(\bar{X})\).
- The margin of error of the spread is 2 times the margin of error of \(\bar{X}\).
3.5 Bias: Why Not Run a Very Large Poll?
The textbook for this section is available here.
Key points
- An extremely large poll would theoretically be able to predict election results almost perfectly.
- These sample sizes are not practical. In addition to cost concerns, polling doesn’t reach everyone in the population (eventual voters) with equal probability, and it also may include data from outside our population (people who will not end up voting).
- These systematic errors in polling are called bias. We will learn more about bias in the future.
Code: Plotting margin of error in an extremely large poll over a range of values of p
100000
N <- seq(0.35, 0.65, length = 100)
p <- sapply(p, function(x) 2*sqrt(x*(1-x)/N))
SE <-data.frame(p = p, SE = SE) %>%
ggplot(aes(p, SE)) +
geom_line()
3.6 Assessment - Introduction to Inference
- Write function called
take_sample
that takes the proportion of Democrats \(p\) and the sample size \(N\) as arguments and returns the sample average of Democrats (1) and Republicans (0).
Calculate the sample average if the proportion of Democrats equals 0.45 and the sample size is 100.
# Write a function called `take_sample` that takes `p` and `N` as arguements and returns the average value of a randomly sampled population.
function(p, N) {
take_sample <- sample(c(0,1), size = N, replace = TRUE, prob = c(1-p, p))
x <-return(mean(x))
}
# Use the `set.seed` function to make sure your answer matches the expected result after random sampling
set.seed(1)
# Define `p` as the proportion of Democrats in the population being polled
0.45
p <-
# Define `N` as the number of people polled
100
N <-
# Call the `take_sample` function to determine the sample average of `N` randomly selected people from a population containing a proportion of Democrats equal to `p`. Print this value to the console.
take_sample(p, N)
## [1] 0.46
- Assume the proportion of Democrats in the population \(p\) equals 0.45 and that your sample size \(N\) is 100 polled voters.
The take_sample
function you defined previously generates our estimate, \(\bar{X}\).
Replicate the random sampling 10,000 times and calculate \(p − \bar{X}\) for each random sample. Save these differences as a vector called errors
. Find the average of errors
and plot a histogram of the distribution.
# Define `p` as the proportion of Democrats in the population being polled
0.45
p <-
# Define `N` as the number of people polled
100
N <-
# The variable `B` specifies the number of times we want the sample to be replicated
10000
B <-
# Use the `set.seed` function to make sure your answer matches the expected result after random sampling
set.seed(1)
# Create an objected called `errors` that replicates subtracting the result of the `take_sample` function from `p` for `B` replications
replicate(B, p - take_sample(p, N))
errors <-
# Calculate the mean of the errors. Print this value to the console.
mean(errors)
## [1] -4.9e-05
hist(errors)
- In the last exercise, you made a vector of differences between the actual value for \(p\) and an estimate, \(\bar{X}\).
We called these differences between the actual and estimated values errors
.
The errors
object has already been loaded for you. Use the hist
function to plot a histogram of the values contained in the vector errors
. Which statement best describes the distribution of the errors?
- A. The errors are all about 0.05.
- B. The errors are all about -0.05.
- C. The errors are symmetrically distributed around 0.
- D. The errors range from -1 to 1.
- The error \(p - \bar{X}\) is a random variable.
In practice, the error is not observed because we do not know the actual proportion of Democratic voters, \(p\). However, we can describe the size of the error by constructing a simulation.
What is the average size of the error if we define the size by taking the absolute value \(|p - \bar{X}|\)?
# Define `p` as the proportion of Democrats in the population being polled
0.45
p <-
# Define `N` as the number of people polled
100
N <-
# The variable `B` specifies the number of times we want the sample to be replicated
10000
B <-
# Use the `set.seed` function to make sure your answer matches the expected result after random sampling
set.seed(1)
# We generated `errors` by subtracting the estimate from the actual proportion of Democratic voters
replicate(B, p - take_sample(p, N))
errors <-
# Calculate the mean of the absolute value of each simulated error. Print this value to the console.
mean(abs(errors))
## [1] 0.039267
- The standard error is related to the typical size of the error we make when predicting.
We say size because, as we just saw, the errors are centered around 0. In that sense, the typical error is 0. For mathematical reasons related to the central limit theorem, we actually use the standard deviation of errors
rather than the average of the absolute values.
As we have discussed, the standard error is the square root of the average squared distance \((\bar{X} - p)^2\). The standard deviation is defined as the square root of the distance squared.
Calculate the standard deviation of the spread.
# Define `p` as the proportion of Democrats in the population being polled
0.45
p <-
# Define `N` as the number of people polled
100
N <-
# The variable `B` specifies the number of times we want the sample to be replicated
10000
B <-
# Use the `set.seed` function to make sure your answer matches the expected result after random sampling
set.seed(1)
# We generated `errors` by subtracting the estimate from the actual proportion of Democratic voters
replicate(B, p - take_sample(p, N))
errors <-
# Calculate the standard deviation of `errors`
sqrt(mean(errors^2))
## [1] 0.04949939
- The theory we just learned tells us what this standard deviation is going to be because it is the standard error of \(\bar{X}\).
Estimate the standard error given an expected value of 0.45 and a sample size of 100.
# Define `p` as the expected value equal to 0.45
0.45
p <-
# Define `N` as the sample size
100
N <-
# Calculate the standard error
sqrt(p*(1-p)/N)
## [1] 0.04974937
- In practice, we don’t know \(p\), so we construct an estimate of the theoretical prediction based by plugging in \(\bar{X}\) for \(p\). Calculate the standard error of the estimate: \(\hat{\mbox{SE}}(\bar{X})\)
# Define `p` as a proportion of Democratic voters to simulate
0.45
p <-
# Define `N` as the sample size
100
N <-
# Use the `set.seed` function to make sure your answer matches the expected result after random sampling
set.seed(1)
# Define `X` as a random sample of `N` voters with a probability of picking a Democrat ('1') equal to `p`
sample(c(0,1), size = N, replace = TRUE, prob = c(1-p, p))
X <-
# Define `X_bar` as the average sampled proportion
mean(X)
X_bar <-
# Calculate the standard error of the estimate. Print the result to the console.
sqrt((X_bar*(1-X_bar)/N))
se <- se
## [1] 0.04983974
- The standard error estimates obtained from the Monte Carlo simulation, the theoretical prediction, and the estimate of the theoretical prediction are all very close, which tells us that the theory is working.
This gives us a practical approach to knowing the typical error we will make if we predict \(p\) with \(\hat{X}\). The theoretical result gives us an idea of how large a sample size is required to obtain the precision we need. Earlier we learned that the largest standard errors occur for \(p = 0.5\).
Create a plot of the largest standard error for \(N\) ranging from 100 to 5,000.
seq(100, 5000, len = 100)
N <- 0.5
p <- sqrt(p*(1-p)/N)
se <-plot(se, N)
Based on this plot, how large does the sample size have to be to have a standard error of about 1%?
- A. 100
- B. 500
- C. 2,500
- D. 4,000
- For \(N = 100\), the central limit theorem tells us that the distribution of \(\hat{X}\) is…
- A. practically equal to \(p\).
- B. approximately normal with expected value \(p\) and standard error \(\sqrt{p(1 − p)/N}\).
- C. approximately normal with expected value \(\bar{X}\) and standard error \(\sqrt{\bar{X}(1 − \bar{X}/N}\).
- D. not a random variable.
- We calculated a vector
errors
that contained, for each simulated sample, the difference between the actual valuep
and our estimate \(\hat{X}\).
The errors \(\bar{X} − p\) are:
- A. practically equal to 0.
- B. approximately normal with expected value 0 and standard error \(\sqrt{p(1 − p)/N}\).
- C. approximately normal with expected value p and standard error \(\sqrt{p(1 − p)/N}\).
- D. not a random variable.
- Make a qq-plot of the
errors
you generated previously to see if they follow a normal distribution.
# Define `p` as the proportion of Democrats in the population being polled
0.45
p <-
# Define `N` as the number of people polled
100
N <-
# The variable `B` specifies the number of times we want the sample to be replicated
10000
B <-
# Use the `set.seed` function to make sure your answer matches the expected result after random sampling
set.seed(1)
# Generate `errors` by subtracting the estimate from the actual proportion of Democratic voters
replicate(B, p - take_sample(p, N))
errors <-
# Generate a qq-plot of `errors` with a qq-line showing a normal distribution
qqnorm(errors)
qqline(errors)
- If \(p = 0.45\) and \(N = 100\), use the central limit theorem to estimate the probability that \(\bar{X} > 0.5\).
# Define `p` as the proportion of Democrats in the population being polled
0.45
p <-
# Define `N` as the number of people polled
100
N <-
# Calculate the probability that the estimated proportion of Democrats in the population is greater than 0.5. Print this value to the console.
1-pnorm(0.5, p, sqrt(p*(1-p)/N))
## [1] 0.1574393
- Assume you are in a practical situation and you don’t know \(p\).
Take a sample of size \(N = 100\) and obtain a sample average of \(\bar{X} = 0.51\).
What is the CLT approximation for the probability that your error size is equal or larger than 0.01?
# Define `N` as the number of people polled
100
N <-
# Define `X_hat` as the sample average
0.51
X_hat <-
# Define `se_hat` as the standard error of the sample average
sqrt(X_hat*(1-X_hat)/N)
se_hat <-
# Calculate the probability that the error is 0.01 or larger
1-pnorm(0.01,0,se_hat) + pnorm(-0.01,0,se_hat)
## [1] 0.8414493