Chapter 10: Confidence Interval Estimation

The aim of taking a sample as we discussed in Chapter 9 is to obtain an accurate reading of the population. However, a sample will never be as accurate as the measure of the entire population. This is an important aspect of randomization control.

In this chapter, we will explore the calculation behind the sample.  Inference statistics  draws its conclusions based on sample data. There are many contexts in which inference is desirable, and there are many approaches to performing inference.

The calculation of the entire population and sample is divided into two different types of processing: parameter and statistic.

Parameter: A number that describes the population. It is fixed, but we rarely know it. Examples include the true proportion of all American adults who visit the library to browse for a new book or the true mean of all residents of New York City who have a library card.

Statistic: A number that describes the sample. This value is known since it is produced by our sample data, but can vary from sample to sample (e.g., if we calculated the mean of a random sample of 1,000 library members of the New York City Public Library). Calculating this mean would most likely vary from the mean calculated from another random sample of 1,000 residents of New York City.

We also base the calculations of the confidence intervals on two parameters:
(1) A point estimate is a single number.
(2) A confidence interval provides additional information about the variability of the estimate.

Confidence intervals provide a range of values that is likely to contain the population parameter of interest. It gives an estimated range of values that is likely to include an unknown population parameter, the estimated range being calculated from a given set of sample data.

Confidence intervals are based on a confidence level, such as 95%, selected by the user/data. That means that if the same population is sampled on numerous occasions and interval estimates are made on each occasion, the resulting intervals would bracket the true population parameter in approximately 95% of the cases. A confidence stated at a 1α level can be thought of as the inverse of a significance level, α.

Suppose we want to estimate a parameter (e.g., population proportion, population average, etc.). The first thing to notice is that it would be impossible to exactly pinpoint the value with 100% accuracy without sampling every single member of the population, since there would always be some uncertainty. As a result, the best we can do is to make a guess at the true value, and then include a margin of error based on a certain level of confidence we have in our results.

However, what about a 95% confidence interval?

Let’s assume we conducted interviews with 100 members of a local library and asked them if they have read the latest novel from J. K. Rowling.  If in this case α =0.05, then the confidence level is 0.95, or 95%. That means that there is good reason to believe that the population’s mean lies between the values of 72.85 and 107.15. The two values are based on the lower value and the higher value of the mean. 95% of the time confidence intervals contain the true mean. If repeated samples were taken and the 95% confidence interval was computed for each sample, 95% of the intervals would contain the population mean. Naturally, 5% of the intervals would not contain the population mean. So, in this case,  72.85 < μ < 107.15

How do we interpret a confidence interval?

Let’s consider the example from earlier, where we want to estimate the percentage of library members who read the latest novel from J. K. Rowling. However, let’s suppose that of our sample of size 100 members, 15 members already purchased the new book by J. K. Rowling. This gives a point estimate of 15%, or 0.15 for the population parameter.

Suppose also that we calculated a critical value of 1.645 and a standard error of 0.0357, with a confidence level of 95%. In this case, the confidence interval (CI) will be:
CI = (Point Estimate) \(Margin of Error)= (Point Estimate)(Critical Value)*(Standard Error)
(1.645*0.0357 = 0.15) =0.0587

This gives the interval (0.0913, 0.2087). To interpret this interval, any of the following statements are equivalent:

1. We are 95% confident that the true percentage of all public library members who own the latest book by J. K. Rowling is between 9.13% and 20.87%.

2. If we repeatedly took different samples consisting of 100 members and computed a CI for each of those samples, 95% of the computed intervals would cover the true percentage of all public library members who own the book.

3. There is a 95% chance that the interval (0.0913, 0.2087) covers the true percentage of all public library members who own J. K. Rowling’s new novel.

A general note: Depending on the parameter you want to estimate, formulas for the point estimate, critical value, and standard error will change. However, the format of a confidence interval is always the same.

Next, Chapter 11, Fundamentals of Hypothesis Testing
Previous, Chapter 9, Sampling Distribution

A Primer for Using Open Source R Software for Accessibility and Visualization