The Pennsylvania State University, Spring 2021 Stat 415-001, Hyebin Song

Interval Estimation

Go to course main page

Introduction to Interval Estimation

Learning objectives

 

Interval estimators and interval estimates

 

 

 

 

 

Steps to construct an interval estimator for using a pivotal quantity

 

Recall that the goal is to find and such that

given a confidence coefficient .

Basic idea: use the distribution of a "good" estimator of to decide a "margin" around the point estimate.

The probability that is within the distance of is the same as the probability that is within distance of .

Choose such that . Then,

 

Example We have a random sample such that .

is a good estimator of . The distribution of = .

The probability that is within of %.

Equivalently, the probability that is within of %.

An interval estimator for :

Let be an realization of . The interval estimate for (=95.45% confidence interval) is .

 

Here are general steps:

  1. Pick a good estimator of .

    • Often, the use of "good" estimators results in good confidence intervals
  2. Find or approximate the distribution of an estimator for .

    • In particular, we use one of the following three methods:

      • use the exact distribution of an estimator
      • use an asymptotic distribution of
      • use a numerical method to approximate the distribution of
    • Often, we work with the distribution of a function of the estimator and the parameter , , whose distribution is known.

      • Such is called as a pivotal quantity ( should not depend on any other unknown parameters than ).
  3. Choose such that based on the distribution of or .

  4. An interval estimator with confidence coefficient is .

 

Confidence intervals for one mean

Learning objectives

 

Confidence intervals for one mean

1. Confidence interval for when with known.

Example: Let equal the length of life of a 60-watt light bulb marketed by a certain manufacturer. Assume that the distribution of is . A random sample of bulbs is tested until they burn out, yielding a sample mean of hours. Find a 90% and 95% confidence interval for .


First, we need to find an interval estimator with the confidence coefficient . That is, find and such that

Then, the realized interval is a confidence interval for . How can we find such and ?

 

We follow the four steps to construct an interval estimator:

  1. The statistic is a good estimator for .

  2. Since , by the additive property of the normal distribution, .

    A pivotal quantity: (recall that is known).

  3. To choose such that , find the number such that

  1. Then, we have .

 

Therefore, 90% and 95% confidence intervals for the average length of life of a 60-watt light bulb are

 

Remark In practice, one can never "know" whether the data are from a normal distribution or not. Often, people perform some sorts of normality tests to determine if an observed data set can be well-modeled by a normal distribution. One quick but powerful test is to plot a histogram of the observed data and see whether the shape of histogram resembles a bell-curve.

 

2. Confidence interval for when with unknown.

Example: Let equal the length of life of a 60-watt light bulb marketed by a certain manufacturer. Assume that the distribution of is normal. A random sample of bulbs is tested until they burn out, yielding a sample mean of hours and sample standard deviation of . Construct a 90% and 95% confidence interval for .


We construct a confidence interval for when is unknown. Following 4 steps,

  1. We use as an estimator of .

  2. Since each i.i.d., we have, .

    We have . This is a function of , and unknown (not a function of and anymore). We replace the unknown population variance with the sample .

    Thus a pivotal quantity .

     

    Lemma: , "t-distribution" with a degrees of freedom of .

    proof.

    (Theorem 5.5-3 in HTZ) If , , and are independent, then .

     

    (Theorem 5.5-2 in HTZ) For i.i.d., we have,

    1. and are independent.

    In other words, .

     

    Combining two theorems,

     

  3. Find the number such that

    • The number is the 1- quantile of a t distribution with degrees of freedom; i.e., the number such that .

      • You can find the number it Table VI in Appendix B of HTZ for some commonly used values.

      • You can use a computer software, such as R, to look up the value of . For example, the script

        returns .

       

     

  4. Therefore, . In other words,

 

 

Therefore, 90% and 95% confidence intervals for the average length of life of a 60-watt light bulb are

 

Remark: we see that and , and thus the length of confidence intervals when is unknown is longer than the confidence intervals when is known. This is because distribution has a thicker probability tail than the standard normal distribution.

In fact, any distribution with a finite degrees of freedom has a thicker tail than the standard normal distribution, and as the degrees of freedom increases, the t distribution becomes more similar to a normal distribution. Usually, when the degrees of freedom is over 30, distribution is quite similar to the standard normal distribution.

 

 

3. Confidence interval for with an unknown underlying distribution and large

Example: Let equal the length of life of a 60-watt light bulb marketed by a certain manufacturer. A random sample of bulbs is tested until they burn out, yielding a sample mean of hours. From previous study, it is known that . Construct an approximate 90% and 95% confidence interval for .


Since we do not know the distribution of , it would be hopeless to find the exact distribution of . Luckily, we have a large sample size (), and thus we can use the CLT to obtain

Note: even though we know the distribution of , it is often hard to obtain the exact distribution of . In such cases, we can use this normal approximation instead.

 

From (1), we have,

Then,

Thus we have,

Remark: Here, an "asymptotically correct" interval estimator means the random interval covers the parameter % when is large. In other words, the inequality, ,

is "correct" when is large ("asymptotic").

 

Therefore, approximate 90% and 95% confidence intervals for the average length of life of a 60-watt light bulb are

 

Suppose the previous knowledge on the population variance was unavailable. In such case, we replace the population variance with the sample variance and use the following approximation:

this is due to the fact that when is large (more precisely, is a consistent estimator for .)

Then, following similar steps as before, we have,

In particular, we can use the same approximate 90% and 95% confidence intervals for the average length of life of a 60-watt light bulb even though the population variance is replaced with the sample variance.

 

Summary

When we have the observed sample from a random sample ,

 

Settings confidence interval
, known
, unknown
Any distribution with , known, large (approximate)
Any distribution with , unknown, large (approximate)

 

Remark: the normal approximation of the distribution of is in general quite good, especially when the distribution of is symmetric, unimodal, and of the continuous type. When the distribution of is highly skewed, a larger sample size is needed for the approximation to be reasonably accurate.

In particular, when the sample size is small and is not normally distributed (especially when heavily skewed), it is preferred to use a numerical method to directly approximate the distribution of .

 

Confidence Intervals for the difference of two means

Learning objectives

 

Two independent random samples vs a paired random sample

Suppose a researcher wants to study whether lack of sleep impacts cognitive performance.

Protocol 1: The researcher recruited 20 participants and divided them into 2 groups. The participants in the first group are allowed to have a normal sleep, but people in the second group are kept awake for 24 hours. Results of cognitive tests are recorded for each patient. Let be the test score result for the th participant from the first group, and be the test score result for the th participant from the second group.

Protocol 2: The researcher recruited 10 participants. Each participant is asked to take the tests twice: one after a normal sleep and the other after being kept awake for 24 hours. Let be the first test score result for the th participant, and be the second test score result for the th participant .

 

In the first protocol, there is no association between th subject from the first and second group, and therefore and are independent. On the other hand, in the second protocol, and are dependent, since the test results are from the same th person. In fact, we have a paired random sample, since each and should be paired. Therefore,

 

Confidence intervals for the difference of two means

 

1. Confidence interval for the difference of the population mean from two independence normal random samples with known variances

Example The researcher followed the first protocol and obtained the following data:

GroupScores 
Group 1 (normal sleep)8.4, 9.2, 8.2, 10.6, 9.3, 8.2, 9.5, 9.7, 9.6, 8.7 
Group 2 (awake for 24 hours)8.5, 7.4, 6.4, 4.8, 8.1, 7.0, 7.0, 7.9, 7.8, 7.6 
   

Let and equal the test score of a participant in the first and second group. Based on previous studies, the researcher decided that it is reasonable to assume that the distribution of test scores is normal with the variance of . The researcher wants to form a 95% confidence interval for the difference of test scores between two groups.


 

Let and be unknown population means of group 1 and 2. We need to find an interval estimator with the confidence coefficient . That is, find and such that

We follow the four steps to find an interval estimator:

  1. We use an estimator for .

  2. The distribution of is (using independence of and )

  3. The pivotal quantity:

  4. Find such that

    Therefore, .

  5. Rearranging the terms in the event,

     

    Therefore,

 

A 95% confidence interval for the difference of test scores between group 1 and group 2 is

In R,

 

2. Confidence interval for the difference of the population mean from two independence normal random samples with unknown variances

 

Example: Consider the previous sleep study example. However, now the researcher is unsure about whether it is valid to assume that the population variance of each group is . The prior studies indicate that the variances of the scores from each group are likely to be the same. The researcher wants to form a 95% confidence interval for the difference of test scores between two groups.


 

2-1. Common variance

In Step 3, we used

as a pivotal quantity. However, in this setting where is unknown, it is not the function of data and the parameter of interest anymore. We replace with a pooled estimator of the common variance

Lemma:

proof.

Let , and such that

By theorem 5.5-3 in HTZ,

 

Following similar steps,

 

Therefore, a 95% confidence interval for the difference of test scores between group 1 and group 2 is

In R,

 

2-2. Uncommon variances

When two samples do not share the common variance, we can no longer use the pooled estimator. We replace and with and in (2) and use

as a pivotal quantity.

Although does not have the t distribution anymore, it approximately follows a t distribution with degrees of freedom (Welch, 1949) where

Using Welch's approximation,

 

3. Confidence interval for the difference of the population mean from two independence random samples with unknown distributions

 

Example The researcher decided to follow the first protocol to study the effect of sleep deprivation. The researcher first looked at some of previous literature and found out that some of the scores from previous studies had a skewed distribution. Worried about possible non-normality of test scores, the researcher recruited 100 participants (50 participants per group) and carried out experiments. Plotting histograms of test scores from each group, the researcher concluded that test scores are not likely to be from normal distributions. The researcher wants to form a 95% confidence interval for the difference of test scores between two groups.

GroupSample meanSample variance
Group 19.1140.811
Group 26.9630.918

 


 

We have two independent random samples , from unknown distributions (but not normal) and sufficiently large and .

From an application of a version of CLT, we have

and

since and when and are sufficiently large.

 

Using (2) and (3),

 

Therefore, an approximate 95% confidence interval for the difference of test scores between group 1 and group 2 is

4. Confidence interval for the difference of the population mean from a paired random samples

 

Example A researcher wants to study whether lack of sleep impacts cognitive performance. The researcher recruited 10 participants. Each participant is asked to take the tests twice: one after a normal sleep and the other after being kept awake for 24 hours.

 12345678910
First test (normal sleep)8.19.57.211.69.97.31010.710.48.5
Second test (awake for 24 hours)7.08.66.310.78.86.38.99.19.07.5

Suppose it is reasonable to assume that the difference of test scores is normally distributed.


 

Since and are paired, we cannot use previous intervals because and are dependent (Observe that a participant who scored high in the first test tended to score high in the second test).

The key idea is to consider another random variable , which is the difference between and . Note, . Therefore the problem becomes to find a confidence interval for where we have a random sample of , and we can use the results from the previous lecture. Since it is assumed that and is unknown, we can use a confidence interval based on the t distribution.

Therefore, the 95% confidence interval for the difference is

where and are sample mean and variance of differences of test scores.

In R,

 

Summary

When we have the observed sample , from two random samples ,

Settings confidence interval
, , independent, known
, , independent unknown
, , independent unknown where s the df from Welch's approximation
Two independent random samples, large and (approximate) , or
Paired samples (dependent, ), when , unknown
Paired samples (dependent, ), large (approximate)

where

 

Confidence intervals for proportions

Learning objectives

 

Confidence interval for one proportion

Suppose we have a random sample . The goal is to construct an approximate confidence interval for when is large. Note that (for any ). By CLT, we have,

Note is (asymptotically) pivotal, since it depends only on and (and does not depend on any unknown parameter).

In step 3, we choose such that

Unlike previous cases, both numerator and denominator depend on .

 

1. Wald Confidence Intervals:

Since when is large ( is a consistent estimator of ),

Therefore, we have,

In other words,

 

 

2. Wilson Confidence Intervals:

Rewriting, we get

, and

 

In other words,

By the quadratic formula, we can show that

Remarks:

 

Confidence interval for the difference of proportions

Suppose we now have two independent random samples , . The goal is to construct an approximate confidence interval for the difference of when are large.

From an application of a version of CLT, we have

and

Therefore,

 

 

Example A random sample of men produced a total of who favored a controversial local issue. An independent random sample of women produced a total of who favored the issue. Assume that is the true underlying proportion of men who favor the issue and that is the true underlying proportion of women who favor of the issue. Find a 95% confidence interval for .

Let be the preference of the ith men, and be the preference of the ith women.

We have and . The observed sample means are .

An approximate 95% confidence interval for is