Linear Models

Linear ModelsLearning ObjectivesIntroduction to Linear ModelsThe method of least squaresRegarding optimality of Least squared estimatorsConfidence intervals and hypothesis testing related to regression modelsConfidence intervals for the regression parametersA confidence interval for the average response value at A prediction interval for a new response value at Hypothesis tests for the regression parametersSimulated example

Learning Objectives

Introduction to Linear Models

Often, we are interested in understanding the relationship between one variable, called the dependent variable or response, on a number of other variables, called the independent variables or predictors. The dependent variable is typically denoted by , and the independent variables are denoted as .

 

Example: Suppose we are interested in predicting a final score of stat 415 class based on a number of predictors.

Response: 415 final score ()

Predictors: hours of study per week (), number of previous math/stat classes (), number of cups of coffee before the exam (), final percentage of stat/math 414 (), etc.

Goal: predict a 415 final score based on given predictor values. For example, what would be the expected 415 final score when

 

Suppose

where is a random error (). The goal is to learn from data. Once we know what is, we can predict the expected value of given different predictor values . In other words,

For example, the expected final score when hours of study per week = 3, number of previous math/stat classes = 5, number of cups of coffee before the exam =2, final percentage of stat/math 414 = 89% is .

 

A linear model assumes that is linear. In other words,

Then the goal is to learn from data. When (one explanatory model), a linear model is called as a simple linear regression model. Otherwise, the model is called multiple linear regression. In this class, we will mostly focus on simple linear regression.

The method of least squares

Suppose in the previous example only hours of study per week () matter in predicting the final score (). That is . Furthermore, the data suggest that the relationship between and is linear. We let be linear and assume , i.e.,

Final Score () = + Hours of study per week () + error, .

lm1

Now the question is to find the best and based on the data , , ..., where is a measurement of at .

 

Remark: depending on the nature of the data, we may treat an independent variable either to be random or deterministic. For example, we may consider be independent and identically distributed (i.i.d.) random pair of random variables (a.k.a random design), or we may regard be as fixed numbers (a.k.a fixed design). In the fixed design case, are independent but not identically distributed, since each has different mean value which depends on .

In regression analysis, we are interested in estimating the expectation of when . In the case of fixed design, estimating the regression line is equivalent to the expectation of . On the other hand, in random design, the regression line is the conditional expectation of given .

In this class, we will always assume deterministic predictors (fixed design), and to emphasize this fact, we will use lower-case letters to denote predictors.

 

Suppose each has the following representation:

where and each is a fixed number. Note this implies that . Given , how can we find ?

 

The basic idea is to choose a function of the predictors, , so that the sum of squared distances from the data to the function values is minimized. Since is linear, we find such that is minimized.

 

Let . Then the minimizer of are called the least square estimates of and . The corresponding statistics (i.e., replace observed with random ) are called least square estimators (LSE).

 

Theorem (LSE of and in simple linear regression) The least square estimators of are

 

Remark: It is important to note that both and are linear functions of the .

With some algebra, we can show

where , .

Proof. Following the standard procedures for minimizing a function, we find the partial derivatives

Setting both to zero we get

From these, we solve for and , and treat them as estimates (i.e., putting on hats).

 

 

 

 

 

Remark: the same least squares method can be applied when we have more than one predictor variable. Namely, when , the least square estimates of are

 

Theorem (MLE of , and in simple linear regression) The maximum likelihood estimates of and are the same as least squares estimates. The maximum likelihood estimator of is given by

 

 

 

 

 

 

 

 

 

 

 

Theorem (Distribution of LSE of and ) Under the simple linear model (1) with Normally distributed errors ,

,

 

 

 

 

 

 

 

 

Regarding optimality of Least squared estimators

Confidence intervals and hypothesis testing related to regression models

Recall the distribution of LSE and under the linear model with :

,

Thus, the standardized regression coefficients are random variables, i.e.,

Unfortunately, is unknown. Therefore, we replace with a sample variance estimator defined as

We note that the distribution of the residual sum of squares divided by is , i.e.,

In particular, is an unbiased estimator of .

 

 

 

Moreover, we can show that , and are independent.

Replacing with in (2), we have,

since,

 

 

 

These results allow us to construct confidence intervals for and , and to carry out hypothesis tests using t-tests.

 

Confidence intervals for the regression parameters

Using

as pivotal quantities,

 

A confidence interval for the average response value at

Often we are interested in estimating the average response value at a particular , i.e., estimate . Let . We want to construct a confidence interval for .

The estimator of the expected value of at is .

The distribution of the estimator of is

Standardization gives

and by replacing unknown with , we obtain the following pivotal quantity:

Therefore, we can construct a (two-sided) confidence interval for as

 

A prediction interval for a new response value at

We are sometimes interested in predicting new value of given . Where can we expect to see the next data point sampled? Note that even though we know the true regression line, we cannot perfectly predict the new value of because of the existence of a random error. In particular, when ,

where .

Suppose we know the true regression line and also . Since , we can predict that % of the times, the next value will be within .

Of course are unknown, and we have to estimate these unknown quantities. The variation in the prediction of a new response depends on two components:

  1. the variation due to estimating the mean with .
  2. the variation due to new random error .

Adding the two variance components, we obtain

A prediction interval for is

If you sample responses at many times, we would expect that next value to lie within this prediction interval in 95% of the samples.

 

Remark: comparing confidence intervals and prediction intervals

Hypothesis tests for the regression parameters

Suppose we are interested in testing whether the slope is the same as the hypothesized value or not (usually we are interested in testing whether the slope is zero or not, i.e., )

  1. , vs. .

  2. Test statistic

  3. Under , .

  4. Reject if the observed test statistic is in the rejection region. For example, when , reject if or .

The same steps are applied for testing about (), with the test statistic

 

Simulated example

By hand

 

 

 

 

 

 

R output