STAT 331 Applied Linear Models

by Catherine Zhou

Copyright Note:
This is just a rearranged and edited copy of STAT331 lecture notes. All rights belong to Professor Leilei Zeng from Department of Acturial Science and Statistics, University of Waterloo.

Introduction

Regression is a statistical technique which is commonly used to quantify relationship betwen a variable of interest (the response variable) and some other variables (explanatory variable).

$\mathbf{Y}$ )
- the study variable. One is interested in evaluating how it changes depending on other variables.
- dependent variable / outcome variable
$\mathbf{x_1,\ldots,x_p}$ )
- the variables influences/affects the response variable
- independent variables, predictors, covariates, features

$Y$ $x_1,\ldots,x_p$ using some function, s.t.

\begin{matrix} (1) & y = f (x_{1}, \dots, x_{p}) \end{matrix}

Application of Regression

	$Y$ )	$x_1,\ldots,x_p$ )
Public health	Lung Function	weight, height, sex, age, COVID infection
Engineering	Circuit Delay	resistance, temperature
Finance	Stock Index	unemployment rate, money supply, government policy, etc
Economics	Unemployment rate	consumer price index, inflation

A general form of a "lienar regression model"

\begin{matrix} (2) & y = \underset{deterministic}{\underset{⏟}{β_{0} + β_{1} x_{1} + \dots + β_{p} x_{p}}} + \underset{random}{\underset{⏟}{ε}} \end{matrix}

$y$ is the value of response variable
$x_1,\ldots,x_p$ are values of explanatory variable (fixed)
$\underbrace{\beta_0}_{\text{intercept, average value of response when } x_1,\ldots,x_p\text{ are all 0}}, \underbrace{\beta_1,\ldots,\beta_p}_{\text{quantify effect of }x_j\text{'s on }y}$ are called model parameters
$\varepsilon$ $y$ $\varepsilon\sim N(0,\sigma^2)$

$\beta_0+\beta_1x_1+\cdots+\beta_px_p$ $\beta$ 's. This is done empirically by "fitting" models to data and idenfitying the model that descreibes the relationship the "best"!

Topics

parameter estimation and inference
model interpretation
prediction
model assumption checking, methods to deal with violations
variable selection
applied issues: outliers, influential data points, multicollinearing

1. Simple Linear Regression

Simple linear model is used to study the relationship between a response variable and a single explanatory variable.

Example - Low Birthweight Infant Data

$y$ : head circumference (cm)
$x$ : gestational age (week)

$(x_i, y_i), i=1,\ldots,n$ .

scatter plot $X$ $Y$ .

Pearson Correlation Coefficient

$X$ $Y$ ,

\begin{matrix} (3) & ρ = \frac{Cov (X, Y)}{\sqrt{Var (X) Var (Y)}} \end{matrix}

Given data, the sample correlation coefficient is

\begin{matrix} (4) & r = \frac{\sum_{i = 1}^{n} (x_{i} - \bar{x}) (y_{i} - \bar{y})}{\sqrt{\sum_{i = 1}^{n} (x_{i} - \bar{x})^{2} \sum_{i = 1}^{n} (y_{i} - \bar{y})^{2}}} \end{matrix}

$S_{xy}=\sum_{i=1}^n(x_i-\bar{x})(y_i-\bar{y})$ $S_{xx}=\sum_{i=1}^n(x_i-\bar{x})^2$ $S_{yy}=\sum_{i=1}^n(y_i-\bar{y})^2$ $r=\frac{S_{xy}}{S_{xx}S_{yy}}$ $r\in[-1,1]$ ,

$0<r\le 1$ $r=1$ , then it is a perfect line)
$-1\le r<0$ $r=-1$ , then it is a perfect line)
$r=0$ , no linear relationship

$\to$ cosider linear regression models.

1.1 Model Formulation

A simple linear model:

\begin{matrix} (5) & y = β_{0} + β_{1} x + ϵ \end{matrix}

$y$ is value of response variable which is viewed as a continuous random variable
$x$ is value of explanatory variable which can be any type and its value is fixed/controlled (hence non-random)
$\beta_0$ $\beta_1$ is the shape
$\epsilon$ $\epsilon\sim N(0,\sigma^2)$

$\{(x_i,y_i);\ i=1,\ldots,n\}$ $n$ subjects, we write the model as

\begin{matrix} (6) & y_{i} = β_{0} + β_{1} x_{i} + ϵ_{i}, i = 1, \dots, n \end{matrix}

$\epsilon_i\overset{iid}{\sim} N(0,\sigma^2)$ .

$\epsilon_i$ $y_i$ such that

\begin{matrix} (7) & y_{i} \sim N (β_{0} + β_{1} x_{i}, σ^{2}) \end{matrix}

Here

\begin{matrix} (8) & E (y_{i}) = E [\underset{fixed}{\underset{⏟}{β_{0} + β_{1} x_{i}}}] = β_{0} + β_{1} x_{i} + \underset{= 0}{\underset{⏟}{E (ϵ_{i})}} \end{matrix}

\begin{matrix} (9) & Var (y_{i}) = Var (β_{0} + β_{1} x_{i} + ϵ_{i}) = Var (ϵ_{i}) = σ^{2} \end{matrix}

$\beta_0$ $\beta_1$

$\beta_0$ $x=0$
$\beta_1$ - expected/average amount of change in response when the value of explanatory variable changes by 1 unit
$\beta_1=0$ $X$ $Y$ (linearly), no linear relationship

1.2 Least Squares Estimation

$\beta_0$ $\beta_1$ $y_i$ $\beta_0+\beta_1x_i$ .

Least Squares of Estimates (LSEs) $\beta_0$ $\beta_1$ are those values minimize sum of squares of errors

\begin{matrix} (10) & \arg min_{β_{0}, β_{1}} \sum_{i = 1}^{n} [y_{i} - (β_{0} + β_{1} x_{i})]^{2} \end{matrix}

To obtain LSEs, take derivates

\begin{matrix} (11) & \frac{\partial}{\partial β_{0}} \sum_{i = 1}^{n} [y_{i} - β_{0} - β_{1} x_{i}]^{2} = \sum - 2 (y_{i} - β_{0} - β_{1} x_{i}) \end{matrix}

\begin{matrix} (12) & \frac{\partial}{\partial β_{1}} \sum_{i = 1}^{n} [y_{i} - β_{0} - β_{1} x_{i}]^{2} = \sum - 2 (y_{i} - β_{0} - β_{1} x_{i}) x_{i} \end{matrix}

Solving the system of the two equations, we let both the derivatives be 0 and

\begin{matrix} (13) & \sum (y_{i} - β_{0} - β_{1} x_{i}) = \sum y_{i} - n β_{0} - β_{1} \sum x_{i} = 0 (1) \end{matrix}

\begin{matrix} (14) & \sum (y_{i} - β_{0} - β_{1} x_{i}) x_{i} = \sum x_{i} y_{i} - β_{0} n \bar{x} - β_{1} \sum x_{i}^{2} = 0 (2) \end{matrix}

We can rewrite (1) as

\begin{matrix} (15) & β_{0} = \frac{1}{n} \sum y_{i} - \frac{β_{1}}{n} \sum x_{i} = \bar{y} - β_{1} \bar{x} \end{matrix}

Substitude this into (2) gives us

\begin{matrix} (16) & β_{1} = \frac{\sum x_{i} y_{i} - n \bar{x} \bar{y}}{\sum x_{i}^{2} - n {\bar{x}}^{2}} \end{matrix}

$\beta_0$ $\beta_1$ are

\begin{matrix} (17) & \hat{β_{1}} = \frac{\sum x_{i} y_{i} - n \bar{x} \bar{y}}{\sum x_{i}^{2} - n {\bar{x}}^{2}}, \hat{β_{0}} = \bar{y} - \hat{β_{1}} \bar{x} \end{matrix}

$\sum(x_i-\bar{x})=0$ , then

\begin{matrix} (18) & S_{x x} = \sum (x_{i} - \bar{x})^{2} = \sum (x_{i} - \bar{x}) x_{i} - \bar{x} \sum (x_{i} - \bar{x}) = \sum x_{i}^{2} - n {\bar{x}}^{2} \end{matrix}

\begin{matrix} (19) & S_{x y} = \sum (x_{i} - \bar{x}) (y_{i} - \bar{y}) = \sum (x_{i} - \bar{x}) y_{i} - \bar{y} \sum (x_{i} - \bar{x}) = \sum x_{i} y_{i} - n \bar{x} \bar{y} \end{matrix}

$\hat{\beta_1}$ $\frac{S_{xy}}{S_{xx}}$

$\hat{\beta_0}$ $\hat{\beta_1}$ (as given above) minimize the sum of squares of erros (second derivative test).

1.2.1 Low Birth Weight Infant Data Example

$y_i$ $x_i$ $i=1,\ldots,100$

\begin{matrix} (20) & \sum x_{i} = 2889, \sum y_{i} = 2645, \sum x_{i}^{2} = 84099, \sum x_{i} y_{i} = 76910 \end{matrix}

\begin{matrix} (21) & \hat{β_{1}} = \frac{\sum x_{i} y_{i} - \bar{x} \bar{y}}{\sum x_{i}^{2} - n {\bar{x}}^{2}} = \frac{76910 - 100 (2889 / 100) (2645 / 100)}{84099 - 100 (2889 / 100)^{2}} \approx 0.78 \end{matrix}

\begin{matrix} (22) & \hat{β_{0}} = \bar{y} - \hat{β_{1}} \bar{x} = \frac{2645}{100} - 0.78 \frac{2889}{100} \approx 3.91 \end{matrix}

$3.91+0.78x$

Fitted value $\hat{y_i}=\hat{\beta_0}+\hat{\beta_1}x_i,\ i=1,\ldots,100$

Residual $r_i=y_i-\hat{y_i},\ i=1,\ldots,100$

Interpretation: the head circle is expected to increase by 0.78 cm when gest. age increases by 1 week.

$\hat{\beta_0}$ $\hat{\beta_1}$ satisfy

\begin{matrix} (23) & \sum \underset{r_{i}}{\underset{⏟}{y_{i} - \hat{β_{0}} - \hat{β_{1}} x_{i}}} = 0, \sum (y_{i} - \hat{β_{0}} - \hat{β_{1}} x_{i}) x_{i} = 0 \end{matrix}

Therefore, residuals have following properties:

$\sum r_i=0$
$\sum r_ix_i=0$
$\sum r_i\hat{y_i}=\sum r_i(\hat{\beta_0}+\hat{\beta_1}x_i)=\hat{\beta_0}\sum r_i+\hat{\beta_1}\sum r_ix_i=0$

$\sigma^2$

$\sigma^2$ represents variability in random errors, and hence variability in responses.

$y_i,\ldots,y_n \sim N(\mu, \sigma^2)$ $\frac{1}{n-1}\sum_{i=1}^n(y_i-\bar{y})^2$ $\sigma^2$ $n-1$ $\bar{y}$ $\mu$ .

$y_i\sim N(\beta_0+\beta_1x_i, \sigma^2)$ $y_i$ $x_i$ .

1.3.1 Mean Squares Errors (MSE)

\begin{matrix} (24) & S^{2} = \frac{1}{n - 2} \sum_{i = 1}^{n} [y_{i} - (\hat{β_{0}} + \hat{β_{1}} x_{i})]^{2} = \frac{1}{n - 2} \sum r_{i}^{2} \end{matrix}

$\sigma^2$ $n-2$ $\beta_0$ $\beta_1$ .

1.3.2 Rocket Propellant Data Example

A rocket motor is manufactured by bonding a ignite propellant to a substainer propellant inside a metal housing.

The shear strength of bonding (a quality character) depends on the age of the sustainer propellant:

$Y$ ): shear strength of bonding
$X$ ): propellant age

$(X_i,Y_i), i=1,\ldots,20$ , the data is stored in an excel file "rocket.xls".

Data Analysis using R (see Lecture 3 Note)

$2627.822-37.154x$ $\hat{\beta_1}=-37.154<0$ , the older the propellant, the weaker the bonding strength.

Fitted value for the propellant that is 8 weeks old in the sample (obs #3)

\begin{matrix} (25) & \hat{y_{3}} = 2627.822 - 37.154 (8) = 2330.59 \end{matrix}

$y_3=2316$

$r_3=y_3-\hat{y_3}=-14.59$ (over-estimate)

1.4 Maximum Likelihood Method

$Y$ $f(y;\theta)$ $\theta$ that characterizes the distribution.

$L(\theta)=f(y;\theta)$
- $Y$ $L(\theta)=P(Y=y;\theta)$
- $\theta$ that makes the observed data mostly likely, hence the name "maximum likelihood".
Maximum likelihood Estimator (MLE)
$\begin{matrix} (26) & \hat{θ} = {\arg max}_{θ} L (θ) \end{matrix}$
$l(\theta)=\log L(\theta)$
$s(\theta)=\frac{\partial}{\partial \theta}l(\theta)=l'(\theta)$
$s(\theta)=0$ $\theta$ $\hat{\theta}$ .

$y_i,\ldots,y_n$ $y_i\sim N(\beta_0+\beta_1x_i,\sigma^2),\ i=1,\ldots,n$ .

$\theta=(\beta_0,\beta_1,\sigma^2)$

\begin{matrix} (27) & L (θ) = \prod_{i = 1}^{n} f (y_{i}; θ) = \prod_{i = 1}^{n} (2 π σ^{2})^{- \frac{1}{2}} \exp {- \frac{y_{i} - β_{0} - β_{1} x_{i})^{2}}{2 σ^{2}}} \end{matrix}

log-likelihood:

\begin{matrix} (28) & l (θ) = \sum_{i = 1}^{n} [- \frac{1}{2} \log (2 π σ^{2}) - \frac{1}{2 σ^{2}} (y_{i} - β_{0} - β_{1} x_{i})^{2}] \end{matrix}

Score Functions:

\begin{matrix} (29) & \begin{matrix} \frac{\partial l}{\partial β_{0}} = \frac{1}{σ^{2}} \sum_{i = 1}^{n} (y_{i} - β_{0} - β_{1} x_{i}) = 0 (1) \\ \frac{\partial l}{\partial β_{1}} = \frac{1}{σ^{2}} \sum_{i = 1}^{n} (y_{i} - β_{0} - β_{1} x_{i}) x_{i} = 0 (2) \\ \frac{\partial l}{\partial σ^{2}} = - \frac{n}{2 σ^{2}} + \frac{1}{2 σ^{2}} \sum_{i = 1}^{n} (y_{i} - β_{0} - β_{1} x_{i})^{2} = 0 (3) \end{matrix} \end{matrix}

$\beta_0$ $\beta_1$ in least square approach. Thus $\beta_0$ $\beta_1$ are identical!

$\sigma^2$ , solving (3) gives

\begin{matrix} (30) & {\hat{σ^{2}}}_{M L E} = \frac{1}{n} \sum_{i = 1}^{n} (y_{i} - \hat{β_{0}} - \hat{β_{1}} x_{i})^{2} = \frac{1}{n} \sum_{i = 1}^{n} r_{i}^{2} \end{matrix}

$\hat{\sigma^2}=\frac{1}{n-2}\sum_{i=1}^nr_i^2$ (MSE).

$\sigma^2$ (later), the MLE is biased and therefore not used in practice.

1.5 Properties of Least Square Estimates

$y=\beta_0+\beta_1x+\epsilon$

$x$ is fixed
$\epsilon$ $y$ is a random variable as well

$(x_1,y_1),\ldots,(x_n,y_n)$ $(x_i,y_i)$ $y_i=\beta_0+\beta_1x_i+\epsilon_i$

1.5.1 Least Squares Estimators (LSEs)

\begin{matrix} (31) & \hat{β_{1}} = \frac{\sum_{i = 1}^{n} (x_{i} - \bar{x}) (y_{i} - \bar{y})}{\sum_{i = 1}^{n} (x_{i} - \bar{x})^{2}}, \hat{β_{0}} = \bar{y} - \hat{β_{1}} \bar{x} \end{matrix}

$y_1,\ldots,y_n$ are random variables, thus these estimators are also random variables.

"Estimator" vs. "Estimate":

random variable $\hat{\beta_1}=\frac{\sum (x_i-\bar{x})(y_i-\bar{y})}{\sum (x_i-\bar{x})^2}$ $y_i,\ldots,y_n$ $y_i=\beta_0+\beta_1x_i+\epsilon_i$
$H$ is a numberrocket propellant example $\{x_1,\ldots,x_n\}=\{21587,\ldots,1753.7\}, \{y_1,\ldots,y_n\}=\{15.5,\ldots,21.5\}$ $\hat{\beta_1}=-37.154$ .

$\epsilon$ :

$E[\epsilon_i]=0$
$\mathrm{Var}[\epsilon_i]=\sigma^2$

$y$ :

$E[y_i]=\beta_0+\beta_1x_i$
$\mathrm{Var}[y_i]=\sigma^2$

$\hat{\beta_0}$ $\hat{\beta_1}$ $E[\hat{\beta_1}]=\beta_1$ $E[\hat{\beta_0}]=\beta_0$ .

Proof:

\begin{matrix} (32) & \begin{matrix} \hat{β_{1}} = \frac{\sum (x_{i} - \bar{x}) (y_{i} - \bar{y})}{\sum (x_{i} - \bar{x})^{2}} = \frac{\sum (x_{i} - \bar{x}) y_{i} - \bar{y} \sum (x_{i} - \bar{x})}{\sum (x_{i} - \bar{x})^{2}} = \frac{\sum (x_{i} - \bar{x}) y_{i}}{\sum (x_{i} - \bar{x})^{2}} \\ E [\hat{β_{1}}] = \frac{\sum (x_{i} - \bar{x}) E (y_{i})}{\sum (x_{i} - \bar{x})^{2}} = \frac{\sum (x_{i} - \bar{x}) (β_{0} + β_{1} x_{i})}{\sum (x_{i} - \bar{x})^{2}} = \frac{β_{0} \sum (x_{i} - \bar{x}) + β_{1} \sum (x_{i} - \bar{x}) x_{i}}{\sum (x_{i} - \bar{x})^{2}} = β_{1} \\ E [\hat{β_{0}}] = E [\bar{y} - \hat{β_{1}} \bar{x}] = \frac{1}{n} \sum E (y_{i}) - \bar{x} E (\hat{β_{1}}) = \frac{1}{n} \sum_{i = 1}^{n} (β_{0} + β_{1} x_{i}) - \bar{x} β_{1} = β_{0} + \bar{x} β_{1} - \bar{x} β_{1} = β_{0} \end{matrix} \end{matrix}

$\hat{\beta_0}$ $\hat{\beta_1}$ $\mathrm{Var}(\hat{\beta_1})=\frac{\sigma^2}{\sum(x_i-\bar{x})^2}$ $\mathrm{Var}(\hat{\beta_0})=\frac{\sigma^2\sum x_i^2}{n\sum (x_i-\bar{x})^2}$ .

Proof:

\begin{matrix} (33) & \begin{matrix} Var (\hat{β_{1}}) = Var [\frac{\sum (x_{i} - \bar{x}) y_{i}}{\sum (x_{i} - \bar{x})^{2}}] = {(\frac{1}{S_{x x}})}^{2} \sum_{i = 1}^{n} Var [(x_{i} - \bar{x}) y_{i}] = \frac{1}{{S_{x x}}^{2}} \sum_{i = 1}^{n} (x_{i} - \bar{x})^{2} Var (y_{i}) = \frac{σ^{2}}{S_{x x}} \\ Var (\hat{β_{0}}) = Var (\bar{y} - \hat{β_{1}} \bar{x}) = Var (\bar{y}) + Var (- \hat{β_{1}} \bar{x}) + 2 Cov (\bar{y}, - \hat{β_{1}} \bar{x}) = Var (\bar{y}) + {\bar{x}}^{2} Var (\hat{β_{1}}) - 2 \bar{x} Cov (\bar{y}, \hat{β_{1}}) \end{matrix} \end{matrix}

It can be shown that

\begin{matrix} (34) & Cov (\bar{y}, \hat{β_{1}}) = Cov (\frac{1}{n} \sum_{i = 1}^{n} y_{i}, \frac{\sum (x_{i} - \bar{x}) y_{i}}{\sum (x_{i} - \bar{x})^{2}}) = 0 \end{matrix}

$\mathrm{Cov}(\sum a_ix_i, \sum b_jy_j)=\sum_i\sum_j a_ib_j\mathrm{Cov}(x_i,y_j)$ .

Thus,

\begin{matrix} (35) & Var (\hat{β_{0}}) = \frac{σ^{2}}{n} + {\bar{x}}^{2} \frac{σ^{2}}{S_{x x}} = \frac{S_{x x} + n {\bar{x}}^{2}}{n S_{x x}} σ^{2} = \frac{\sum x_{i}^{2} - n {\bar{x}}^{2} + n {\bar{x}}^{2}}{n S_{x x}} σ^{2} = \frac{σ^{2} \sum x_{i}^{2}}{n S_{x x}} \end{matrix}

$\hat{\beta_0}$ $\hat{\beta_1}$ $\mathrm{Cov}(\hat{\beta_0},\hat{\beta_1})=-\frac{\sigma^2\bar{x}}{\sum(x_i-\bar{x})^2}$

This will be proven in assignment 1.

$\epsilon$ $\epsilon_i$ 's are independent and normally distributed.

Implication:

$y_i$ 's are independent and normally distributed
$\hat{\beta_1}=\frac{\sum (x_i-\bar{x})y_i}{\sum (x_i-\bar{x})^2}$ $\hat{\beta_0}=\bar{y}-\hat{\beta_1}\bar{x}$ $y_i$ 's. Hence, they are also normally distributed.

1.5.2 Summary

$\epsilon_i\overset{iid}{\sim}N(0,\sigma^2)$ implies

$\hat{\beta_1}\sim N(\beta_1,\frac{\sigma^2}{S_{xx}})$ $\frac{\hat{\beta_1}-\beta_1}{\sqrt{\frac{\sigma^2}{S_{xx}}}}\sim N(0,1)$
$\hat{\beta_0}\sim N(\beta_0,\frac{\sigma^2\sum x_i^2}{nS_{xx}})$ $\frac{\hat{\beta_0}-\beta_0}{\sqrt{\frac{\sigma^2\sum x_i^2}{nS_{xx}}}}\sim N(0,1)$

$\sigma^2$ is usually not known, it is necesary to estimate it.

$S^2=\frac{1}{n-2}\sum_{i=1}^n(y_i-\hat{y_i})^2$ $\sigma^2$ $E(S^2)=\sigma^2$

1.5.3 Standard Error

\begin{matrix} (36) & SE (\hat{β_{1}}) = \sqrt{\frac{S^{2}}{S_{x x}}}, SE (\hat{β_{0}}) = \sqrt{\frac{S^{2} \sum x_{i}^{2}}{n S_{x x}}} \end{matrix}

Instead of standard normal distribution, we have the following t-distribution results:

$\frac{\hat{\beta_1}-\beta_1}{\mathrm{SE}(\hat{\beta_1})}\sim t_{n-2}$ $n-2$ df)
$\frac{\hat{\beta_0}-\beta_0}{\mathrm{SE}(\hat{\beta_0})}\sim t_{n-2}$

$t_q$ $q$ is df)

\begin{matrix} (37) & f (t) = \frac{Γ (\frac{q + 1}{2})}{\sqrt{π q} \cdot Γ (\frac{q}{2})} {(1 + \frac{t^{2}}{q})}^{- \frac{q + 1}{2}}, t \in R, Γ (q) = \int_{0}^{\infty} z^{q - 1} e^{- z} d z \end{matrix}

summetric about 0
$N(0,1)$ $q>30$ )

$t$ -distribution results, enables us to conduct inferences, e.g.

$\beta_1$
$\beta_1$ $H_0: \beta_1=0$

1.6 Inference for Regression Parameters

$\hat{\beta_0}$ $\hat{\beta_1}$ , so that we can conduct inferences:

$\beta$ $\beta_1$
$\beta$ $H_0:\beta_1=0$

$\epsilon_i\overset{iid}{\sim}N(0,\sigma^2),\ i=1,\ldots,n$ , we have showed that

\begin{matrix} (38) & \hat{β_{1}} \sim N (β_{1}, \frac{σ^{2}}{S_{x x}}), \hat{β_{0}} \sim N (β_{0}, \frac{σ^{2} \sum x_{i}^{2}}{n S_{x x}}) \end{matrix}

$\hat{\beta_0}$ $\hat{\beta_1}$ $\beta_0$ $\beta_1$ .
$\to$ high accuracy
$\sigma^2$ $\sigma^2$ $S^2=\frac{1}{n-2}\sum_i r_i^2$ (MSE)

We have standard error:

\begin{matrix} (39) & SE (\hat{β_{1}}) = \sqrt{\frac{S^{2}}{S_{x x}}}, SE (\hat{β_{0}}) = \sqrt{\frac{S^{2} \sum x_{i}^{2}}{n S_{x x}}} \end{matrix}

and (Student) t-distribution:

\begin{matrix} (40) & \frac{\hat{β_{1}} - β_{1}}{SE (\hat{β_{1}})} \sim t_{n - 2}, \frac{\hat{β_{0}} - β_{0}}{SE (\hat{β_{0}})} \sim t_{n - 2} \end{matrix}

These results are then used for inference.

1.6.1 Confidence Interval (CI)

$100(1-\alpha)\%$ $\beta_1$ $\frac{\hat{\beta_0}-\beta_0}{\mathrm{SE}(\hat{\beta_0})}\sim t_{n-2}$ , we have

\begin{matrix} (41) & P (- t_{α / 2, n - 2} < \frac{\hat{β_{0}} - β_{0}}{SE (\hat{β_{0}})} < t_{α / 2, n - 2}) = 1 - α \end{matrix}

$100(1-\alpha)\%$ $\beta_1$ is

\begin{matrix} (42) & {β_{1} : \hat{β_{1}} - t_{α / 2, n - 2} \cdot SE (\hat{β_{1}}) < β_{1} < \hat{β_{1}} + t_{α / 2, n - 2} \cdot SE (\hat{β_{1}})} \end{matrix}

$\hat{\beta_1}\pm t_{\alpha/2, n-2}\cdot\mathrm{SE}(\hat{\beta_1})$ .

$t$ -distribution can be obtained from a t-distribution table or using R.

Low Birth Weight Infant Data (cont. after here)

$\mathrm{SE}(\hat{\beta_1})=\sqrt{\frac{S^2}{S_{xx}}}=\sqrt{\frac{2.529}{635.79}}\approx 0.063$ .

$\beta_1$ :

\begin{matrix} (43) & \hat{β_{1}} \pm t_{0.025, 98} \cdot SE (\hat{β_{1}}) = 0.780 \pm 1.98 (0.063) = (0.655, 0.905) \end{matrix}

$(0.655,0.905)$ $\beta_1$ .

1.6.2 Hypothesis Testing

$\beta_1$ $\beta^*$ . We write the hypothesis as following:

\begin{matrix} (44) & \underset{null hypothesis}{\underset{⏟}{H_{0}}} : β_{1} = β^{*} vs. \underset{alternative hypothesis}{\underset{⏟}{H_{a}}} : β_{1} \neq β^{*} \end{matrix}

The t-test:

t-statistic:
$\begin{matrix} (45) & t = \frac{\hat{β_{1}} - β^{*}}{SE (\hat{β_{1}})} \sim t_{n - 2} \end{matrix}$
$H_0:\beta_1=\beta^*$ is true.
decision rule: $H_0$ $\abs{t}>t_{\alpha/2, n-2}$
$\abs{t}$ $\hat{\beta_1}$ $\beta^*$ $H_0:\beta_1=\beta^*$ .
$\alpha=0.05$ $t_{0.025,n-2}$ $H_0$ $\abs{t}>t_{0.025,n-2}$ .
p-value
$\begin{matrix} (46) & p-value = P (| T | > | t |) = 2 P (T > | t |) \end{matrix}$
$T$ $T\sim t_{n-2}$ .
$H_0$ $\to$ $H_0$ is true.
$H_0$ $<$ $\alpha$ (significance level).

$\beta_1$

linear relationship $X$ $Y$ $\beta_1=0$ ).

\begin{matrix} (47) & \begin{matrix} H_{0} : β_{1} = 0 v.s. H_{a} : β_{1} \neq 0 \\ t = \frac{\hat{β_{1}} - (0)}{SE (\hat{β_{1}})} = \frac{\hat{β_{1}}}{SE (\hat{β_{1}})} \end{matrix} \end{matrix}

$H_0$ $\abs{t}=\abs{\frac{\hat{\beta_1}}{\mathrm{SE}(\hat{\beta_1})}}>t_{\alpha/2,n-2}$ $\beta_1$ is statistically significant.

Some Remarks:

$H_0$ $H_0$ .
$H_0:\beta_j=\beta^*$ .
$H_0$ if
$\begin{matrix} (48) & \begin{matrix} | t | \leq t_{α / 2, n - 2} ⟹ | \frac{\hat{β_{j}} - β^{*}}{SE (\hat{β_{j}})} | \leq t_{α / 2, n - 2} ⟹ - t_{α / 2, n - 2} \leq \frac{\hat{β_{j} - β^{*}}}{SE (\hat{β_{j}})} \leq t_{α / 2, n - 2} ⟹ \\ \underset{100 (1 - α) % CI for β_{j}}{\underset{⏟}{\hat{β_{j}} - t_{α / 2, n - 2} \cdot SE (\hat{β_{j}}) \leq β^{*} \leq \hat{β_{j}} + t_{α / 2, n - 2} \cdot SE (\hat{β_{j}})}} \end{matrix} \end{matrix}$
$H_0:\beta_j=\beta^*$ $\beta^*$ lies within the CI.
$\beta_1$ $\beta_0$ .
The procedures can be easily modified to produce one-sided test, e.g.
$\begin{matrix} (49) & H_{0} : β_{1} \geq β^{*} vs. H_{a} : β_{1} < β^{*} \end{matrix}$
$H_0$ $t=\frac{\hat{\beta_1}-\beta^*}{\mathrm{SE}(\hat{\beta_1})}<-t_{\alpha/2,n-2}$ $=P(T<t)$ .
Choosing a one-sided test for sole purpose of attaining significance is not appropriatefailed $H_0:\beta_j=0$ is not appropriate.
$\mathrm{Var}(\hat{\beta_1})=\frac{\sigma^2}{\sum (x_i-\bar{x})^2}$ $\sum (x_i-\bar{x})^2$ $x_i$ 's.

Rocket Propellant Data (cont. from here)

Question $X$ $Y$ )?

We wish to test hypothesis:

\begin{matrix} (50) & H_{0} : β_{1} = 0 vs. H_{a} : β_{1} \neq 0 \end{matrix}


xxxxxxxxxx
3
1
# Fit a linear regression model using lm() function
2
fit <- lm(ShearStrength ~ PropellantAge, data=rocket)
3
summary(fit)

$t=\frac{\hat{\beta_1}}{\mathrm{SE}(\hat{\beta_1})}=\frac{-37.154}{2.889}\approx -12.86$

$\alpha=0.05, t\sim t_{18}$ $H_0$ $\abs{t}=\abs{-12.86}>2.101$ $t_{0.025, 18}=2.101$ $H_0$ , conclude that there is strong evidence of a linear relationship between bonding strength and propellant age.

$=2P(T>\abs{-12.86})$ $T\sim t_{18}$ $=1.64e^{-10}<\alpha$ $H_0$

1.7 Estimation of Mean Response

$x$ $x=x_p$ , according to a simple linear model we have

\begin{matrix} (51) & μ = E [y | x_{p}] = β_{0} + β_{1} x_{p} \end{matrix}

$\hat{\mu}=\hat{\beta_0}+\hat{\beta_1}x_p$ $\hat{\beta_0}$ $\hat{\beta_1}$ are LSEs.

$\mu$ $\mu$ $\hat{\mu}$ .

$\hat{\mu}$ $y_i$ 's
$E[\hat{\mu}]=E(\hat{\beta_0})+E(\hat{\beta_1})x_p=\beta_0+\beta_1x_p=\mu$ (unbiased)
$\mathrm{Var}(\hat{\mu})=\mathrm{Var}(\bar{y}-\hat{\beta_1}\bar{x}+\hat{\beta_1}x_p)=\frac{\sigma^2}{n}+(0)+(x_p-\bar{x})^2\frac{\sigma^2}{S_{xx}}=\sigma^2\left[\frac{1}{n}+\frac{(x_p-\bar{x})^2}{S_{xx}} \right]$
$x_p$ $\bar{x}$
$\hat{\mu}$ is
$\begin{matrix} (52) & \frac{\hat{μ} - μ}{\underset{SE (\hat{μ})}{\underset{⏟}{\sqrt{S^{2} [\frac{1}{n} + \frac{(x_{p} - \bar{x})^{2}}{S_{x x}}]}}}} \sim t_{n - 2} \end{matrix}$
$x=x_p$ $\hat{\mu}\pm t_{0.025,n-2}\cdot\mathrm{SE}(\hat{\mu})$

Rocket Data Example (cont. from here))

What is the average bonding strength for the rocket moters made from a batch of sustainer propellants that is 10 weeks old?

$\hat{y}=2627.822-37.154x$

Estimate mean response:

\begin{matrix} (53) & \begin{matrix} μ = 2627.822 - 37.154 (10) = 2256.282 \\ SE (\hat{μ}) = \sqrt{S^{2} [\frac{1}{n} + \frac{(10 - \bar{x})^{2}}{S_{x x}}]} = 23.584 \end{matrix} \end{matrix}

$\mu$ $x=10$ :

\begin{matrix} (54) & \hat{μ} \pm t_{0.025, 18} \cdot SE (\hat{μ}) = (2206.731, 2305.823) \end{matrix}

A one-sided test scenerio

$x=16$ weeks:

\begin{matrix} (55) & \hat{μ} = \hat{β_{0}} + \hat{β_{1}} (16) = 2033.365 \end{matrix}

$\hat{\mu}>2000$ $\mu>2000$ .

Test hypothesis:

\begin{matrix} (56) & \begin{matrix} H_{0} : μ \leq 2000 vs. H_{a} : μ > 2000 \\ \hat{μ} = 2033.365, SE (\hat{μ}) = 22.801 \\ t = \frac{\hat{μ} - 2000}{SE (\hat{μ})} = 1.463 \end{matrix} \end{matrix}

$H_0$ $H_0$ $t>t_{\alpha/2,n-2}$ $t=1.643<t_{0.005, 18}=1.734$ $H_0$ .

$H_0$ , so these propellants may not be safe to use.

1.8 Prediction

$y_p$ "new" $x=x_p$ .

The (true) future response value is

\begin{matrix} (57) & y_{p} = β_{0} + β_{1} x_{p} + ϵ_{p} \end{matrix}

$\epsilon_p$ is a future random error.

$\beta_0$ $\beta_1$ $\epsilon_p$ by its expection, 0.

$y_p$ $\hat{y_p}=\hat{\beta_0}+\hat{\beta_1}x_p$ $y_p-\hat{y_p}$ .

$y_p-\hat{y_p}$ :

$E[y_p-\hat{y_p}]=0$

Proof:

\begin{matrix} (58) & E [y_{p} - \hat{y_{p}}] = E (y_{p}) - E (\hat{y_{p}}) = E (β_{0} + β_{1} x_{p} + ϵ_{p}) - E (\hat{β_{0}} - \hat{β_{1}} x_{p}) = 0 \end{matrix}

$\mathrm{Var}(y_p-\hat{y_p})=\mathrm{Var}(\epsilon_p-(\hat{\beta_0}+\hat{\beta_1}x_p))=\mathrm{Var}(\epsilon_p)+\mathrm{Var}(\hat{\beta_0}+\hat{\beta_1}x_p)= \sigma^2+\sigma^2\left[\frac{1}{n}+\frac{(x_p-\bar{x})^2}{S_{xx}} \right]$
$\hat{\beta_0}$ $\hat{\beta_1}$ $\epsilon_p$ $\epsilon_p$ $\hat{\beta_0}$ $\hat{\beta_1}$ .
$\frac{y_p-\hat{y_p}-(0)}{\mathrm{SE}(y_p-\hat{y_p})}\sim t_{n-2}$

$100(1-\alpha)$ $y_p$ :

\begin{matrix} (59) & \hat{y_{p}} \pm t_{α / 2, n - 2} \cdot SE (y_{p} - \hat{y_{p}}) \end{matrix}

$100(1-\alpha)$ $x=x_p$ )

Rocket Data

Find predicted value and a 95% prediction interval for bonding strength of a new rocket motor made with a propellant that is 16 weeks old.

$x=16$ :

\begin{matrix} (60) & \begin{matrix} \hat{y_{p}} = \hat{β_{0}} + \hat{β_{1}} (16) = 2256.282 \\ SE (y_{p} - \hat{y_{p}}) = \sqrt{S^{2} [1 + \frac{1}{n} + \frac{(16 - \bar{x})^{2}}{S_{x x}}]} = 98.774 \end{matrix} \end{matrix}

A 95% prediction interval:

\begin{matrix} (61) & \hat{y_{p}} \pm t_{0.0025, 18} \cdot SE (y_{p} - \hat{y_{p}}) = (2048.476, 2463.524) \end{matrix}

$\mu$ :

\begin{matrix} (62) & \hat{μ} \pm t_{0.025, 18} \cdot SE (\hat{μ}) = (2206.731, 2305.833) \end{matrix}

1.9 Analysis of Variance (ANOVA)

We consider regression analysis from a new perspective called analysis of variance. The approach of analysis of variance is based on the partitioning of total variability in responses.

Total Sum of Squares (SST)

\begin{matrix} (63) & S S T = \sum_{i = 1}^{n} (y_{i} - \bar{y})^{2} \end{matrix}

measures deviation of reponses from the sample mean
$SST=0$ implies no variation
$SST$ , the greater the variation

Error Sum of Squares (SSE)

\begin{matrix} (64) & S S E = \sum_{i = 1}^{n} (y_{i} - \hat{y_{i}})^{2} = \sum_{i = 1}^{n} r_{i}^{2} \end{matrix}

measures deviation of responses from the fitted values on the regression line

Regression Sum of Squares (SSR)

\begin{matrix} (65) & S S R = \sum_{i = 1}^{n} (\hat{y_{i}} - \bar{y})^{2} \end{matrix}

represents deviation of fitted value from the sample mean

Partition of Total Sum of Squares:

\begin{matrix} (66) & \begin{matrix} S S T = \sum_{i = 1}^{n} (y_{i} - \bar{y})^{2} = \sum_{i = 1}^{n} (y_{i} - \hat{y_{i}} + \hat{y_{i}} - \bar{y})^{2} = \sum_{i} (y_{i} - \hat{y_{i}})^{2} + \sum_{i} (\hat{y_{i}} - \bar{y})^{2} + 2 \sum_{i} \underset{r_{i}}{\underset{⏟}{(y_{i} - \hat{y_{i}})}} (\hat{y_{i}} - \bar{y}) \\ = \sum_{i} (y_{i} - \hat{y_{i}})^{2} + \sum_{i} (\hat{y_{i}} - \bar{y})^{2} + 2 \underset{= 0}{\underset{⏟}{\sum_{i} r_{i} \hat{y_{i}}}} - 2 \bar{y} \underset{= 0}{\underset{⏟}{\sum_{i} r_{i}}} = \underset{S S E}{\underset{⏟}{\sum_{i} (y_{i} - \hat{y_{i}})^{2}}} + \underset{S S R}{\underset{⏟}{\sum_{i} (\hat{y_{i}} - \bar{y})^{2}}} \end{matrix} \end{matrix}

1.9.1 Breakdown of Degrees of Freedom

$n-1$ df
$\bar{y}$
$n-2$ df
$\beta_0$ $\beta_1$ are estimated to obtain fitted values
SSR has 1 df
$n$ $(\hat{y_i}-\bar{y})$ $\beta_0$ $\beta_1$ . Hence, it is associated with 2 df, and 1 df got lost because cosntraint:
$\begin{matrix} (67) & \sum_{i} (\hat{y_{i}} - \bar{y}) = \sum_{i} (\bar{y} - \hat{β_{1}} \bar{x} + \hat{β_{1}} x_{i} - \bar{y}) = \hat{β_{1}} \underset{= 0}{\underset{⏟}{\sum_{i} (x_{i} - \bar{x})}} = 0 \end{matrix}$

ANOVA Table

Source	SS	df	MS (mean squares)
Regression	$SSR=\sum_{i=1}^n(\hat{y_i}-\bar{y})^2$	1	$MSR=SSR/df = SSR$
Error	$SSE=\sum_{i=1}^n(y_i-\hat{y_i})^2$	$n-2$	$MSE=SSE/df=\frac{SSE}{n-2}$
Total	$SST=\sum_{i=1}^n(y_i-\bar{y})^2$	$n-1$

Mean Squares:

$E(MSE)=\sigma^2$
$E(MSR)=\sigma^2+{\beta_1}^2S_{xx}$
Proof: note that we can write
$\begin{matrix} (68) & \begin{matrix} \hat{y_{i}} - \bar{y} = \hat{β_{1}} (x_{i} - \bar{x}) \\ S S R = {\hat{β_{1}}}^{2} S_{x x} \end{matrix} \end{matrix}$
Therefore,
$\begin{matrix} (69) & E (S S R) = E ({\hat{β_{1}}}^{2}) S_{x x} = [Var ({\hat{β_{1}}}^{2}) + (E (\hat{β_{1}}))^{2}] S_{x x} = [\frac{σ^{2}}{S_{x x}} + {β_{1}}^{2}] S_{x x} = σ^{2} + {β_{1}}^{2} S_{x x} \end{matrix}$

$\beta_1\neq 0$ .

$\beta_1$ $H_0:\beta_1=0$ $H_0$ $\frac{MSR}{MSE}$ $\frac{MSR}{MSE}$ .

$H_0:\beta_1=0$ , the radio

\begin{matrix} (70) & F = \frac{M S R}{M S E} \sim F_{1, n - 2} \end{matrix}

$n-2$

$H_0$ $F_{\alpha;1,n-2}$ $\alpha$ $F_{1,n-2}$ distribution).

Coefficient of Determination:

\begin{matrix} (71) & R^{2} = \frac{S S R}{S S T} \end{matrix}

$R^2$ represents the proportion of total variation in responses that can be explained by the linear regression model
$R^2$ , better the fit
$R^2=1$ $\hat{y_i}=y_i\ \forall\ i$ $SSE=0$ $SSR=SST$ .

2. Multiple Linear Regression

2.1 Random Vectors and Multivariate Normal Distribution

Random Vector

$y_1,\ldots,y_n$ $Y=(y_1,\ldots,y_n)^T$ $n\times 1$ random vector.

$Y$

\begin{matrix} (72) & E (Y) = [E (y_{1}), \dots, E (y_{n})]^{T} = (μ_{1}, \dots, μ_{n})^{T} = μ_{n \times 1} \end{matrix}

$Y$

\begin{aligned} Var (Y) & = E [(Y - E (Y)) (Y - E (Y))^{T}] \\ = [\begin{array}{cccc} Var (y_{1}) & Cov (y_{1}, y_{2}) & \dots & Cov (y_{1}, y_{n}) \\ Var (y_{2}) & \dots & Cov (y_{2}, y_{n}) \\ ⋱ & ⋮ \\ Var (y_{n}) \end{array}] \\ = [\begin{array}{cccc} {σ_{1}}^{2} & σ_{12} & \dots & σ_{1 n} \\ {σ_{2}}^{2} & \dots & σ_{2 n} \\ ⋱ & ⋮ \\ {σ_{n}}^{2} \end{array}] = \sum_{n \times n} \end{aligned}

$\mathrm{Var}(Y)$ is symmetric and positive definite matrix.
$y_1,\ldots,y_n$ $\implies$ $\mathrm{Var}(Y)$ is diagonal (not necessarily true the other way around).

Basic Properties

$\forall$ random vector,

\begin{aligned} E (A Y + b) & = A E (Y) + b \\ Var (Y + B) & = Var (Y) \\ Var (A Y + B) & = A Var (Y) A^{T} \end{aligned}

(A and b are matrix and vector of constants).

2.1.1 Multivariate Normal Distribution

$Y=(y_1,\ldots,y_n)^T$ has a multivariate normal distribution if its density function takes the form

\begin{matrix} (73) & f (y_{1}, \dots, y_{n}) = (2 π)^{- \frac{n}{2}} {| Σ |}^{- \frac{1}{2}} \exp {- \frac{1}{2} \underset{1 \times n}{\underset{⏟}{(Y - μ)^{T}}} \underset{n \times n}{\underset{⏟}{Σ^{- 1}}} \underset{n \times 1}{\underset{⏟}{(Y - μ)}}} \end{matrix}

$E(Y)=\mu_{n\times 1}$ $\mathrm{Var}(Y)=\Sigma_{n\times n}$ .

$Y\sim MVN(\mu,\Sigma)$ .

Some useful results about multivariate normal distribution:

$Y\sim MVN(\mu,\sigma^2)$ , then
$\begin{matrix} (74) & A Y + B \sim M V N (A μ + b, A Σ A^{T}) \end{matrix}$
(Transformation invariant)
$Y\sim MVN(\mu,\Sigma)$ $Y$ has a normal distribution, i.e.
$\begin{matrix} (75) & y_{i} \sim N (μ_{i}, {σ_{i}}^{2}), i = 1, \dots, n \end{matrix}$
$Y$
$\begin{matrix} (76) & \begin{matrix} Y = [\begin{matrix} y_{1} \\ ⋮ \\ y_{k} \\ y_{k + 1} \\ ⋮ \\ y_{n} \end{matrix}] = [\begin{matrix} Y_{1} \\ Y_{2} \end{matrix}], μ = [\begin{matrix} μ_{1} \\ μ_{2} \end{matrix}], Σ = [\begin{array}{cc} Σ_{11} & Σ_{12} \\ Σ_{21} & Σ_{22} \end{array}] \end{matrix} \end{matrix}$
$Y_1\sim MVN(\mu_1,\Sigma_{11}), Y_2\sim MVN(\mu_2,\Sigma_{22})$ (marginal normality!)
$y_i\sim N(\mu_i,{\sigma_i}^2)$ $y_1,\ldots,y_n$ are independent, then
$\begin{matrix} (77) & Y = (y_{1}, \dots, y_{n})^{T} \sim M V N (μ, Σ) \end{matrix}$
$\Sigma=\mathrm{diag}\{{\sigma_1}^2,\ldots,{\sigma_n}^2 \}$ .
$Y$
$\begin{matrix} (78) & Var (Y) is diagonal \Leftrightarrow independence of y_{i} \end{matrix}$
$Y\sim MVN(\mu,\Sigma)$ $V=AY$ $W=BY$ , then
$\begin{matrix} (79) & V and W are independent \Leftrightarrow A Σ B^{T} = 0 \end{matrix}$

2.2 Multiple Linear Models

We can write the model in scalar form:

\begin{matrix} (80) & y_{i} = β_{0} + β_{1} x_{i 1} + \dots + β_{p} x_{i p} + ϵ_{i} \end{matrix}

$x_{i1},\ldots,x_{ip}$ $p$ expalnatory variables
$\beta_1,\ldots,\beta_p$ $x_j$ on response.
$\epsilon_i\overset{iid}{\sim} N(0,\sigma^2)$

Alternatively, we write the model in matrix form:

\begin{matrix} (81) & \begin{matrix} {[\begin{matrix} y_{1} \\ y_{2} \\ ⋮ \\ y_{n} \end{matrix}]}_{n \times 1} = {[\begin{array}{cccc} 1 & x_{11} & \dots & x_{1 p} \\ 1 & x_{21} & \dots & x_{2 p} \\ ⋮ \\ 1 & x_{n 1} & \dots & x_{n p} \end{array}]}_{n \times (p + 1)} {[\begin{matrix} β_{0} \\ β_{1} \\ ⋮ \\ β_{p} \end{matrix}]}_{(p + 1) \times 1} + {[\begin{matrix} ϵ_{1} \\ ϵ_{2} \\ ⋮ \\ ϵ_{n} \end{matrix}]}_{n \times 1} \\ Y_{n \times 1} = X_{n \times (p + 1)} β_{(p + 1) \times 1} + ϵ_{n \times 1} \end{matrix} \end{matrix}

$X$ is called design matrix.

$\beta$ Parameters:

$\beta_0$ $x_1,\ldots,x_p$ are 0.
$\beta_j$ $x_j$ , while holding all the other explanatory variables fixed.
$H_0:\beta_j=0\implies x_j$ is not (linearly) related to response, given all the other explanatory variables in the model.

$\epsilon_i\overset{iid}{\sim} N(0,\sigma^2)$ $\epsilon\sim MVN(\mathbf 0,\sigma^2I)$ $I$ $n\times n$ identity matrix.

$Y=x\beta+\epsilon$ $Y$ $\epsilon$ $Y\sim MVN(x\beta,\sigma^2I)$ .

2.3 Parameters Estimation

2.3.1 Least Squares Approach

Same as in simple linear regression

\begin{matrix} (82) & \hat{β} = {\arg min}_{β} \sum_{i = 1}^{n} {[y_{i} - (β_{0} + β_{1} x_{i 1} + \dots + β_{p} x_{i p})]}^{2} \end{matrix}

$\beta$ vector that minimizes sum of squares.

In matrix form,

\begin{matrix} (83) & \hat{β} = {\arg min}_{β} [(Y - X β)^{T} (Y - X β)] \end{matrix}

$\beta$

\begin{aligned} \frac{\partial}{\partial β} [(Y - X β)^{T} (Y - X β)] & = \frac{\partial}{\partial β} [Y^{T} Y - Y^{T} X β - β^{T} X^{T} Y + β^{T} X^{T} X β] \\ = - (Y^{T} X)^{T} - X^{T} Y + 2 X^{T} X β = - 2 X^{T} Y + 2 X^{T} X β \end{aligned}

Aside,

$z=a^T\beta$ $\frac{\partial}{\partial\beta}z=a$
$z=\beta^Ta$ $\frac{\partial}{\partial\beta}z=a$
$z=\beta^TA\beta$ $\frac{\partial}{\partial\beta}z=(A+A^T)\beta$

$=0$ $\beta$ gives

\begin{aligned} - 2 X^{T} Y + 2 X^{T} X β & = 0 \\ X^{T} X β & = X^{T} Y \\ β & = (X^{T} X)^{- 1} X^{T} Y \end{aligned}

$\beta$ :

\begin{matrix} (84) & \hat{β} = (X^{T} X)^{- 1} X^{T} Y \end{matrix}

$\hat{\beta}$ :

$E(\hat{\beta})=\beta$ (unbiased)
$\begin{matrix} (85) & E (\hat{β}) = E [(X^{T} X)^{- 1} X^{T} Y] = (X^{T} X)^{- 1} X^{T} E (Y) = (X^{T} X)^{- 1} X^{T} X β = β \end{matrix}$
$\mathrm{Var}(\hat{\beta})=\sigma^2(X^TX)^{-1}$
$\begin{aligned} Var (\hat{β}) & = Var (\underset{A}{\underset{⏟}{(X^{T} X)^{- 1} X^{T}}} Y) \\ = (X^{T} X)^{- 1} X^{T} Var (Y) ((X^{T} X)^{- 1} X^{T})^{T} \\ = (X^{T} X)^{- 1} X^{T} σ^{2} I X (X^{T} X)^{- 1} \\ = σ^{2} (X^{T} X)^{- 1} \end{aligned}$
$\hat{\beta}\sim MVN(\beta,\sigma^2(X^TX)^{-1})$
$Y\sim MVN$ $\hat{\beta}=(X^TX)^{-1}X^TY$ also has a MVN distribution.
$\hat{\beta_j}\sim N(\beta_j,\sigma^2\underbrace{[(X^TX)^{-1}]_{jj}}_{(j,j)\text{ element of matrix }(X^TX)^{-1}})$

2.4 Fitted Values and Residuals

Fitted Value Vector:

\begin{matrix} (86) & \hat{Y} = X \hat{β} = \underset{H}{\underset{⏟}{X (X^{T} X)^{- 1} X^{T}}} Y = H Y \end{matrix}

Hat Matrix:

\begin{matrix} (87) & H = X (X^{T} X)^{- 1} X^{T} \end{matrix}

$H^T=(\mathbf X(\mathbf X^T\mathbf X)^{-1}\mathbf X^T)^T=\mathbf X(\mathbf X^T\mathbf X)^{-1}\mathbf X^T$ .
$HH=\mathbf X(\mathbf X^T\mathbf X)^{-1}\mathbf X^T(\mathbf X(\mathbf X^T\mathbf X)^{-1}\mathbf X^T)=\mathbf X(\mathbf X^T\mathbf X)^{-1}\mathbf X^T=H$ .

Residual Vector:

\begin{matrix} (88) & r = Y - \hat{Y} = Y - H Y = (I - H) Y \end{matrix}

$E(\mathbf r)=(I-H)E(\mathbf Y)=(I-H)\mathbf X\beta=(I-\mathbf X(\mathbf X^T\mathbf X)^{-1}\mathbf X^T)\mathbf X\beta=\mathbf X\beta-\mathbf X(\mathbf X^T\mathbf X)^{-1}\mathbf X^T\mathbf X\beta=0$ .
$\mathrm{Var}(\mathbf r)=(I-H)\underbrace{\mathrm{Var}(\mathbf Y)}_{\sigma^2I}(I-H)^T=\sigma^2(I-H)$ .
$H$ $I-H$ .
$\mathbf r$ $\mathbf Y$ $Y\sim MVN$ $\mathbf r\sim MVN(0,\sigma^2(I-H))$ .

Some Other Results about Residuals:

$\sum_{i=1}^nr_i=0$
$\sum_{i=1}^nr_ix_{i1}=0\ ,\ldots\ , \sum_{i=1}^nr_ix_{ip}=0$
$\sum_{i=1}^nr_i\hat{y_i}=0$

Proof:

\begin{matrix} (89) & \begin{matrix} X^{T} r = {[\begin{array}{cccc} 1 & x_{11} & \dots & x_{1 p} \\ ⋮ & ⋮ \\ 1 & x_{n 1} & \dots & x_{n p} \end{array}]}^{T} [\begin{matrix} r_{1} \\ ⋮ \\ r_{n} \end{matrix}] = [\begin{matrix} \sum_{i} r_{i} \\ \sum_{i} r_{i} x_{i 1} \\ ⋮ \\ \sum_{i} r_{i} x_{i p} \end{matrix}] \\ X^{T} r = X^{T} (Y - \hat{Y}) = X^{T} Y - X^{T} \underset{H}{\underset{⏟}{X (X^{T} X)^{- 1} X^{T}}} Y = 0 \end{matrix} \end{matrix}

$\sigma^2$

$SSE=\sum_{i=1}^n{r_i}^2$ $\sigma^2$ .

Question $SSE$ in MLR?

Result $\frac{SSE}{\sigma^2}\sim \Chi^2_{n-(p+1)}$

Proof $\frac{SSE}{\sigma^2}=\frac{1}{\sigma^2}\mathbf r^T\mathbf r$ ${\Chi_1}^2$ distributed random variables.

$\mathbf r\sim MVN(0,\sigma^2(I-H))$ $(I-H)$ $\exists$ decomposition

\begin{matrix} (90) & (I - H)_{n \times n} = P Λ P^{T} \end{matrix}

$P$ $n\times n$ $P^T=P^{-1}$ $\Lambda$ $n\times n$ diagonal matrix with value 0 or 1 on diagonal line.

Now we define a new random vector

\begin{matrix} (91) & Z = \frac{1}{σ} P^{T} r, Z^{T} Z = \frac{1}{σ^{2}} r^{T} P P^{T} r = \frac{1}{σ^{2}} r^{T} r \end{matrix}

$Z\sim MVN(0, \Lambda)$ $Z$ $\mathbf r$ .

\begin{matrix} (92) & \begin{matrix} E (Z) = \frac{1}{r} P^{T} \underset{= 0}{\underset{⏟}{E (r)}} = 0 \\ Var (Z) = \frac{1}{σ^{2}} P^{T} \underset{σ^{2} (I - H)}{\underset{⏟}{Var (r)}} P = P^{T} (I - H) P = P^{T} (P Λ P^{T}) P = Λ \end{matrix} \end{matrix}

$P$ $P^TP=I$ .

$Z\sim MVN(0,\Lambda)$ $\Lambda$ $\Lambda$ is 1, and

\begin{aligned} tr (Λ) & = tr (I - H) = tr (I_{n \times n}) - tr (H_{n \times n}) \\ = n - tr {(X (X^{T} X)}^{- 1} X^{T}) \\ = n - tr {((X^{T} X)}^{- 1} (X^{T} X)) \\ = n - tr (I_{(p + 1) \times (p + 1)}) \\ = n - (p + 1) \end{aligned}

$Z$ $n-(p+1)$ $\overset{iid}{\sim} N(0,1)$ random variables.

Therefore,

\begin{matrix} (93) & \frac{S S E}{σ^{2}} = \frac{1}{σ^{2}} r^{T} r = Z^{T} Z \sim X_{n - (p + 1)}^{2} \end{matrix}

Moreoever,

\begin{matrix} (94) & \begin{matrix} E (\frac{S S E}{σ^{2}}) = n - (p + 1) \\ E (S S E) = σ^{2} (n - (p + 1)) \end{matrix} \end{matrix}

So the MSE

\begin{matrix} (95) & M S E = \frac{S S E}{n - (p + 1)} \end{matrix}

$\sigma^2$ .

2.6 Inference in Multiple Regression

Some useful results:

$\hat{\beta}\sim MVN(\beta,\sigma^2(\mathbf X^T\mathbf X)^{-1})$
$\hat{\beta_j}\sim N(\beta_j,\sigma^2[(\mathbf X^T\mathbf X)^{-1}]_{jj})$ or equivalently
$\begin{matrix} (96) & \frac{\hat{β_{j}} - β_{j}}{\sqrt{σ^{2} [(X^{T} X)^{- 1}]_{j j}}} \sim N (0, 1) \end{matrix}$
$\sigma^2$ $\sigma^2$ $S^2=\frac{SSE}{n-(p+1)}$ , and obtain a t-distribution result in simple linear regression
$\begin{matrix} (97) & \frac{\hat{β_{j}} - β_{j}}{\underset{SE (\hat{β_{j}})}{\underset{⏟}{\sqrt{S^{2} [(X^{T} X)^{- 1}]_{j j}}}}} \sim t_{n - (p + 1)} \end{matrix}$
$n-(p+1)$ $p$ denotes the number of explanatory variables.

$\beta_j$ :

$(1-\alpha)100\%$ $\hat{\beta_j}\pm t_{\alpha/2,n-(p+1)}\cdot\mathrm{SE}(\hat{\beta_j})$
$H_0:\beta_j=\beta^*$
$\begin{matrix} (98) & | t | = | \frac{\hat{β_{j}} - β^{*}}{SE (\hat{β_{j}})} | > t_{α / 2, n - (p + 1)} ⟹ reject H_{0} \end{matrix}$

Question: $p>n$ )

LSE: $\hat{\beta}=(\mathbf X^T\mathbf X)^{-1}\mathbf X^TY$ $\mathbf X^T\mathbf X$ $\mathbf X_{n\times (p+1)}$ $n>p+1$ $\rank(\mathbf X)=p+1$ .

$c^T\beta$

$c=(1,x_1,\cdots,x_p)^T$ , a vector of constants.

$c=(1,x_1,\cdots,x_p)^T$

\begin{matrix} (99) & {\hat{μ}}_{c} = c^{T} \hat{β} \end{matrix}

$\hat{\beta}\sim MVN(\beta,\sigma^2(\mathbf X^T\mathbf X)^{-1})$ , then

\begin{matrix} (100) & {\hat{μ}}_{c} \sim N (c^{T} β, c^{T} [σ^{2} (X^{T} X)^{- 1}] c) \end{matrix}

$\sigma^2$ $\hat{\sigma^2}=\frac{SSE}{n-(p+1)}$ , we have

\begin{matrix} (101) & \frac{c^{T} \hat{β} - c^{T} β}{\underset{SE (c^{T} \hat{β})}{\underset{⏟}{\sqrt{c^{T} [\hat{σ^{2}} (X^{T} X)^{- 1}] c}}}} \sim t_{n - (p + 1)} \end{matrix}

$\mu_c=c^T\beta$ .

\begin{matrix} (102) & {\hat{μ}}_{c} \pm t_{α / 2, n - (p + 1)} \cdot SE ({\hat{μ}}_{c}) \end{matrix}

$(1-\alpha)100\%$ $\mu_c$ .

2.7 Prediction in Multiple Linear Regression

$c=(1,x_1,\cdots,x_p)^T$ , we wish to predict the outcome and obtain a prediction interval.

Future response: according to the model

\begin{matrix} (103) & y_{p} = c^{T} β + ϵ_{p} \end{matrix}

$\hat{y}_p=c^T\hat{\beta}$
$y_p-\hat{y}_p$
$\begin{matrix} (104) & \begin{matrix} E (y_{p} - {\hat{y}}_{P}) = E (y_{p}) - E ({\hat{y}}_{p}) = c^{T} β + \underset{0}{\underset{⏟}{E (ϵ_{p})}} - E (c^{T} \hat{β}) = c^{T} β - c^{T} β = 0 \\ Var (y_{p} - {\hat{y}}_{p}) = Var (c^{T} β + ϵ_{p} - c^{T} \hat{β}) = Var (ϵ_{p} - c^{T} \hat{β}) = Var (ϵ_{p}) + Var (c^{T} \hat{β}) = σ^{2} + c^{T} [σ^{2} (X^{T} X)^{- 1}] c \end{matrix} \end{matrix}$
$\sqrt{\hat{\sigma^2}+c^T[\hat{\sigma^2}(\mathbf X^T\mathbf X)^{-1}]c}$
t-distribution:
$\begin{matrix} (105) & \frac{y_{p} - \hat{y_{p}}}{SE (y_{p} - {\hat{y}}_{p})} \sim t_{n - (p + 1)} \end{matrix}$
A 95% prediction interval:
$\begin{matrix} (106) & \hat{y_{p}} \pm t_{α / 2, n - (p + 1)} \cdot SE (y_{p} - \hat{y_{p}}) \end{matrix}$
$\alpha=0.5$ .
Caution: avoid prediction beyond the range of data!

2.8 Geometric Interpretation of Least Squares

Let

$Y=(y_1,\ldots,y_n)^T$ $n\times 1$ response vector,
$\mathbb 1=(1,\ldots,1)^T$ $n\times 1$ vector of 1,
$x_j=(x_{1j},\ldots,x_{nj})^T$ $n\times 1$ $j$ $j=1,\ldots,p$ .
$X=(\mathbb 1, x_1,\cdots,x_p)$ $n\times (p+1)$ design matrix.

$n\times 1$ $n$ $\mathbb R^n$ .

$Y$, $X\beta$, and $X\hat{\beta}$

$Y$ $\mathbb R^n$ .
$L(X)$ $\{1,x_1,\cdots,x_p\}$ , e.g.
$\begin{matrix} (107) & β_{0} 1 + β_{1} x_{1} + \dots + β_{p} x_{p}, \forall β = (β_{0}, \dots, β_{p}) \end{matrix}$
$L(X)$ $\mathbb R^n$ $\mathbb 1,x_1,\cdots,x_p$ (denoted by the oval region).
$\forall\ \beta$ $X\beta$ $L(X)$ $Y-X\beta$ $Y$ $X\beta$ .
$\beta$ $Y-X\beta$ $L(X)$ $\mathbb 1,x_1,\cdots,x_p$ in this subspace.
$\hat{\beta}$ satisfies
$\hat{Y}=X\hat{\beta}=\underbrace{X(X^TX)^{-1}X^T}_{H}Y$ $H$ $H$ $Y$ $L(X)$ .
$r=Y-\hat{Y}$ $L(X)$ , so
$\begin{matrix} (108) & \begin{matrix} X^{T} r = 0 ⟹ {\begin{cases} \sum r_{i} = 0 \\ \sum r_{i} x_{i j} = 0, j = 1, \dots, p \end{cases} \end{matrix} \end{matrix}$

2.9 ANOVA for Multiple Linear Regression

Consider a multiple linear model

\begin{matrix} (109) & Y_{i} = β_{0} + β_{1} x_{i 1} + \dots + β_{p} x_{i p} + ϵ_{i}, i = 1, \dots, n \end{matrix}

We still have following source of variations:

regression
error
total

Sum of squares and degrees of freedom add up as before:

\begin{matrix} (110) & \begin{matrix} S S T = S S R + S S E \\ d f_{T} = d f_{R} + d f_{E} \end{matrix} \end{matrix}

$SST$ $df_T$ remains the same:

\begin{matrix} (111) & S S T = \sum_{i = 1}^{n} (y_{i} - \bar{y})^{2}, d f_{T} = n - 1 \end{matrix}

$SSR$ $SSE$ $df$ 's are different from SLR.

\begin{aligned} S S R = \sum_{i = 1}^{n} (\hat{y_{i}} - \bar{y})^{2} & d f_{R} = p \\ S S E = \sum_{i = 1}^{n} (y_{i} - \hat{y_{i}})^{2} & d f_{E} = n - (p + 1) \end{aligned}

$\hat{y_i}$ is fitted value now based on MLR,
$\begin{matrix} (112) & \hat{y_{i}} = \hat{β_{0}} + \hat{β_{1}} x_{i 1} + \dots + \hat{β_{p}} x_{i p} \end{matrix}$
$df_R$ $p$ (number of explanatory variables)
$df_E$ $n-(p=1)$

ANOVA Table

Source	$SS$	$df$	$MS$
Regression	$SSR=\sum(\hat{y_i}-\bar{y})^2$	$p$	$MSR=\frac{SSR}{p}$
Error	$SSE=\sum(y_i-\hat{y_i})^2$	$n-(p+1)$	$MSE=\frac{SSE}{n-p-1}$
Total	$SST=\sum(y_i-\bar{y})^2$	$n-1$

2.9.1 F-test of "overall" significance

Suppose we want to evaluate the "overall" significance of a multiple linear model, e.g. we wish to test hypotheses:

\begin{matrix} (113) & H_{0} : \forall β_{i} = 0 (all coefficients are 0) v.s. H_{a} : \exists β_{i} \neq 0 \end{matrix}

F-statistic:

\begin{matrix} (114) & F = \frac{S S R / p}{S S E / (n - p - 1)} = \frac{M S R}{M S E} \end{matrix}

$H_0$ $F$ is large (that is the model explains more variation in response than the random error), i.e.

\begin{matrix} (115) & F > F_{α, p, n - p - 1} (critial value) \end{matrix}

$\alpha$ $\alpha/2$ ? We only look at the large positive value.

\begin{matrix} (116) & p-value = P r (> F) \end{matrix}

$H_0$ $Y$ .

2.9.2 Coefficient of Determination

\begin{matrix} (117) & R^{2} = \frac{S S R}{S S T} \end{matrix}

It still represent the portion of variation in the response explained by all the explanatory variabels (or the model).
$R^2$ always increases as more explanatory variables are added to the model.
$\beta$ $\beta$ $SSE$ $\beta_j$ $SSE=0$ $R^2=1$ .
$R^2$ increases in the number of explanatory variables, it alone cannot be used as a meaningful comparison of models with very different number of explanatory variables. $R^2$ may not always mean "better" model.

$R^2$

\begin{matrix} (118) & {R_{adj}}^{2} = 1 - \frac{n - 1}{n - p - 1} (1 - R^{2}) \end{matrix}

$p$
higher the better in general
no longer a proportion (not in percentage)

Example: Low Birthweight Infant Data

Fit summary lm(headcirc ~ gestage+brithwt,)

Residual standard error: 1.274 on 97 df

$R^2$ $R^2$ : 0.747

F-statistic: 147.1 on 2 and 97 df.

$H_0:\beta_1=\beta_2=0$ $H_a:$ $\beta_1$ $\beta_2$ is not 0.

\begin{matrix} (119) & F = \frac{M S R}{M S E} = 147.1 \end{matrix}

$F=147.1>F_{0.05,2,97}=3.09$ $= P(>147.1)\ll$ $H_0$ , and conclude that the model is significant.

3. Specification Issues in Regression Models

3.1 Categorical Variables

In linear regression models, the response variablehas to be numerical, however, the explanatory variables can be either numerical or categorical.

A categorical variable takes a value that falls into one of several categories, e.g.

binary, toxaemia yes/no, coded by 1 and 0.
ordered, mild, moderate, severe
not ordered, red, blue, green

Approach: recode into indicator variables or treat as numerical if it makes sense to do so.

Example: Occupational Prestige

Reponse: prestige score

Explanatory: education (yr), type: blue collar, white collar, professional

Occupation	$Y$ )	$x_1$ )	$x_2$ )
Computer operations	47.7	11.36	wc
Construction labourers	26.5	7.52	bc
Office derks	35.6	11.00	wc
Nurses	64.7	12.46	prof
$\vdots$	$\vdots$	$\vdots$	$\vdots$

How to code variable "type"?

e.g, let

\begin{matrix} (120) & \begin{matrix} x_{i, 2} = {\begin{cases} 0 & bc \\ 1 & wc \\ 2 & prof \end{cases} \end{matrix} \end{matrix}

This approach is not generally appropriate unless explanatory variable is ordinal and the response is linear according to this scheme.

More flexible approach: introduce indicator/dummy variables

\begin{matrix} (121) & \begin{matrix} x_{i, 2} = {\begin{cases} 1 & prof \\ 0 & otherwise \end{cases} x_{i, 3} = {\begin{cases} 1 & wc \\ 0 & otherwise \end{cases} \end{matrix} \end{matrix}

Then,

\begin{matrix} (122) & \begin{matrix} X = [\begin{matrix} 1 & 11.36 & 0 & 1 & 0 \\ 1 & 7.52 & 0 & 0 & 1 \\ 1 & 11.00 & 0 & 1 & 0 \\ 1 & 12.46 & 1 & 0 & 0 \\ ⋮ & ⋮ & ⋮ & ⋮ & ⋮ \end{matrix}] \end{matrix} \end{matrix}

$\rank(x)=4$ $X$ $X^TX$ is not invertible!

$k$ $k-1$ indicators for the categorical variable.

Model:

\begin{matrix} (123) & y_{i} = β_{0} + β_{1} x_{i, 1} + \underset{indicator for variable "type"}{\underset{⏟}{β_{2} x_{i, 2} + β_{3} x_{i, 3}}} + ϵ_{i} \end{matrix}

Interpretation:

$x_{i,1}$ $\beta_0+\beta_1x_{i,1}$
$x_{i,1}$ $\beta_0+\beta_1x_{i,1}+\beta_2$
$x_{i,1}$ $\beta_0+\beta_1x_{i,1}+\beta_3$
$\beta_2$ is the difference between prof and bc in mean response
$\beta_3$ is the difference between wc and bc in mean response
$\beta_2-\beta_3$ is the difference between prof and wc in mean response

$\hat{\beta}\sim MVN(\beta, \sigma^2(X^TX)^{-1})$

$\beta_2$
$\beta_3$ ,
$\hat{\beta_j}\sim N(\beta_j, \sigma^2{[X^TX]}^{-1}_{jj})$ $SE(\hat{\beta_j})=\sqrt{\hat{\sigma}^2[X^TX]^{-1}_{jj}}$
$\beta_2-\beta_3$
$\mathrm{Var}(\hat{\beta_2}-\hat{\beta_3})=\underbrace{\mathrm{Var}(\hat{\beta_2})}_{\sigma^2[X^TX]^{-1}_{22}}+\underbrace{\mathrm{Var}(\hat{\beta_3})}_{\sigma^2[X^TX]^{-1}_{33}}-2\underbrace{\mathrm{Cov}(\hat{\beta_2}, \hat{\beta_3})}_{\sigma^2[X^TX]^{-1}_{23}}$ $SE(\hat{\beta_2}-\hat{\beta_3})=\sqrt{\mathrm{Var}(\hat{\beta_2}-\hat{\beta_3})\vert_{\sigma^2=\hat{\sigma}^2}}$

3.2 Interaction Effects

Interaction effects exist when the effect of one variable depends on the value of another variable.

$x_1,x_2$ , an interaction is the product of these two variables, e.g.

\begin{matrix} (124) & y = β_{0} + β_{1} x_{1} + β_{2} x_{2} + β_{3} x_{1} x_{2} + ϵ \end{matrix}

$\beta_1$ $\beta_2$ main effects $x_1$ $x_2$ $\beta_3$ interaction effect $x_1$ $x_2$ .

General Interpretation:

$x_1$ $x_2$
$\begin{matrix} (125) & β_{0} + β_{1} x_{1} + β_{2} x_{2} + β_{3} x_{1} x_{2} \end{matrix}$
$x_1$ $x_1+1$ $x_2$ fixed
$\begin{matrix} (126) & β_{0} + β_{1} (x_{1} + 1) + β_{2} x_{2} + β_{3} (x_{1} + 1) x_{2} ⟹ (β_{0} + β_{1}) + β_{1} x_{1} + (β_{2} + β_{3}) x_{2} + β_{3} x_{1} x_{2} \end{matrix}$
$x_1$ increases by 1, take difference, we get
$\begin{matrix} (127) & β_{1} + β_{3} x_{2} \end{matrix}$
$x_2$ $\beta_3$ (interaction effect).

$x_1x_2$ $x_1$ $x_2$ .

$x_2$ , is binary.

Example: low birthweight infant

\begin{matrix} (128) & y_{(head circ)} = β_{0} + β_{1} \underset{(gest age)}{\underset{⏟}{x_{1}}} + β_{2} \underset{(toxemia), 0/1}{\underset{⏟}{x_{2}}} + β_{3} x_{1} x_{2} + ϵ \end{matrix}

$x_2=0$ $\beta_0+\beta_1x_1$
$x_2=1$ $(\beta_0+\beta_2)+(\beta_1+\beta_3)x_1$
$\beta_2$ $x_2$ , binary) represents the difference in intercepts between two groups
$\beta_3$ (interaction effect) represents the difference in slopes between two groups

$y$ $x_1$ $\beta_3$ :

\begin{matrix} (129) & H_{0} : β_{3} = 0, vs. H_{a} : β_{3} \neq 0 \end{matrix}

4. Testing General Linear Hypothesis

Model:

\begin{matrix} (130) & y_{i} = β_{0} + β_{1} x_{i 1} + \dots + β_{p} x_{i p} + ϵ_{i} \end{matrix}

We have discussed

$H_0:\beta_j=0$
$H_0:\beta_1=\cdots=\beta_p=0$

General Linear Hypothesis:

\begin{matrix} (131) & H_{0} : A β = 0 vs. H_{a} : A β \neq 0 \end{matrix}

$A$ $l\times (p+1)$ matrix of some constants.

Example: Prestige Data

"Full" model:

\begin{matrix} (132) & \underset{prestige score}{\underset{⏟}{y_{i}}} = β_{0} + β_{1} \underset{education}{\underset{⏟}{x_{i 1}}} + β_{2} \underset{(indicator for prof)}{\underset{⏟}{x_{i 2}}} + β_{3} \underset{(indicator for wc)}{\underset{⏟}{x_{i 3}}} + ϵ_{i} \end{matrix}

$x_{i2}$ $x_{i3}$ $\beta_2=\beta_3=0$ , meaning that the type itself does not matter overall.

$H_0:\beta_2=\beta_3=0$ , tests overall effect of type
$\begin{matrix} (133) & \begin{matrix} A = [\begin{matrix} 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 1 \end{matrix}] ⟹ H_{0} : A β = 0_{2 \times 1} \end{matrix} \end{matrix}$
$H_0$ $y_i=\beta_0+\beta_1x_{i1}+\epsilon_i$ , which is the "reduced" model.
$H_0:\beta_1=0,\beta_2=\beta_3$ , this test has no effect of education and no difference between prof and wc.
$\begin{matrix} (134) & \begin{matrix} A = [\begin{matrix} 0 & 1 & 0 & 0 \\ 0 & 0 & 1 & - 1 \end{matrix}] ⟹ H_{0} : A β = 0_{2 \times 1} \end{matrix} \end{matrix}$
$H_0$ $y_i=\beta_0+\beta_2(x_2+x_3)+\epsilon_i$ $x_2+x_3$ is prof or wc.

$l$ $A$ $l\times (p+1)$ $\rank(A)=l$ .

Principle of Extra Sum of Squares:

$A$ .
$\begin{matrix} (135) & \begin{matrix} L (X) = {β_{0} 1 + β_{1} x_{1} + \dots + β_{p} x_{p}}, span of column vectors 1, x_{1}, \dots, x_{p} \\ L_{A} (X) = {β_{0} 1 + β_{1} x_{1} + \dots + β_{p} x_{p} : A β = 0}, the constriant \end{matrix} \end{matrix}$
Rewrite
$\begin{matrix} (136) & Y = \hat{Y_{A}} + (\hat{Y} - \hat{Y_{A}}) + r \end{matrix}$
1. $r\perp \hat{Y_A}$ $\hat{Y_A}\in L_A(X)\subset L(X)$
2. $r\perp \hat{Y}-\hat{Y_A}$
3. $\hat{Y_A}\perp (\hat{Y}-\hat{Y_A})$ $r_A-r=\hat{Y}-\hat{Y_A}$ $\hat{Y_A}\perp (r_A-r)$
$Y$ is a sum of orthonomal vector
$\begin{matrix} (137) & \begin{matrix} ‖ Y ‖^{2} = ‖ \hat{Y_{A}} ‖^{2} + \underset{\begin{matrix} extra variation \\ between the full \\ and reduced model \end{matrix}}{\underset{⏟}{‖ \hat{Y} - \hat{Y_{A}} ‖^{2}}} + ‖ r ‖^{2} \end{matrix} \end{matrix}$
1. $\Vert \hat{Y_A}\Vert^2=\sum_i y_i^2$ , unadjusted total sum of squares
2. $\Vert \hat{Y_A}\Vert^2=\sum_i\hat{y_{Ai}}^2$
3. $\Vert r\Vert^2=\sum_i r_i^2=SSE$ , sum of residual squares of full model
4. $\Vert \hat{Y}-\hat{Y_A}\Vert^2=\sum_i(\hat{y_i}-\hat{y_{A,i}})^2$ $r_A=(\hat{Y}-\hat{Y_A})+r$ $r\perp (\hat{Y}-\hat{Y_A})$ , then
  $\begin{matrix} (138) & \begin{matrix} ‖ r_{A} ‖^{2} = ‖ \hat{Y} - \hat{Y_{A}} ‖^{2} + ‖ r ‖^{2} \\ ‖ \hat{Y} - \hat{Y_{A}} ‖^{2} = ‖ r ‖^{2} - ‖ r_{A} ‖^{2} = S S E - S S E_{A} \geq 0 \end{matrix} \end{matrix}$
  $A$ "
$H_0:A\beta=0$ $\hat{Y}$ $\hat{Y_A}$ $\mathrm{ExtraSS}=SSE-SSE_A$ $SSE$ .

$H_0:A\beta=0$

$A$ $l\times (p+1)$ $l$ $p+1$ $\beta$ 's.
$A=\left[\begin{matrix}0 & 1 & 0 & 0 & \\ 0 & 0 & 1 & -1\end{matrix}\right], \beta=(\beta_0, \beta_1,\beta_2,\beta_3)^T$ ,
$\begin{matrix} (139) & \begin{matrix} A β = [\begin{matrix} β_{1} \\ β_{2} - β_{3} \end{matrix}] = [\begin{matrix} 0 \\ 0 \end{matrix}] \Leftrightarrow β_{1} = 0, β_{2} = β_{3} \end{matrix} \end{matrix}$
Principle of Extra Sum of Squares:
$\begin{matrix} (140) & \begin{matrix} \underset{\begin{matrix} unadjusted \\ total SS \end{matrix}}{\underset{⏟}{\sum_{i} y_{i}^{2}}} = \underset{\begin{matrix} SS explained \\ by "reduced" model \\ with constraint A \end{matrix}}{\underset{⏟}{\sum_{i} {\hat{y_{A i}}}^{2}}} + \underset{\begin{matrix} Extra SS \\ explained by \\ "full" vs. "reduced" \end{matrix}}{\underset{⏟}{\sum_{i} (\hat{y_{i}} - \hat{y_{A i}})^{2}}} + \underset{SSE}{\underset{⏟}{\sum_{i} r_{i}^{2}}} \end{matrix} \end{matrix}$
If ExtraSSSSE $H_0:A\beta=0$ $H_0$ .
Note $\texttt{ExtraSS}=\texttt{SSE}_A-\texttt{SSE}$ .

F-statistic:

\begin{matrix} (141) & F = \frac{\frac{‖ \hat{Y} - \hat{Y_{A}} ‖^{2}}{l}}{\frac{‖ r ‖^{2}}{n - p - 1}} = \frac{(S S E - S S E_{A}) / l}{S S E / (n - p - 1)} \overset{H_{0}}{\sim} F_{l, n - p - 1} \end{matrix}

Note $l$ $\rank(A)=l$ .

A sketch of proof:

$\frac{\texttt{SSE}}{\sigma^2}\sim \Chi_{n-p-1}^2$ $\epsilon_i\overset{iid}{\sim} N(0,\sigma^2)$ )
$\frac{\texttt{SSE}_A-\texttt{SSE}}{\sigma^2}\overset{H_0}{\sim} \Chi_l^2$ $\epsilon_i\overset{iid}{\sim} N(0, \sigma^2)$ $H_0$ is true)
$\texttt{SSE}$ $\texttt{SSE}_A-\texttt{SSE}$ are independent.

Therefore, combining 1, 2, and 3,

\begin{matrix} (142) & F = \frac{\frac{({SSE}_{A} - SSE) / σ^{2}}{l}}{\frac{SSE / σ^{2}}{n - p - 1}} \overset{H_{0}}{\sim} F_{l, n - p - 1} \end{matrix}

$H_0$ $F>F_{\alpha, l, n-p-1}$

Example: Prestige Score Data

$x_1$ $x_2=I(\text{type}=\text{prof}), x_3=I(\text{type}=\text{wc})$ $\times$ type (interaction)

\begin{matrix} (143) & y = β_{0} + β_{1} x_{1} + β_{2} x_{2} + β_{3} x_{3} + β_{4} x_{1} x_{2} + β_{5} x_{1} x_{3} + ϵ \end{matrix}

$\beta_1$ : slope for "bc" (baseline type)
$\beta_4$ : difference in slopes between "prof" and "bc".
$\beta_5$ : difference in slopes between "wc" and "bc"

$x_1$ $y$ the same for different types?

We wish to test the significance of interaction effects, e.g.

\begin{matrix} (144) & \begin{matrix} H_{0} : β_{4} = β_{5} = 0 vs. H_{a} : at least one is not \\ A = [\begin{matrix} 0 & 0 & 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 0 & 0 & 1 \end{matrix}], A β = [\begin{matrix} β_{4} \\ β_{5} \end{matrix}] = 0 \end{matrix} \end{matrix}

$H_0:A\beta_0$ is true):

\begin{matrix} (145) & y_{A} = β_{0} + β_{1} x_{1} + β_{2} x_{2} + β_{3} x_{3} + ϵ \end{matrix}

In R, fit both full and reduced models.


xxxxxxxxxx
5
1
fit <- lm(prestige ~ education + factor(type) + education*factor(type), ...) # this is our full model
2
summary(fit)
3
# coefficients
4
# ...
5
# Redidual standard error: 7.827 and 92 df

$\hat{\sigma}=\sqrt{\texttt{MSE}}$ $\texttt{SSE}=(7.827)^2\cdot 92$ .


xxxxxxxxxx
4
1
fit_A <- lm(prestige ~ education + factor(type), ...) # this is the reduced model
2
summary(fit_A)
3
# ...
4
# Residual standard error: 7.814 and 94 df

$\hat{\sigma_A}^2=\sqrt{\texttt{MSE}_A}$ $\texttt{SSE}_A=(7.814)^2\cdot 94$ .

$H_0:\beta_4=\beta_5=0$ :

\begin{matrix} (146) & F = \frac{({SSE}_{A} - SSE) / l}{SSE / (n - p - 1)} = \frac{[(7.814)^{2} \cdot 92 - (7.827)^{2} \cdot 92] / 2}{(7.827)^{2}} \approx \frac{\overset{ExtraSS}{\overset{⏞}{103.4106}} / 2}{(7.827)^{2}} \approx 0.844 \end{matrix}

$F\overset{H_0}{\sim} F_{2,92}$ $\texttt{p-value}=P(>0.844)=0.433$ $H_0$ , we have no evidence that the effect of education is different between types.

In R, anova() function can be used for the F-test procedure.


xxxxxxxxxx
7
1
anova(fit_A, fit)
2
## Analysis of Variance Table
3
# Model 1: prestige ~ edu + factor(type)
4
# Model 2: prestige ~ edu + factor(type) + edu * factor(type)
5
#   Res.df      RSS           Df    Sum of Sq               F       Pr(>F)
6
# 1  94      5740.0 (SSE_A)   
7
# 2  92      5636.5 (SSE)     2     103.52 (SSE_A-SSE)    0.8448     0.433

5. Model Assumptions and Residual Analysis

5.1. Introduction

A linear regression model is specified under several assumptions:

$y$ $x_1,\ldots,x_p$ $y=\underbrace{\beta_0+\beta_1x_1+\cdots+\beta_px_p}_{\text{linear function}}+\epsilon$
$\epsilon_i$ $E[\epsilon_i]=0$
$\epsilon_i$ $\mathrm{Var}(\epsilon_i)=\sigma^2$
$\epsilon_i$ 's are normally distributed
$\epsilon_i$ 's are independent

$\epsilon_i$ 's are not directly observable, thus it seems hard to examine their distributional properties.

$r_i=y_i-\hat{y_i}$ , and we argue residuals behave similarly as random errors.

Relationship between Residuals and Random Errors

$r=Y-\hat{Y}=(I-H)Y=(I-H)(X\beta+\epsilon)=(I-H)\epsilon$ $H=X(X^TX)^{-1}X^T$ $(I-H)X\beta=0$ .
$\epsilon\sim MVN(0, \sigma^2I)\Rightarrow r\sim MVN(0, \sigma^2(I-H))$
$H$ $I$ $r\approxeq \epsilon$ $r\sim MVN(0, \sigma^2I)$ .

$\epsilon_i$ $r_i$ 's.

5.2 Residual Analysis

$r_i=y_i-\hat{y_i}$

deviation between the data and fit
$\epsilon_i$
$\epsilon_i$ $r_i$ 's.

$d_i=\frac{r_i}{\hat{\sigma}}$

$E[r_i]=0$ $\sqrt{\mathrm{Var}(r_i)}$
$r\sim MVN(0, \sigma^2(I-H))$ $\mathrm{Var}(r_i)\approxeq \sigma^2$ (approximate variance)
$r_i$ $\frac{r_i-(0)}{\hat{\sigma}}$
$E(d_i)=0$ $\mathrm{Var}(d_i)\approxeq 1$ .

$e_i=\frac{r_i}{\sqrt{\hat{\sigma}^2(1-h_{ii})}}$

$r_i$ $\mathrm{Var}(r_i)=\sigma^2(1-h_{ii})$ $\sigma^2$ $h_{ii}$ $(i,i)$ $H$
$E(e_i)=0$ $\mathrm{Var}(e_i)=1$ .

Residual Plots

Gaphical display of residuals, an effective way to detect departures of model assumptions.
Various types of plots for different assumptions
Typically studentized residuals are ploted.

Plot of Residuals vs. Fitted Values

$r_i$ $d_i$ $e_i$ $(\hat{y_i})$

If residuals fluctuates randomly around 0 inside a horizontal band, then no visible defects.
If the residuals can be contained in an opening funnel ("fan" shape), it indicates that $\mathrm{Var}(\epsilon_i)$ is not constant.
nonlinearity $y$ $x$ variables is not linear, or some other explanatory variables are needed)

Plots of Residuals vs. Explanatory Variable

$j$ $x_{ij}$ 's.

$\hat{y_i}$ 's.

$\to$ no visible defects
$\to$ Variance is nonconstant
$\to$ $x_j^2$ in the model)

Partial Residual Plots

$y$ $x_j$ .

$x_j, j=1,\ldots,p$ ${r_i}^{(j)}=r_i+\hat{\beta_j}x_{ij}$ $r_i$ $p$ $x_j$ variable back into the residual.

${r_i}^{(j)}$ $x_{ij}$

$\to$ $x_j$ enters model linearly
$\to$ $x_j$ may be helpful

Q-Q Plot for Normal Distribution

departure from normality $d_{(1)}<d_{(2)}<\cdots <d_{(n)}$ $N(0,1)$ . If normality holds, then the ordered residuals should be align with the quantiles of normal distribution.

$\to$ underlying distribution is normal.
$\to$ heavy-tailed
$\to$ light-tailed
$\to$ positively skewed (right skewed)
$\to$ negatively skewed (left skewed)

Remarks:

Do not wnat to check normality untill other assumptions been checked and fixed.
$y$ may help.
Light tailed, can ignore
Heavy tailed, problematic.

5.3 Addressing Model Assumption Problems

If residual plots reveal problems with assumptions, we might be able to address via data transformation.

Transformation of Response to Stablize Variance
$r$ $\hat{y}$ $r$ $x_j$ )
$g$ $g(y_i)$
$\begin{matrix} (147) & g (y_{i}) = β_{0} + β_{1} x_{i 1} + \dots + β_{p} x_{i p} + ϵ_{i} \end{matrix}$
$\mu_i=E(y_i)$ , ie.
$\begin{matrix} (148) & Var (y_{i}) = Var (ϵ_{i}) = h (μ_{i}) σ^{2} (for some h () > 0) \end{matrix}$
$\mathrm{Var}(g(y_i))\approx \sigma^2$
By first order Taylor Expansion, we have
$\begin{matrix} (149) & g (y_{i}) \approx g (μ_{i}) + (y_{i} - μ_{i}) g^{'} (μ_{i}) \end{matrix}$
Then,
$\begin{matrix} (150) & Var (g (y_{i})) \approx [g^{'} (μ_{i})]^{2} Var (y_{i}) \approx [g^{'} (μ_{i})]^{2} h (μ_{i}) σ^{2} \end{matrix}$
$[g'(\mu_i)]^2\propto \frac{1}{h(\mu_i)}$ .
Examples:
1. $h(\mu_i)=\mu_i\Rightarrow \mathrm{Var}(y_i)=\mu_i\sigma^2$ .
  $g'(\mu_i)\propto \frac{1}{\sqrt{\mu_i}}=\frac{1}{\sqrt{\mu_i}}\Rightarrow g(\mu_i)=\sqrt{\mu_i}$ will work
  $g(y_i)=\sqrt{y_i}$ to abtain approximate constant variance.
2. $h(\mu_i)=\mu_i^2\Rightarrow \mathrm{Var}(y_i)=\mu_i^2\sigma^2$ .
  $g'(\mu_i)\propto \frac{1}{\mu_i}\Rightarrow g(\mu_i)=\ln(\mu_i)$ may work
  $g(y_i)=\ln(y_i)$ to abtain approximate constant variance.
3. Box-cox Power Transformation
  $\begin{matrix} (151) & \begin{matrix} g (y_{i}) = {\begin{cases} \frac{{y_{i}}^{λ} - 1}{λ} & λ \neq 0 \\ \log (y_{i}) & λ = 0 \end{cases} \end{matrix} \end{matrix}$
  Note that
  $\begin{matrix} (152) & \begin{matrix} g^{'} (y_{i}) = {\begin{cases} {μ_{i}}^{λ - 1} & λ \neq 0 \\ \frac{1}{μ_{i}} & λ = 0 \end{cases} \Leftrightarrow h (μ_{i}) \propto \frac{1}{[g^{'} (μ_{i})]^{2}} = {μ_{i}}^{c} \end{matrix} \end{matrix}$
  $c$ is some arbitrary power. Thus, Box-Cox transformation can help address the non-constant variance of the form
  $\begin{matrix} (153) & Var (y_{i}) = {μ_{i}}^{c} σ^{2} \end{matrix}$
  Some special cases:
  - $\lambda=\frac{1}{2}$ , we have the square-root transformation
  - $\lambda=0$ , this is the log transformation
  - $\lambda=1$ , we have the identity transformation (no transformation)
  - $\lambda=-1$ , we have the reciprocal transformation
  $\lambda$ and find the "best" choice that gives the largest log-likelihood value.
  $\beta_j$ $x_j$ $\beta_j$ $g(y_i)$ .
  For log transformation $g(y_i)=\log(y_i)$ $\beta_j$ represents the change in mean response in the log scale,
  $\begin{matrix} (154) & y_{i} = e^{β_{0} + β_{1} x_{i 1} + \dots + β_{p} x_{i p} + ϵ_{i}} \end{matrix}$
  - $e^{\beta_j}$ is the multiplicative change applied to the original response.
  - $x_j$ $100(e^{\beta_j}-1)$ percentage change in the mean response in the original scale.
  - $x=(1,x_1,\ldots,x_p)^T$
    $\begin{matrix} (155) & \exp (x^{T} \hat{β} \pm t_{α / 2, n - p - 1} SE (x^{T} \hat{β})) \end{matrix}$
  $\lambda$ , transformation leads to less interpretable results.
4. Transforming/Adding Explanatory Variables
  $y$ $g(y_i)$ $x_j$ $r$ $x_j$ or partial residual plots.
  1. $x_j$ $\log(x_j)$
  2. $x^2,x^3,\cdots$ , e.g.
    $\begin{matrix} (156) & y_{i} = β_{0} + β_{1} + β_{2} x_{i}^{2} + ϵ_{i} \end{matrix}$
    $\beta$ 's)

6. Effects of Individual Observations

Sometimes some of the observations have an unduly large infuence on the fit of the model.

Outliers

$(y_i,x_{i1},\ldots,x_{ip})$ that differs substantially from the majority of the observations in the dataset.

e.g. SLR (simple linear regression):

What causes outliers?

data entry or measurement error
sampling problem: an extradinary subject or a subject not part of the target population
natural variation: some unusual condition may occur during experiments, which may affect the outcome

Generally we do not recommend removing outliers from dataset, unless we have good reason to believe the observation is an error or the subject does not belong to the target population.

Outliers may change the result of model fitting, e.g. observation C pulls the fitted line closer to it, observation A or B may change the slope a bit.

Therefore, it is useful to investigate to what extend the outliers influence our fitted models.

How to detect outliers?

6.1 Studentized Residuals

\begin{matrix} (157) & e_{i} = \frac{r_{i}}{\sqrt{{\hat{σ}}^{2} (1 - h_{i i})}} \end{matrix}

$r_{n\times 1}\sim MVN(\mathbf 0_{n\times 1}, \sigma^2(I-H))$ $h_{ii}$ $i$ $H$ $e_i\sim N(0,1)$ apprximately.

$e_i$ $N(0,1)$ $\abs{e_i}>3$ $i$ $y_i$ .

$e_i$ $(\hat{y_i})$ can be used to identify outliers in response.

6.2 Leverage

$h_{ii}$ $i$ $H=X(X^TX)^{-1}X^T$ $h_{ii}$ $i$ . Note that

\begin{matrix} (158) & \hat{Y} = X \hat{β} = X (X^{T} X)^{- 1} X^{T} Y = H Y \end{matrix}

\begin{matrix} (159) & \hat{y_{i}} = [h_{i 1}, \dots, h_{i n}] [y_{i} \dots y_{n}]^{T} = \sum_{j = 1}^{n} h_{i j} y_{j} = h_{i i} y_{i} + \sum_{j \neq i} h_{i j} y_{j} \end{matrix}

$h_{ii}$ $y_i$ $\hat{y_i}$ .

$0<h_{ii}\le 1$ .
$H$ $H=HH\to h_{ii}=[h_{i1},\ldots,h_{in}][h_{1i},\ldots,h_{ni}]=\sum_jh_{ij}h_{ji}$
$H$ $h_{ij}=h_{ji}$ .
So,
$\begin{matrix} (160) & h_{i i} = \sum_{j} h_{i j}^{2} = h_{i i}^{2} + \sum_{j \neq i} h_{i j}^{2}, h_{i i} (1 - h_{i i}) = \sum_{i \neq j} h_{i j}^{2} \geq 0 \Rightarrow 0 < h_{i i} \leq 1 \end{matrix}$
$h_{ii}$ $h_{ij}$ 's gets closer to 0,
$\begin{matrix} (161) & \hat{y_{i}} = h_{i i} y_{i} + \sum_{j \neq i} h_{i j} y_{j} \approx y_{i} \end{matrix}$
$h_{ii}$ $h_{ij}$ $j\neq i$ $\hat{y_i}$ $y_i$ $i$ has a high leverage on the fit.
Rule of Thumb: if an observation with leverage higher than twice the average leverage, it is considered a high-leverage case.
$\begin{aligned} h_{i i} & > 2 \bar{h} = \frac{2 (p + 1)}{n} \\ tr (h) & = \frac{1}{n} \sum_{i = 1}^{n} h_{i i} = \frac{1}{n} = \frac{1}{n} tr (H) \\ = \frac{1}{n} tr {(X (X^{T} X)}^{- 1} X^{T}) = \frac{1}{n} tr {((X^{T} X)}^{- 1} X^{T} X) \\ = \frac{1}{n} tr (I_{(p + 1) \times (p + 1)}) = \frac{1}{n} (p + 1) \end{aligned}$

$H=X(X^TX)^{-1}X^T$ , it only involves explanatory variables, but not response.

$h_{ii}$ $x_{i1}, x_{i2}, \ldots, x_{ip}$ $\bar{x}_1,\ldots,\bar{x}_p$ $x_{i1},\cdots,x_{ip}$ ) is far away from the centroid.

$X=\left[\begin{matrix}1 & x_1 \\ \vdots & \vdots \\ 1 & x_n \end{matrix}\right]$ ,

\begin{aligned} (X^{T} X)^{- 1} & = \frac{1}{n S_{x x}} [\begin{array}{c} \sum x_{i}^{2} & - n \bar{x} \\ - n \bar{x} & n \end{array}] \\ h_{i i} & = [1, x_{i}] \frac{1}{n S_{x x}} [\begin{array}{c} \sum x_{i}^{2} & - n \bar{x} \\ - n \bar{x} & n \end{array}] [\begin{array}{c} 1 \\ x_{i} \end{array}] \\ = \frac{1}{n} + \frac{(x_{i} - \bar{x})^{2}}{S_{x x}} \end{aligned}

(This generalizes to MLR)

$x_i$ $\bar{x}$ $h_{ii}$ is small
$x_i$ $\bar{x}$ $h_{ii}$ is large

Thus, leverage is useful to help identify outliers in the sense of having explanatory varaibles with extreme values.

6.3 Influence

$\beta$ compared to when it is not used to fit the model.

$Y=X\beta+\epsilon$ $\hat{\beta}$ as usual.

$\hat{\beta}^{(i)}$ $i$ $\hat{\beta}^{(i)}$ $\hat{\beta}$ $i$ is highly influential.

$\hat{\beta}^{(i)}$ $\hat{\beta}$ )

\begin{matrix} (162) & D_{i} = \frac{(\hat{β} - {\hat{β}}^{(i)})^{T} [{\hat{σ}}^{2} (X^{T} X)^{- 1}]^{- 1} (\hat{β} - {\hat{β}}^{(i)})}{p + 1} = \frac{(\hat{β} - {\hat{β}}^{(i)})^{T} X^{T} X (\hat{β} - {\hat{β}}^{(i)})}{{\hat{σ}}^{2} (p + 1)} \end{matrix}

$(\hat{\beta}-\hat{\beta}^{(i)})$ $(p+1)\times 1$ $(\hat{\beta}-\hat{\beta}^{(i)})^T(\hat{\beta}-\hat{\beta}^{(i)})$ , the sum of squares of difference in all elements.
$(\hat{\beta}-\hat{\beta}^{(i)})$ $\mathrm{Var}(\hat{\beta})=\sigma^2(X^TX)^{-1}$
$D_i$ $\hat{\beta}$ $\hat{\beta}^{(i)}$ up to a scaling factor
$D_i$ $\hat{\beta}$ $\hat{\beta}^{(i)}$ , thus removing observation i has stronger influence on the parameter estimates.

It can be shown taht $D_i$ can be written as

\begin{matrix} (163) & D_{i} = e_{i}^{2} \frac{h_{i i}}{1 - h_{i i}} \frac{1}{p + 1} \end{matrix}

$D_i$ $e_i$ $e_1=\dfrac{r_i}{\sqrt{\hat{\sigma}^2(1-h_{ii})}}$ $h_ii$ (leverage).

$\abs{e_i}$ $h_{ii}$ are large values.

Proof $i$ $v_i=[1, x_{i1},\cdots,x_{ip}]^T$ denote the vector of its explanatory varaible values.

\begin{matrix} (164) & \begin{matrix} X^{(i)} = [\begin{matrix} - v_{1}^{T} - \\ ⋮ \\ - v_{i}^{T} - \\ ⋮ \\ - v_{n}^{T} - \end{matrix}] (delete i th v_{i}^{T}) Y^{(i)} = [\begin{matrix} y_{1} \\ ⋮ \\ y_{i} \\ ⋮ \\ y_{n} \end{matrix}] (delete y_{i}) \end{matrix} \end{matrix}

$\beta$ becomes

\begin{matrix} (165) & {\hat{β}}^{(i)} = [{X^{(i)}}^{T} X^{(i)}]^{- 1} {X^{(i)}}^{T} Y^{(i)} \end{matrix}

Note without loss of generality,

\begin{aligned} X & = [\begin{array}{c} X^{(i)} \\ v_{i}^{T} \end{array}], Y = [\begin{array}{c} Y^{(i)} \\ y_{i} \end{array}] \\ X^{T} X & = {X^{(i)}}^{T} X^{(i)} + v_{i} v_{i}^{T}, X^{T} Y = {X^{(i)}}^{T} Y^{(i)} + v_{i} y_{i} \\ {X^{(i)}}^{T} X^{(i)} & = X^{T} X - v_{i} v_{i}^{T}, {X^{(i)}}^{T} Y^{(i)} = X^{T} Y - v_{i} y_{i} \end{aligned}

$\hat{\beta}^{(i)}$

\begin{aligned} \hat{β^{(i)}} & = [X^{T} X - v_{i} v_{i}^{T}]^{- 1} (X^{T} Y - v_{i} y_{i}) \\ = [(X^{T} X)^{- 1} + \frac{(X^{T} X)^{- 1} v_{i} v_{i}^{T} (X^{T} X)^{- 1}}{1 - v_{i}^{T} (X^{T} X)^{- 1} v_{i}}] (X^{T} Y - v_{i} y_{i}) \\ = \underset{\hat{β}}{\underset{⏟}{(X^{T} X)^{- 1} X^{T} Y}} - (X^{T} X)^{- 1} v_{i} y_{i} + \frac{(X^{T} X)^{- 1} v_{i} v_{i}^{T} \overset{\hat{β}}{\overset{⏞}{(X^{T} X)^{- 1} X^{T} Y}} - (X^{T} X)^{- 1} v_{i} \overset{h_{i i}}{\overset{⏞}{v_{i}^{T} (X^{T} X)^{- 1} v_{i}}} y_{i}}{1 - \underset{h_{i i}}{\underset{⏟}{v_{i}^{T} (X^{T} X)^{- 1} v_{i}}}} \\ = \hat{β} - (X^{T} X)^{- 1} v_{i} y_{i} + \frac{(X^{T} X)^{- 1} v_{i} v_{i}^{T} \hat{β} - (X^{T} X)^{- 1} v_{i} h_{i i} y_{i}}{1 - h_{i i}} \\ = \hat{β} - \frac{(X^{T} X)^{- 1} v_{i}}{1 - h_{i i}} [y_{i} (1 - h_{i i}) - \underset{{\hat{y}}_{i}}{\underset{⏟}{v_{i}^{T} \hat{β}}} + h_{i i} y_{i}] \\ = \hat{β} - \frac{(X^{T} X)^{- 1} v_{i}}{1 - h_{i i}} (y_{i} - \hat{y_{i}}) \\ \hat{β^{(i)}} & = \hat{β} - (X^{T} X)^{- 1} v_{i} \frac{r_{i}}{1 - h_{i i}} \\ \hat{β} - \hat{β^{(i)}} & = (X^{T} X)^{- 1} v_{i} \frac{r_{i}}{1 - h_{i i}} \end{aligned}

$[A-aa^T]^{-1}=A^{-1}+\frac{A^{-1}aa^TA^{-1}}{1-a^TA^{-1}a}$

Substitute this into the expression for $D_i$ :

\begin{aligned} D_{i} & = \frac{\hat{β} - \hat{β^{(i)}} X^{T} X (\hat{β} - \hat{β^{(i)}})}{{\hat{σ}}^{2} (p + 1)} \\ = \frac{r_{i}^{2}}{(1 - h_{i i})^{2}} \frac{v_{i}^{T} (X^{T} X)^{- 1} X^{T} X (X^{T} X)^{- 1} v_{i}}{{\hat{σ}}^{2} (p + 1)} \\ = \frac{r_{i}^{2} h_{i i}}{{\hat{σ}}^{2} (1 - h_{i i})^{2}} \frac{1}{p + 1} \\ = {[\frac{r_{i}}{\sqrt{{\hat{σ}}^{2} (1 - h_{i i})}}]}^{2} \frac{h_{i i}}{1 - h_{i i}} \frac{1}{p + 1} \\ = e_{i}^{2} \frac{h_{i i}}{1 - h_{i i}} \frac{1}{p + 1} \end{aligned}

$D_i=e_i^2\dfrac{h_{ii}}{1-h_{ii}}\dfrac{1}{p+1}$

$\abs{e_i}$ $h_{ii}$ $D_i$ will be small and not influential.
$h_{ii}$ $\abs{e_i}$ $D_i$ can be also small and not influential.
$\abs{e_i}$ $h_{ii}$ $D_i$ will be large and observation i is influential.

Rules of Thumb:

$D_i>0.5$ , worth of investigation, may be influential
$D_i>1$ , quite likely to be influential
$D_i$ values may also indicate the observation is influential

6.4 Summary

$e_i=\dfrac{r_i}{\sqrt{\hat{\sigma}^2(1-h_{ii})}}$ $\abs{e_i}>3\Rightarrow$ outlier in response.
$h_{ii}$ $h_ii>2\bar{h}=\dfrac{2(p+1)}{n}\Rightarrow$ high leverage case, potentially outlier in explanatory variable.
$D_i=e_i^2\dfrac{h_{ii}}{1-h_{ii}}\dfrac{1}{p+1}$ $D_i>0.5$ $1\Rightarrow$ influential observation.

7. Model Selection

In many applications, there may be a rather large pod of explanatory variables measured along with the response, and we have very little idea of which ones are important.

In terms of model fitting:

too few predictors (under-specified model), biased estimates/predictions
too many predictions (over-specified model), modelling sprurious relationships that are actually noise, less precise estimates/predictions
just right with correct terms, no bias and most precise estimation/prediction.

$p$ $k\le p$ variables that gives us the "best" model:

goodness of fit
interpretability
prediction performance

Two key ingrediants of Model Selection:

Selection Criterion (for comparing different models)
Search Strategy (which model to fit?)

7.1 Selection Criterion

Some common criteria:

$R^2$

\begin{aligned} (k is the number of predictors) & R_{a d j}^{2} & = 1 - \frac{SSE / (n - k - 1)}{SST / (n - 1)} \\ = 1 - \frac{n - 1}{n - k - 1} (1 - \underset{R^{2}}{\underset{⏟}{\frac{SSR}{SST}}}) \\ = 1 - \frac{n - 1}{n - k - 1} (1 - R^{2}) \end{aligned}

$R_{adj}^2$ accounts for number of predictors in the model, penalizes inclusion of unimportant predictors.
$R^2$ $R_{adj}^2$ $R^2$ (or SSE) as much
$R_{adj}^2$ $R^2$ , but it can be used as a measure of goodness of fit and model selection criterion.
$R_{adj}^2$ , the better the model.

e.g. We have the following data:

$n=35$	$k=4$ )	$k=8$ )
SSR	31.2	32
SSE	8.8	8
SST	40	40

\begin{matrix} (166) & \begin{array}{rcc} Criterion & Model 1 & Model 2 \\ R^{2} & \frac{31.2}{40} = 0.78 & \frac{32}{40} = 0.80 \\ R_{a d j}^{2} & 1 - \frac{8.8 / 30}{40 / 34} = 0.751 & 1 - \frac{8 / 26}{40 / 34} = 0.738 \end{array} \end{matrix}

$R^2$ $R_{adj}^2$ for 4-predictor is larger and hence it is preferred.

AIC (Akaike Information Criterion)

$n$ $q$ $q=k+1+1$ $k$ $\sigma^2$ )

\begin{matrix} (167) & A I C = - 2 [\ln L (\hat{θ}) - q] = 2 q - 2 \ln L (\hat{θ}) \end{matrix}

$L(\hat{\theta})$ $\hat{\theta}$ .

$\beta$
$L(\hat{\theta})$ $2q$
A model with lower AIC is preferred

BIC (Bayesian Information Criterion)

\begin{matrix} (168) & B I C = q \ln (n) - 2 \ln L (\hat{θ}) \end{matrix}

BIC is similar to AIC setup, but it also considers sample size and penalizes inclusion of more predictors more strongly
A model with lower BIC is better in general.

Remark:

$R_{adj}^2$ , AIC, BIC are not interpretable themselves, but can be used for comparing different models.
They are all penalized on the number of predictors in the model
They are all related to SSE
$\begin{aligned} R_{adj}^{2} & = 1 - \frac{SSE / (n - k - 1)}{SST / (n - 1)} \\ AIC & = 2 q_{k + 2} - 2 \underset{\begin{array}{c} for linear regression model \\ - \frac{n}{2} \log \frac{SSE}{n} + constant \end{array}}{\underset{⏟}{\ln L (\hat{θ})}} \\ BIC & = q \ln (n) - 2 \ln L (\hat{θ}) \end{aligned}$
$k$ $R_{\text{adj}}^2$ , AIC, and BIC will all pick the same best model, the one with the smallest SSE.

7.2 Search Strategy

All Subsets Regression

$p$ predictors, each one may or may not enter the model, thus there are

\begin{matrix} (169) & \sum_{j = 0}^{p} (\binom{p}{j}) = 2^{p} \end{matrix}

possible models to fit.

$R_{adj}^2$ , AIC, BIC).

This is appealing as we can for sure find the optimal model!

$p$ is very large).

We may want to find a "good" (useful), not necessarily optimal, model in reasonable computation time. Many strategies focus on adding/removing variable one at a time sequantially.

Forward Selection (add one variable at a time)

$\beta_0$
$p$ $y_i=\beta_0+\beta_1x_{ij}+\epsilon_i, j=1,\ldots,p$
$x_j$ to the model
$p-1$ $x_j$ and another variable
1. $p-1$ models improves criterion, STOP
2. $p-1$ models according to the criterion, so we have 2 variables in the model
Continue adding 1 varaiblea at a time until no more variable improve criterion

Backward Elimination (remove one variable at a time)

$p$ predictors
$p$ $p-1$ predictors)
$p$ $x_j$ from the model
$p-1$ $x_j$ and one other variable from the model, continue until removing additional variable doesn't improve.

Forward-Backward Stepwise

Start with forward selection
$k$ variables in the model:
1. $k$ $k-1$ variables, if any of these improve criterion, remove the variable that improve the most
2. $p-k$ $k+1$ variables if any of these improve criterion, add the variable that improves the most

8. Building Predictive Model

$y$ (do not care much about finding association between explanatory variables and outome, or interpretation).

$R_{\text{adj}}^2$ , AIC, BIC) and computed on fitted models, which assess explanatory power of a model on the data used to fit the model (training data). In the case of prediction modeling, we want metrics/criteria to evaluate how well model perform in predicting the outcome on new data given predictors (validation data).

Mean Squared Prediction Error (MSPE)

\begin{matrix} (170) & MSPE = \frac{1}{υ} \sum_{i = 1}^{υ} (y_{i} - {\hat{y}}_{i})^{2} \end{matrix}

$y_i$ is observed outcome in the validation /testing data
$\hat{y}_i$ is the predicted outcome based on a model fit on training data
MSPE involves a validation/testing set that is not part of model fitting, it measures how far the predicted is from the observed validation/testing data.

Root mean squared erorr: $\mbox{RMSE} = \sqrt{\mbox{MSPE}}$

Mean Absolute Error: $\frac{1}{\upsilon}\sum_{i=1}^{\upsilon}\abs{y_i-\hat{y}_i}$

Ideally we have a very large dataset and split it into three mutually exclusive sets

$n$ obs)	$\upsilon$ obs)	$t$ obs)
$y_1,\ldots,y_n$	$y_{n+1},\ldots,y_{n+\upsilon}$	$y_{n+\upsilon+1},\ldots,y_{n+\upsilon+t}$
$x_1,\ldots,x_n$	$x_{n+1},\ldots,x_{n+\upsilon}$	$x_{n+\upsilon+1},\ldots,x_{n+\upsilon+t}$

$x_i=(x_{i1},\ldots,x_{ip})$

8.1 Different Datasets

Training set
- Fit candidate models, as many as you want.
- $\hat{\beta}$
Validation set
- used to evaluate performance of models using training data
- estimate prediction error, e.g. MSPE, for each model
- perform model selection, e.g. choose the model with best/lowest MSPE
Test set
- we don't get access to it until the very end, it is used for final assessment of the choosen model

MSPE on validation set should approximate the MSPE on testing set since neither set used to fit the model.

Implementation:

simplest: randomly divide data into training/validation, e.g. 80/20 split, but we don't get to use all data for training

Idea: randomly split data into two sets

Training Set	Validation Set
$\{(y_i,x_{i1},\ldots,x_{ip}): i=1,\cdots,n\}$	$\{(y_i,x_{i1},\cdots,x_{ip}): j=n+1,\ldots,n+\upsilon\}$
$\hat{\beta}$ , for each model	$\mbox{MSPE}=\frac{1}{\upsilon}\sum_{j=n+1}^{n+\upsilon}(y_j-\hat{y}_j)^2$ . Choose the "best" model, e.g. one with the smallest MSPE

Problems: not using all the data for training; different partition of data may lead to different result.

better: use cross-validation (CV) scheme

$k$ -Folds

$\mathbf{Y}, \mathbf{X_1},\ldots,\mathbf{X_p}$ $i$ , we have the following procedure:

$K$ roughly equal-sized sets (folds) randomly
$k$ $k$ for validation and train on the rest of data
Estimate prediction error for a given model
1. $K$ $k$ as validation (remaining data as training set) and obtain estimate of prediction error
  $\begin{matrix} (171) & {MSPE}_{k} = \frac{1}{υ_{k}} \sum_{i = 1}^{υ_{k}} (y_{i} - {\hat{y}}_{i})^{2}, k = 1, \dots, K \end{matrix}$
  $y_i$ $\hat{y}_i$ $i$ $k$ .
2. Calculate the average prediction error across all folds
  $\begin{matrix} (172) & \frac{1}{K} \sum_{k = 1}^{k} {MSPE}_{k} \end{matrix}$
Choose the "best" model as the one with the lowest average prediction error
$k=5,10,n$ $n$ is the total number of observation (leave-one-out CV).

STAT 331 Applied Linear Models

Introduction

A general form of a "lienar regression model"

1. Simple Linear Regression

1.1 Model Formulation

1.2 Least Squares Estimation

1.2.1 Low Birth Weight Infant Data Example

1.3 Estimation of Variance σ2\sigma^2

1.3.1 Mean Squares Errors (MSE)

1.3.2 Rocket Propellant Data Example

1.4 Maximum Likelihood Method

1.5 Properties of Least Square Estimates

1.5.1 Least Squares Estimators (LSEs)

1.5.2 Summary

1.5.3 Standard Error

1.6 Inference for Regression Parameters

1.6.1 Confidence Interval (CI)

Low Birth Weight Infant Data (cont. after here)

1.6.2 Hypothesis Testing

Significance Test of β1\beta_1

Rocket Propellant Data (cont. from here)

1.7 Estimation of Mean Response

Rocket Data Example (cont. from here))

1.8 Prediction

Rocket Data

1.9 Analysis of Variance (ANOVA)

1.9.1 Breakdown of Degrees of Freedom

2. Multiple Linear Regression

2.1 Random Vectors and Multivariate Normal Distribution

2.1.1 Multivariate Normal Distribution

2.2 Multiple Linear Models

2.3 Parameters Estimation

2.3.1 Least Squares Approach

2.4 Fitted Values and Residuals

2.5 Estimation of σ2\sigma^2

2.6 Inference in Multiple Regression

2.6.1 Inference for Linear Combinations of Coefficients: cTβc^T\beta

2.7 Prediction in Multiple Linear Regression

2.8 Geometric Interpretation of Least Squares

2.9 ANOVA for Multiple Linear Regression

2.9.1 F-test of "overall" significance

2.9.2 Coefficient of Determination

3. Specification Issues in Regression Models

3.1 Categorical Variables

Example: Occupational Prestige

3.2 Interaction Effects

Example: low birthweight infant

4. Testing General Linear Hypothesis

Example: Prestige Data

4.1 F-test for General Linear Hypothesis: H0:Aβ=0H_0:A\beta=0

Example: Prestige Score Data

5. Model Assumptions and Residual Analysis

5.1. Introduction

5.2 Residual Analysis

Q-Q Plot for Normal Distribution

5.3 Addressing Model Assumption Problems

6. Effects of Individual Observations

6.1 Studentized Residuals

6.2 Leverage

6.3 Influence

Cook's Distance (between β^(i)\hat{\beta}^{(i)} and β^\hat{\beta})

6.4 Summary

7. Model Selection

7.1 Selection Criterion

Adjusted R2R^2

AIC (Akaike Information Criterion)

BIC (Bayesian Information Criterion)

7.2 Search Strategy

All Subsets Regression

Forward Selection (add one variable at a time)

Backward Elimination (remove one variable at a time)

Forward-Backward Stepwise

8. Building Predictive Model

8.1 Different Datasets

Cross-validation with kk-Folds

$\sigma^2$

$\beta_1$

$\sigma^2$

$c^T\beta$

$H_0:A\beta=0$

$\hat{\beta}^{(i)}$ $\hat{\beta}$ )

$R^2$

$k$ -Folds