Copyright Note: This is just a rearranged and edited copy of STAT331 lecture notes. All rights belong to Professor Leilei Zeng from Department of Acturial Science and Statistics, University of Waterloo.
Introduction
Regression is a statistical technique which is commonly used to quantify relationship betwen a variable of interest (the response variable) and some other variables (explanatory variable).
Response Variable: ()
the study variable. One is interested in evaluating how it changes depending on other variables.
dependent variable / outcome variable
Explanatory Variable: ()
the variables influences/affects the response variable
independent variables, predictors, covariates, features
We imagine we can explain in terms of using some function, s.t.
Application of Regression
Response ()
Explanatory Variables ()
Public health
Lung Function
weight, height, sex, age, COVID infection
Engineering
Circuit Delay
resistance, temperature
Finance
Stock Index
unemployment rate, money supply, government policy, etc
Economics
Unemployment rate
consumer price index, inflation
A general form of a "lienar regression model"
is the value of response variable
are values of explanatory variable (fixed)
are called model parameters
is a random error, represents the unexplained variation in . We assume
The objective is to determine the form of the linear function, , more specifically to determine the values of 's. This is done empirically by "fitting" models to data and idenfitying the model that descreibes the relationship the "best"!
Topics
parameter estimation and inference
model interpretation
prediction
model assumption checking, methods to deal with violations
variable selection
applied issues: outliers, influential data points, multicollinearing
1. Simple Linear Regression
Simple linear model is used to study the relationship between a response variable and a single explanatory variable.
Example - Low Birthweight Infant Data
Response variable : head circumference (cm)
Explanatory variable : gestational age (week)
Data consists of pairs .
A scatter plot of data reveals a linear relationship between and .
Pearson Correlation Coefficient
For all random variables and ,
Given data, the sample correlation coefficient is
where , , and , and it suffices , and ,
, positive linear relationship (if , then it is a perfect line)
, negative linear relationship (if , then it is a perfect line)
, no linear relationship
This is useful fo rquantifying strength & direction of the relationship, but cannot be used for prediction cosider linear regression models.
1.1 Model Formulation
A simple linear model:
is value of response variable which is viewed as a continuous random variable
is value of explanatory variable which can be any type and its value is fixed/controlled (hence non-random)
is the intercept and is the shape
is the random error and
Suppose we observe pairs from a random sample of subjects, we write the model as
and we assume .
The normality of random error implies the normality of response such that
Here
Interpretation of parameters and
- expected/average value of response variable when
- expected/average amount of change in response when the value of explanatory variable changes by 1 unit
- variable does not influence (linearly), no linear relationship
1.2 Least Squares Estimation
The "best" line has values of and that minimize the errors between and .
Least Squares of Estimates (LSEs) of and are those values minimize sum of squares of errors
To obtain LSEs, take derivates
Solving the system of the two equations, we let both the derivatives be 0 and
We can rewrite (1) as
Substitude this into (2) gives us
LSEs for and are
Aside, since , then
Therefore, the LSE can also be writen as
Note: we should check if and (as given above) minimize the sum of squares of erros (second derivative test).
1.2.1 Low Birth Weight Infant Data Example
Let be the head circle length, be the gest. age,
Estimated/fitted regression line is then
Fitted value:
Residual:
Interpretation: the head circle is expected to increase by 0.78 cm when gest. age increases by 1 week.
Recall that LSEs and satisfy
Therefore, residuals have following properties:
1.3 Estimation of Variance
represents variability in random errors, and hence variability in responses.
Consider sampling from a single population, , sample variance: is a unbiased estimator of population variance , it is devided by because 1 df is lost by using sample mean to estimate population mean .
For simple linear regression models, same logic appplies but recognize that , that is comes from different distribution with means that depend on .
1.3.1 Mean Squares Errors (MSE)
is an unbiased estimator for , it is devided by as 2 df's are lost due to estimation of and .
1.3.2 Rocket Propellant Data Example
A rocket motor is manufactured by bonding a ignite propellant to a substainer propellant inside a metal housing.
The shear strength of bonding (a quality character) depends on the age of the sustainer propellant:
response (): shear strength of bonding
explanatory (): propellant age
20 observations of pair , the data is stored in an excel file "rocket.xls".
We are 95% confident (or there is a 95% chance) the interval contains .
1.6.2 Hypothesis Testing
Suppose we wish to test the slope takes a specific value, say . We write the hypothesis as following:
The t-test:
t-statistic:
if is true.
decision rule: Reject if
Given a random sample, a large observed value of t-statistic (on both tails), , indicates strong discrepancy between and , hence strong evidence to reject .
Set significance level , so we choose an extreme large tail value , and reject if .
p-value
where is a random variable such that .
it is the probability of observing a t-statistic value at least as extreme as what was actually observed, when is true, so small p-value less likely is true.
Reject if p-value (significance level).
Significance Test of
A special case, we wish to test if there is a linear relationship between and (i.e. if ).
Reject if , we say that the coefficient is statistically significant.
Some Remarks:
Fail to reject does not mean in favour or "accept" .
Confidence intervals van be used to make decisions about hypothesis .
Recall in t-test, do not reject if
That is we do not reject if lies within the CI.
Our primary interest often lies in the inference about slope , but not for intercept .
The procedures can be easily modified to produce one-sided test, e.g.
Reject if , p-value .
Choosing a one-sided test for sole purpose of attaining significance is not appropriate. Choosing a one-sided test after running a two-sided test that failed to reject the null is not appropriate.
, this means that we can make the variance small by making large, i.e. by spreading out 's.
What is the average bonding strength for the rocket moters made from a batch of sustainer propellants that is 10 weeks old?
Recall - Fitted line:
Estimate mean response:
A 95% CI for the mean at :
A one-sided test scenerio
Estimate average bonding strength at weeks:
Note: this is a point estimate, the fact can not be used to draw the conclusion that .
Test hypothesis:
large observed t-value implies very unlikely is true, so reject if . Since , do not reject .
There is not enough evidence to reject , so these propellants may not be safe to use.
1.8 Prediction
Here we consider prediction of a single response , for a "new" subject with an explanatory variable value .
The (true) future response value is
where is a future random error.
Naturally, we replace and by their LSEs, and replace random error by its expection, 0.
We predict by , and prediction error is .
Some results about prediction error: :
Proof:
and are functions of data at hand, the future error is not related to the data. Hence, is not related to and .
It is also true that
A % prediction interval for :
(it is wider than a % CI for mean response at )
Rocket Data
Find predicted value and a 95% prediction interval for bonding strength of a new rocket motor made with a propellant that is 16 weeks old.
Prediction :
A 95% prediction interval:
A 95% CI for :
1.9 Analysis of Variance (ANOVA)
We consider regression analysis from a new perspective called analysis of variance. The approach of analysis of variance is based on the partitioning of total variability in responses.
Total Sum of Squares (SST)
measures deviation of reponses from the sample mean
implies no variation
the greater , the greater the variation
Error Sum of Squares (SSE)
measures deviation of responses from the fitted values on the regression line
Regression Sum of Squares (SSR)
represents deviation of fitted value from the sample mean
Partition of Total Sum of Squares:
1.9.1 Breakdown of Degrees of Freedom
SST has df
1 df is lost because estimation of population mean using
SSE has df
2 df is lost because two parameters and are estimated to obtain fitted values
SSR has 1 df
Although there are terms of , all fitted values are from same regression line that involves two parameters and . Hence, it is associated with 2 df, and 1 df got lost because cosntraint:
ANOVA Table
Source
SS
df
MS (mean squares)
Regression
1
Error
Total
Mean Squares:
Proof: note that we can write
Therefore,
Given the abvoe results, we expect MSR to be larger than MSE if .
This gives an idea for testing significance of , , we want to reject if the observed value of the ratio is large, but how large is large enough? It would be nice to know the sampling distribution of .
Under , the radio
a F-distribution with df 1 and
We will do a F-test for the hypothesis and reject if the observed value of F-statistic > (upper percentage point of distribution).
Coefficient of Determination:
represents the proportion of total variation in responses that can be explained by the linear regression model
Bigger the value of , better the fit
, perfect fit , and .
2. Multiple Linear Regression
2.1 Random Vectors and Multivariate Normal Distribution
Random Vector
For any set of r.v.'s , is a random vector.
Mean of
Variance-Covariance Matrix of
is symmetric and positive definite matrix.
If are independent is diagonal (not necessarily true the other way around).
Basic Properties
random vector,
(A and b are matrix and vector of constants).
2.1.1 Multivariate Normal Distribution
A random vector has a multivariate normal distribution if its density function takes the form
where and .
We also write .
Some useful results about multivariate normal distribution:
If , then
(Transformation invariant)
If , then element of has a normal distribution, i.e.
More generally, for any partition of
then (marginal normality!)
If , and are independent, then
where .
For multivariate normal random vector
If , let and , then
2.2 Multiple Linear Models
We can write the model in scalar form:
are expalnatory variables
are regression coefficients representing the effect of each on response.
Random error
Alternatively, we write the model in matrix form:
is called design matrix.
Interpretation of Parameters:
Intercept : mean response when all of are 0.
Regression coefficient : change in mean response associated with 1 unit of increase in , while holding all the other explanatory variables fixed.
is not (linearly) related to response, given all the other explanatory variables in the model.
As , then the random error vector , here is an identity matrix.
According to the model , response vector is a linear function of , thus .
2.3 Parameters Estimation
2.3.1 Least Squares Approach
Same as in simple linear regression
We want to find values of vector that minimizes sum of squares.
In matrix form,
Taking derivative w.r.t. vector
Aside,
if , then
if , then
if , then
Solving the equation for gives
Least Square Estimator for :
Properties of LSE :
(unbiased)
Since , then the linear transformation also has a MVN distribution.
2.4 Fitted Values and Residuals
Fitted Value Vector:
Hat Matrix:
Symmetric: .
Idempotent: .
Residual Vector:
.
.
Since Is symmetric and idempotent, so does .
As is a linear transformation of , and , then .
Some Other Results about Residuals:
Proof:
2.5 Estimation of
We still want to use sum of squares of residuals to estimate the variance of random error, .
Question: What is the df of in MLR?
Result:
Proof: The idea is to re-write as a sum of independent distributed random variables.
Recall , is symmetric and idempotent, decomposition
where is a orthogonal matrix , and is a diagonal matrix with value 0 or 1 on diagonal line.
Now we define a new random vector
then as is a linear transform of .
since is orthogonal, .
We know vector , here is diagonal with value 0 or 1 for diagonal elements. To find the degree of freedom, we need to find how many diagonal elements in is 1, and
This implies that vector has non-zero elements in diagonal, and they are random variables.
Therefore,
Moreoever,
So the MSE
is an unbiased estimator for .
2.6 Inference in Multiple Regression
Some useful results:
or equivalently
As is unknown, we replace by its estimator , and obtain a t-distribution result in simple linear regression
Here, df is , denotes the number of explanatory variables.
For an individual coefficient :
CI is
Test
Question: Can we fit a linear model when the number of model parameters (or number of explanatory variables) is larger than the number of observations? (i.e. )
LSE:, for to be invertible, the design matrix needs to be full column rank - the number of rows needs to be bigger than the number of columns, that is , and .
2.6.1 Inference for Linear Combinations of Coefficients:
In multiple linear regression, we wish to estimate the mean response at some given values of explanatory variables, e.g. , a vector of constants.
Estimated Mean Response at
Recall that , then
Folowing the same argument as before, replace by its estimator , we have
This is used to conduct inference about .
is the CI for .
2.7 Prediction in Multiple Linear Regression
For a new subject with explanatory variable values , we wish to predict the outcome and obtain a prediction interval.
Future response: according to the model
Prediction:
Prediction error:
Standard error:
t-distribution:
A 95% prediction interval:
here .
Caution: avoid prediction beyond the range of data!
2.8 Geometric Interpretation of Least Squares
Let
be the response vector,
𝟙 be a vector of 1,
be the vector of values of the th explanatory variables, .
Then, 𝟙 is the design matrix.
From geometry point of view, each vector above represents a point in a -dimensional space .
Response vector is represented by an arrow pointing to a point in space .
is the set of all linear conbinations of vectors , e.g.
𝟙
is then a subspace of , spanned by vectors 𝟙 (denoted by the oval region).
, is an arrow in the subspace and difference is the distance between the arrows and .
The shortest distance is when choose such that the vector orthogonal to the subspace , so it is orthogonal to all the vectors 𝟙 in this subspace.
Thus, the estimator satisfies
, the matrix is an orthogonal projection matrix which projects reponse vector onto subspace .
The residual vector is orthogonal to , so
2.9 ANOVA for Multiple Linear Regression
Consider a multiple linear model
We still have following source of variations:
regression
error
total
Sum of squares and degrees of freedom add up as before:
The value of and remains the same:
But the values of and and their 's are different from SLR.
is fitted value now based on MLR,
has increased from 1 to (number of explanatory variables)
becomes
ANOVA Table
Source
Regression
Error
Total
2.9.1 F-test of "overall" significance
Suppose we want to evaluate the "overall" significance of a multiple linear model, e.g. we wish to test hypotheses:
F-statistic:
We reject if is large (that is the model explains more variation in response than the random error), i.e.
Why only use instead of ? We only look at the large positive value.
If is rejected, we conclude that at least one of the explanatory variables is important or related to .
2.9.2 Coefficient of Determination
It still represent the portion of variation in the response explained by all the explanatory variabels (or the model).
always increases as more explanatory variables are added to the model.
A "little" model is nested within a "big" model, the infimum over a bigger set of is smaller than the infimum over a smaller set of , so for bigger model is less than for the little mode; when the omdel is saturated, i.e. the number of 's is same as the number of data points, the model fits perfectly, and .
As increases in the number of explanatory variables, it alone cannot be used as a meaningful comparison of models with very different number of explanatory variables. Bigger may not always mean "better" model.
Adjust
adjusted by
higher the better in general
no longer a proportion (not in percentage)
Example: Low Birthweight Infant Data
Fit summary lm(headcirc ~ gestage+brithwt,)
Residual standard error: 1.274 on 97 df
Multiple : 0.753, Adjusted : 0.747
F-statistic: 147.1 on 2 and 97 df.
Test v.s. at least one fo and is not 0.
Since , or p-value reject , and conclude that the model is significant.
3. Specification Issues in Regression Models
3.1 Categorical Variables
In linear regression models, the response variablehas to be numerical, however, the explanatory variables can be either numerical or categorical.
A categorical variable takes a value that falls into one of several categories, e.g.
binary, toxaemia yes/no, coded by 1 and 0.
ordered, mild, moderate, severe
not ordered, red, blue, green
Approach: recode into indicator variables or treat as numerical if it makes sense to do so.
Example: Occupational Prestige
Reponse: prestige score
Explanatory: education (yr), type: blue collar, white collar, professional
Occupation
prestige ()
education ()
type ()
Computer operations
47.7
11.36
wc
Construction labourers
26.5
7.52
bc
Office derks
35.6
11.00
wc
Nurses
64.7
12.46
prof
How to code variable "type"?
e.g, let
This approach is not generally appropriate unless explanatory variable is ordinal and the response is linear according to this scheme.
More flexible approach: introduce indicator/dummy variables
Then,
, design matrix is not full column rank, is not invertible!
In general, if there are levels, just need indicators for the categorical variable.
Model:
Interpretation:
mean prestige score if education and type=bc is
mean prestige score if education and type=prof is
mean prestige score if education and type=wc is
is the difference between prof and bc in mean response
is the difference between wc and bc in mean response
is the difference between prof and wc in mean response
We know the LSE
test the difference between prof and bc, e.g.
or test the difference between wc and bc, e.g. ,
we use ,
test the difference between prof and wc, e.g.
,
3.2 Interaction Effects
Interaction effects exist when the effect of one variable depends on the value of another variable.
Suppose we have two numerical variables , an interaction is the product of these two variables, e.g.
We call and the main effects of and , and the interaction effect between and .
General Interpretation:
mean response given and
mean response when increases by 1, (), fixed
change in mean response as increases by 1, take difference, we get
It now depends on the value of through (interaction effect).
As we can see, by adding interaction , we allow the effect on depending on the other variable .
When one of the two variables, say , is binary.
Example: low birthweight infant
For the group of infants whose mothers did not have toxemia, , mean head circ
For the group of infants whose mothers had toxemia, , mean head circ
(main effect of , binary) represents the difference in intercepts between two groups
(interaction effect) represents the difference in slopes between two groups
Question: is the relationship between head circ () and gest age () the same for these two groups? We can test significance of interaction effect :
4. Testing General Linear Hypothesis
Model:
We have discussed
test for a single coefficient, eg. t-test for
test for all coefficients, eg. F-test for
General Linear Hypothesis:
Here is a matrix of some constants.
Example: Prestige Data
"Full" model:
where and together is the type variable. We want to test if , meaning that the type itself does not matter overall.
, tests overall effect of type
𝟘
If is true, the "full" model reduces to , which is the "reduced" model.
, this test has no effect of education and no difference between prof and wc.
𝟘
If is true, the "full" model reduces to , where is prof or wc.
In general, Constraints, is matrix and .
Principle of Extra Sum of Squares:
Fit "full" model and "reudced" model with constriants .
Rewrite
, as
, as and
Response vector is a sum of orthonomal vector
, unadjusted total sum of squares
, sum of residual squares of full model
, note that , , then
"additional sum of squares explained by full vs. reduced one with constraint "
If is true, we expect to be close to , and to be small relative to sum of squares of errors .
4.1 F-test for General Linear Hypothesis:
is matrix of constants, is the number of constraints, is the number of 's.
e.g. ,
Principle of Extra Sum of Squares:
If ExtraSS is large relative to SSE, the null is less likely to be true, and we want to reject .
Note: .
F-statistic:
Note: is the number of constraints, .
A sketch of proof:
(need )
- write it as a sum of squares of standard normal variables, (need , is true)
and are independent.
Therefore, combining 1, 2, and 3,
Reject if
Example: Prestige Score Data
Full model: education (), type (), education type (interaction)
: slope for "bc" (baseline type)
: difference in slopes between "prof" and "bc".
: difference in slopes between "wc" and "bc"
Question: is the effect of education on prestige score the same for different types?
We wish to test the significance of interaction effects, e.g.
Reduced model ( is true):
In R, fit both full and reduced models.
xxxxxxxxxx
5
1
fit<-lm(prestige~education+factor(type) +education*factor(type), ...) # this is our full model
2
summary(fit)
3
# coefficients
4
# ...
5
# Redidual standard error: 7.827 and 92 df
The residual standard error is , so .
xxxxxxxxxx
4
1
fit_A<-lm(prestige~education+factor(type), ...) # this is the reduced model
2
summary(fit_A)
3
# ...
4
# Residual standard error: 7.814 and 94 df
The residual standard error is , so .
F-statistic for :
, , do not reject , we have no evidence that the effect of education is different between types.
In R, anova() function can be used for the F-test procedure.
xxxxxxxxxx
7
1
anova(fit_A, fit)
2
## Analysis of Variance Table
3
# Model 1: prestige ~ edu + factor(type)
4
# Model 2: prestige ~ edu + factor(type) + edu * factor(type)
A linear regression model is specified under several assumptions:
Relationship between and is linear, ie.
has a mean of 0,
's have a constant variance,
's are normally distributed
's are independent
Random errors 's are not directly observable, thus it seems hard to examine their distributional properties.
However, we can obtain residuals, , and we argue residuals behave similarly as random errors.
Relationship between Residuals and Random Errors
, where and as .
If is small relative to identity matrix , then and .
Idea: check assumptions about 's using diagnostic plots of residuals, 's.
5.2 Residual Analysis
Residual:
deviation between the data and fit
can be viewewd as "observed" value of random error
if there is any departure from assumptions on error 's, it should shown up by residual 's.
Standardized Residual:
standardize residual by minus its mean, , and divided by its standard deviation,
Recall , (approximate variance)
Scaling of by
and .
Studentized Residual:
Scaling of by using , where is the exact/true variance. is the element of
, .
Residual Plots
Gaphical display of residuals, an effective way to detect departures of model assumptions.
Various types of plots for different assumptions
Typically studentized residuals are ploted.
Plot of Residuals vs. Fitted Values
A plot of residual or any of scaled residuals ( or ) against corresponding fitted value
If residuals fluctuates randomly around 0 inside a horizontal band, then no visible defects.
If the residuals can be contained in an opening funnel ("fan" shape), it indicates that is not constant.
If the residuals are contained inside a curve plot, then it indicates nonlinearity (the relationship between and some variables is not linear, or some other explanatory variables are needed)
Plots of Residuals vs. Explanatory Variable
A plot of residuals against the values of -th explanatory variable, 's.
Some interpretation of the plot as the case of plot of residuals vs. 's.
Horizontal band no visible defects
Funned/fan shape Variance is nonconstant
Curvature Nonlinearity (may suggest need in the model)
Partial Residual Plots
Most useful in investigating the relationshiop between response and an explanatory variable .
Partial Residual for , , where is the residual based on all explanatory variables, adding effect of variable back into the residual.
Plot Partial Residuals vs.
Linear trend enters model linearly
Curvature higher order terms in may be helpful
Q-Q Plot for Normal Distribution
This is a graphical technique for detecting substantive departure from normality. Plot ordered standardized residuals against the theoretical quantiles of . If normality holds, then the ordered residuals should be align with the quantiles of normal distribution.
Points lies approximately on the straight line underlying distribution is normal.
Sharp upward and downward curves at both extremes heavy-tailed
Flattening at extremes light-tailed
Sharp change in the trend in an upward direction from the mid positively skewed (right skewed)
Sharp change in the trend in a downward direction from the mid negatively skewed (left skewed)
Remarks:
Do not wnat to check normality untill other assumptions been checked and fixed.
For skewed, transformation of may help.
Light tailed, can ignore
Heavy tailed, problematic.
5.3 Addressing Model Assumption Problems
If residual plots reveal problems with assumptions, we might be able to address via data transformation.
Transformation of Response to Stablize Variance
This can help address non-constant variance identified by the plot vs. (or vs. )
Idea: apply function and fit regression model on transformed
Rationals: the variance fo response might be a function of mean , ie.
in which case we want
By first order Taylor Expansion, we have
Then,
Thus, we need .
Examples:
.
We need will work
Thus, we can apply to abtain approximate constant variance.
.
We need may work
Thus, we can apply to abtain approximate constant variance.
Box-cox Power Transformation
Note that
where is some arbitrary power. Thus, Box-Cox transformation can help address the non-constant variance of the form
Some special cases:
, we have the square-root transformation
, this is the log transformation
, we have the identity transformation (no transformation)
, we have the reciprocal transformation
One can automatically try a sequence of values of and find the "best" choice that gives the largest log-likelihood value.
Note that interpreting can be less intuitive as a result of transformation, since now increasing by 1 unit corresponds to a change of in .
For log transformation: , represents the change in mean response in the log scale,
is the multiplicative change applied to the original response.
every 1 unit increase in is associated with percentage change in the mean response in the original scale.
95% CI for mean response at
For arbitrary , transformation leads to less interpretable results.
Transforming/Adding Explanatory Variables
This might be applied if (or ) has a clear non-linear relation with some , for example revealed by plots of vs. or partial residual plots.
We can consider transforming using power transformation, e.g.
We could add polynomial terms, such as , e.g.
which is still a linear model (linear in 's)
6. Effects of Individual Observations
Sometimes some of the observations have an unduly large infuence on the fit of the model.
Outliers
An outlier is defined as a particular observation that differs substantially from the majority of the observations in the dataset.
e.g. SLR (simple linear regression):
What causes outliers?
data entry or measurement error
sampling problem: an extradinary subject or a subject not part of the target population
natural variation: some unusual condition may occur during experiments, which may affect the outcome
Generally we do not recommend removing outliers from dataset, unless we have good reason to believe the observation is an error or the subject does not belong to the target population.
Outliers may change the result of model fitting, e.g. observation C pulls the fitted line closer to it, observation A or B may change the slope a bit.
Therefore, it is useful to investigate to what extend the outliers influence our fitted models.
How to detect outliers?
6.1 Studentized Residuals
Recall , ( is the th diagonal delement of ), and apprximately.
Rules of Thumb: if is very large according to , e.g. , the observation could be considered an outlier in extreme value of response .
e.g. plot of studentized residual () vs. fitted value can be used to identify outliers in response.
6.2 Leverage
Recall is the th diagonal element in the hat matrix , we call the leverage of observation . Note that
So
The leverage characterizes the influence of observation on its corresponding fitted value .
.
is idempotent.
is also symmetric, i.e. .
So,
As gets larger and close to 1, the off diagonal elements 's gets closer to 0,
That is if leverage is large relative to 's () then the fitted value is mostly determined bythe observation , we say that observation has a high leverage on the fit.
Rule of Thumb: if an observation with leverage higher than twice the average leverage, it is considered a high-leverage case.
Recall , it only involves explanatory variables, but not response.
The leverage is small for observation with () near the centroid (), and it is large if () is far away from the centroid.
e.g. SLR, ,
(This generalizes to MLR)
if is close to , is small
if is far from , is large
Thus, leverage is useful to help identify outliers in the sense of having explanatory varaibles with extreme values.
6.3 Influence
An observation is influenced if its presence in fitting regression considerably changes estimates for compared to when it is not used to fit the model.
Start by fitting a model with observations , and obtain LSE as usual.
Let denote LSE based on fitting the same model but the th observation removed from the data set. If is quite different than , then observation is highly influential.
Cook's Distance (between and )
The difference is a vector, so we want to look at , the sum of squares of difference in all elements.
The magnitude of difference should be scaled by the variance
Thus is Euclidean distance bewteen and up to a scaling factor
The bigger value, the bigger the difference between and , thus removing observation i has stronger influence on the parameter estimates.
It can be shown taht Cook's can be written as
where we can see that depends on both (standard residual ) and (leverage).
Observation i is most influential when both and are large values.
Proof: For the th observation, denote the vector of its explanatory varaible values.
Without observation i, the LSE of becomes
Note without loss of generality,
Substitute these into expression fo
Note that
Substitute this into the expression for Cook's :
So we have Cook's Distance
If is large but is small (e.g. may be an outlier in response but not in explanatory variable), Cook's will be small and not influential.
If is large but is small (e.g. may be an outlier in explanatory variable but not in response), Cook's can be also small and not influential.
When both and are large, Cook's will be large and observation i is influential.
Rules of Thumb:
, worth of investigation, may be influential
, quite likely to be influential
or much larger than the other values may also indicate the observation is influential
6.4 Summary
Studentizsed residual , if outlier in response.
Leverage , high leverage case, potentially outlier in explanatory variable.
Cook's Distance , or influential observation.
7. Model Selection
In many applications, there may be a rather large pod of explanatory variables measured along with the response, and we have very little idea of which ones are important.
In terms of model fitting:
too few predictors (under-specified model), biased estimates/predictions
too many predictions (over-specified model), modelling sprurious relationships that are actually noise, less precise estimates/predictions
just right with correct terms, no bias and most precise estimation/prediction.
Suppose given a full set of explanatory variables, we wish to find the subset variables that gives us the "best" model:
goodness of fit
interpretability
prediction performance
Two key ingrediants of Model Selection:
Selection Criterion (for comparing different models)
Search Strategy (which model to fit?)
7.1 Selection Criterion
Some common criteria:
Adjusted
accounts for number of predictors in the model, penalizes inclusion of unimportant predictors.
while always increases with more predictors, the may decrease if adding more predictors does not improve (or SSE) as much
while loses usual interpretation of , but it can be used as a measure of goodness of fit and model selection criterion.
The higher the , the better the model.
e.g. We have the following data:
Model 1 ()
Model 2 ()
SSR
31.2
32
SSE
8.8
8
SST
40
40
We expect for 8-predictor model to be larger than then one for 4-predictor model, however, the for 4-predictor is larger and hence it is preferred.
AIC (Akaike Information Criterion)
Let be sample size and be the number of parameters in the model (in MLR, Consisted of predictors, an intercept, and varaince )
where is theh likelihood evaluated at parameter estiamtes .
Recall LSE is also MLE for
AIC is considered penalized maximum log-likelihood, with more parameters will increase but will be offsetted by penalty
A model with lower AIC is preferred
BIC (Bayesian Information Criterion)
BIC is similar to AIC setup, but it also considers sample size and penalizes inclusion of more predictors more strongly
A model with lower BIC is better in general.
Remark:
, AIC, BIC are not interpretable themselves, but can be used for comparing different models.
They are all penalized on the number of predictors in the model
They are all related to SSE
For models of same size (number of predictors), , AIC, and BIC will all pick the same best model, the one with the smallest SSE.
7.2 Search Strategy
All Subsets Regression
With predictors, each one may or may not enter the model, thus there are
possible models to fit.
In theory, we can fit these models and choose the "best" one based on some selection criterion (e.g. , AIC, BIC).
This is appealing as we can for sure find the optimal model!
It can be computationlly intensive, or even infeasible (when is very large).
We may want to find a "good" (useful), not necessarily optimal, model in reasonable computation time. Many strategies focus on adding/removing variable one at a time sequantially.
Forward Selection (add one variable at a time)
Start with model that only has intercept
Fit SLR models, i.e.
Pick the best of p models with 1-predictor according to the chosen criterion, and add that variable to the model
Fit models containing and another variable
if none fo models improves criterion, STOP
pick best of models according to the criterion, so we have 2 variables in the model
Continue adding 1 varaiblea at a time until no more variable improve criterion
Backward Elimination (remove one variable at a time)
Start with full model taht has all predictors
Fit models resulting from removing one predictor from the regression (each model has predictors)
Pick the best of models according to the criterion and eliminate such from the model
Fit models that remove and one other variable from the model, continue until removing additional variable doesn't improve.
Forward-Backward Stepwise
Start with forward selection
If we have variables in the model:
Backwards: fitting models with variables, if any of these improve criterion, remove the variable that improve the most
Forwards: fitting models with variables if any of these improve criterion, add the variable that improves the most
8. Building Predictive Model
Suppose our goal is to find model that fit best predicts outcome (do not care much about finding association between explanatory variables and outome, or interpretation).
Previously we talked about selection criterion, (, AIC, BIC) and computed on fitted models, which assess explanatory power of a model on the data used to fit the model (training data). In the case of prediction modeling, we want metrics/criteria to evaluate how well model perform in predicting the outcome on new data given predictors (validation data).
Mean Squared Prediction Error (MSPE)
is observed outcome in the validation /testing data
is the predicted outcome based on a model fit on training data
MSPE involves a validation/testing set that is not part of model fitting, it measures how far the predicted is from the observed validation/testing data.
Root mean squared erorr:
Mean Absolute Error:
Ideally we have a very large dataset and split it into three mutually exclusive sets
Training set ( obs)
Validation Set ( obs)
Test set ( obs)
where
8.1 Different Datasets
Training set
Fit candidate models, as many as you want.
Obtain estimates of coefficients
Validation set
used to evaluate performance of models using training data
estimate prediction error, e.g. MSPE, for each model
perform model selection, e.g. choose the model with best/lowest MSPE
Test set
we don't get access to it until the very end, it is used for final assessment of the choosen model
MSPE on validation set should approximate the MSPE on testing set since neither set used to fit the model.
Implementation:
simplest: randomly divide data into training/validation, e.g. 80/20 split, but we don't get to use all data for training
Idea: randomly split data into two sets
Training Set
Validation Set
Fit candidate models, get parameter estimates, i.e. , for each model
Estimate prediction error, e.g. . Choose the "best" model, e.g. one with the smallest MSPE
Problems: not using all the data for training; different partition of data may lead to different result.
better: use cross-validation (CV) scheme
Cross-validation with -Folds
With , for each fold , we have the following procedure:
Divide data for training/validation into roughly equal-sized sets (folds) randomly
For CV fold , use data in fold for validation and train on the rest of data
Estimate prediction error for a given model
Fit the model times, each time treating data in fold as validation (remaining data as training set) and obtain estimate of prediction error
where and are observations and fitted response of the th subject in fold .
Calculate the average prediction error across all folds
Choose the "best" model as the one with the lowest average prediction error
Some common choices , where is the total number of observation (leave-one-out CV).