• STAT 331 Applied Linear Models

      by Catherine Zhou

      Copyright Note:
      This is just a rearranged and edited copy of STAT331 lecture notes. All rights belong to Professor Leilei Zeng from Department of Acturial Science and Statistics, University of Waterloo.

       


      Introduction

      Regression is a statistical technique which is commonly used to quantify relationship betwen a variable of interest (the response variable) and some other variables (explanatory variable).

      We imagine we can explain Y in terms of x1,,xp using some function, s.t.

      (1)y=f(x1,,xp)

      Application of Regression

       Response (Y)Explanatory Variables (x1,,xp)
      Public healthLung Functionweight, height, sex, age, COVID infection
      EngineeringCircuit Delayresistance, temperature
      FinanceStock Indexunemployment rate, money supply, government policy, etc
      EconomicsUnemployment rateconsumer price index, inflation

       

      A general form of a "lienar regression model"

      (2)y=β0+β1x1++βpxpdeterministic+εrandom

      The objective is to determine the form of the linear function, β0+β1x1++βpxp, more specifically to determine the values of β's. This is done empirically by "fitting" models to data and idenfitying the model that descreibes the relationship the "best"!

      Topics

       


      1. Simple Linear Regression

      Simple linear model is used to study the relationship between a response variable and a single explanatory variable.

      Example - Low Birthweight Infant Data

      Data consists of pairs (xi,yi),i=1,,n.

      A scatter plot of data reveals a linear relationship between X and Y.

      Pearson Correlation Coefficient

      For all random variables X and Y,

      (3)ρ=Cov(X,Y)Var(X)Var(Y)

      Given data, the sample correlation coefficient is

      (4)r=i=1n(xix¯)(yiy¯)i=1n(xix¯)2i=1n(yiy¯)2

      where Sxy=i=1n(xix¯)(yiy¯), Sxx=i=1n(xix¯)2, and Syy=i=1n(yiy¯)2, and it suffices r=SxySxxSyy, and r[1,1],

      This is useful fo rquantifying strength & direction of the relationship, but cannot be used for prediction cosider linear regression models.

      1.1 Model Formulation

      A simple linear model:

      (5)y=β0+β1x+ϵ

      Suppose we observe pairs {(xi,yi); i=1,,n} from a random sample of n subjects, we write the model as

      (6)yi=β0+β1xi+ϵi,i=1,,n

      and we assume ϵiiidN(0,σ2).

      The normality of random error ϵi implies the normality of response yi such that

      (7)yiN(β0+β1xi,σ2)

      Here

      (8)E(yi)=E[β0+β1xifixed]=β0+β1xi+E(ϵi)=0
      (9)Var(yi)=Var(β0+β1xi+ϵi)=Var(ϵi)=σ2

      Interpretation of parameters β0 and β1

       

      1.2 Least Squares Estimation

      The "best" line has values of β0 and β1 that minimize the errors between yi and β0+β1xi.

      Least Squares of Estimates (LSEs) of β0 and β1 are those values minimize sum of squares of errors

      (10)argminβ0,β1i=1n[yi(β0+β1xi)]2

      To obtain LSEs, take derivates

      (11)β0i=1n[yiβ0β1xi]2=2(yiβ0β1xi)
      (12)β1i=1n[yiβ0β1xi]2=2(yiβ0β1xi)xi

      Solving the system of the two equations, we let both the derivatives be 0 and

      (13)(yiβ0β1xi)=yinβ0β1xi=0(1)
      (14)(yiβ0β1xi)xi=xiyiβ0nx¯β1xi2=0(2)

      We can rewrite (1) as

      (15)β0=1nyiβ1nxi=y¯β1x¯

      Substitude this into (2) gives us

      (16)β1=xiyinx¯y¯xi2nx¯2

      LSEs for β0 and β1 are

      (17)β1^=xiyinx¯y¯xi2nx¯2,β0^=y¯β1^x¯

      Aside, since (xix¯)=0, then

      (18)Sxx=(xix¯)2=(xix¯)xix¯(xix¯)=xi2nx¯2
      (19)Sxy=(xix¯)(yiy¯)=(xix¯)yiy¯(xix¯)=xiyinx¯y¯

      Therefore, the LSE β1^ can also be writen as SxySxx

      Note: we should check if β0^ and β1^ (as given above) minimize the sum of squares of erros (second derivative test).

      1.2.1 Low Birth Weight Infant Data Example

      Let yi be the head circle length, xi be the gest. age, i=1,,100

      (20)xi=2889, yi=2645, xi2=84099, xiyi=76910
      (21)β1^=xiyix¯y¯xi2nx¯2=76910100(2889/100)(2645/100)84099100(2889/100)20.78
      (22)β0^=y¯β1^x¯=26451000.7828891003.91

      Estimated/fitted regression line is then 3.91+0.78x

      Fitted value: yi^=β0^+β1^xi, i=1,,100

      Residual: ri=yiyi^, i=1,,100

      Interpretation: the head circle is expected to increase by 0.78 cm when gest. age increases by 1 week.

      Recall that LSEs β0^ and β1^ satisfy

      (23)yiβ0^β1^xiri=0,(yiβ0^β1^xi)xi=0

      Therefore, residuals have following properties:

       

      1.3 Estimation of Variance σ2

      σ2 represents variability in random errors, and hence variability in responses.

      Consider sampling from a single population, yi,,ynN(μ,σ2), sample variance: 1n1i=1n(yiy¯)2 is a unbiased estimator of population variance σ2, it is devided by n1 because 1 df is lost by using sample mean y¯ to estimate population mean μ.

      For simple linear regression models, same logic appplies but recognize that yiN(β0+β1xi,σ2), that is yi comes from different distribution with means that depend on xi.

       

      1.3.1 Mean Squares Errors (MSE)

      (24)S2=1n2i=1n[yi(β0^+β1^xi)]2=1n2ri2

      is an unbiased estimator for σ2, it is devided by n2 as 2 df's are lost due to estimation of β0 and β1.

       

      1.3.2 Rocket Propellant Data Example

      A rocket motor is manufactured by bonding a ignite propellant to a substainer propellant inside a metal housing.

      The shear strength of bonding (a quality character) depends on the age of the sustainer propellant:

      20 observations of pair (Xi,Yi),i=1,,20, the data is stored in an excel file "rocket.xls".

      Data Analysis using R (see Lecture 3 Note)

      The estimated/fitted line is 2627.82237.154x, with β1^=37.154<0, the older the propellant, the weaker the bonding strength.

      Fitted value for the propellant that is 8 weeks old in the sample (obs #3)

      (25)y3^=2627.82237.154(8)=2330.59

      Observed value: y3=2316

      Residual: r3=y3y3^=14.59 (over-estimate)

       

      1.4 Maximum Likelihood Method

      Suppose a r.v. Y has a probability density/mass function f(y;θ), we wish to estimate parameter θ that characterizes the distribution.

      In simple linear regression, yi,,yn are independent and yiN(β0+β1xi,σ2), i=1,,n.

      Likelihood function for θ=(β0,β1,σ2)

      (27)L(θ)=i=1nf(yi;θ)=i=1n(2πσ2)12exp{yiβ0β1xi)22σ2}

      log-likelihood:

      (28)l(θ)=i=1n[12log(2πσ2)12σ2(yiβ0β1xi)2]

      Score Functions:

      (29)lβ0=1σ2i=1n(yiβ0β1xi)=0(1)lβ1=1σ2i=1n(yiβ0β1xi)xi=0(2)lσ2=n2σ2+12σ2i=1n(yiβ0β1xi)2=0(3)

      Note (1) and (2) are same as equations used to solve for β0 and β1 in least square approach. Thus MLE and LSE for β0 and β1 are identical!

      However, for σ2, solving (3) gives

      (30)σ2^MLE=1ni=1n(yiβ0^β1^xi)2=1ni=1nri2

      which is different from σ2^=1n2i=1nri2 (MSE).

      It can be shown that MSE is an unbiased estimator for σ2 (later), the MLE is biased and therefore not used in practice.

       

      1.5 Properties of Least Square Estimates

      Simle linear models: y=β0+β1x+ϵ

      Sample: (x1,y1),,(xn,yn); each pair (xi,yi) satisfies yi=β0+β1xi+ϵi

       

      1.5.1 Least Squares Estimators (LSEs)

      (31)β1^=i=1n(xix¯)(yiy¯)i=1n(xix¯)2,β0^=y¯β1^x¯

      Responses y1,,yn are random variables, thus these estimators are also random variables.

      "Estimator" vs. "Estimate":

      Assumptions on the random error ϵ:

      Implication for the response variable y:

      Proposition: The LSE β0^ and β1^ are unbiased, that is E[β1^]=β1, E[β0^]=β0.

      Proof:

      (32)β1^=(xix¯)(yiy¯)(xix¯)2=(xix¯)yiy¯(xix¯)(xix¯)2=(xix¯)yi(xix¯)2E[β1^]=(xix¯)E(yi)(xix¯)2=(xix¯)(β0+β1xi)(xix¯)2=β0(xix¯)+β1(xix¯)xi(xix¯)2=β1E[β0^]=E[y¯β1^x¯]=1nE(yi)x¯E(β1^)=1ni=1n(β0+β1xi)x¯β1=β0+x¯β1x¯β1=β0

      Proposition: The variance of β0^ and β1^ are Var(β1^)=σ2(xix¯)2, Var(β0^)=σ2xi2n(xix¯)2.

      Proof:

      (33)Var(β1^)=Var[(xix¯)yi(xix¯)2]=(1Sxx)2i=1nVar[(xix¯)yi]=1Sxx2i=1n(xix¯)2Var(yi)=σ2SxxVar(β0^)=Var(y¯β1^x¯)=Var(y¯)+Var(β1^x¯)+2Cov(y¯,β1^x¯)=Var(y¯)+x¯2Var(β1^)2x¯Cov(y¯,β1^)

      It can be shown that

      (34)Cov(y¯,β1^)=Cov(1ni=1nyi,(xix¯)yi(xix¯)2)=0

      We will prove in assignment 1 that Cov(aixi,bjyj)=ijaibjCov(xi,yj).

      Thus,

      (35)Var(β0^)=σ2n+x¯2σ2Sxx=Sxx+nx¯2nSxxσ2=xi2nx¯2+nx¯2nSxxσ2=σ2xi2nSxx

      Proposition: The covariance between β0^ and β1^ is Cov(β0^,β1^)=σ2x¯(xix¯)2

      This will be proven in assignment 1.

      Assumption for the random error ϵ: ϵi's are independent and normally distributed.

      Implication:

      1.5.2 Summary

      Assumptions ϵiiidN(0,σ2) implies

      A practical matter: the variance σ2 is usually not known, it is necesary to estimate it.

      The MSE: S2=1n2i=1n(yiyi^)2 is an unbiased estimator for σ2, i.e. E(S2)=σ2

       

      1.5.3 Standard Error

      (36)SE(β1^)=S2Sxx,SE(β0^)=S2xi2nSxx

      Instead of standard normal distribution, we have the following t-distribution results:

      Aside: pdf of a tq-distribution (q is df)

      (37)f(t)=Γ(q+12)πqΓ(q2)(1+t2q)q+12, tR, Γ(q)=0zq1ezdz

      Knowing the sampling distribution of the LSEs, i.e. t-distribution results, enables us to conduct inferences, e.g.

       

      1.6 Inference for Regression Parameters

      We wish to know the sampling distribution of LSEs β0^ and β1^, so that we can conduct inferences:

      1. find a range of positive values for β's, e.g. confidence interval for β1
      2. test if β's take some particular values, e.g. test H0:β1=0

      Given ϵiiidN(0,σ2), i=1,,n, we have showed that

      (38)β1^N(β1,σ2Sxx),β0^N(β0,σ2xi2nSxx)

      We have standard error:

      (39)SE(β1^)=S2Sxx,SE(β0^)=S2xi2nSxx

      and (Student) t-distribution:

      (40)β1^β1SE(β1^)tn2,β0^β0SE(β0^)tn2

      These results are then used for inference.

       

      1.6.1 Confidence Interval (CI)

      We want a 100(1α)% (significance level) CI for β1. Given β0^β0SE(β0^)tn2, we have

      (41)P(tα/2,n2<β0^β0SE(β0^)<tα/2,n2)=1α

      Thus, a 100(1α)% CI for β1 is

      (42){β1: β1^tα/2,n2SE(β1^)<β1<β1^+tα/2,n2SE(β1^)}

      or β1^±tα/2,n2SE(β1^).

      The percentage point of t-distribution can be obtained from a t-distribution table or using R.

      Low Birth Weight Infant Data (cont. after here)

      We find the standard error: SE(β1^)=S2Sxx=2.529635.790.063.

      A 95% CI for β1:

      (43)β1^±t0.025,98SE(β1^)=0.780±1.98(0.063)=(0.655,0.905)

      We are 95% confident (or there is a 95% chance) the interval (0.655,0.905) contains β1.

       

      1.6.2 Hypothesis Testing

      Suppose we wish to test the slope β1 takes a specific value, say β. We write the hypothesis as following:

      (44)H0null hypothesis:β1=βvs.Haalternative hypothesis:β1β

      The t-test:

       

      Significance Test of β1

      A special case, we wish to test if there is a linear relationship between X and Y (i.e. if β1=0).

      (47)H0:β1=0v.s.Ha:β10t=β1^(0)SE(β1^)=β1^SE(β1^)

      Reject H0 if |t|=|β1^SE(β1^)|>tα/2,n2, we say that the coefficient β1 is statistically significant.

      Some Remarks:

      Rocket Propellant Data (cont. from here)

      Question: is there a significant (linear) relationship between propellant age (X) and bonding strength (Y)?

      We wish to test hypothesis:

      (50)H0:β1=0 vs. Ha:β10

      The t-statistic value: t=β1^SE(β1^)=37.1542.88912.86

      Let α=0.05,tt18 if H0 is true. Since |t|=|12.86|>2.101 (t0.025,18=2.101), reject H0, conclude that there is strong evidence of a linear relationship between bonding strength and propellant age.

      Alternative, calculate p-value =2P(T>|12.86|) where Tt18, and p-value=1.64e10<α so reject H0

       

      1.7 Estimation of Mean Response

      We wish to estimate the mean response at a given x Value, say x=xp, according to a simple linear model we have

      (51)μ=E[y|xp]=β0+β1xp

      We will estimate it with μ^=β0^+β1^xp where β0^ and β1^ are LSEs.

      To make inference about μ (e.g. construct a 95% CI for μ), we need to know the sampling distribution of the estimator μ^.

      Rocket Data Example (cont. from here))

      What is the average bonding strength for the rocket moters made from a batch of sustainer propellants that is 10 weeks old?

      Recall - Fitted line: y^=2627.82237.154x

      Estimate mean response:

      (53)μ=2627.82237.154(10)=2256.282SE(μ^)=S2[1n+(10x¯)2Sxx]=23.584

      A 95% CI for the mean μ at x=10:

      (54)μ^±t0.025,18SE(μ^)=(2206.731,2305.823)

      A one-sided test scenerio

      Estimate average bonding strength at x=16 weeks:

      (55)μ^=β0^+β1^(16)=2033.365

      Note: this is a point estimate, the fact μ^>2000 can not be used to draw the conclusion that μ>2000.

      Test hypothesis:

      (56)H0:μ2000 vs. Ha:μ>2000μ^=2033.365,SE(μ^)=22.801t=μ^2000SE(μ^)=1.463

      large observed t-value implies very unlikely H0 is true, so reject H0 if t>tα/2,n2. Since t=1.643<t0.005,18=1.734, do not reject H0.

      There is not enough evidence to reject H0, so these propellants may not be safe to use.

       

      1.8 Prediction

      Here we consider prediction of a single response yp, for a "new" subject with an explanatory variable value x=xp.

      The (true) future response value is

      (57)yp=β0+β1xp+ϵp

      where ϵp is a future random error.

      Naturally, we replace β0 and β1 by their LSEs, and replace random error ϵp by its expection, 0.

      We predict yp by yp^=β0^+β1^xp, and prediction error is ypyp^.

      Some results about prediction error: ypyp^:

      1. E[ypyp^]=0

      Proof:

      (58)E[ypyp^]=E(yp)E(yp^)=E(β0+β1xp+ϵp)E(β0^β1^xp)=0
      1. Var(ypyp^)=Var(ϵp(β0^+β1^xp))=Var(ϵp)+Var(β0^+β1^xp)=σ2+σ2[1n+(xpx¯)2Sxx]

        β0^ and β1^ are functions of data at hand, the future error ϵp is not related to the data. Hence, ϵp is not related to β0^ and β1^.

      2. It is also true that ypyp^(0)SE(ypyp^)tn2

      A 100(1α)% prediction interval for yp:

      (59)yp^±tα/2,n2SE(ypyp^)

      (it is wider than a 100(1α)% CI for mean response at x=xp)

      Rocket Data

      Find predicted value and a 95% prediction interval for bonding strength of a new rocket motor made with a propellant that is 16 weeks old.

      Prediction x=16:

      (60)yp^=β0^+β1^(16)=2256.282SE(ypyp^)=S2[1+1n+(16x¯)2Sxx]=98.774

      A 95% prediction interval:

      (61)yp^±t0.0025,18SE(ypyp^)=(2048.476,2463.524)

      A 95% CI for μ:

      (62)μ^±t0.025,18SE(μ^)=(2206.731,2305.833)

      1.9 Analysis of Variance (ANOVA)

      We consider regression analysis from a new perspective called analysis of variance. The approach of analysis of variance is based on the partitioning of total variability in responses.

      Total Sum of Squares (SST)

      (63)SST=i=1n(yiy¯)2

      Error Sum of Squares (SSE)

      (64)SSE=i=1n(yiyi^)2=i=1nri2

      Regression Sum of Squares (SSR)

      (65)SSR=i=1n(yi^y¯)2

      Partition of Total Sum of Squares:

      (66)SST=i=1n(yiy¯)2=i=1n(yiyi^+yi^y¯)2=i(yiyi^)2+i(yi^y¯)2+2i(yiyi^)ri(yi^y¯)=i(yiyi^)2+i(yi^y¯)2+2iriyi^=02y¯iri=0=i(yiyi^)2SSE+i(yi^y¯)2SSR

      1.9.1 Breakdown of Degrees of Freedom

      ANOVA Table

      SourceSSdfMS (mean squares)
      RegressionSSR=i=1n(yi^y¯)21MSR=SSR/df=SSR
      ErrorSSE=i=1n(yiyi^)2n2MSE=SSE/df=SSEn2
      TotalSST=i=1n(yiy¯)2n1 

      Mean Squares:

      Given the abvoe results, we expect MSR to be larger than MSE if β10.

      This gives an idea for testing significance of β1, H0:β1=0, we want to reject H0 if the observed value of the ratio MSRMSE is large, but how large is large enough? It would be nice to know the sampling distribution of MSRMSE.

      Under H0:β1=0, the radio

      (70)F=MSRMSEF1,n2

      a F-distribution with df 1 and n2

      We will do a F-test for the hypothesis and reject H0 if the observed value of F-statistic > Fα;1,n2 (upper α percentage point of F1,n2 distribution).

      Coefficient of Determination:

      (71)R2=SSRSST

       


      2. Multiple Linear Regression

      2.1 Random Vectors and Multivariate Normal Distribution

      Random Vector

      For any set of r.v.'s y1,,yn, Y=(y1,,yn)T is a n×1 random vector.

      Mean of Y

      (72)E(Y)=[E(y1),,E(yn)]T=(μ1,,μn)T=μn×1

      Variance-Covariance Matrix of Y

      Var(Y)=E[(YE(Y))(YE(Y))T]=[Var(y1)Cov(y1,y2)Cov(y1,yn)Var(y2)Cov(y2,yn)Var(yn)]=[σ12σ12σ1nσ22σ2nσn2]=n×n

      Basic Properties

      random vector,

      E(AY+b)=AE(Y)+bVar(Y+B)=Var(Y)Var(AY+B)=AVar(Y)AT

      (A and b are matrix and vector of constants).

      2.1.1 Multivariate Normal Distribution

      A random vector Y=(y1,,yn)T has a multivariate normal distribution if its density function takes the form

      (73)f(y1,,yn)=(2π)n2|Σ|12exp{12(Yμ)T1×nΣ1n×n(Yμ)n×1}

      where E(Y)=μn×1 and Var(Y)=Σn×n.

      We also write YMVN(μ,Σ).

      Some useful results about multivariate normal distribution:

      1. If YMVN(μ,σ2), then

        (74)AY+BMVN(Aμ+b,AΣAT)

        (Transformation invariant)

      2. If YMVN(μ,Σ), then element of Y has a normal distribution, i.e.

        (75)yiN(μi,σi2), i=1,,n

        More generally, for any partition of Y

        (76)Y=[y1ykyk+1yn]=[Y1Y2],μ=[μ1μ2],Σ=[Σ11Σ12Σ21Σ22]

        then Y1MVN(μ1,Σ11),Y2MVN(μ2,Σ22) (marginal normality!)

      3. If yiN(μi,σi2), and y1,,yn are independent, then

        (77)Y=(y1,,yn)TMVN(μ,Σ)

        where Σ=diag{σ12,,σn2}.

      4. For multivariate normal random vector Y

        (78)Var(Y) is diagonal independence of yi
      5. If YMVN(μ,Σ), let V=AY and W=BY, then

        (79)V and W are independentAΣBT=0

       

      2.2 Multiple Linear Models

      We can write the model in scalar form:

      (80)yi=β0+β1xi1++βpxip+ϵi

      Alternatively, we write the model in matrix form:

      (81)[y1y2yn]n×1=[1x11x1p1x21x2p1xn1xnp]n×(p+1)[β0β1βp](p+1)×1+[ϵ1ϵ2ϵn]n×1Yn×1=Xn×(p+1)β(p+1)×1+ϵn×1

      X is called design matrix.

      Interpretation of β Parameters:

      As ϵiiidN(0,σ2), then the random error vector ϵMVN(0,σ2I), here I is an n×n identity matrix.

      According to the model Y=xβ+ϵ, response vector Y is a linear function of ϵ, thus YMVN(xβ,σ2I).

       

      2.3 Parameters Estimation

      2.3.1 Least Squares Approach

      Same as in simple linear regression

      (82)β^=argminβi=1n[yi(β0+β1xi1++βpxip)]2

      We want to find values of β vector that minimizes sum of squares.

      In matrix form,

      (83)β^=argminβ[(YXβ)T(YXβ)]

      Taking derivative w.r.t. vector β

      β[(YXβ)T(YXβ)]=β[YTYYTXββTXTY+βTXTXβ]=(YTX)TXTY+2XTXβ=2XTY+2XTXβ

      Aside,

      Solving the equation =0 for β gives

      2XTY+2XTXβ=0XTXβ=XTYβ=(XTX)1XTY

      Least Square Estimator for β:

      (84)β^=(XTX)1XTY

      Properties of LSE β^:

      1. E(β^)=β (unbiased)

        (85)E(β^)=E[(XTX)1XTY]=(XTX)1XTE(Y)=(XTX)1XTXβ=β
      2. Var(β^)=σ2(XTX)1

        Var(β^)=Var((XTX)1XTAY)=(XTX)1XTVar(Y)((XTX)1XT)T=(XTX)1XTσ2IX(XTX)1=σ2(XTX)1
      3. β^MVN(β,σ2(XTX)1)

        Since YMVN, then the linear transformation β^=(XTX)1XTY also has a MVN distribution.

      4. βj^N(βj,σ2[(XTX)1]jj(j,j) element of matrix (XTX)1)

       

      2.4 Fitted Values and Residuals

      Fitted Value Vector:

      (86)Y^=Xβ^=X(XTX)1XTHY=HY

      Hat Matrix:

      (87)H=X(XTX)1XT

      Residual Vector:

      (88)r=YY^=YHY=(IH)Y

       

      Some Other Results about Residuals:

      Proof:

      (89)XTr=[1x11x1p1xn1xnp]T[r1rn]=[iriirixi1irixip]XTr=XT(YY^)=XTYXTX(XTX)1XTHY=0

      2.5 Estimation of σ2

      We still want to use sum of squares of residuals SSE=i=1nri2 to estimate the variance of random error, σ2.

      Question: What is the df of SSE in MLR?

      Result: SSEσ2Xn(p+1)2

      Proof: The idea is to re-write SSEσ2=1σ2rTr as a sum of independent X12 distributed random variables.

      Recall rMVN(0,σ2(IH)), (IH) is symmetric and idempotent, decomposition

      (90)(IH)n×n=PΛPT

      where P is a n×n orthogonal matrix PT=P1, and Λ is a n×n diagonal matrix with value 0 or 1 on diagonal line.

      Now we define a new random vector

      (91)Z=1σPTr,ZTZ=1σ2rTPPTr=1σ2rTr

      then ZMVN(0,Λ) as Z is a linear transform of r.

      (92)E(Z)=1rPTE(r)=0=0Var(Z)=1σ2PTVar(r)σ2(IH)P=PT(IH)P=PT(PΛPT)P=Λ

      since P is orthogonal, PTP=I.

      We know vector ZMVN(0,Λ), here Λ is diagonal with value 0 or 1 for diagonal elements. To find the degree of freedom, we need to find how many diagonal elements in Λ is 1, and

      tr(Λ)=tr(IH)=tr(In×n)tr(Hn×n)=ntr(X(XTX)1XT)=ntr((XTX)1(XTX))=ntr(I(p+1)×(p+1))=n(p+1)

      This implies that vector Z has n(p+1) non-zero elements in diagonal, and they are iidN(0,1) random variables.

      Therefore,

      (93)SSEσ2=1σ2rTr=ZTZXn(p+1)2

      Moreoever,

      (94)E(SSEσ2)=n(p+1)E(SSE)=σ2(n(p+1))

      So the MSE

      (95)MSE=SSEn(p+1)

      is an unbiased estimator for σ2.

       

      2.6 Inference in Multiple Regression

      Some useful results:

      1. β^MVN(β,σ2(XTX)1)

      2. βj^N(βj,σ2[(XTX)1]jj) or equivalently

        (96)βj^βjσ2[(XTX)1]jjN(0,1)
      3. As σ2 is unknown, we replace σ2 by its estimator S2=SSEn(p+1), and obtain a t-distribution result in simple linear regression

        (97)βj^βjS2[(XTX)1]jjSE(βj^)tn(p+1)

        Here, df is n(p+1), p denotes the number of explanatory variables.

       

      For an individual coefficient βj:

      Question: Can we fit a linear model when the number of model parameters (or number of explanatory variables) is larger than the number of observations? (i.e. p>n)

      LSE: β^=(XTX)1XTY, for XTX to be invertible, the design matrix Xn×(p+1) needs to be full column rank - the number of rows needs to be bigger than the number of columns, that is n>p+1, and rank(X)=p+1.

      2.6.1 Inference for Linear Combinations of Coefficients: cTβ

      In multiple linear regression, we wish to estimate the mean response at some given values of explanatory variables, e.g. c=(1,x1,,xp)T, a vector of constants.

      Estimated Mean Response at c=(1,x1,,xp)T

      (99)μ^c=cTβ^

      Recall that β^MVN(β,σ2(XTX)1), then

      (100)μ^cN(cTβ,cT[σ2(XTX)1]c)

      Folowing the same argument as before, replace σ2 by its estimator σ2^=SSEn(p+1), we have

      (101)cTβ^cTβcT[σ2^(XTX)1]cSE(cTβ^)tn(p+1)

      This is used to conduct inference about μc=cTβ.

      (102)μ^c±tα/2,n(p+1)SE(μ^c)

      is the (1α)100% CI for μc.

       

      2.7 Prediction in Multiple Linear Regression

      For a new subject with explanatory variable values c=(1,x1,,xp)T, we wish to predict the outcome and obtain a prediction interval.

      Future response: according to the model

      (103)yp=cTβ+ϵp

       

      2.8 Geometric Interpretation of Least Squares

      Let

      From geometry point of view, each n×1 vector above represents a point in a n-dimensional space Rn.

      $Y$, $X\beta$, and $X\hat{\beta}$

       

      2.9 ANOVA for Multiple Linear Regression

      Consider a multiple linear model

      (109)Yi=β0+β1xi1++βpxip+ϵi, i=1,,n

      We still have following source of variations:

      Sum of squares and degrees of freedom add up as before:

      (110)SST=SSR+SSEdfT=dfR+dfE

      The value of SST and dfT remains the same:

      (111)SST=i=1n(yiy¯)2,dfT=n1

      But the values of SSR and SSE and their df's are different from SLR.

      SSR=i=1n(yi^y¯)2dfR=pSSE=i=1n(yiyi^)2dfE=n(p+1)

      ANOVA Table

      SourceSSdfMS
      RegressionSSR=(yi^y¯)2pMSR=SSRp
      ErrorSSE=(yiyi^)2n(p+1)MSE=SSEnp1
      TotalSST=(yiy¯)2n1 

      2.9.1 F-test of "overall" significance

      Suppose we want to evaluate the "overall" significance of a multiple linear model, e.g. we wish to test hypotheses:

      (113)H0: βi=0(all coefficients are 0)v.s.Ha: βi0

      F-statistic:

      (114)F=SSR/pSSE/(np1)=MSRMSE

      We reject H0 if F is large (that is the model explains more variation in response than the random error), i.e.

      (115)F>Fα,p,np1(critial value)

      Why only use α instead of α/2? We only look at the large positive value.

      (116)p-value=Pr(>F)

      If H0 is rejected, we conclude that at least one of the explanatory variables is important or related to Y.

      2.9.2 Coefficient of Determination

      (117)R2=SSRSST

      Adjust R2

      (118)Radj2=1n1np1(1R2)

      Example: Low Birthweight Infant Data

      Fit summary lm(headcirc ~ gestage+brithwt,)

      Residual standard error: 1.274 on 97 df

      Multiple R2: 0.753, Adjusted R2: 0.747

      F-statistic: 147.1 on 2 and 97 df.

      Test H0:β1=β2=0 v.s. Ha: at least one fo β1 and β2 is not 0.

      (119)F=MSRMSE=147.1

      Since F=147.1>F0.05,2,97=3.09, or p-value=P(>147.1) reject H0, and conclude that the model is significant.

       


      3. Specification Issues in Regression Models

       

      3.1 Categorical Variables

      In linear regression models, the response variablehas to be numerical, however, the explanatory variables can be either numerical or categorical.

      A categorical variable takes a value that falls into one of several categories, e.g.

      Approach: recode into indicator variables or treat as numerical if it makes sense to do so.

      Example: Occupational Prestige

      Reponse: prestige score

      Explanatory: education (yr), type: blue collar, white collar, professional

      Occupationprestige (Y)education (x1)type (x2)
      Computer operations47.711.36wc
      Construction labourers26.57.52bc
      Office derks35.611.00wc
      Nurses64.712.46prof

      How to code variable "type"?

      e.g, let

      (120)xi,2={0bc1wc2prof

      This approach is not generally appropriate unless explanatory variable is ordinal and the response is linear according to this scheme.

      More flexible approach: introduce indicator/dummy variables

      (121)xi,2={1prof0otherwisexi,3={1wc0otherwise

      Then,

      (122)X=[111.3601017.52001111.00010112.46100]

      rank(x)=4, design matrix X is not full column rank, XTX is not invertible!

      In general, if there are k levels, just need k1 indicators for the categorical variable.

      Model:

      (123)yi=β0+β1xi,1+β2xi,2+β3xi,3indicator for variable "type"+ϵi

      Interpretation:

      We know the LSE β^MVN(β,σ2(XTX)1)

       

      3.2 Interaction Effects

      Interaction effects exist when the effect of one variable depends on the value of another variable.

      Suppose we have two numerical variables x1,x2, an interaction is the product of these two variables, e.g.

      (124)y=β0+β1x1+β2x2+β3x1x2+ϵ

      We call β1 and β2 the main effects of x1 and x2, and β3 the interaction effect between x1 and x2.

      General Interpretation:

      As we can see, by adding interaction x1x2, we allow the effect on x1 depending on the other variable x2.

      When one of the two variables, say x2, is binary.

      Example: low birthweight infant

      (128)y(head circ)=β0+β1x1(gest age)+β2x2(toxemia), 0/1+β3x1x2+ϵ

      Question: is the relationship between head circ (y) and gest age (x1) the same for these two groups? We can test significance of interaction effect β3:

      (129)H0:β3=0,vs.Ha:β30

       

      4. Testing General Linear Hypothesis

      Model:

      (130)yi=β0+β1xi1++βpxip+ϵi

      We have discussed

      General Linear Hypothesis:

      (131)H0:Aβ=0vs.Ha:Aβ0

      Here A is a l×(p+1) matrix of some constants.

      Example: Prestige Data

      "Full" model:

      (132)yiprestige score=β0+β1xi1education+β2xi2(indicator for prof)+β3xi3(indicator for wc)+ϵi

      where xi2 and xi3 together is the type variable. We want to test if β2=β3=0, meaning that the type itself does not matter overall.

      1. H0:β2=β3=0, tests overall effect of type

        (133)A=[00100001]H0:Aβ=02×1

        If H0 is true, the "full" model reduces to yi=β0+β1xi1+ϵi, which is the "reduced" model.

      2. H0:β1=0,β2=β3, this test has no effect of education and no difference between prof and wc.

        (134)A=[01000011]H0:Aβ=02×1

        If H0 is true, the "full" model reduces to yi=β0+β2(x2+x3)+ϵi, where x2+x3 is prof or wc.

      In general, l Constraints, A is l×(p+1) matrix and rank(A)=l.

      Principle of Extra Sum of Squares:

      4.1 F-test for General Linear Hypothesis: H0:Aβ=0

      F-statistic:

      (141)F=Y^YA^2lr2np1=(SSESSEA)/lSSE/(np1)H0Fl,np1

      Note: l is the number of constraints, rank(A)=l.

      A sketch of proof:

      1. SSEσ2Xnp12 (need ϵiiidN(0,σ2))
      2. SSEASSEσ2H0Xl2 - write it as a sum of squares of standard normal variables, (need ϵiiidN(0,σ2), H0 is true)
      3. SSE and SSEASSE are independent.

      Therefore, combining 1, 2, and 3,

      (142)F=(SSEASSE)/σ2lSSE/σ2np1H0Fl,np1

      Example: Prestige Score Data

      Full model: education (x1), type (x2=I(type=prof),x3=I(type=wc)), education × type (interaction)

      (143)y=β0+β1x1+β2x2+β3x3+β4x1x2+β5x1x3+ϵ

      Question: is the effect of education x1 on prestige score y the same for different types?

      We wish to test the significance of interaction effects, e.g.

      (144)H0:β4=β5=0vs.Ha:at least one is notA=[000010000001], Aβ=[β4β5]=0

      Reduced model (H0:Aβ0 is true):

      (145)yA=β0+β1x1+β2x2+β3x3+ϵ

      In R, fit both full and reduced models.

      The residual standard error is σ^=MSE, so SSE=(7.827)292.

      The residual standard error is σA^2=MSEA, so SSEA=(7.814)294.

      F-statistic for H0:β4=β5=0:

      (146)F=(SSEASSE)/lSSE/(np1)=[(7.814)292(7.827)292]/2(7.827)2103.4106ExtraSS/2(7.827)20.844

      FH0F2,92, p-value=P(>0.844)=0.433, do not reject H0, we have no evidence that the effect of education is different between types.

      In R, anova() function can be used for the F-test procedure.

       


       

      5. Model Assumptions and Residual Analysis

      5.1. Introduction

      A linear regression model is specified under several assumptions:

      1. Relationship between y and x1,,xp is linear, ie. y=β0+β1x1++βpxplinear function+ϵ
      2. ϵi has a mean of 0, E[ϵi]=0
      3. ϵi's have a constant variance, Var(ϵi)=σ2
      4. ϵi's are normally distributed
      5. ϵi's are independent

      Random errors ϵi's are not directly observable, thus it seems hard to examine their distributional properties.

      However, we can obtain residuals, ri=yiyi^, and we argue residuals behave similarly as random errors.

      Relationship between Residuals and Random Errors

      Idea: check assumptions about ϵi's using diagnostic plots of residuals, ri's.

      5.2 Residual Analysis

      Residual: ri=yiyi^

      Standardized Residual: di=riσ^

      Studentized Residual: ei=riσ^2(1hii)

      Residual Plots

      Plot of Residuals vs. Fitted Values

      A plot of residual ri or any of scaled residuals (di or ei) against corresponding fitted value (yi^)

      1. If residuals fluctuates randomly around 0 inside a horizontal band, then no visible defects.

      2. If the residuals can be contained in an opening funnel ("fan" shape), it indicates that Var(ϵi) is not constant.

      3. If the residuals are contained inside a curve plot, then it indicates nonlinearity (the relationship between y and some x variables is not linear, or some other explanatory variables are needed)

      Plots of Residuals vs. Explanatory Variable

      A plot of residuals against the values of j-th explanatory variable, xij's.

      Some interpretation of the plot as the case of plot of residuals vs. yi^'s.

      1. Horizontal band no visible defects

      2. Funned/fan shape Variance is nonconstant

      3. Curvature Nonlinearity (may suggest need xj2 in the model)

      Partial Residual Plots

      Most useful in investigating the relationshiop between response y and an explanatory variable xj.

      Partial Residual for xj,j=1,,p, ri(j)=ri+βj^xij, where ri is the residual based on all p explanatory variables, adding effect of xj variable back into the residual.

      Plot Partial Residuals ri(j) vs. xij

      1. Linear trend xj enters model linearly

      2. Curvature higher order terms in xj may be helpful

      Q-Q Plot for Normal Distribution

      This is a graphical technique for detecting substantive departure from normality. Plot ordered standardized residuals d(1)<d(2)<<d(n) against the theoretical quantiles of N(0,1). If normality holds, then the ordered residuals should be align with the quantiles of normal distribution.

      1. Points lies approximately on the straight line underlying distribution is normal.

      2. Sharp upward and downward curves at both extremes heavy-tailed

      3. Flattening at extremes light-tailed

      4. Sharp change in the trend in an upward direction from the mid positively skewed (right skewed)

      5. Sharp change in the trend in a downward direction from the mid negatively skewed (left skewed)

      Remarks:

      5.3 Addressing Model Assumption Problems

      If residual plots reveal problems with assumptions, we might be able to address via data transformation.

      1. Transformation of Response to Stablize Variance

        This can help address non-constant variance identified by the plot r vs. y^ (or r vs. xj)

        Idea: apply function g and fit regression model on transformed g(yi)

        (147)g(yi)=β0+β1xi1++βpxip+ϵi

        Rationals: the variance fo response might be a function of mean μi=E(yi), ie.

        (148)Var(yi)=Var(ϵi)=h(μi)σ2(for some h( )>0)

        in which case we want Var(g(yi))σ2

        By first order Taylor Expansion, we have

        (149)g(yi)g(μi)+(yiμi)g(μi)

        Then,

        (150)Var(g(yi))[g(μi)]2Var(yi)[g(μi)]2h(μi)σ2

        Thus, we need [g(μi)]21h(μi).

        Examples:

        1. h(μi)=μiVar(yi)=μiσ2.

          We need g(μi)1μi=1μig(μi)=μi will work

          Thus, we can apply g(yi)=yi to abtain approximate constant variance.

        2. h(μi)=μi2Var(yi)=μi2σ2.

          We need g(μi)1μig(μi)=ln(μi) may work

          Thus, we can apply g(yi)=ln(yi) to abtain approximate constant variance.

        3. Box-cox Power Transformation

          (151)g(yi)={yiλ1λλ0log(yi)λ=0

          Note that

          (152)g(yi)={μiλ1λ01μiλ=0h(μi)1[g(μi)]2=μic

          where c is some arbitrary power. Thus, Box-Cox transformation can help address the non-constant variance of the form

          (153)Var(yi)=μicσ2

          Some special cases:

          • λ=12, we have the square-root transformation
          • λ=0, this is the log transformation
          • λ=1, we have the identity transformation (no transformation)
          • λ=1, we have the reciprocal transformation

          One can automatically try a sequence of values of λ and find the "best" choice that gives the largest log-likelihood value.

          Note that interpreting βj can be less intuitive as a result of transformation, since now increasing xj by 1 unit corresponds to a change of βj in g(yi).

          For log transformation: g(yi)=log(yi), βj represents the change in mean response in the log scale,

          (154)yi=eβ0+β1xi1++βpxip+ϵi
          • eβj is the multiplicative change applied to the original response.

          • every 1 unit increase in xj is associated with 100(eβj1) percentage change in the mean response in the original scale.

          • 95% CI for mean response at x=(1,x1,,xp)T

            (155)exp(xTβ^ ± tα/2,np1SE(xTβ^))

          For arbitrary λ, transformation leads to less interpretable results.

        4. Transforming/Adding Explanatory Variables

          This might be applied if y (or g(yi)) has a clear non-linear relation with some xj, for example revealed by plots of r vs. xj or partial residual plots.

          1. We can consider transforming xj using power transformation, e.g. log(xj)

          2. We could add polynomial terms, such as x2,x3,, e.g.

            (156)yi=β0+β1+β2xi2+ϵi

            which is still a linear model (linear in β's)

       

      6. Effects of Individual Observations

      Sometimes some of the observations have an unduly large infuence on the fit of the model.

      Outliers

      An outlier is defined as a particular observation (yi,xi1,,xip) that differs substantially from the majority of the observations in the dataset.

      e.g. SLR (simple linear regression):

      What causes outliers?

      Generally we do not recommend removing outliers from dataset, unless we have good reason to believe the observation is an error or the subject does not belong to the target population.

      Outliers may change the result of model fitting, e.g. observation C pulls the fitted line closer to it, observation A or B may change the slope a bit.

      Therefore, it is useful to investigate to what extend the outliers influence our fitted models.

      How to detect outliers?

      6.1 Studentized Residuals

      (157)ei=riσ^2(1hii)

       

      Recall rn×1MVN(0n×1,σ2(IH)), (hii is the ith diagonal delement of H), and eiN(0,1) apprximately.

      Rules of Thumb: if ei is very large according to N(0,1), e.g. |ei|>3, the observation i could be considered an outlier in extreme value of response yi.

      e.g. plot of studentized residual (ei) vs. fitted value (yi^) can be used to identify outliers in response.

      6.2 Leverage

      Recall hii is the ith diagonal element in the hat matrix H=X(XTX)1XT, we call hii the leverage of observation i. Note that

      (158)Y^=Xβ^=X(XTX)1XTY=HY

      So

      (159)yi^=[hi1,,hin][yiyn]T=j=1nhijyj=hiiyi+jihijyj

      The leverage hii characterizes the influence of observation yi on its corresponding fitted value yi^.

      Recall H=X(XTX)1XT, it only involves explanatory variables, but not response.

      The leverage hii is small for observation with (xi1,xi2,,xip) near the centroid (x¯1,,x¯p), and it is large if (xi1,,xip) is far away from the centroid.

      e.g. SLR, X=[1x11xn],

      (XTX)1=1nSxx[xi2nx¯nx¯n]hii=[1,xi]1nSxx[xi2nx¯nx¯n][1xi]=1n+(xix¯)2Sxx

      (This generalizes to MLR)

      Thus, leverage is useful to help identify outliers in the sense of having explanatory varaibles with extreme values.

      6.3 Influence

      An observation is influenced if its presence in fitting regression considerably changes estimates for β compared to when it is not used to fit the model.

      Start by fitting a model with observations Y=Xβ+ϵ, and obtain LSE β^ as usual.

      Let β^(i) denote LSE based on fitting the same model but the ith observation removed from the data set. If β^(i) is quite different than β^, then observation i is highly influential.

      Cook's Distance (between β^(i) and β^)

      (162)Di=(β^β^(i))T[σ^2(XTX)1]1(β^β^(i))p+1=(β^β^(i))TXTX(β^β^(i))σ^2(p+1)

      It can be shown taht Cook's Di can be written as

      (163)Di=ei2hii1hii1p+1

      where we can see that Di depends on both ei (standard residual e1=riσ^2(1hii)) and hii (leverage).

      Observation i is most influential when both |ei| and hii are large values.

      Proof: For the ith observation, vi=[1,xi1,,xip]T denote the vector of its explanatory varaible values.

      (164)X(i)=[v1TviTvnT](delete ith viT)Y(i)=[y1yiyn](delete yi)

      Without observation i, the LSE of β becomes

      (165)β^(i)=[X(i)TX(i)]1X(i)TY(i)

      Note without loss of generality,

      X=[X(i)viT],Y=[Y(i)yi]XTX=X(i)TX(i)+viviT,XTY=X(i)TY(i)+viyiX(i)TX(i)=XTXviviT,X(i)TY(i)=XTYviyi

      Substitute these into expression fo β^(i)

      β(i)^=[XTXviviT]1(XTYviyi)=[(XTX)1+(XTX)1viviT(XTX)11viT(XTX)1vi](XTYviyi)=(XTX)1XTYβ^(XTX)1viyi+(XTX)1viviT(XTX)1XTYβ^(XTX)1viviT(XTX)1vihiiyi1viT(XTX)1vihii=β^(XTX)1viyi+(XTX)1viviTβ^(XTX)1vihiiyi1hii=β^(XTX)1vi1hii[yi(1hii)viTβ^y^i+hiiyi]=β^(XTX)1vi1hii(yiyi^)β(i)^=β^(XTX)1viri1hiiβ^β(i)^=(XTX)1viri1hii

      Note that [AaaT]1=A1+A1aaTA11aTA1a

      Substitute this into the expression for Cook's Di:

      Di=β^β(i)^XTX(β^β(i)^)σ^2(p+1)=ri2(1hii)2viT(XTX)1XTX(XTX)1viσ^2(p+1)=ri2hiiσ^2(1hii)21p+1=[riσ^2(1hii)]2hii1hii1p+1=ei2hii1hii1p+1

      So we have Cook's Distance Di=ei2hii1hii1p+1

      Rules of Thumb:

      6.4 Summary

       

      7. Model Selection

      In many applications, there may be a rather large pod of explanatory variables measured along with the response, and we have very little idea of which ones are important.

      In terms of model fitting:

      Suppose given a full set of p explanatory variables, we wish to find the subset kp variables that gives us the "best" model:

      Two key ingrediants of Model Selection:

      1. Selection Criterion (for comparing different models)
      2. Search Strategy (which model to fit?)

      7.1 Selection Criterion

      Some common criteria:

      Adjusted R2

      (k is the number of predictors)Radj2=1SSE/(nk1)SST/(n1)=1n1nk1(1SSRSSTR2)=1n1nk1(1R2)

      e.g. We have the following data:

      n=35Model 1 (k=4)Model 2 (k=8)
      SSR31.232
      SSE8.88
      SST4040
      (166)CriterionModel 1Model 2R231.240=0.783240=0.80Radj218.8/3040/34=0.75118/2640/34=0.738

      We expect R2 for 8-predictor model to be larger than then one for 4-predictor model, however, the Radj2 for 4-predictor is larger and hence it is preferred.

      AIC (Akaike Information Criterion)

      Let n be sample size and q be the number of parameters in the model (in MLR, q=k+1+1 Consisted of k predictors, an intercept, and varaince σ2)

      (167)AIC=2[lnL(θ^)q]=2q2lnL(θ^)

      where L(θ^) is theh likelihood evaluated at parameter estiamtes θ^.

      BIC (Bayesian Information Criterion)

      (168)BIC=qln(n)2lnL(θ^)

      Remark:

      7.2 Search Strategy

      All Subsets Regression

      With p predictors, each one may or may not enter the model, thus there are

      (169)j=0p(pj)=2p

      possible models to fit.

      In theory, we can fit these models and choose the "best" one based on some selection criterion (e.g. Radj2, AIC, BIC).

      This is appealing as we can for sure find the optimal model!

      It can be computationlly intensive, or even infeasible (when p is very large).

      We may want to find a "good" (useful), not necessarily optimal, model in reasonable computation time. Many strategies focus on adding/removing variable one at a time sequantially.

      Forward Selection (add one variable at a time)

      1. Start with model that only has intercept β0

      2. Fit p SLR models, i.e. yi=β0+β1xij+ϵi,j=1,,p

      3. Pick the best of p models with 1-predictor according to the chosen criterion, and add that variable xj to the model

      4. Fit p1 models containing xj and another variable

        1. if none fo p1 models improves criterion, STOP
        2. pick best of p1 models according to the criterion, so we have 2 variables in the model
      5. Continue adding 1 varaiblea at a time until no more variable improve criterion

      Backward Elimination (remove one variable at a time)

      1. Start with full model taht has all p predictors
      2. Fit p models resulting from removing one predictor from the regression (each model has p1 predictors)
      3. Pick the best of p models according to the criterion and eliminate such xj from the model
      4. Fit p1 models that remove xj and one other variable from the model, continue until removing additional variable doesn't improve.

      Forward-Backward Stepwise

      1. Start with forward selection

      2. If we have k variables in the model:

        1. Backwards: fitting k models with k1 variables, if any of these improve criterion, remove the variable that improve the most
        2. Forwards: fitting pk models with k+1 variables if any of these improve criterion, add the variable that improves the most

       

      8. Building Predictive Model

      Suppose our goal is to find model that fit best predicts outcome y (do not care much about finding association between explanatory variables and outome, or interpretation).

      Previously we talked about selection criterion, (Radj2, AIC, BIC) and computed on fitted models, which assess explanatory power of a model on the data used to fit the model (training data). In the case of prediction modeling, we want metrics/criteria to evaluate how well model perform in predicting the outcome on new data given predictors (validation data).

      Mean Squared Prediction Error (MSPE)

      (170)MSPE=1υi=1υ(yiy^i)2

      Root mean squared erorr: RMSE=MSPE

      Mean Absolute Error: 1υi=1υ|yiy^i|

      Ideally we have a very large dataset and split it into three mutually exclusive sets

      Training set (n obs)Validation Set (υ obs)Test set (t obs)
      y1,,ynyn+1,,yn+υyn+υ+1,,yn+υ+t
      x1,,xnxn+1,,xn+υxn+υ+1,,xn+υ+t

      where xi=(xi1,,xip)

      8.1 Different Datasets

      1. Training set

        • Fit candidate models, as many as you want.
        • Obtain estimates of coefficients β^
      2. Validation set

        • used to evaluate performance of models using training data
        • estimate prediction error, e.g. MSPE, for each model
        • perform model selection, e.g. choose the model with best/lowest MSPE
      3. Test set

        • we don't get access to it until the very end, it is used for final assessment of the choosen model

      MSPE on validation set should approximate the MSPE on testing set since neither set used to fit the model.

      Implementation:

      Cross-validation with k-Folds

      With Y,X1,,Xp, for each fold i, we have the following procedure:

      1. Divide data for training/validation into K roughly equal-sized sets (folds) randomly

      2. For CV fold k, use data in fold k for validation and train on the rest of data

      3. Estimate prediction error for a given model

        1. Fit the model K times, each time treating data in fold k as validation (remaining data as training set) and obtain estimate of prediction error

          (171)MSPEk=1υki=1υk(yiy^i)2, k=1,,K

          where yi and y^i are observations and fitted response of the ith subject in fold k.

        2. Calculate the average prediction error across all folds

          (172)1Kk=1kMSPEk
      4. Choose the "best" model as the one with the lowest average prediction error

      5. Some common choices k=5,10,n, where n is the total number of observation (leave-one-out CV).