Tuesday, February 26, 2008

Lecture 7 - Coefficient of Determination

After calculating β0 and β1 to determine the best line to fit the data, we want to quantify how well the line fits the data. It may be the best line, but how good is it?

Sum of Squares
Looking at the graph of the data, we could say that without any modeling or regression at all, we would expect the y-value for any give x to be the mean y, ybar. Most of the observations, of course, would not be equal to the mean. We can measure how far the observations are from the mean by taking the difference between each yi and ybar, squaring them, and taking the sum of the squares. We call this the total sum of squares or SST.

You probably remember that the variance that we discussed much earlier in the course is this sum of squares divided by n-1.

The total sum of squares is made up of two parts - the part that is explained by the regression (yhat-ybar) and the part that the observation differs from the regression (yi-yhat). When we square each of these and sum them we compute the regression sum of squares, SSR, and the error sum of squares, SSE.

Coefficient of Determination
The 3 sum of squares terms, SST, SSR and SSE, don't tell us much by themselves. If we're dealing with observations which use large units, these terms may be relatively large even though the variance from a linear relationship is small. On the other hand, if the units of the measurements in our observations is small, the sum of square terms may be small even when the variance from linearity is great.

Therefore, the objective statistic that we use to assess how well the regression fits the data is the ratio of the regression sum of squares, SSR, to the total sum of squares, SST. We call this statistic the coefficient of determination, r2.

r2 = SSR / SST

No comments: