Monday, February 25, 2008

Lecture 7 - Simple Linear Regression

Linear Regression essentially means creating a linear model that describes the relationship between two variables.

Our type of linear regression is often referred to as simple linear regression. The simple part of the linear regression refers to the fact that we don't consider other factors in the relationship - just the two variables. When we model how several variables may determine another variable, it's called multiple regression - the topic for a more advanced course (or chapter 13 in our text).

For example, we may think that the total sales at various stores is proportional to the number of square feet of space in the store. If we collect data from a number of stores and plot them in an XY scatter plot, we would probably find that the data points don't lie on a perfectly straight line. However, they may be "more or less" linear to the naked eye. Linear regression involves finding a single line that approximates the relationship. With this line, we can estimate the expected sales at a new store, given the number of square feet it will have.

In mathematical terms, linear regression means finding values for β0 and β1 in the equation

y = β0 + β1x
such that the resulting equation fits the data points as closely as possible.

The equation above may look more familiar to you in this form:
y = mx + b
That's the form we learned in linear algebra. m is the slope and b is the y-intercept. Similarly, in our statistical form, β1 is the slope and β0 is the y-intercept.

How Close is Close?
We said that we want to find a line that fits the data "as closely as possible". How close is that? Well, for any given β0 and β1, we can calculate how for off we are by looking at each x-value, calculating what the linear estimate would be according to our regression equation and comparing that to the actual observed y-value. The difference is error between our regression estimate and the observation. Clearly, we want to find the line that minimizes the total error.

Minimizing the total error is done in practice by minimizing the sum of the squares of the errors. If we used the actual error term, and not the square, positive and negative errors would cancel each other out. We don't use the absolute value of the error term because we will need to integrate it and the absolute value function is not integrable at 0.

Generating regression coefficients β0 and β1 for the linear model by minimizing the sum of the square of the errors is known as the least-squares method.

No comments: