Friday, January 11, 2008

More on covariance and correlation coefficient

So, what do the covariance and correlation coefficient mean?

In theory, the covariance is a measure of how closely the data resembles a linear relationship.

At a high level:
If covar(x,y) >> 0 then we can say there's a positive correlation. I.e., as x increases, y also increases.
If covar(x,y) << 0 then there's a negative correlation. I.e., as x increases, y decreases.

It was noted that the covariance isn't very useful because it's influenced by the magnitude of the values. For example, if the observations in both sets are all multiplied by 10 (which could happen if you simply change the units of measure from, say, meters to centimeters), the covariance becomes 100 times larger even though the actual correlation of the data has not changed at all.

Note: We didn't discuss why, on an intuitive level, the formula for the covariance is a measure of the linear correlation of the data sets. I'll leave that for further thought later on.

In any case, the covariance is standardized by dividing by the product of the standard deviations. (Again, why this works on an intuitive level is not clear to me. But it does!) Covariance is therefore a pretty useless statistic by itself, but it is part of the calculation of the correlation coefficient, which is very useful.

The range of values for the correlation coefficient is from -1 to 1. A 0 correlation coefficient indicates no correlation - either because the data are wildly scattered or because they are on a horizontal or vertical line (indicating that the data sets are independent). The closer the correlation coefficient is to 1 or -1, the stronger the linear relationship between the data sets.

No comments: