Tuesday, January 8, 2008

Lecture 1 - Ch 3 - Descriptive Statistics

We now discuss some basic concepts we can use to quantitatively describe a set of values. Rather than describing the set by painting a picture with a graph, chart or table as we did in the previous blog post, we paint the picture in a more precise and analytical way using numbers.

One of the most fundamental ways to describe a set of values is to identify the central tendency of the set. We recognize that when looking at some sample, the statistic that we're interested in won't have the same value for every member of the set. But we're still interested in the central value for the set as a whole. We commonly call this central value the "average". However, there are several ways to calculate this central value which differ slightly:

(Arithmetic) Mean - commonly called the average
Median - the middle value (or average of the two middle values if there are an even number of observations). The position of the median value is (n+1)/2.
Mode - the most common value in the data set

We also introduce the following definitions:
Range = maxvalue - minvalue
Quartile (Q1, Q2, Q3) which divides the data into 4 equal units in the same way the median divides it into 2 units. Again, if the number of values is not evenly divisible into 4 groups with the same number of values, take the average of the 2 closest values. Q2 is always the same as the median.
Position of Q1 = (n+1)/4
Position of Q2 = (n+1)/2
Position of Q3 = 3
(n+1)/4
Check the data in those positions to get values for Q1, Q2 and Q3.

If Q1, Q2 and/or Q3 are not integers, interpolate between the closest data points to get a value for quartile.

Then a couple of statistical concepts to describe the dispersion (variation) of a set of values:
Range = maxvalue - minvalue
Interquartile Range = Q3-Q1. Use box-and-whiskers diagram to represent this graphically.
Variance - the sum of the square of the differences between the values and the mean, divided by n-1
Standard Deviation - square root of the variance
Coefficient of variation - (stddev / mean) * 100

A few questions came up about the Variance:
1. Why do we square the deviation from the mean when calculating the variance?
Answer: So that positive and negative values don't cancel each other out. Squaring ensures that we always have a positive value.

Follow-up question: Why not just take the absolute value?
Answer: Because we'll want to integrate this function and the absolute value function is not differentiable because it has a "kink" in it at 0 where the derivative is undefined.

2. Why do we divide by n-1 and not n?
Answer: If we were looking at the entire population, we would indeed divide by N. But with a sample, we divide by n-1. I'm still not so clear on the why that is.

No comments: