Sunday, February 10, 2008

Lecture 5 - Ch 6b - Checking for Normality

The cumulative distribution function (CDF) is the function of the area under the probability distribution function from -infinity to x. For the standard normal distribution, the CDF looks like this:
You can use the CDF to check for the normality of a data set, i.e. how closely it fits the normal distribution, by examining how closely the area under the curve of the data fits the CDF. (Question: why compare to the CDF and not the probability distribution function (PDF) itself?) Minitab has two functions that supports checking against the CDF. Both functions are under the Graph menu.

The first is the Empirical CDF option which produces a graph that looks like the graph to the right. The smooth s-shaped line is the normal distribution CDF. The sample dataset (rubber.mtw) is represented by the step-like function which is overlayed on top of it. It apparently fits the normal distribution relatively well. (Note: You can click the graph to enlarge it.)


The second option is the Probability Plot option which adjusts the y-axis so that the CDF function is a straight line rather than s-shaped and can also shows a confidence interval around it. The graph of the same data as above with a 95% confidence interval is shown at right.

The question that remains is how we quantify how well a dataset fits the normal distribution - or any distribution for that matter. I mean, it's nice to eyeball either of the two graphs above and say "yeh, that looks pretty close", but that's not very precise. If the data represented something in which I have a business or financial stake, I would want to know whether the normal distribution model fits the data with more precision.

No comments: