GSB420 - Business Statistics: March 2008

Friday, March 28, 2008

ECO 509 - Spring Quarter 2008

Next quarter (Spring 2008) I'll be taking ECO 509 - Business Conditions Analysis (aka Macroeconomics) with Professor Jaejoon Woo (Wed night section).

You can find the blog for ECO 509 at http://eco509.blogspot.com.

Friday, March 14, 2008

Final Exam Recap

Well, it's finally over! Here's a recap of my random thoughts on the final exam:

In general, the final was harder than I expected. I'm pretty sure I did well, but it was definitely harder than the midterm and harder than I thought it would be.
He hit us with 10 straight questions from Chapter 12 right out of the block! I expected to ease into it. The previous quarter's final was pretty linear - starting at chapter 7, then chapter 8, etc and not hitting chapter 12 until the last questions. I knew chapter 12 would be a big chunk of the exam, but I didn't expect him to lead off with it.
I think we were all pretty stumped by that one question (was it 9 or 10?) that had us calculate the intercept, b₀. I kept coming up with 100 for an answer, but it wasn't one of the choices. I saw Azzam go up and ask about something and figured it might be that. So I went up and asked also. I think we all breathed a sigh of relief when he made the change to the last two choices. BTW, I think he could have just changed Σ Y to be 100 and that would have made the intercept 40, which was one of the choices.
Probability of Type I errors? Sheesh! I didn't see that coming. The answer is that it's alpha, α.
I was surprised at the "regular" question that had us work out the regression from the raw data. I'm almost positive I made some arithmetic mistakes in calculating the Σ(X_i-Xbar)(Y_i-Ybar) or Σ(X_i-Xbar)² or one of the other calculations.
My answer for the "west nile" question was that we do not reject the null hypothesis, H₀ that the average # of cases is different than 3.
On the last question, part A, you had to assume (or somehow know through ESP or divine vision) that the level of confidence to use is 95%. You can calculate the t-score easily enough (I think I got something like 2.414), but draw any conclusions, you need to calculate the critical value for t_{α/2, n-1} which requires the level of confidence.
I knew there would be a Durbin-Watson question on the exam! My answer for that one was that there was not evidence of autocorrelation since the DW stat was greater than d_U and less than 2+d_L. I'm not sure I was using the right d_U because I wasn't sure if I should use α or α/2 on the DW table. I used α (which I think was 0.05).
As predicted, there were a few questions with Minitab output. No big surprises there.
In at least 2 of the questions (one multiple choice and one "regular"), he gave us the variance rather than the standard deviation. Tricky! I almost fell for that one.
One question asked us to determine sample size, given a confidence level, margin of error and standard deviation (or maybe variance). Using the formula, you calculate n=74.3 (something like that - not a round number). You had to know to round up, not truncate the decimal.
Higher confidence levels need wider intervals. You had to figure he would ask about that.
If p is low, H₀ must go! You could use that to answer one of the multiple choice question on p-value approach in hypothesis testing.

All in all, not a terrible test. Just harder than last quarter's final, IMHO. I'm anxious to see how I did. I think he said he may have them graded by Monday. The multiple choice is easy to grade. I think he gives the "regular" questions to a Teaching Assistant. HW5 grades have not been posted to Blackboard yet.

Thursday, March 13, 2008

Final Exam Study Guide - Last Minute Notes

Just a couple of last minute thoughts:

There were no example problems that used the Durbin-Watson statistic. That doesn't mean it won't be on the exam! The D-W table is part of the formula sheet, so I'm expecting a question on it.
Remember that if DW is less than d_L, there's autocorrelation. If it's between d_L and d_U, it's inconclusive. If it's between d_U and 2+d_L, there's no autocorrelation. I doubt we'll be asked about the range from 2-4.
Review how to read those Minitab outputs! There's bound to be at least one on the exam. Remember that in Minitab output, SS stands from Sum of Squares. S stands for S_YX. The Coefficient of the Intercept is b₀. The Coefficient of the other (independent) variable is b₁. SE stands for standard error.
There weren't any practice problems on the confidence interval for mean/individual Y. I would expect one of those since the formulas are on the sheet. He'll probably give us h_i and S_XY. Remember to use n-2 when looking up the value for t in this case.
Remember that most of the answers can be derived from the data in the question and the formula sheet. You're not really expected to memorize very much. If you can't figure it out, look at the formula sheet.
Remember to bring a copy of the formula sheet, a calculator and a #2 pencil. Don't laugh! I forgot a pencil for the midterm and ran out to Walgreen's a half hour before the exam.

Wednesday, March 12, 2008

Final Exam Study Guide - Practice Questions - Part 2

In this post, I'll go over the answers to the "regular" questions from the last quarter's final. I'll also note which chapter the question is from.

Question 1 (Chapter 12): You would like to estimate the income of a person based on his age. The following data shows the yearly income (in $1,000) and age of a sample of seven individuals.
Income (in $1,000) Age
20                 18
24                 20
24                 23
25                 34
26                 24
27                 27
34                 27
a. Develop the least squares regression equation.
b. Estimate the yearly income of a 30-year-old individual.

Answer:
a. In order to calculate b₀ and b₁, we need to first calculate the mean of X (age) and Y (income). For xbar, I calculated 24.71 and for ybar, I got 25.71. To calculate b₁, we need to calculate x_i-xbar and y_i-ybar for each i:

Income Age x_i-xbar y_i-ybar (x_i-xbar)(y_i-ybar) (x_i-xbar)²

20     18  -6.71   -5.71        38.31          45.02
24     20  -4.71   -1.71         8.05          22.18
24     23  -1.71   -1.71         2.92           2.92
25     34   9.29   -0.71        -6.60          86.30
26     24  -0.71    0.29        -0.21           0.50
27     27   2.29    1.29         2.95           5.24
34     27   2.29    8.29        18.98           5.24

The sum of the (x_i-xbar)(y_i-ybar) is 64.4. The sum of the (x_i-xbar)² is 167.4. Therefore, b₁ is 64.4/167.4 = 0.38.
We can also calculate b₀ = ybar - b₁xbar = 25.71 - (0.38)(24.71) = 16.2.
Therefore, the regression equation is y = 16.2 + 0.38x.

b. Use the equation to estimate y for x=30:
y = 16.2 + 0.38(30) = 27.6, which is $27,600 annual income.

Question 2 (Chapter 12): Below you are given a partial computer output based on a sample of 8 observations, relating an independent variable (x) and a dependent variable (y).
              Coefficient Standard Error
Intercept     13.251      10.77
X             0.803       0.385

Analysis of Variance
SOURCE            SS
Regression
Error (Residual)  41.674
Total             71.875
a. Develop the estimated regression line.
b. At α = 0.05, test for the significance of the slope.
c. Determine the coefficient of determination (R²).

Answer:
a. This one's a lot easier than #1. No calculations necessary, just the ability to pull b₀ and b₁ out of the computer output. They're the coefficients of the intercept and X. So the regression equation becomes:
y = 13.251 + 0.803x

b. The t score for the slope is t = b₁/s_b₁.
From part a, we know that b₁ = 0.803.
s_b₁ is given in the computer output as the standard error of x = 0.385.
Therefore, t = 0.803/0.385 = 2.086.
Looking at the t distribution table for n-2=6 and α/2=0.025, we find a critical t value of 2.447. Since the t score of 2.086 is less than 2.447, we do not reject the null hypothesis that there is no linear relationship.

c. r² = SSR/SST. But SSR was conveniently removed from the computer output. We need to calculate it from SSR = SST-SSE = 71.875-41.674 = 30.201.
Therefore, r² = 30.201/71.875 = 0.42.

Question 3 (Chapter 9): A sample of 81 account balances of a credit company showed an average balance of $1,200 with a standard deviation of $126.
a. Formulate the hypotheses that can be used to determine whether the mean of all account balances is significantly different from $1,150.
b. Let α = .05. Using the critical value approach what is your conclusion?

Answer:
a. Since we want to know if the mean is "significantly different" from $1,150, the null hypothesis is that it is $1,150.
H₀: μ = 1150
H₁: μ ≠ 1150

b. Since we don't have the population standard deviation, use the t test statistic.
t = (xbar-μ₀)/(s/√n)
= (1200-1150)/(126/√81)
= 50/14
= 3.57
The critical value for t for 80 degrees of freedom and &alpha/2=0.025 is 1.990.
Since the t-value=3.57 is greater than the critical value of 1.990, we reject H₀ and conclude that the mean is significantly different from $1,150.

Question 4 (Chapter 8): A statistician selected a sample of 16 accounts receivable and determined the mean of the sample to be $5,000 with a sample standard deviation of $400. He reported that the sample information indicated the mean of the population ranges from $4,739.80 to $5,260.20. He neglected to report what confidence level (1-a) he had used. Based on the above information, determine the confidence level that was used.

Answer: The statistician is reporting a confidence interval of 5000 ± 260.20. He only mentions the sample standard deviation (not the population std dev), so he must be using the t-distribution and the formula: xbar ± t_{n-1, α/2}(s/√n).

So we have:
260.2 = t(s/√n)
260.2 = t (400/√16)
260.2 = 100t
t = 2.602

We look to the t distribution table and find that t_{15, α/2} = 2.602 is true for α/2 = 0.01. So α = 0.02 and the confidence level is 1-0.02 = 0.98 = 98%.

Question 5 (Chapter 12): The director of graduate studies at a college of business would like to predict the grade point index (GPI) of students in an MBA program based on their GMAT scores. A sample of 20 students is selected. The result of the regression is summarized in the following Minitab output.
Regression Analysis: GPI versus GMAT

The regression equation is
GPI = 0.300 + 0.00487 GMAT

Predictor         Coef         SE Coef         T
Constant        0.3003          0.3616      0.83
GMAT         0.0048702           [ N ]     [ M ]

S = 0.155870 R-Sq = 79.8%

Analysis of Variance

Source             DF         SS         MS         F         P
Regression          1     1.7257     1.7257     71.03     0.000
Residual Error     18     0.4373     0.0243
Total              19     2.1631
a) Given that Σ(X_i-xbar)² = 72757.2 , where X = GMAT, compute N.
b) Compute M and interpret the result. In particular do we reject the underlying hypothesis (which hypothesis) or not?

Answer:
a. N is what we usually call the standard error of the slope, s_b₁. (This is the hardest part of the problem - figuring out what's missing in the Minitab output.) From the formula sheet, we know:
s_b₁ = S_XY/√SSX

We're given SSX, but we need to calculate S_XY from the formula:
S_XY = √(SSE/(n-2)).

We have SSE from the output: SSE = 0.4373. So,
S_XY = √(0.4373/18) = 0.156

Therefore,
s_b₁ = 0.156/√72757.2 = 0.156/269.7 = 0.00058

b. M is the t-score for the slope which is given by:
t = b₁/s_b₁
= 0.0048702/0.00058
= 8.4

The critical value for t for 18 degrees of freedom and α/2=0.005 is 2.878. Therefore, since our t-score is greater than the critical t-value, we would reject the null hypothesis, H₀: μ=0.

Monday, March 10, 2008

Final Exam Study Guide - Practice Questions

Question 1: A population has a standard deviation of 16. If a sample of size 64 is selected from this population, what is the probability that the sample mean will be within ±2 of the population mean?
a. 0.6826
b. 0.3413
c. -0.6826
d. Since the mean is not given, there is no answer to this question.

Answer:
We need to calculate the z-score for the ±2 interval. In order to do that, we need the standard error of the mean, σ/√n = 16/sqrt(64) = 2.
So when we're asked for the probability that the sample mean is ±2 from the population mean, it's asking for the probability of the mean being within 1 standard error. Even without looking it up in the table, we know that the answer must be A - both from our experience that 68% of the data fall within 1 std dev, and because the other answers are unreasonable.

Question 2: The fact that the sampling distribution of sample means can be approximated by a normal probability distribution whenever the sample size is large is based on the
a. central limit theorem
b. fact that we have tables of areas for the normal distribution
c. assumption that the population has a normal distribution
d. None of these alternatives is correct.

Answer: There's not much to say here. The statement is essentially the definition of the Central Limit Theorem, see page 213. The sample size must be approximately 30 for this to hold for all distributions.

Question 3: A population has a mean of 53 and a standard deviation of 21. A sample of 49 observations will be taken. The probability that the sample mean will be greater than
57.95 is
a. 0
b. .0495
c. .4505
d. .9505

Answer: Find the z-score of this mean: (57.95-53)/(21/sqrt(49)) = 4.95/3 = 1.65. So the question becomes: What's the probability of an observation being more than 1.65 std devs from the mean. You know it can't be much. It's greater than 0. Answer B is the only logical one. Of course, when we go to the cumulative normal distribution table, we find that 1.65 has 0.9505 area, so the area to the right of 1.65 is 0.0495.

Question 4: Suppose a sample of n = 50 items is drawn from a population of manufactured products and the weight, X, of each item is recorded. Prior experience has shown that the weight has a probability distribution with mu = 6 ounces and sigma = 2.5 ounces. Which of the following is true about the sampling distribution of the sample mean if a sample of size 50 is selected?
a) The mean of the sampling distribution is 6 ounces.
b) The standard deviation of the sampling distribution is 2.5 ounces.
c) The shape of the sample distribution is approximately normal.
d) All of the above are correct.

Answer:
A is true. Although when you take a single sample, its mean is not necessarily equal to the population mean, nonetheless, the mean of the sampling distribution (of all samples) will tend toward the population mean as n increases.
B is also not necessarily true. The standard deviation of the sample is not necessarily equal to the population standard deviation. It is usually smaller by a factor of 1/&radicn.
C is not true. The central limit theorem tells us that when the sample size is ≥30, the distribution of the sample mean is approximately normal. However, the shape of the sample distribution itself is not necessarily normal.
D is clearly not true since B and C are not true.

Question 5: The owner of a fish market has an assistant who has determined that the weights of catfish are normally distributed, with mean of 3.2 pounds and standard deviation of 0.8 pound. If a sample of 25 fish yields a mean of 3.6 pounds, what is the Z-score for this observation?
a) 18.750
b) 2.500
c) 1.875
d) 0.750

Answer:
When evaluating the sample mean,
z = (xbar-μ)/(σ/√n) Note: This formula is not on the sheet.
= (3.6-3.2)/(0.8/√25)
= 0.4/0.16
= 2.5
So, answer B is correct.

Question 6: A 95% confidence interval for a population mean is determined to be 100 to 120. If the confidence coefficient is reduced to 0.90, the interval for mu
a. becomes narrower
b. becomes wider
c. does not change
d. becomes 0.1

Answer: No calculations are necessary here. It's completely conceptual. The general rule is: A higher level of confidence requires a wider confidence interval. Therefore, if we reduce the level of confidence to 90%, the confidence interval can be narrower. Answer A is the correct answer.

Exhibit 8-3
The manager of a grocery store has taken a random sample of 100 customers. The average length of time it took these 100 customers to check out was 3.0 minutes. It is known that the standard deviation of the population of checkout times is 1 minute.

Question 7: Refer to Exhibit 8-3. The standard error of the mean equals
a. 0.001
b. 0.010
c. 0.100
d. 1.000

Answer: The standard error of the mean is:
σ/√n = 1/√100 = 1/10 = 0.1
The correct answer is C.

Question 8: Refer to Exhibit 8-3. With a .95 probability, the sample mean will provide a margin of error of
a. 1.96
b. 0.10
c. 0.196
d. 1.64

Answer: The margin of error is the plus/minus term in the confidence interval. In this case, since we know the population standard deviation, the margin of error term is:
z_α/2(σ/√n)
From the z-table, we find that z_0.025 = 1.96
Therefore,
margin of error, E = 1.96(1/√100) = 0.196
Answer C is correct.

Question 12: When the following hypotheses are being tested at a level of significance of α
H₀: μ ≥ 100 H_a: μ < 100
the null hypothesis will be rejected if the p-value is
a. < α
b. > α
c. > α/2
d. < α/2

Answer: First, we notice that this is a one-tailed hypothesis test. The rejection region is entirely to one side of the mean.
Our general rule is If p is low, H₀ must go. So, if p is less than α, we reject the null hypothesis. Answer A is correct.

Question 13: In order to test the following hypotheses at an α level of significance
H₀: μ ≤ 100 H_a: μ > 100
the null hypothesis will be rejected if the test statistic Z is
a. > Z_α
b. < Z_α
c. < -Z_α
d. > Z_α/2

Answer: We've got a one-tailed hypothesis again. This time, the rejection region is in the right-hand tail. Therefore, we reject H₀ if the test statistic is more extreme (i.e. further to the right) than the Z_α. So answer A is correct.

Question 14: Your investment executive claims that the average yearly rate of return on the stocks she recommends is more than 10.0%. She takes a sample to prove her claim. The correct set of hypotheses is
a. H₀: μ = 10.0% H_a: μ ≠ 10.0%
b. H₀: μ ≤ 10.0% H_a: μ > 10.0%
c. H₀: μ ≥ 10.0% H_a: μ < 10.0%

Answer: I don't really like this question because it sounds like she's making a claim based on a status quo of the return rate being > 10%. Since the null hypothesis is about the status quo, I'm tempted to pick answer C. Unfortunately, that's not the right way to look at it in this case.

Rather, since her claim is that the return is greater than 10%, which does not contain an equal sign, that must be the alternative hypothesis, H_a. Therefore, the null hypothesis, H₀, is μ ≤ 10%. Answer B is correct.

Question 15: A soft drink filling machine, when in perfect adjustment, fills the bottles with 12 ounces of soft drink. Any over filling or under filling results in the shutdown and readjustment of the machine. To determine whether or not the machine is properly adjusted, the correct set of hypotheses is
a. H₀: μ > 12 H_a: μ ≤ 12
b. H₀: μ ≤ 12 H_a: μ > 12
c. H₀: μ = 12 H_a: μ ≠ 12

Answer: This one's a gimme. The null hypothesis H₀ is that the machine is continuing to work properly and μ = 12. The alternative hypothesis, H_a is that it is filling with some other mean volume and μ ≠ 12. Correct answer is C.

Question 16: A two-tailed test is performed at 95% confidence. The p-value is determined to be 0.11.
The null hypothesis
a. must be rejected
b. should not be rejected
c. could be rejected, depending on the sample size
d. has been designed incorrectly

Answer: Since the level of significance is 5%, the combined area of the two-tailed rejection region is 0.05. I.e., 0.025 in either tail. The p-value is 0.11. We remember our mantra: If p is low, H₀ must go! But p is not lower than 0.05. Therefore, we do not reject H₀ and answer B is correct.

Question 17: For a one-tailed hypothesis test (upper tail) the p-value is computed to be 0.034. If the test is being conducted at 95% confidence, the null hypothesis
a. could be rejected or not rejected depending on the sample size
b. could be rejected or not rejected depending on the value of the mean of the sample
c. is not rejected
d. is rejected

Answer: Level of significance is 5% = 0.05. p is 0.034. Repeat after me: If p is low, H₀ must go! In this case, yes, p is lower than the level of significance and therefore H₀ is rejected. Answer D is correct.

Note: If this had been a two-tailed test, then the 0.05 rejection region would have been split between the two tails, each having 0.025. In that case, it's not clear whether p = 0.034 is lower than 0.025 unless we know whether p was calculated on one side (as we did in class) or on both sides (as is done in the textbook). I asked Prof. Selcuk about this in an email and he replied that he would avoid such ambiguous cases on the final exam.

Exhibit 9-1
n = 36
xbar = 24.6
S = 12
H₀: μ ≤ 20
H_a: μ > 20

Question 18: Refer to Exhibit 9-1. The test statistic (t-score of xbar) is
a. 2.3
b. 0.38
c. -2.3
d. -0.38

Answer: The formula (on the formula sheet) for the t test statistic is:
t = (xbar - μ₀)/(s/√n)
= (24.6-20)/(12/√36)
= 4.6/2 = 2.3
A is the correct answer.

Question 19: Refer to Exhibit 9-1. If the test is done at 95% confidence, the null hypothesis should
a. not be rejected
b. be rejected
c. Not enough information is given to answer this question.
d. None of these alternatives is correct.

Answer: This question is tricky because we don't know if it's a one-tail or two-tail test. First, assume it's a one-tail test, i.e. the entire rejection region is in one tail. Refer to the t distribution table and look up the t value for 35 degrees of freedom and a 0.05 area in the tail. We find that t value to be approximately 1.69. Our t test statistic is 2.3 which is greater than 1.69, indicating that we should reject the null hypothesis, H₀.

Just to be sure, let's assume that's it's a two-tail test, so the rejection region is only 0.025 on each side. Referring to the t distribution table again, we find the t value for 35 degrees of freedom and a 0.025 area is approximately 2.03. Again, our t test statistic is more extreme than the critical t value. Therefore, reject the null hypothesis, H₀.

Answer B is correct.

Question 20: In regression analysis if the dependent variable is measured in dollars, the independent variable
a. must also be in dollars
b. must be in some units of currency
c. can be any units
d. can not be in dollars

Answer: This is entirely conceptual. The dependent and independent variables are entirely independent of each other. Think of the site.mtw example that we were using extensively in class. The dependent variable was store sales (measured in dollars) and the independent variable was the size of the store (measured in square feet). The correct answer is C - the independent variable can be in any units.

Question 21: In a regression analysis, if SST=4500 and SSE=1575, then the coefficient of determination (R²) is
a. 0.35
b. 0.65
c. 2.85
d. 0.45

Answer: Since SST=SSE+SSR, SSR=4500-1575=2925. And R²=SSR/SST=2925/4500=0.65. Therefore, answer B is correct.

Question 22: Regression analysis was applied between sales (Y in $1,000) and advertising (X in $100), and the following estimated regression equation was obtained.
Y-hat = 80 + 6.2 X
Based on the above estimated regression line, if advertising is $10,000, then the point estimate for sales (in dollars) is
a. $62,080
b. $142,000
c. $700
d. $700,000

Answer: When a question is this easy, you know there's some sort of trick. Watch your units!! Since X is in hundreds of dollars, plug in 100 in the regression equation. Y = 80 + 6.2(100) = 700. Y is in thousands of dollars. Therefore, the point estimate for sales in dollars is $700,000 - answer D.

Question 23: If the coefficient of correlation is a positive value, then
a. the intercept must also be positive
b. the coefficient of determination (R2) can be either negative or positive, depending on the value of the slope
c. the regression equation could have either a positive or a negative slope
d. the slope of the line must be positive

Answer: We learned about the coefficient of correlation way back in Chapter 3. It's a measure of the strength of the linear relationship between x and y. Its values range from -1 to 1. Values close to -1 or 1 indicate a strong linear relationship, either negative or positive.

Answer A is incorrect because the coefficient of correlation tells us nothing about the intercept.
Answer B is incorrect because the coefficient of determination (r²) can only be positive. r² = SSR/SST and both SSR and SST are positive (since they're both sums of squares), so r² must be positive.
Answer C is incorrect because a positive coefficient of correlation indicates a positive relationship which would be modeled with a positive slope.
Answer D is correct.

Exhibit 14-10
The following information regarding a dependent variable Y and an independent variable X is
provided.
∑ X = 16 ∑ (x-xbar)(y-ybar) = -8
∑ Y = 28 ∑ (x-xbar)² = 8
n = 4

Question 24: Refer to Exhibit 14-10. The slope of the regression function is
a. -1
b. 1.0
c. 11
d. 0.0

Answer: On the formula sheet we have the formula for the regression slope, b₁:
b₁ = ∑ (x-xbar)(y-ybar) / ∑ (x-xbar)² = -8/8 = -1.
So answer A is correct.

Question 25: Refer to Exhibit 14-10. The intercept of the regression line is
a. -1
b. 1.0
c. 11
d. 0.0

Answer: Again, the formula sheet gives us the computation for the intercept, b₀:
b₀ = ybar - b₁xbar = (28/4) - (-1)(16/4) = 7 + 4 = 11.
So answer C is correct.

More answers to sample problems to come. (I'm kinda jumping around for now.)

Final Exam Study Guide - Analysis of Prior Exam Questions

Looking at last quarter's exam gives us some insight as to what to expect on our final. The most important piece is that it provides practice questions at the level we'll be expected to perform. It is extremely worthwhile to do these problems on your own and make sure you understand the answer.*

Another interesting insight that we gain from the sample exam is the distribution of questions. Here's what I came up for the number of questions per chapter and the number of points associated with those questions:


          Mult    Short   Total
Chapter   Choice  Answer  Points
7         6       0       12
8         5       1       20
9         8       1       26
12        6       3       42

We'll probably have a few questions from Chapter 6 thrown in, but those will probably be relatively easy compared to the more advanced material. These numbers tell me one thing for sure: Chapter 12 is really important!

*For the record: If you noticed that I didn't stick around for the in-class review off the sample final on Thursday, it's not because I think I know all this stuff! Just the opposite. Almost all of this material is new to me and I wanted to work through all the questions on my own without having heard the answer already solved by someone else.

Final Exam Study Guide - Outline

Final Exam Study Guide
There's a lot of material to review for our final exam. In order to study for the final, I'm going through all the chapters that will be covered (6, 7, 8, 9 and 12) and pulling out the important points from each one. I've basically written them up as "learning objectives" for each chapter. Also since we didn't cover every section of every chapter, I've listed the sections that we did cover.

Here's my outline as it stands so far:

Chapter 6 - The Normal Distribution
6.1
Understand the concept of a continuous probability distribution and the difference between continuous and discrete probability distributions.

6.2
Understand the normal and standard normal distributions.
Calculate the z score for any given X.
Read the standard normal distribution table and answer questions of the form:
P(X<a)
P(X>a)
P(a<X<b)

6.3
Use the normal probability plot to evaluate normality of data.

Chapter 7 - Sampling Distributions

7.1
Understand the concept of a sampling distribution.

7.2
Calculate z-scores for xbar using the standard error of the mean: σ/√n
Understand the Central Limit Theorem.

Chapter 8 - Confidence Intervals

8.1
Construct a confidence interval for the population mean, given a sample mean, population standard deviation, sample size and level of confidence.
Know that a high level of confidence requires a wider confidence interval.

8.2
Construct a confidence interval for the population mean, given the sample mean, sample standard deviation, sample size and level of confidence.
Know that for the t statistic, the degrees of freedom is n-1.
Read the t-table to find the critical value for a given level of confidence and degrees of freedom.

8.4
Calculate the sample size required for a given margin of error and level of confidence.
Know that a smaller margin of error requires a larger sample size.
Know that a higher level of confidence requires a larger sample size.

Chapter 9 – Hypothesis Testing

9.1
Understand the concept of the null and alternative hypotheses.
Construct null and alternative hypotheses based on a description of the test.
Understand the concepts of rejection and non-rejection regions.
Understand the level of significance, alpha, of a hypothesis test.

9.2
Know the difference between a one-tailed and two-tailed hypothesis test.
Calculate critical values for the rejection and non-rejection regions for both one-tailed and two-tailed tests.
Calculate the z test statistic and compare to critical values to make a decision whether or not to reject the null hypothesis.
Calculate the p-value and compare to the level of significance to make a decision whether or not to reject the null hypothesis.

9.3
Create null and alternative hypotheses for one-tailed testing.

9.4
Use the t test statistic to conduct one and two-tailed hypothesis tests when σ is not known.

Chapter 12 - Simple Linear Regression

12.1
Understand the basic concepts of independent and dependent variables, intercept and slope.
Understand the concept of simple linear regression.
Regression: modeling a relationship between variables with a curve
Linear Regression: the curve in the relationship is a straight line (not some sort of arc)
Simple Linear Regression: only consider one independent variable as the predictor of the dependent variable
Understand the simple linear regression model formula: Y_i = β₀ + β₁X_i + ε_i

12.2
Understand the method of least squares.
Apply the computation formulas of the least squares method to compute the Y intercept b₀ and the slope b₁.
Know how to read and interpret partial computer output (Minitab) and develop the regression line based on it.

12.3
Understand the sum of squares terms SST, SSR and SSE for the measures of variation in regression.
Calculate any of the sum of squares terms, given the other two.
Understand the coefficient of determination, r².
Calculate r² given any two sum of square terms.
Know how to read and interpret partial computer output (Minitab) and calculate sum of squares terms and r² based on it.
Understand the standard error of the estimate and calculate it, given SSE or SST and SSR.

12.4
Know the four assumptions necessary to use the method of least squares in simple linear regression.

12.5
Know how to use residual analysis to validate the four assumptions.

12.6
Understand the Durbin-Watson statistic.
Know how to interpret the Durbin-Watson statistic to detect autocorrelation.

12.7
Calculate the standard error of the slope, S_b₁.
Calculate the t test statistic for the slope and determine whether there is a significant linear relationship.
Know that when comparing the t test statistic for the slope to the critical t value, you use n-2 degrees of freedom.
Construct a confidence interval for the slope.

12.8
Construct a prediction interval for an individual response Y.
Construct a confidence interval for the mean of Y.
(This is as far as I got so far. More to come!)

Tuesday, March 4, 2008

Lecture 8 - Confidence Interval for Ŷ

Confidence Interval for Ŷ
Once we've calculated our regression coefficients, b₀ and b₁, we can estimate the value of Y at any given X with the formula:
Ŷ = b₀ + b₁X

This is known as a point estimate. It estimates Ŷ to a point. However, since it's just an estimate, it's logical to ask for a confidence interval around Ŷ.

In addition to constructing a confidence interval for an individual Ŷ, we can also construct a confidence interval for an average Ŷ at X.

The difference is best illustrated with an example. In site.mtw we have data on square footage of stores and their annual sales. In general, sales increase linearly with increasing square footage. We perform a regression analysis and determine the regression coefficients. Now we could ask two questions:
1. If I build a single new store with 4,000 square feet, what does the regression predict for its annual sales? The answer can be expressed as a confidence interval for an individual Ŷ, because we're making a prediction for an individual new store.
2. If I build 10 new stores, each with 4,000 square feet, what does the regression predict for the average annual sales of those stores? The answer to this question can be expressed as a confidence interval for an average Ŷ, since we're making a prediction about the average sales at many new stores.

Confidence Interval for Average Ŷ
The confidence interval for the average Ŷ (question #2 above) takes the common form:where S_Ŷ is the standard error of Ŷ. Note: n-2 is used in looking up the t value in the t table.

We are told thatwhere h_i is given by:
That's all there is to it! Well see in a minute that Minitab can calculate the standard error term for us, so it's constructing the interval is just a matter of looking up the value in the t table and then doing the arithmetic.

Confidence Interval for Individual Ŷ
If we're constructing the confidence interval for an individual Ŷ (question #1 above), the calculations are very similar except that we use a 1+h_i term in place of h_i. So that term becomes:Other than the standard error term, everything else is the same as calculating for an average Ŷ.

Using Minitab to calculate the confidence interval
Here's how to get the info we need out of Minitab:
1. Load up your data in a worksheet. (We use the site.mtw file as usual.)
2. Select Stat-Regression-Regression from the menubar.
3. Put the independent variable (square feet) in the Predictor box. Put the dependent variable (annual sales) in the Response box.
4. Click the Options button. Enter 4 in the Prediction interval for new observations box. This tells Minitab that we want a prediction of annual sales at 4000 square feet (the units of our data are thousands of feet).
5. Check the Confidence Limit checkbox if you want a confidence interval for an average Y. Check the Prediction Limit checkbox if you want a confidence interval for an individual Y.
6. Click OK in the Options and the main Regression windows. The results appear in the session window. Here's the relevant information:

Predicted Values for New Observations

New
Obs    Fit  SE Fit      95% CI          95% PI
  1  7.644   0.309  (6.971, 8.317)  (5.433, 9.854)

The prediction is 7.64 (the "fit"). The SE Fit term is the standard error for the average Y (the one with just h_i, not 1+h_i.

I suspect that we'll probably be expected to construct the confidence interval for the average Y, given the Fit and SE Fit output. Don't forget: You still need to look up the t value (at n-2!) and multiply the SE Fit value by it.

Lecture 8 - Inferences About the Regression Slope - Part 2

Confidence Interval for b₁
The second question that we ask when evaluating the regression is: What is the confidence interval for b₁?

Like any confidence interval, this one will take the form:
b₁ ± t_{α/2, n-2} S_b₁

In the last blog post, we found that S_b₁ is:
S_b₁ = S_XY/SQRT(SSX)

Knowing S_b₁, we can look up t in the t-table and construct the confidence interval relatively easily.

Note that for the confidence interval for b₁ we use n-2 in looking up the t-score.

Using Minitab to evaluate the regression
We won't be expected to calculate S_XY and S_b₁ by hand for the final (or so we were told). But we will likely be asked to create a confidence interval for b₁ given a snippet of Minitab output. So it's worthwhile to take a look at it:

We used the site.mtw dataset and ran the standard regression analysis and got this:

Predictor      Coef  SE Coef      T      P
Constant     0.9645   0.5262   1.83  0.092
Square Feet  1.6699   0.1569  10.64  0.000
S = 0.966380   R-Sq = 90.4%   R-Sq(adj) = 89.6%

The S_b₁ value is calculated for us, but it's not obvious where it is. It's the SE Coef term that I've highlighted in red. b₁ itself is the Coef term, in blue. With those two numbers and a t-table, you can construct a confidence interval for b₁. Just remember to use n-2 in the t-table.

With these two values you can also determine the t statistic for hypothesis testing β₁=0 from the previous blog post by dividing b₁/S_b₁. But the truth is, you don't have to do that! The t-value for b₁ is right there in the Minitab output also. I've highlighted it in green. The number in the p column (highlighted purple) is the p-value for b₁. So if that number is less than α/2, then you can reject the hypothesis that β₀ is 0.

Lecture 8 - Inferences About the Regression Slope

After we use the method of least-squares to calculate regression coefficients (b₀ and b₁) and we validate the LINE assumptions, we next turn to evaluating the regression, specifically the slope, b₁ and ask two questions:
1. Is it statistically significant?
2. What is the confidence interval for b₁?

The first question (we actually covered this after the second question in class), whether b₁ is statistically significant, is determined by asking: Is it any better than a flat horizontal line through the data?

We answer this question by making a hypothesis that the true relationship slope, β₁ is 0 and using our skills at hypothesis testing to determine whether we should reject that hypothesis.

H₀: β₁ = 0
H₁: β₁ ≠ 0

The t statistic that we use to test the hypothesis is:
t = (b₁-β₁)/S_b1
where S_b1 is the standard error of the slope.

In our case, β₁ is 0 according to our hypothesis, so t reduces to:
t = b₁/S_b1

The standard error of the slope, S_b1, is defined as:
S_b1 = S_XY/SQRT(SSX)
where S_XY is the standard error of the estimate.

The standard error of the estimate, S_XY, is defined as:
S_XY = SQRT(SSE/n-2)

So, if we have our calculations of SSX and SSE, we can do the math and find S_b1 and the t-score for b₁.

We finish our hypothesis testing by comparing the t-score for b₁ to t_{α/2, n-2}, where α is our level of significance.
If t is beyond t_{α/2, n-2} (either on the positive or negative end), we conclude that the hypothesis, H₀, must be rejected.
We could also make the conclusion based on the p-value of the t-score. If the p-value is less than α/2, then we reject H₀.

**Confidence interval for b₁ will be covered in the next blog post.**

Lecture 8 - Residual Analysis - Checking Independence of Errors

Checking the Independence of Errors Assumption
The "I" in the LINE mnemonic stands for Independence of Errors. This means that the distribution of errors is random and not influenced by or correlated to the errors in prior observations. The opposite is independence is called autocorrelation.

Clearly, we can only check for independence/autocorrelation when we know the order in which the observations were made and the data points were collected.

We check for independence/autocorrelation in two ways. First, we can plot the residuals vs. the sequential number of the data point. If we notice a pattern, we say that there is an autocorrelation effect among the residuals and the independence assumption is not valid. The plot at right of residuals vs. observation week shows a clear up and down pattern of the residuals and indicates that the residuals are not independent.

The second test of independent/autocorrelation is a more quantitative measure. (All the methods that we've used up to this point for checking assumptions have been graphical/visual.) This test involves calculating the Durbin-Watson Statistic. The D-W statistic is defined as:
It's the sum of the squares of the differences between consecutive errors divided by the the sum of the squares of all errors.

Another way to look at the Durbin-Watson Statistic is:

D = 2(1-ρ)

where ρ (the Greek letter rho - lower case) = the correlation between consecutive errors.

Looking at it that way, there are 3 important values for D:
D=0: This means that ρ=1, indicating a positive correlation.
D=2: In this case, ρ=0, indicating no correlation.
D=4: ρ=-1, indicating a negative correlation

In order to assess whether there is independence, we check to see if D is close to 2 (in which case we say there is no correlation and errors are independent) or if it's closer to one of the other extreme values of 0 or 4 (in which case we say that the independence assumption is not valid). There is also some grey area between both 0 and 2 and between 2 and 4 in which case we say that the Durbin-Watson statistic does not give us enough information to make a determination, it is inconclusive.

To determine the boundaries for when the Durbin-Watson statistic is relevant and when it's inconclusive, we turn to table E.9, which provides us with lower and upper bounds, d_L and d_U.

Reading the Durbin-Watson Critical Values Table
The critical values are dependent on the sample size, n, the number of independent variables in the regression model, k, and the level of significance, α. In the case of simple linear regression, there's always only 1 independent variable. (That's the simple part.) The level of significance is usually 0.01 (99% confidence) or 0.05 (95% confidence).

So, to read the table:
1. Locate the large section of the table for your level of significance, α.
2. Find the two columns, d_L and d_U, for k=1 (assuming it's simple).
3. Go down the column to the row with your sample size, n.
4. Read the two values for d_L and d_U

Interpreting the Durbin-Watson Statistic
0 < D < d_L: There is positive autocorrelation
d_L < D < d_U: Inconclusive
d_U < D < 2+d_L: No autocorrelation
2+d_L < D < 2+d_U: Inconclusive
2+d_U < D < 4: There is negative autocorrelation

Graphically, it can be represented like this:

Note: Positive autocorrelation is somewhat common. Negative autocorrelation is very uncommon and our book does not deal with it.

Sunday, March 2, 2008

Lecture 8 - Residual Analysis - Checking the Equal Variance Assumption

Homoscadasticity (Not that there's anything wrong with that.)
We now turn to checking the assumption of equal variance of errors, the "E" in our LINE mnemonic. This assumptions states that not only is the error at each x-value distributed normally, but the variance in the error is equal at each point.

Equal variance of errors is known as homoscadasticity. Unequal variance of errors is called heteroscadasticity.

For this analysis we turn again to the plot of residuals vs. the independent variable (x) that we used in when we validated the linearity assumption. For linearity, we were just looking to see if the residuals were evenly distributed above and below the x-axis. To check for equal variance of errors, we check to see if there's any pattern in the distribution of the residuals around the x-axis.

Running the residual plot versus x in Minitab:
1. Load up your data.
2. Select Stat-Regression-Regression from the menu bar.
3. Put Annual Sales in the Response box and Square Feet in the Predictor box.
4. Click the Graphs button. Put Square Feet in the Residuals versus the variables box.
5. Click OK in both the Graphs and Regression dialogs. The residual plot appears.

Review the graph and ask yourself: Is there any pattern in the residuals? Do they get increasing larger or small as x changes? If so, then you have a case of heteroscadasticity. But if the residuals are distributed evenly and consistently around the x-axis, then you can conclude that the variances are consistent and the assumption of equality of variances is valid.

In our example, I'd be somewhat concerned with the fact that the residuals are closer to the x-axis for small values of x, but broaden out for larger values. The variance does seem to taper off as x gets very large, which is an indication that the variances are equal for x>2 or so. (Click the graph for a larger view of the plot.)

Lecture 8 - Residual Analysis - Checking the Normality Assumption

The next assumption in the LINE mnemonic after Linearity is Independence of Errors. We skipped that one momentarily because it's a bit more complex than the others. So we saved it for last. In the meantime, we looked at the next assumption: Normality of Error.

Checking the Normality Assumption
This assumption states that the error in the observation is distributed normally at each x-value. A larger error is less likely than a smaller error and the distribution of errors at any x follows the normal distribution.

Although we typically only have one observation at each x, if we assume that the distribution of the errors is the same at each x, we can simply plot all the errors (residuals) and check if they follow the normal distribution. We do this by running a normal probability plot of the residuals. Fortunately for us, Minitab has a built-in normal probability plot function.

Checking Normality Using Minitab
1. Open up your data worksheet. As usual, we'll use the site.mtw file for our example.
2. Select Stat-Regression-Regression from the menu bar.
3. Put Annual Sales in the Response box since it's the dependent (response) variable and put Square Feet in the Predictors box since it's the independent (predictor) variable.
4. Click the Graph button and under Residual Plots, check the Normal plot of residuals checkbox.
5. Click OK in the Graphs and the Regression dialogs.

Minitab creates the normal probability plot of the residuals. The y-axis of this graph is adjusted so that if the data are distributed normally, they will fall on a straight line on the graph. Minitab even draws a line through the residuals for us (presumably using the method of least-squares).

Drawing a conclusion from the graph
Review this graph and ask yourself: Do the residual points fall more-or-less on a straight line in the normal probability plot? If they do, you can conclude that the errors are distributed normally and the normality of errors assumption is valid. In our example, the normality plot of the residuals are pretty much linear, but I would be concerned about the upward trend at the far right end of the graph. (Click the graph to see it in more detail.)

GSB420 - Business Statistics