Wednesday, January 30, 2008

The Birthday Problem

A relatively well-known exercise in probability is to determine the probability that two (or more) people share a birthday in a group of n people.

The answer is pretty surprising to most people. It turns out that in a group of 23 people, there's more than a 50% chance that two (or more) people have the same birthday.

The way to solve this problem is to first look at the reverse problem. I.e. what is the probability of everyone having different birthdays. Well, let's look at the people individually. The first person can have any birthday, so there are 365 possibilities for him. The second person can't have the same as the first one's bday, so there are 364 possibilities for the second person. The 363 for the third person and 362 for the fourth and so on.

Since the events (the selection of birthdays for each person) are independent, the number of successful outcomes is:
365 x 364 x 363 x ... x (365-n+1) where n is the number of people in the group

The total number of possible birthday combinations is 365^n.

So the probability of 23 people all having different birthdays is:

P(23) = (365 x 364 x 363 x … x 343) / (365^23)

And the probability of two (or more) people having the same birthday is just the inverse:

P’(23) = 1 - P(23)

The calculation of P(23) is pretty ugly, but can be computed relatively accurately by most scientific calculators and personal computers. P(23) = 0.493, so P'(23) is 50.7%.

Taking it a step further

This is the point where most people stop, but a few follow-up questions come up:

1. Is there a way to simplify the formula, even with an approximation, that would make the computation quicker and easier?

2. At the same time, can we generalize the formula beyond the 365 possible birthdays and develop a way to calculate (or estimate) the probability of selecting n items all of different type from an infinite supply of items with x different types. Similarly, what is the probability of two items having the same type within a selection of n items.

It turns out that these questions have actually been studied quite a bit. In 1966, E. H. McKinney of Ball State University wrote a piece in American Mathematical Monthly (v. 73, No. 4, pp. 385-387) where he derived a formula for the problem of finding the smallest value of n such that the probability is great than or equal to 1/2 that at least r people have the same birthday.

McKinney did the computation for r = 2, 3 and 4 and reports the answers as 23, 88 and 187 respectively. What I really found interesting was the following note:

The author did not carry out the computation for r=5 since the machine time [on an IBM 7090 computer] for one computation of [the formula] for a given n was estimated at two hours.

I located the IBM history site on the 7090 Data Processing System. It's really fascinating and worth a read. But what really caught my eye was the bottom line:

The IBM 7090, which is manufactured at IBM's Poughkeepsie, N. Y. plant, sells for $2,898,000 and rents for $63,500 a month in a typical configuration.

Wow! For almost $3 million, you got a computer that would take 2 hours to calculate what you could probably do in a fraction of a second today. I say "probably", because I don't totally understand McKinney's formula yet and I haven't implemented it. I'm still working on that part. :)

More on approximations of the birthday probability later...

No comments: