Guest post by Andrew Boyd
Today, the people on the bus. The University of Houston's College of Engineering presents this series about the machines that make our civilization run, and the people whose ingenuity created them.
Suppose we want to find out how often people ride the bus. More specifically, among people who ride the bus, what's the average number of days they ride each week?
Here's one way we might answer the question. We hop on a bus and start asking people how often they ride. The first person says he rides two days per week. The next says five. And so on. We'll do the same thing for a lot of different people on a lot of different busses then take the average. This should give us a pretty good estimate. Or should it?
Let's look more closely. The field of statistics tells us that if we take a sufficiently large, random sample of bus riders, calculating an average is actually a very good way to get an estimate. In fact, a sample of just a few thousand people would probably be sufficient. That's why political pollsters are able to give relatively accurate estimates on 'what the country thinks' about an issue without actually asking everyone in the country what they think.
In our case, the problem isn't with the act of sampling per se, but with the way we're sampling. By randomly sampling riders on the bus, we're not randomly sampling among riders.
How's that again? We can glean insight from a small example. Consider a bus system that operates a single bus and suppose only two people ride the bus. One person rides just one day a week and the other rides all five. So among the two people who ride the bus, the average is three days a week.
But suppose we gathered our data in the way described earlier — by riding the bus for a week and sampling the riders we encountered. On one day we'd find both riders. But on all the others we'd find only one rider, and he'd tell us five different times he rode the bus five days a week. When we calculate the average from our sample, it works out to a number much larger than three.
What's happening is that when sampling from riders on the bus, we're more likely to bump into frequent riders than occasional riders. And this gives us a number bigger than the number we're actually looking for. Interestingly, we can get the number we want from the bus samples. The calculation isn't difficult, but it's not as easy as taking a simple average.
The bus rider sampling problem shows up in many forms in all sorts of unexpected places. And it's just one of many problems that help remind us to remain intellectually vigilant. For what appears obvious at first glance may not, in the end, be that obvious. Accepting an argument too quickly, or oversimplifying a problem, can easily lead us into trouble.
I'm Andy Boyd at the University of Houston, where we're interested in the way inventive minds work.
Notes and references:
A. Drake. Fundamentals of Applied Probability Theory. New York: McGraw-Hill, 1967.
The bus passengers photograph is a Royalty free stock image from Stock.XCHNG. The abstract people were created by E. A. Boyd.