Culture. Nurture. Tincture. Enrapture.

You‘re bad at statistics and that‘s ok

We all like to think of ourselves as reasonable, logical and rational beings. Yet, left unchecked, our institutions will lead us astray surprisingly more often than we expect. This is not an issue in itself, as the human brain has no obligation to be rational in the slightest. Through evolution, we have developed unconscious biases that hasten decision-making but also distort our perception. In a world relying more and more on data, computers and the like, such tuning can have a tremendous impact on our lives. In the following, I want to point out some common misfires we make in statistics. Being aware of them is the first step to fighting those biases and making better-informed decisions every day.

Feeling random IS NOT random

Here, I really can‘t say it more plainly: we don‘t know what random is. In truth, our brains are pattern recognition experts. Given any events, we will automatically search for trends and explanations even when they don‘t actually exist. Very useful for quickly identifying threats, less so in our modern world. As such, given a completely random sequence, we inevitably find some designed purpose to it. Take for example simple coin flips. What is the probability of it turning up tails or heads? I am sure you said 50%, which is correct if the coin is weighted properly. But what if I told you that I have flipped 4 heads in a row before? What would you feel the probability of it turning up heads would be? It is, of course, 50% again. Still, I am sure that you feel that the coin turning heads again feels wrong somehow, that the probability of that happening is lower than 50%. Technically, the probability of each coin flip IS NOT dependent on what turned before even if we feel like it does. The probability of the coin sequence HHHHH is the same as HHHHT or HTHTH for that matter. This bias is known as the gambler‘s fallacy. We have an innate tendency to expect successive outcomes to balance out.

In fact, the gambler‘s fallacy can be used to determine if a random sequence was man-made or not. Say, I ask you to write me a sequence of a hundred coin flips. I could spot your sequences among dozens of computer-generated sequences by looking at the longest consecutive sequence. In a hundred coin flips, what do you expect the probability of having a sequence of 7 heads would be? 1%? 5%? Actually, it is a bit more than 30%. I expect none of you would have placed 4 heads in a row in their sequence, even less 7.

Involuntary cherry-picking

Time to tackle a crucial element of statistics that we intuitively ignore. In a word: cherry picking. Whenever we observe a series of events, remember that these generally stand for a broader group of occurrences. Critically, we often forget that what we see could be skewed and NOT be representative of the phenomenon as a whole. Say you ask people at an ice cream convention if they like chocolate ice cream. Would the tally you made really be representative of how many people AS A WHOLE (including those that would never go to such a convention or flat-out dislike ice cream)? No. That is an example of what we call the selection bias. While sampling anything, there is always a risk that we exclude, distort or overrepresent specific trends within it. This can be shaped by HOW we take our sampling (like the previous example) and sometimes by insidious factors we don‘t think about. This is currently related to a pressing crisis in the research industry. Over the years, researchers have realized that results published are heavily skewed for positive outcomes. This is related to a variety of factors including how much easier it is for researchers to find grants when they publish apparent impact. In that way, almost any trend visible in the research literature as to carefully be re-evaluated. For our part, we can fight back by paying attention to how we select our sample and what groups we discourage or flat-out exclude from it.

Man left standing

Samples are impacted by how we select them but they are also greatly distorted by how we use them. I have to reference a favourite story of mine here. During World War II, the United States Navy analyzed de-commissioned aircrafts returning from the front to estimate which part of these planes should be reinforced. Initially, the navy‘s scientists proposed to add armour to zones where more bullets were identified. Seemed intuitive enough. A few months later, when statistician Abraham Wald delved into the same data he arrived at a quite different conclusion, urging the navy to change course. Wald realized that it wasn‘t the regions where bullets were found that you needed to reinforce. After all, these planes were able to come back. It was regions where bullets WEREN‘T found that needed more armour. The absence of damage to these regions meant that a plane shot there had a very low chances of coming back. In that way, the navy had ignored that the act of looking only at planes that came back was a selection process overlooking the most critical planes to study. This is called the survivorship bias. Survivorship bias is a known belligerent in historical studies. After all, it is easy to overestimate the importance of buildings, institutions and works that are still around today to the detriment of what did not. In essence, this is also what the famous Bayes theorem informs us about. This theorem is our best way to estimate how conditional probabilities work. Typical examples consider the chance of survival to a treatment knowing some particular characteristic of the patient. In other words, Bayes‘ theorem recognizes that filter processes affect our datasets and their statistics. Realize that to make better decisions.

Size does matter

But even ignoring the effects of selection and filtering, our samples and how we see them can still be distorted. One feature of sampling that we don‘t necessarily appreciate is how the size of our sample shapes it. Think of a perfect sampling process as picking marbles of different colours randomly from a bag (another classic example). Of course, taking only a single marble won‘t be representative of what‘s in the bag - as not all marbles are of that colour. But not many people realize that we still encounter the same problem by picking more marbles. Can you guarantee that by picking all but three marbles from a bag you know all the colors within it? What if, by chance, those marbles left are of a completely different color? In truth, only by selecting every single marble can you be sure how what your bag exactly contains. For large datasets, we see every day this is, of course, impractical. Therefore, we need to remember that every single sampling is but an incomplete picture of the situation.

What‘s more, the smaller your sample is, the more vulnerable to large variations it will be. As every sampling probability is linked to those before (counter to our coin flip example), selecting more marbles balances out the proportions in the sample until we reach those of the bag itself. To bring a real-world example, this is what happened during a large cancer survey in the United States in the 2000s. Scientists gathered thousands of accounts of cancer over numerous months and cities. Analysis showed that people from southern rural towns where education is less accessible had the highest cancer rates. They also found that people from southern rural towns where education is less accessible had the LOWEST cancer rates. Predictably, this confounded scientists in charge of the project. Besides some other problems in the study (including some we talked about like selection bias), this study was limited by the difference in sampling size between cities. As it happened, small southern rural towns were way less populated than bigger northern cities, as were samplings from those cities. In that way, samples from smaller towns were more susceptible to extreme results. In the end, remember that size does matter.

The die is cast

Wrapping up, struggling with probabilities is not due to a lack of intelligence. After all, this is deeply rooted in our biology. Does it limit our decision-making? Yes, but we should accept that nature and prepare accordingly. After all, in my mind, it is always better to rely on conscious planning than on innate talent. What‘s more, armed with your new knowledge, you‘ve made the hardest step leading to better decisions. See and enjoy how your world looks with realistic expectations. Ok: that was a bad statistics joke. I‘ll see myself out.

See you tomorrow nonetheless.