Answering Questions With Data

No, not that Data. (Source: Wikipedia)

When we try to apply statistics to sports, or to anything else for that matter, what we are doing is trying to answer a question with data. The data part of that is obvious, but what’s less clear is usually what the question is. But there is always something we are trying to know or understand more clearly. This process is also at the heart of all research science and the issue of what questions should be or can be meaningfully asked is actually a very difficult one and it often takes years of experience to do this correctly so that your data are not fooling you. This is why research articles in science journals are written so weirdly and often with so much jargon. There are a lot of tiny distinctions that we easily conflate in everyday language that are vitally important when doing research. This is true in any data-based research, including analysis of sporting statistics. This is always an issue and part of the conflict between old- and new-school statistics really boils down to misunderstandings of what questions the data are actually answering.

One of the most common distinctions that gets lost in sport (and in everyday life, really) is the difference between statistics that tell you how often something has happened in the past and the odds that something will happen in the future. All sporting statistics are the former. If a batter in baseball has a .300 average it tells us the rate at which he or she has got a hit so far in a season or career, specifically three times out of ten. If a cricketer averages 45 with the bat it is the same thing: in the past that cricketer has averaged 45 runs for every dismissal at the relevant level.

Such stats tell stories, often extremely effectively. If I tell a baseball fan that a hitter had a .287 batting average, 12 home runs and 58 RBIs, that immediately gives a sense of the hitter, albeit an incomplete one. I could add to the story by saying they scored 93 runs, or had an OPS of .671 or some such. All those tell us about the player without ever having to watch a single plate appearance. Extremely importantly in the ongoing (and probably never ending) debate of old- versus new-school statistics, all of them fall into this same category of frequentist statistics. One person might understand a .287 batting average better than a .671 OPS and people might (okay, do) disagree about which is more important, but they both tell you about things that already happened.

The problem is, the question of what happened in the past isn’t usually the question to which we want to know the answer. It’s great to know that our hypothetical hitter has got a hit in 28.7% of official at bats for the year, but usually what we want to know is something like the likelihood of said hitter getting a hit in their next at bat, or at what rate they will get a hit next year. It’s fairly obvious that the latter is not the same, but it’s less obvious&mdash but just as true&mdash that the former is not the same either. And this is why I started with the importance of knowing what question we are asking of the data; it is extremely common to see people not just in sport, but when dealing with probability in general to assume that the frequency of an event happening in the past is the same as the likelihood of that event happening in the future.

This distinction is the impetus behind a lot of advanced stats and Sabermetrics. I do have some issues with Sabermetrics, but on the whole I quite like it. (This surprises a lot of people, but it is true.) Part of that is just having a natural affinity for playing with huge datasets from a sport I love, but also most people behind Sabermetrics understand this distinction and a lot of other important scientific principles to working with data. They are very good. The issue is that most people in the media and even a lot in front offices don’t understand that distinction, and then completely misapply advanced statistics.

This is why I have started here. In practice people are almost always going to use frequentist statistics to approximate likelihoods. The alternative is building a proper Bayesian formulation, and whilst that is increasingly feasible, it’s still well beyond what most people can do, or even find it worth doing. But what’s important is understanding when we are using frequentist statistics and what their limitations are.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s