In our current infatuation with data, data, and lots more data, it can be discomforting to realize that a lot of it is crap. And I’m not just talking about useless metrics, or dirty data that needs tidying up before you can make sense of it. I’m talking about data that is false, fraudulent, improbable, or even impossible. Bad data happens surprisingly often in the social sciences (see the replicability crisis). It probably happens just about as often in communications measurement and evaluation, as well.
To be valid, your work must be based on valid data. A lot of bad research got that way by basing itself on bad data. Just the other day our United States of America collected some very important data that helps determine how and by whom our country is run. How do we insure that our research or elections are based on valid information?
Some of the more interesting techniques to check for erroneous, fraudulent, or impossible data rely on testing for certain mathematical characteristics of typical, or atypical, data. Two of these are especially quirky and surprisingly fun to think about: the GRIM test and Benford’s law.
The GRIM test. If you want to do a rough check for the validity of averages of integer data (like average age, or average number of tweets per month), and you know N, the number of terms averaged together, then the recently-developed GRIM test provides an easy way to do that. It will confirm that a given average is mathematically possible. (See “There’s a Good Chance Your Public Relations Research Is Faulty: Here’s the GRIM Test, an Easy Way to Check Your Data.”)
Benford’s Law
This year, at the 2018 Summit on the Future of Communications Measurement, I spoke on Benford’s Law. Here’s a recap of the presentation:
Back in 1880, a Canadian astronomer named Simon Newcomb noticed that his notebook of logarithmic tables, (which in pre-computer times were used to help simplify calculations), showed a strange pattern of wear. More wear on the pages with numbers whose beginning digit was 1 or 2, and much less wear on the pages with numbers that began with the higher digits, like 7, 8, or 9. He came up with the notion that numbers found in naturally occurring data, like those in log tables or most research databases, were much more likely to begin with the smaller digits, and less likely to begin with the larger ones. That is, the leading digit of whatever the number is, is more likely to be a lower number (1 or 2, say) than a higher number (8 or 9, for instance).
Some 50 years later, an American named Frank Benford put this concept into the form of a law (sometimes called Newcomb-Benford’s law, law of anomalous numbers, or first-digit law). For you mathematicians, the distribution is: Prob(d) = log10(d+1) – log10(d), where d is the initial digit. It’s displayed in the chart up above (which I found here, with a good explanation, as well).
And, So, Why Is This Even Interesting?
The big deal here is that the first digit (left-most digit) of numbers in naturally occurring data sets is not randomly distributed. Yes, I’m sure that boggles your mind, just as it did mine when I first heard it. Try it this way: Consider any set of naturally occurring numbers, like lengths of rivers, numbers on the first page of the WSJ, or the numbers reported in a tax return. You’d think that the digits 1,2,3,4,5,6,7,8, or 9 would be equally likely to appear as the first digits of those numbers. But they aren’t. 1 is far more likely, at about 30%, declining logarithmically to 9, at 4 or 5%.
To have some fun learning more about Benford’s Law, watch this very energetic guy apply it to the front page of the Financial Times:
Benford’s Law does not work for any set of data that is constrained within particular limits, like the heights of people, or prices. It does work for a remarkable range of naturally occurring data and numbers.
What’s it Used For?
Since about 1989, Benford’s Law has been used to detect fraud in financial statements. And more recently in demographic data and research reports. It is admissible evidence in federal, state, and local courts of law. It has been used to test regression coefficients in scientific papers, and to detect anomalies in health care data. It has been used on Iranian election data to demonstrate probable fraud. Also, several years after Greece joined the European Union, Benford’s Law was used to analyze the financial data Greece used in their application. It was found to be highly improbable.
But Why Does it Happen Like That?
This strongly skewed distribution of initial digits seems surprising, even weird. Why shouldn’t they all be equally likely? Why should lower numbers be so much more common?
I scoured the web in an effort to find a simple, clear explanation, but failed. It’s just not a easy thing to explain. Perhaps a clue is found by noting that many common natural phenomena have logarithmic scales. Like decibels for sound, and the Richter magnitude scale for earthquakes. And so one of the most readily comprehensible explanations, given in terms of the growth of a country’s population makes reasonable sense: If the population grows exponentially (as many natural phenomena do) then the number that represents the size of the population spends much more time, as it grows, beginning with a 1 than with the later leading digits. So the numbers with smaller leading digits are more common.
Final note: In a way, data is news at its most basic. So if we can discover ways to identify fake data, we can probably discover how to identify fake news. The GRIM test and Benford’s Law are high school algebra. Just imagine what higher level math combined with big data can do. ∞