This is one of three related articles on how to avoid and deal with bad data:
by Katie Paine — Warning: I’m getting up on my Bad Data soap box here. Still, all the soap in the world is not going to clean up the gigantic pile of dirty data that has been landing on my desktop in the past few weeks. Each day has brought another round of, “You have got to be freakin’ kidding me!”
Okay, yes, a certain amount of dealing with bad data is not that unusual around here. One of my jobs as “Measurement Sherpa” for my clients is to make sure that the data on which they are basing decisions is valid, accurate, and reliable. So we spend a lot of time digging in the dirt that is modern data. Here are a few horror stories that you may very well recognize from your own quest for good data:
1. Four Vendors, Four Very Different Sets of Data
For one client we are monitoring four different vendors to see which one has the most accurate sentiment analysis tool. My conclusion: Who the hell knows?
Each vendor provides results that are about 20 percentage points different from the others. One reports a “positive sentiment score” of 58%. Another says its 11.95%, and a third puts it right in between at 33.12%. The “neutral” scores are even farther apart. One scores 78% of all mentions neutral, another says it’s 30%. Negatives, in case you’re wondering, range from 9.46% to 26.53%.
The weirdest thing is that two of them use human coders and two are entirely machine coded, and when I use my own coders to do a validity check, they tend to agree more with the machine coding than the other humans. Go figure.
2. The Case of the Positive Prom Dress
For another project we are testing which vendor’s “alerts” are actually alerting the company to the right stuff. One day my editorial assistant started laughing at an alert, and I knew we were in trouble. Apparently someone with the same last name as the manufacturing company the vendor was monitoring tweeted a picture of herself in her prom dress. The vendor’s data rated it as “positive” towards the manufacturer. Wouldn’t you love to see those coding instructions? (And if you want really good well-tested coding instructions look no farther than here.)
3. Lost in Bad Travel Data
Then there was the travel destination data that contained duplicate articles when it arrived. Easy enough to fix, I thought: Just run our de-duping routine and poof! Except that one copy had been coded positive, and the other had been coded neutral. Which do I keep? I then had to go in and reread and recode every duplicate. What was the vendor thinking?
No sooner had I gotten that bit of confusion cleaned up than I noticed that coverage for my client had dropped by 50% from the previous quarter. Turns out that the vendor was using such narrow search strings that they only included about half of the relevant items. So we asked them to reduce some of the search term restrictions. They did, but then, of course, they delivered dozens of calendar listings, traffic, and police reports. 60% of the new content was irrelevant.
4. Who Needs 6.8 Trillion Impressions?
About a year ago, I was called in to an agency that had been laughed out of a CMO’s office for reporting that they had generated 6.8 trillion impressions the previous year. The CMO had pointed out that the agency was implying that everyone on the planet saw news of the company 10 times. Turns out, the agency’s “impressions” counted every single Facebook posting as “2.5 million impressions,” since there were 850 million people signed up on Facebook at the time, and they were using a 3x multiplier. Bigger isn’t always better.
5. Tweet or Be Damned
Beware of a certain social monitoring vendor which prominently features “impressions” on its reporting dashboard. Unfortunately, once I asked the right questions, they admitted that they only report them for Twitter, despite their claim that they allegedly monitor dozens of different sites.
For more on dirty data and how to clean it up, see “Dirty Data Dooms Measurement: Here Are 5 Tidy Up Techniques.” ∞