My Big Dirty Data
Today we lift the hood on data and peer into its murky depths. There we find that clean data is an oxymoron and survey responses should be taken with a healthy dose of skepticism.
George Box, one of the founding fathers of modern statistics, famously said: "All models are wrong but some are useful.” This phrase is extremely apt in the age of big, dirty data. Most data sets are riddled with incorrect or misleading data. The models we build upon these data sets will always be approximations and simplifications of the real world; this does not dilute their value.
So yes, there is no such thing as perfect data. Similarly, there is no data collection scheme that guarantees perfect measurement, transcription or storage.
My big dirty data Data is inevitably ‘dirty’ due to a number of reasons. These include:
/aliasing, (where information about two distinct entities has been merged in error - e.g. when two people have the same name)
/multiple entry (where information about the same entity is split up - e.g. when the name has been spelled differently for the same person)
Does this mean we should discount big data? Absolutely not. If we were to abandon dirty data we would have almost no data left. Rather than discarding ‘dirty data’, we recognise that to avoid making invalid conclusions, we need to first clean the data by finding and correcting errors.
Computer-executed algorithms are a critical tool in a data analyst’s toolkit; they help us clean ‘dirty data’ by identifying patterns characteristic of incompleteness and inconsistency. You may not realise it, but similar algorithms are at work every time you open up a Google search engine. We’ve all encountered Google’s “did you mean…” which broadens our field of search. Likewise the algorithms which correct misspelt words.
As data analysts, we recognise that when looking at ‘dirty data’ it’s useful to distinguish between “useful data” and “ less useful data”. The “trick” is to filter out data that is misleading or unnecessary, while retaining data which contains valuable information.
Not all errors are equal Now on to random and systematic errors and why its important to make a distinction between them.
Random errors are individual errors that are randomly variable and the reasons behind them are effectively unknown. The good thing about random errors is that they tend to cancel each other out when averaged. Most modern analytics is designed to account for random errors.
An example would be in making assumptions about a security company’s response time, based on the response times for a particular day. Averaging the response time based on only one day’s operations will probably not provide a true picture of performance. Remember, response times will vary from call-to-call. Calculating an average response time over a longer period, would provide a more accurate picture. The point here is you don’t need better data. You need more data.
Systematic errors are more difficult to detect and can also lead to wrong conclusions. Here, the challenge is as you collect more data, the errors reinforce one another. Painstaking effort is required to identify the source of systematic error and correct for its bias.
For example, a telephonic survey of political views may collect responses that are systematically biased if respondents are answering questions on a controversial topic such as, ‘will you vote for Trump?’ Here, additional data collection will not correct this systematic error. Rather, an alternate survey design - such as postal or email - might reduce the propensity for people to give misleading responses.
By creating communities of like minded people, social media platforms can create ‘echo chambers’ that are themselves ridden with statistical bias.
Start at the beginning The starting point is to ensure that data collection strategies are well-designed. If the data input is biased from the beginning, there is almost no way of correcting the subsequent errors. Questionnaires and surveys need to be well designed and take human behaviour into account.
Even when your data is clean, problems can be introduced if clean data sources are poorly integrated. Using unique identifiers such as an ID number to link datasets is critical and can help avoid errors.