The Self-Actualised Datum
Articles and opinion about the definition of data science flood the digital airwaves. Is data analytics the same as data science? If you can achieve a result with an Excel pivot table, is it still data science? If there is no statistical analysis does it still qualify? What is the ‘big’ in big data? Is machine learning a subset or superset of data science, or is it another thing entirely?
This seems to me to be the age-old discussion of dancing angels on the head of a pin; valiant attempts at classification that obfuscate the larger picture, which is that data science seeks to illuminate data in ways that give new insight. In the previous sentence, if you leave the word ‘data’ and ‘new insight’ undefined, well, then the subject has extremely wide boundaries, and almost anything applies.
So classification is probably a good idea, if for no other reason than to constrain the discipline. But I’d rather not jump into the classification debate. There are many other commentators with better-considered propositions than my own.
But I came across a different model that resonated strongly with me. It does not so much try to define the subject as define what we do with the data. And it is built in a neat triangle and labelled similarly to Maslow’s famous model of human needs. I love it. It hints humorously at the self-actualisation that tech-nerds like me can find in data science. It was created by Monica Rogeti, a data science educator and entrepreneur.
While it is probably a creative overreach to try and compare Maslow’s seminal work to machine learning algorithms (um, sexual intimacy from Maslow compares to what in data science? On second thoughts, don’t answer that), some analogies are apposite.
Maslow’s hierarchy rises from south to north in order of the greater and more wide reaching satisfaction levels. To stagnate at any level implies that the person has not reached her full potential. Something similar can be said of data. Data at its most basic, is a mere collection of bits, perhaps living blissfuly and ignorantly in a table or flat file, or (perish the thought), as a long stream of neglected letters, numbers and words, tumbling endlessly through cyberspace, miserable, lonely and unconnected. That all changes when we apply the Data Science hierarchy.
At the bottom, Monica Rogeti tables COLLECT. In my experience, there might even be a lower level than that, which is FIND. Many companies have collected data, and then lost it in nooks and crannies of the corporate spread. Finding lost data is as much an art as a technical process, especially where institutional memory is thin. Remember, for any company in business for more than a decade, there are many hard drives and old servers sitting in storage closets.
Her second level is MOVE/STORE. This should be planed like a military operation. There may be hundreds of data stores. The ETL (or other process) has to be scheduled, checked, executed and signed of, until all data sites are in an accessible and addressable data store, even if it is in terrible shape.
Her third level is EXPLORE/TRANSFORM. This is the bath in which the disparate data buckets are scrubbed, patched, polished and returned to a usable state, in whatever table/structure/nostructure makes sense. It is at this point that the real fun stuff begins. The grunge work is over.
Her fourth level is AGGREGATE/LABEL. This requires a great deal of understanding about the nature of the data and its connection to the larger, real world problem and is often the last step of human input. It is where data is given meaning, injected with semantics, collected into groups and surrounded with flesh.
Her fifth level is LEARN/OPTIMISE. This what separates the wheat from the chaff. It is where statistics and machine learning tries to fit curves, learn rules, overlay heuristics. This is where the science gets its space in the sun.
And finally, level six - DEEP LEARNING/AI - a living set of mathematical and statistical narratives that supply the elusive insight, and a blog for another day.
So in summary,
When a datum is collected, attached to other data, where context and relationships are discovered by man or math, where these discoveries are both well-articulated and useful, then we have data science at its best - that is, self-defined and alive.
Or when expressed with a bit of poetic licence:
Our data is scrubbed clean, finds security, joins a family, rises to a position of respect, and finally becomes a fully actualised member of a community of insights.
The original Rogeti article can be found at https://hackernoon.com/the-ai-hierarchy-of-needs-18f111fcc007