The ceiling of the Sistine Chapel is considered to be one of the most breathtaking interior paintings in the world. To create this magnificent work of art, the painter Michelangelo, laboured for four years between 1508 and 1512. Most things of beauty require a considerable amount of effort and data analytics and machine learning are no different. To extract valuable insights or to build a high performing machine learning model, a lot of work is done behind the scenes.
Data scientists usually go through a standard set of processes before emerging with great insights. These are:
Modelling (with many iterations)
Presentation of results
A typical machine learning process
When asked, a data scientist will invariably tell you that a majority of her time is spent on step 2, data preprocessing.
Data preprocessing is about getting the raw data into a state that can be fed into the machine learning models. This involves:
Data cleansing, finding and eliminating errors in the data
Combining multiple datasets and creating new variables and aggregations
Formatting variables to ensure they are usable and in the same form
So much time is spent on data preprocessing because raw data almost never comes in a neat and obedient format. As organisations venture deeper into data-driven decision making, they must understand that data is not collected solely for regulatory compliance but also for more advanced analytics.
With this in mind, it helps tremendously if the data scientist is part of the data capturing effort. This will help ensure that organisations are collecting all required datasets and more importantly, are storing them in a usable format and location. This small change can make all the difference in how quickly your data science team moves from raw data to awesome insights that improve your business.