Prof. Jan Bosch
Virtually all organizations I work with have terabytes or even petabytes of data stored in different databases and file systems. However, there’s a very interesting pattern I’ve started to recognize during recent months. On the one hand, the data that gets generated is almost always intended for human interpretation. Consequently, there are lots of alphanumeric data, comments and other unstructured data in these files and databases. On the other hand, the size of the stored data is so phenomenally large that it’s impossible for any human to make heads or tails of it.
The consequence is that enormous amounts of time are required to preprocess the data in order to make it usable for training machine learning models or for inference using already trained models. Data scientists at a number of companies have told me that they and their colleagues spend well over 90 percent of their time and energy on this.
'Most of the data is mud pretending to be oil'
For most organizations, therefore, the only way to generate any value from the vast amounts of data that are stored on their servers is to throw lots and lots of human resources at it. Since, oftentimes, the business case for doing so is unclear or insufficient, the only logical conclusion is that the vast majority of data that’s stored at companies is simply useless. It’s dead weight and will never generate any relevant business value. Although the saying is that “data is the new oil”, the reality is that most of it is mud pretending to be oil.
Even if the data is relevant, there are several challenges associated with using it in analytics or machine learning. The first is timeliness: if you have a data set of, say, customer behavior that’s 24, 12 or even only 6 months old, it’s highly likely that your customer base has evolved and that preferences and behaviors have changed, invalidating your data set.
Second, particularly in companies that release new software frequently, such as when using DevOps, the problem is that with every software version, the way data is generated may have changed. Especially when the data is generated for human consumption, eg engineers debugging systems in operation, it’s time consuming to merge data sets that were produced by different versions of the software.
Third, in many organizations, multiple data sets are generated continuously, even by the same system. To derive the information that’s actually relevant for the company frequently requires combining data from different sets. The challenge is that different data sets may not use the same way of timestamping entries, may store data at very different levels of abstraction and frequency and may evolve in very unpredictable ways. This makes combining the data effort consuming and any automation developed for the purpose very brittle and likely to fail unpredictably.
My main message is that, rather than focusing on preprocessing data, we need to spend much more time and focus on how the data is produced in the first place. The goal should be to generate data such that it doesn’t require any preprocessing at all. This opens up a host of use cases and opportunities that I’ll discuss in future articles.
Concluding, for all the focus on data, the fact of the matter is that in most companies, most data is useless or requires prohibitive amounts of human effort to unlock the value that it contains. Instead, we should focus on how we generate data in the first place. The goal should be to do that in such a way that the data can be used for analytics and machine learning without any preprocessing. So, clean up the mess, get rid of the useless data and generate data in ways that actually make sense.