If Machine Learning is at the heart of all current issues, there is nothing without data. No Machine Learning without exploitable data therefore! It is indeed clear that for a machine to learn, it must be given the right information. This is also true for us humble human beings, isn’t it? Fortunately when we were at school we had the books corresponding to our course. Imagine for a moment that you are in Terminale Scientifique and that you are given a 1st Literary book! Wouldn’t the rest of your apprenticeship be endangered? Of course yes, the conclusion is therefore clear:
No effective machine learning without quality information / data!
Yes, but what quality of data?
Obviously, the requirements of a Data Science project are not the same as a governance project . The quality or rather the type of quality of the information that we collect in the end is a major issue. But a Data scientist will not have to do with quality indicators for example! his concern will be to find data and especially that the latter be exploitable.
And there is data, of all types and formats. The challenges of the Data Scientist quickly boil down to their ability to qualify problems in order to effectively compensate for data gaps.
Imagine that our Data Scientist is working on credit data. You probably didn’t know it, but that’s exactly what your banker uses when you apply for credit! You therefore have a certain number of characteristics (banking history, age, civility, etc.) which will allow a Machine Learning algorithm to know if you are eligible for this credit. Basically, the idea is to know if compared to all the information (of other individuals in the bank) your profile presents a significant risk. You now know the culprit of your last refusal of credit!
Now imagine that in all this information the ages are not filled in or incorrect.
Here are some classic examples:
- An empty column / field.
- A date of birth 01/01/1901 (as if by chance the default value set by the bank’s software package).
- The age is inconsistent with the date of birth.
- A negative amount or age: -23 years old (yes if it happens!).
There are plenty of examples.
You can imagine that if you teach your Machine Learning model with such data you will not have a quality result!
Analyze to better straighten
The previous example is of course a simple example, very simple even. But let’s not forget that the challenge is not only to detect the problem. It will also be necessary to better analyze the data considered to be correct. Analyzing the values with their frequency of distribution in order to define if the ages are distributed according to a normal law for example will be extremely interesting. Let’s not forget that our famous Data scientist is a statistician and that in this precise case he will probably be able to replace the incorrect values by (I am simplifying) the median value.
It could just as easily start from an average value, but the major concern of our Data Scientist is to avoid removing data. He must therefore constantly detect and see how to place patches in his dataset in order to minimize the impacts of non-quality on his predictions.
Data preparation must feed on data quality.
It therefore becomes essential to analyze (“profiler” in English) as much as possible the data that are necessary for the predictive model. This analysis should also be able to give an idea of the information from a statistical point of view (distribution frequencies, min, max values, standard deviation, mean, etc.). These captured elements will become essential for Data Scientists in their data quality management strategies.
Work on the characteristics
Once the analysis has been carried out on all the data, our Data scientist will eventually and depending on the characteristics re-work the data .
Thanks to a value distribution graph, for example, our Data Scientist can quickly find what are called outliers. These values are those that stand out (In Informatica jargon or that of statisticians we call them: Outliers) and which will disrupt our Machine Learning algorithms. In the same vein as before, it is up to the Data Scientist to determine whether he should delete the data or replace it with another value (median for example).
There is in fact a plethora of solutions that the Data Scientist can implement.
The complexity remains in detecting these data quality issues and interpreting them.
A second interesting aspect – and I will stop there for this article – is to find if there are any dependencies between the columns. In algorithms such as Naïve Bayes it is essential to present independent characteristics. it is therefore first a question of verifying that this is the case. Unfortunately, addictions are not always obvious, such as date of birth and age. Sometimes a tooled approach, such as Informatica Data Quality will be essential to find intruders.
To conclude, then, Machine Learning algorithms need a very high level of information quality. Beyond the diagnostic, these algorithms will have to be fed by verified data and often even reworked. The use of adequate tools for this type of manipulation therefore quickly becomes essential if the Data Scientist does not want to spend his time working and reworking his data sets.