A different kind of software project
A Machine Learning project cannot be approached like a classic software project. Indeed, learning modeling is totally different from strict programming based on rules and exceptions. So naturally the same goes for how to approach this type of project. The development in V has therefore ended. However, a 100% agile method will not necessarily be suitable, at least not at all stages (Cf. stages below).
In short, a Machine Learning project is not an IT project like any other!
It is not a classic development project and therefore has its own constraints, but above all it will need great flexibility and regular readjustments.
Succeeding in your machine learning project therefore amounts to respecting the steps below:
- Goals definition
- Data Access & Analysis
- Data preparation
- Evaluation and scoring (Iteration)
- Deployment (Regular re-evaluation / Iteration )
In this article we will go through these different stages together and we will especially see the main elements to remember.
Step N ° 1: Definition of objectives
While this step may seem obvious, it is nonetheless vital for the success of the project. Beyond the underlying business problem, it is a question here of determining what type of problem we must solve.
For that we need to know if we have experimental data with result or not (even partial) in order to determine if we are approaching a problem of supervised or unsupervised type.
Then what is the typology of the problem to be solved:
- System of recommendations
- Reduction of the number of dimensions
For more details on these Machine Learning issues. If you are lost, do not hesitate to refer to the memento of Machine Learning algorithms from datacorner.fr
Step N ° 2: Access & Data Analysis
Here is a crucial step in which you will have to rework the data (features or variables). This is an essential operation because Machine Learning algorithms do not accept all types of data. This is a necessary operation in order to refine the variables so that they are better managed by these same algorithms.
Splitting the dataset
First of all you are working with a dataset. You will have to cut it into two parts (minimum):
- Training data: subset intended for training a model.
- Test data: subset intended for the evaluation of the model. This dataset should not be used in the design of the model!
You will manage this splitting from predefined functions (for example via sklearn.model_selection.train_test_split ). But nothing is ever that simple because the way in which you are going to slice your dataset may have too great an importance on your model. At this level already it will be necessary to be subtle and test several possibilities (eg sklearn.model_selection.KFold ).
This is an equally important step in which you will have to:
- Make an inventory of your data (type of data):
- Typology: Numerical, temporal, text, binary, etc.
- Categorical, discrete or continuous variables?
- Number of observations (number of lines)?
- Number of features / variables (number of columns)?
- Detect if you have outliers and above all decide what you are going to do with them (delete them or simply alter them)
- Detect missing values
- Detection of correlated variables / features
At this level, it is essential in my opinion to have a good dataviz tool !
Step N ° 3: Data preparation
Step 2 allowing you to make a complete inventory of the data you have, you will now have to prepare your features / variables so that they can be used by Machine Learning algorithms.
To resume the previous points:
- You should only have data (variables) in numeric format. if you have data of type:
- Date: Apply formulas to transform them into period, etc. Why not add aggregations on sliding windows (on the week, month, year preceding)?
- Categorical: Use One-Hot encoding whenever possible. If you have too many variables, reduce the scope by making groupings.
- Text: you will certainly have to cut, reformat your data to have categorical data
- If you have missing information (Null)
- Delete the entire row if you really have a lot of data (not recommended but sometimes you won’t have a choice)
- Replace them with a value, the median, the mean, etc.
- Scale the numeric values (feature scaling)
- Switching to a logarithm when the variables have extreme values reduces their importance.
This step is also called feature engineering!
Another important aspect is the management of its datasets: the creation of working datasets. If you have a single data set you are going to need to build a training data set and a test set!
Step N ° 4: Modeling
Depending on the problem you are going to deal with, you have the choice of algorithms, draw and test! This phase can be long (time) because training is a very heavy task, especially when you have a lot of data (which is also recommended).
The difficulty is not in this choice but rather in the adjustment of the hyperparameters that you will have to make in order to obtain a powerful model.
Step N ° 5: Evaluation & scoring
Your algorithm thus chooses the adjusted hyper-parameters, you will have to validate your model. Impossible not to enter an iterative mode in which you will experiment with these hyper-parameters . Do not hesitate to use third-party tools or approaches such as grid-search here.
Be careful especially with over-fitting (or over-training) which will give you the illusion of a good model!
Indeed if you exceed a certain score (around 95%) it is likely that your model is ultra-efficient… but only for your training data. So try it with the test data… you will certainly be surprised!
The way to measure performance differs depending on the type of problem but also what you really want to measure. Several measures are available (non-exhaustive list):
- Prediction error
- XY graph predicted value / predicted value
- Intra-class, inter-class variance
- Number of arcs cut
You are therefore entering an optimization phase based on a necessarily iterative approach. Here are some areas for improvement and / or optimization:
- Algorithm change
- Is the distribution of the test / training sets consistent, homogeneous?
- Adding / removing variables
- Grouping of values: add averages, sum, number by groups.
- Add / remove rows (with new data sources)
- Adjustment of hyper-parameters
- Add hard-to-learn combinations of variables to a model like a ratio
- Aggregating over longer periods (for example 1 month for a daily granularity) can be a good idea
- Use the output of another machine learning model.
- Look for information that could help a model correct errors
Step N ° 6: Deployment
Your model is ready. it is efficient and can comply with all cases. You can now deploy it through an API or integrate it directly into a program . Be careful however, because by nature a model cannot live forever (it is indeed based on learning on data… and data is constantly evolving). It is therefore necessary to plan to check regularly.
So much for this article which was intended to go through a typical Machine Learning project. If you need to go further in the methodology I invite you to have a look at the CRISP method .