The phases of an ML (Machine learning) project
An ML project takes place in several successive distinct phases
- Data recovery
- Data preparation
- Learning (Choice of model, training and adjustment of the model)
- Tests
- Prediction (s)
If these phases must be carried out separately, in reality we will be more in an iterative approach. For example, a model adjustment is only possible if new characteristics are added, it will therefore be necessary to return to phases 1 and 2 in order to add these new variables. Similarly, it will be interesting in certain cases to test new algorithms in order to test the relevance and the level of error of our model.
On the other hand there is an immutable rule in ML: It is absolutely necessary that the training data are different from those of tests (and therefore also of production). To simplify, if you are given a data set you must divide it into two (learning and tests) and above all never play the test game during the learning phase!
The typologies
There are several main learning families:
Supervised learning | In this case you have collected data with all the characteristics (model variables) and labels (results). You therefore have in hand the variables as well as the results in your training data. |
Unsupervised learning | In this case you only have the characteristics. You don’t know the results of the training data. It is therefore up to your model to learn… but alone! |
Semi-supervised learning | In this case, you do not have all the labels (results) but only some. Your model will therefore have to learn with holes! |
Group learning | It is a particular mode of learning where the model cannot learn gradually. It must ingest all available data every time! which makes it heavier to use. |
Online learning | Here the model is gradually trained, over the course of the water (either by observation or by small groups of observations) |
More generally, we will distinguish:
- Learning from observations is more basic and involves the system learning the training data and then comparing the new data / observations by similarity with those learned.
- Model learning (which interests us here): in which we will try to build a model which generalizes the observed observations. the idea is to find in relation to the observations what are the mathematical links between the characteristics (and labels).
Algorithms
First of all, there are 3 major typologies of algorithms:
Typology | Supervised or / and Unsupervised | Description |
Regression | Supervised | This is the basis of predictive models. The idea is, from existing data to deduce the line / curve that connects these points. By extension (a line / curve being infinite) it is thus possible to determine new values. |
Classification | Supervised / Unsupervised | The goal of this type of algorithm is not to predict an exact value (such as a specific amount or outcome) but to classify the data into groups. Classes can be binary (yes / no) or multi-valued (known or not). |
Clustering | Unsupervised / Supervised | The idea of clustering is to create groups of observations (basically group together the observations that are the most similar). There are several approaches: Hierarchical clustering (ascending / agglomerative or descending / dividing) to constitute trees by progressive group. Non-hierarchical clustering: The idea is also to constitute decision trees but this time we know the number of groups (cluster) to constitute. The mixed approach. |
The purpose of this table is to summarize some of the most commonly used ML algorithms:
Algorithm | Learning | Typology | Comments |
Linear regression (univariate / multivariate) | Supervised | Regression | It is the most common and the simplest mode of ML. The idea here is to make the model guess the equation that will allow subsequent predictions. The univariate mode has only one variable (characteristics), it is therefore a simple line (y = ax + b) as for the multivariate mode it takes into account several other characteristics (attention to the normalization of the variables). |
Polynomial regression | Supervised | Regression | It is a particular extension of multivariate regression. To put it simply, the idea is to have a curve rather than a straight line (we are therefore no longer in linearity) |
Regression regularized | Supervised | Regression | The idea here is to improve the regression models by adding notions of shrinkage / penalties in order to reduce the space and therefore remove the gross errors from the model. It is clearly a method of regularization. Penalty functions: Regression Ridge, Lasso, ElasticNet |
Naives Bayes | Supervised | Classification | It is a classifier. Certainly the most widely used, it is based on Bayes’ law of probability. Its particularity: it is based on the fact that the characteristics are independent of each other. |
Logistic regression | Supervised | Regression | It is a very popular and widely used classifier because of its linear aspect. Its cost function is based on the Log Loss which strongly penalizes false positives & negatives. Find an example of use here. |
K-NN (K nearest neighbors) | Supervised | Classification | Algorithm based on the proximity of observations. |
Random forest | Supervised Unsupervised | Classification | Fast, robust and parallelizable. The idea is to train several decision trees on random and different subsets of your dataset. In the end, a democratic vote of your different groups gives you the prediction. |
SVM (Support View Machine) | Supervised Unsupervised | Classification |
There are many more of course, so I will try to update this table regularly.
If you are looking more for how to approach a machine learning project, do not hesitate to read this article .