Machine learning – Memento

Share this post

The phases of an ML (Machine learning) project

An ML project takes place in several successive distinct phases

  1. Data recovery
  2. Data preparation
  3. Learning (Choice of model, training and adjustment of the model)
  4. Tests
  5. Prediction (s)

If these phases must be carried out separately, in reality we will be more in an iterative approach. For example, a model adjustment is only possible if new characteristics are added, it will therefore be necessary to return to phases 1 and 2 in order to add these new variables. Similarly, it will be interesting in certain cases to test new algorithms in order to test the relevance and the level of error of our model.

On the other hand there is an immutable rule in ML: It is absolutely necessary that the training data are different from those of tests (and therefore also of production). To simplify, if you are given a data set you must divide it into two (learning and tests) and above all never play the test game during the learning phase!

The typologies

There are several main learning families:

Supervised learningIn this case you have collected data with all the characteristics (model variables) and labels (results). You therefore have in hand the variables as well as the results in your training data.
Unsupervised learningIn this case you only have the characteristics. You don’t know the results of the training data. It is therefore up to your model to learn… but alone!
Semi-supervised learningIn this case, you do not have all the labels (results) but only some. Your model will therefore have to learn with holes!
Group learningIt is a particular mode of learning where the model cannot learn gradually. It must ingest all available data every time! which makes it heavier to use.
Online learningHere the model is gradually trained, over the course of the water (either by observation or by small groups of observations)

More generally, we will distinguish:

  • Learning from observations is more basic and involves the system learning the training data and then comparing the new data / observations by similarity with those learned.
  • Model learning (which interests us here): in which we will try to build a model which generalizes the observed observations. the idea is to find in relation to the observations what are the mathematical links between the characteristics (and labels).


First of all, there are 3 major typologies of algorithms:

TypologySupervised or / and UnsupervisedDescription
RegressionSupervisedThis is the basis of predictive models. The idea is, from existing data to deduce the line / curve that connects these points. By extension (a line / curve being infinite) it is thus possible to determine new values.
ClassificationSupervised / UnsupervisedThe goal of this type of algorithm is not to predict an exact value (such as a specific amount or outcome) but to classify the data into groups. Classes can be binary (yes / no) or multi-valued (known or not).
ClusteringUnsupervised / SupervisedThe idea of ​​clustering is to create groups of observations (basically group together the observations that are the most similar). There are several approaches: Hierarchical clustering (ascending / agglomerative or descending / dividing) to constitute trees by progressive group.
Non-hierarchical clustering: The idea is also to constitute decision trees but this time we know the number of groups (cluster) to constitute.
The mixed approach.

The purpose of this table is to summarize some of the most commonly used ML algorithms:

Linear regression (univariate / multivariate)SupervisedRegressionIt is the most common and the simplest mode of ML. The idea here is to make the model guess the equation that will allow subsequent predictions. The univariate mode has only one variable (characteristics), it is therefore a simple line (y = ax + b) as for the multivariate mode it takes into account several other characteristics (attention to the normalization of the variables).
Polynomial regressionSupervisedRegressionIt is a particular extension of multivariate regression. To put it simply, the idea is to have a curve rather than a straight line (we are therefore no longer in linearity)
Regression regularizedSupervisedRegressionThe idea here is to improve the regression models by adding notions of shrinkage / penalties in order to reduce the space and therefore remove the gross errors from the model. It is clearly a method of regularization. Penalty functions: Regression Ridge, Lasso, ElasticNet
Naives Bayes SupervisedClassificationIt is a classifier. Certainly the most widely used, it is based on Bayes’ law of probability. Its particularity: it is based on the fact that the characteristics are independent of each other.
Logistic regressionSupervised RegressionIt is a very popular and widely used classifier because of its linear aspect. Its cost function is based on the Log Loss which strongly penalizes false positives & negatives. Find an example of use here.
K-NN (K nearest neighbors)SupervisedClassificationAlgorithm based on the proximity of observations.
Random forestSupervised UnsupervisedClassificationFast, robust and parallelizable. The idea is to train several decision trees on random and different subsets of your dataset. In the end, a democratic vote of your different groups gives you the prediction.
SVM (Support View Machine) Supervised UnsupervisedClassification

There are many more of course, so I will try to update this table regularly.

If you are looking more for how to approach a machine learning project, do not hesitate to read this article .

Share this post

Benoit Cayla

In more than 15 years, I have built-up a solid experience around various integration projects (data & applications). I have, indeed, worked in nine different companies and successively adopted the vision of the service provider, the customer and the software editor. This experience, which made me almost omniscient in my field naturally led me to be involved in large-scale projects around the digitalization of business processes, mainly in such sectors like insurance and finance. Really passionate about AI (Machine Learning, NLP and Deep Learning), I joined Blue Prism in 2019 as a pre-sales solution consultant, where I can combine my subject matter skills with automation to help my customers to automate complex business processes in a more efficient way. In parallel with my professional activity, I run a blog aimed at showing how to understand and analyze data as simply as possible: Learning, convincing by the arguments and passing on my knowledge could be my caracteristic triptych.

View all posts by Benoit Cayla →

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Fork me on GitHub