Machine learning - Memento

Share this post

Index

The phases of an ML (Machine learning) project

An ML project takes place in several successive distinct phases

Data recovery
Data preparation
Learning (Choice of model, training and adjustment of the model)
Tests
Prediction (s)

If these phases must be carried out separately, in reality we will be more in an iterative approach. For example, a model adjustment is only possible if new characteristics are added, it will therefore be necessary to return to phases 1 and 2 in order to add these new variables. Similarly, it will be interesting in certain cases to test new algorithms in order to test the relevance and the level of error of our model.

On the other hand there is an immutable rule in ML: It is absolutely necessary that the training data are different from those of tests (and therefore also of production). To simplify, if you are given a data set you must divide it into two (learning and tests) and above all never play the test game during the learning phase!

The typologies

There are several main learning families:

Supervised learning	In this case you have collected data with all the characteristics (model variables) and labels (results). You therefore have in hand the variables as well as the results in your training data.
Unsupervised learning	In this case you only have the characteristics. You don’t know the results of the training data. It is therefore up to your model to learn… but alone!
Semi-supervised learning	In this case, you do not have all the labels (results) but only some. Your model will therefore have to learn with holes!
Group learning	It is a particular mode of learning where the model cannot learn gradually. It must ingest all available data every time! which makes it heavier to use.
Online learning	Here the model is gradually trained, over the course of the water (either by observation or by small groups of observations)

More generally, we will distinguish:

Learning from observations is more basic and involves the system learning the training data and then comparing the new data / observations by similarity with those learned.
Model learning (which interests us here): in which we will try to build a model which generalizes the observed observations. the idea is to find in relation to the observations what are the mathematical links between the characteristics (and labels).

Algorithms

First of all, there are 3 major typologies of algorithms:

Typology	*Supervised or / and Unsupervised*	Description
Regression	Supervised	This is the basis of predictive models. The idea is, from existing data to deduce the line / curve that connects these points. By extension (a line / curve being infinite) it is thus possible to determine new values.
Classification	Supervised / Unsupervised	The goal of this type of algorithm is not to predict an exact value (such as a specific amount or outcome) but to classify the data into groups. Classes can be binary (yes / no) or multi-valued (known or not).
Clustering	Unsupervised / Supervised	The idea of clustering is to create groups of observations (basically group together the observations that are the most similar). There are several approaches: Hierarchical clustering (ascending / agglomerative or descending / dividing) to constitute trees by progressive group. Non-hierarchical clustering: The idea is also to constitute decision trees but this time we know the number of groups (cluster) to constitute. The mixed approach.

The purpose of this table is to summarize some of the most commonly used ML algorithms:

Algorithm	Learning	Typology	Comments
Linear regression (univariate / multivariate)	Supervised	Regression	It is the most common and the simplest mode of ML. The idea here is to make the model guess the equation that will allow subsequent predictions. The univariate mode has only one variable (characteristics), it is therefore a simple line (y = ax + b) as for the multivariate mode it takes into account several other characteristics (attention to the normalization of the variables).
Polynomial regression	Supervised	Regression	It is a particular extension of multivariate regression. To put it simply, the idea is to have a curve rather than a straight line (we are therefore no longer in linearity)
Regression regularized	Supervised	Regression	The idea here is to improve the regression models by adding notions of shrinkage / penalties in order to reduce the space and therefore remove the gross errors from the model. It is clearly a method of regularization. Penalty functions: Regression Ridge, Lasso, ElasticNet
Naives Bayes	Supervised	Classification	It is a classifier. Certainly the most widely used, it is based on Bayes’ law of probability. Its particularity: it is based on the fact that the characteristics are independent of each other.
Logistic regression	Supervised	Regression	It is a very popular and widely used classifier because of its linear aspect. Its cost function is based on the Log Loss which strongly penalizes false positives & negatives. Find an example of use here.
K-NN (K nearest neighbors)	Supervised	Classification	Algorithm based on the proximity of observations.
Random forest	Supervised Unsupervised	Classification	Fast, robust and parallelizable. The idea is to train several decision trees on random and different subsets of your dataset. In the end, a democratic vote of your different groups gives you the prediction.
SVM (Support View Machine)	Supervised Unsupervised	Classification

There are many more of course, so I will try to update this table regularly.

If you are looking more for how to approach a machine learning project, do not hesitate to read this article .

Share this post

Machine learning – Memento

The phases of an ML (Machine learning) project

The typologies

Algorithms

Benoit Cayla

Leave a Reply Cancel reply

The phases of an ML (Machine learning) project

The typologies

Algorithms

Benoit Cayla

You might also like

Data Science, very short introduction

IA: Between Data and Automation (RPA)

Is big data dead? Long live Machine Learning

Leave a Reply Cancel reply