Minimum statistics knowledge needed for Machine Learning

Share this post

It will soon be back to school, it was hot, the beach was good and the sand very warm. You are therefore well rested and ready to go back to school. It is therefore the right time to review some statistical bases that will allow you to better understand and use Machine Learning algorithms. Don’t make a face, it’s for a good cause.

In this article, I suggest that you review the essential background of any good Data-Scientist:

  • Average
  • Median
  • Standard deviation
  • Variance
  • Quantile
  • Quartile

The average

Do you all know what the mean is? yes undoubtedly, nevertheless we are going to redefine it all the same, above all to put it in perspective with the notion of median that we will see later.

The average is therefore defined by the sum of the data divided by the number of data. Or from a mathematical point of view:

Let’s take an example now with a small distribution of values ​​= [1, 7, 8, 9, 10], the mean is obviously 7.

Seen in a graphical perspective let’s bring the y-axis back to the average value to better visualize it. Well, it’s as if the area of ​​the values ​​above and below this average were equal.

The median

Beware of the confusion! because median does not mean at all mean (even if these two values ​​can be equal. By definition the median is the value of a distribution which makes it possible to cut this same distribution into two equal parts.

If we take our previous distribution, we will have the median value of 8:

In fact the calculation is rather obvious as soon as we have an odd number of values ​​in our distribution.

On the other hand, if we have a number of values ​​that is even, the median is the average of the two values ​​in the middle.

Standard deviation

The standard deviation ( standard deviation in English), also spelled standard deviation is a mathematical concept defined in probability and applied to the statistics . In probability , the standard deviation is a measure of the dispersion of a random variable ; in statistics , it is a measure of data dispersion. It is defined as the square root of the variance or, equivalently, as the root mean square of the deviations from the mean . It has the same dimension as the random variable or statistical variable in question.

Standard deviations are encountered in all fields where probabilities and statistics are applied, in particular in the field of surveys , physics , biology or finance . They generally make it possible to synthesize the numerical results of a repeated experiment. Both in probability and in statistics, it is used to express other important concepts such as the correlation coefficient , the coefficient of variation or the optimal Neyman distribution .

The standard deviation is used to measure the dispersion of a data set around the mean. The lower it is, the more the values ​​are grouped around the mean.

Example of two samples with the same mean but different standard deviations illustrating the standard deviation as a measure of the dispersion around the mean:

(Source Wikipedia)

The standard deviation is therefore a measure that measures the dispersion of data in a distribution.

This measurement consists in calculating the distance of the points in relation to the mean.

To better understand this measurement, see how it is calculated from a practical point of view in 5 simple steps:

Steps :

  1. Calculation of the mean of the distribution (here the mean is equal to 7).
  2. Calculate the distance between the mean and each value of the distribution (a simple subtraction)
  3. Squaring the previously calculated distance
  4. Variance calculation: It is the sum of the squared deviations calculated previously divided by the number of values ​​in the distribution minus 1.
  5. Calculation of the standard deviation: it is simply the square root of the variance.

From an analytical point of view here is the calculation formula:

Variance

We just saw the variance is simply the square of the standard deviation.

Quantile

We have seen that the median allows a distribution to be split into two equal parts. Well, quantiles are only a generalization of this idea of ​​breaking up a distribution into shares. The underlying idea is therefore to create cuts with a number of equal values ​​in this distribution.

From a vocabulary point of view:

  • The median separates the values ​​into two equal population groups.
  • The quartiles divide them into four groups.
  • and from a global point of view: the quantiles in n.

NB: there are also the percentiles and deciles.

Quartile

The quartiles are the three quantiles that divide a data set into four equal-sized groups. The median is the quantile that separates the dataset into two groups of equal size.

From a practical point of view (Q1 = 1st quartile, Q3 = 3rd quartile):

  • A quarter of the values ​​are less than or equal to Q1.
  • Three quarters of the values ​​are less than or equal to Q3.
  • Half of the values ​​lie in the interquartile range [Q1; Q3].
Share this post

Benoit Cayla

In more than 15 years, I have built-up a solid experience around various integration projects (data & applications). I have, indeed, worked in nine different companies and successively adopted the vision of the service provider, the customer and the software editor. This experience, which made me almost omniscient in my field naturally led me to be involved in large-scale projects around the digitalization of business processes, mainly in such sectors like insurance and finance. Really passionate about AI (Machine Learning, NLP and Deep Learning), I joined Blue Prism in 2019 as a pre-sales solution consultant, where I can combine my subject matter skills with automation to help my customers to automate complex business processes in a more efficient way. In parallel with my professional activity, I run a blog aimed at showing how to understand and analyze data as simply as possible: datacorner.fr Learning, convincing by the arguments and passing on my knowledge could be my caracteristic triptych.

View all posts by Benoit Cayla →

One thought on “Minimum statistics knowledge needed for Machine Learning

Leave a Reply

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Fork me on GitHub