One-Hot encoding

Share this post

Categorical variable

Do you know what a categorical variable is ? If you do not know it, you have certainly encountered the concerns that this type of characteristic imposes with regard to Machine Learning algorithms. To put it simply, this type of characteristic is not quantitative (ordinal variable). It is therefore not digital (well in the majority of cases) but offers equally important qualitative information.

Your challenge as a Data Scientist is to convert this qualitative into quantitative!

However, the thing is not easy because to convert qualitative it is necessary to know its value and to be able to scale it in relation to the other available data. A case-by-case approach seems necessary, but other, simpler techniques can help you move forward.

One-hot encoding

This is the easiest method. The one that you will often use above all else. The idea is very simple: create and add binary columns (dummies) which refer or not to the data by a 0 or 1. Take for example the titanic dataset, you have columns which contain categorical variables like Gender, Cabin , etc. The idea is to add additional columns to the data set by specifying in relation to each data if it exists or not.

The example of Sex is rather telling. In the dataset this categorical variable takes only two values ​​male and female. We will therefore replace this column (with two values) by two columns male and female.

You can check it simply with the Pandas library as follows:

cols = ['Sex', 'Embarked']
for col in cols:
    display('Colonne: ' + col)

Note: I added the Embarked categorical variable to this to refine the future model.

Fortunately the Pandas library comes to our rescue once again and creates the famous dummies columns for us in no time at all (thanks to the get_dummies method):

data_dummies = pd.get_dummies(titanic, prefix=['S', 'E'], columns=['Sex', 'Embarked'])

Here is the result :

Some limits

Have you not wondered why you didn’t take the Cabin column or even the Name column? and quite simply because these columns present too many values ​​on their distribution. Indeed, adding dummies columns would risk adding almost as many columns as there are existing rows! (this is especially true for the name) and therefore no interest for the rest of the treatments.

An intermediate solution will require us to dissect the data in order to group them into larger clusters. This will allow us to apply the previous method to it.

For the Name column, why not split (parse) this name and get the last name there. Will the frequency of distribution of surnames make it possible to create a sufficient group? For the Cabin column, why not parse / retrieve the first letter? It is very likely that the first letter corresponds to a deck of the boat for example, right?

Let’s try this with Pandas. Let’s first look at the distribution of this characteristic:


We get 147 possible values ​​on this test set.
Be careful because we have unspecified values ​​(NaN), which must be filled for this parameter, why not add (for the moment a fictitious bridge X)??

cabin = titanic['Cabin'].fillna('X')

Obviously we now have 148 distinct values.

Now let’s take only the first character of the cabin:


t’s much better we only have 9 possible values!

Why not add this new column (Bridge?) And turn it into dummie columns (9) now?

Another concern that you may encounter is when your categorical variable is of numeric type. This happens often in fact and in this case scikit-learn offers you the OneHotEncoder class to handle this scenario.

Let’s apply our work with a logistic regression

First, let’s prepare the data so we don’t forget to provide only numeric data to our logistic regression algorithm (on titanic data):

from sklearn.linear_model import LogisticRegression
lr1 = LogisticRegression()
def Prepare_Modele(X, cabin):
    target = X.Survived
    sexe = pd.get_dummies(X['Sex'], prefix='sex')
    cabin = pd.get_dummies(cabin.str[0], prefix='Cabin')
    age = X['Age'].fillna(X['Age'].mean())
    X = X[['Pclass', 'SibSp']].join(cabin).join(sexe).join(age)
    return X, target
X, y = Prepare_Modele(titanic, cabin), y)
lr1.score(X, y)

We get a score of 81%

The conclusion is that there is no miracle recipe when it comes to dealing with categorical variables. Very often it is here that business knowledge is essential in order to better understand and manage these characteristics.

Share this post

Benoit Cayla

In more than 15 years, I have built-up a solid experience around various integration projects (data & applications). I have, indeed, worked in nine different companies and successively adopted the vision of the service provider, the customer and the software editor. This experience, which made me almost omniscient in my field naturally led me to be involved in large-scale projects around the digitalization of business processes, mainly in such sectors like insurance and finance. Really passionate about AI (Machine Learning, NLP and Deep Learning), I joined Blue Prism in 2019 as a pre-sales solution consultant, where I can combine my subject matter skills with automation to help my customers to automate complex business processes in a more efficient way. In parallel with my professional activity, I run a blog aimed at showing how to understand and analyze data as simply as possible: Learning, convincing by the arguments and passing on my knowledge could be my caracteristic triptych.

View all posts by Benoit Cayla →

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Fork me on GitHub