CatBoost !

Share this post


Did you know the latest trendy Gradient Boosting algorithm: CatBoost ? No ?

you probably already know the XGBoost but also the good old Light GBM. So it’s time to see what this latest addition to the gradient booster family has to offer us more (or less) than its predecessors.

This algorithm comes straight from the Russian company Yandex (which is nothing more than a search engine and a Russian portal, moreover Yandex is the default engine for Mozilla Firefox in Russia and Turkey!). It was made available to the Open-Source community in April 2017 and is particularly effective in certain cases.


The installation in Python (and / or R) is rather simple:

If you are using Anaconda (like me), run a command line and type:

conda config --add channels conda-forge
conda install catboost
pip install ipywidget
# pour activer les extensions Jupyter tapez aussi :
jupyter nbextension enable --py widgetsnbextension

Just using the pip utility:

pip install catboost
pip install ipywidget
# pour activer les extensions Jupyter tapez aussi :
jupyter nbextension enable --py widgetsnbextension

Here CatBoost is ready for use, launch your Jupyter notebook and type in the first line:

import catboost as cb

Categorical variables with CatBoost

CatBoost is unique (compared to its competitors) that it assigns column indexes to categorical data so that they can be encoded via One-Hot encoding using the one_hot_max_size parameter . If nothing is passed in the cat_features argument , then CatBoost will treat all columns as “regular” numeric fields.

Warning :

  • If a column containing string values ​​is not supplied in cat_features, CatBoost outputs in error.
  • Columns of type int are treated as numeric in all cases
  • the column must be explicitly specified in cat_features for the algorithm to treat it as categorical.

Excellent news, and to put it simply … the algorithm will take care of one-hot encoding for us. 

CatBoost vs XGBoost

And if we compare the two most used Gradient boosting algorithms: XGBoost vs CatBoost . For that I invite you to reread my article on XGBoost because we are going to use exactly the same use case: which I remind you was the Titanic dataset.

As we saw in the previous chapter, we don’t have to worry about the one-hot encodate , so we’ll leave our categorical variables (Cabin, class, etc.) as they are and let CatBoost take care of them. . Our data preparation function is reduced as follows:

import catboost as cb
from sklearn.preprocessing import MinMaxScaler
import pandas as pd
import xgboost as xgb

train = pd.read_csv("../titanic/data/train.csv")
test = pd.read_csv("../titanic/data/test.csv")

def dataprep(data):
    cabin = data['Cabin'].fillna('X').str[0]
    # Age 
    age = data['Age'].fillna(data['Age'].mean())
    emb = data['Embarked'].fillna('X').str[0]
    # Prix du billet / Attention une donnée de test n'a pas de Prix !
    faresc = pd.DataFrame(MinMaxScaler().fit_transform(data[['Fare']].fillna(0)), columns = ['Prix'])
    # Classe
    pc = pd.DataFrame(MinMaxScaler().fit_transform(data[['Pclass']]), columns = ['Classe'])
    dp = data[['SibSp', 'Sex']].join(pc).join(faresc).join(age).join(cabin).join(emb)
    return dp

Xtrain = dataprep(train)
Xtest = dataprep(test)
y = train.Survived

Note however that we still have to manage the NAs (empty or nulls), otherwise the training will crash, which is rather normal after all.

Now let’s train our model. First of all we must call CatBoost by presenting its hyper-parameters to it but also by specifying which columns are categorical. It is the role of the vector cat_features (below) which specifies the row of the columns of this type. Just don’t forget to pass this vector to the fit () training function to take it into account.

Then a list (params) proposes the parameters.

cat_features = [1,5,6]
params = {'depth': [4, 7, 10],
          'learning_rate' : [0.03, 0.1, 0.15],
          'l2_leaf_reg': [1,4,9],
          'iterations': [300]}
cbc = cb.CatBoostClassifier(), y, cat_features)

The model is now trained, let’s look at its result:

p_cbc = cbc.predict(Xtrain)
print ("Score Train -->", round(cbc.score(Xtrain, y) *100,2), " %")
Score Train --> 87.77  %

an 87.7 % is more than reasonable for a first test.

Let’s change some settings:

clf = cb.CatBoostClassifier(eval_metric="AUC", 
                            iterations= 500, 
                            l2_leaf_reg= 9, 
                            learning_rate= 0.15), y, cat_features)
print ("Score Train -->", round(clf.score(Xtrain, y) *100,2), " %")
Score Train --> 92.59  %

Hmm, that’s better (maybe too much for that matter) because such a score must be validated: beware of over-training! Something besides that the algorithm can manage (via parameters). In any case, this algorithm turns out to be very interesting if at least we adjust the input variables well.

As usual the above sources are available on GitHub .


In a few points:

  • Catboost is a very powerful algorithm but also slower than its competitors.
  • It does not have any other booster than the trees available
  • It can be used for regression or classification tasks
  • This algorithm has several parameters for learning categorical variables (which it manages automatically as we have seen)
  • It has overfit detection parameters .
Share this post

Benoit Cayla

In more than 15 years, I have built-up a solid experience around various integration projects (data & applications). I have, indeed, worked in nine different companies and successively adopted the vision of the service provider, the customer and the software editor. This experience, which made me almost omniscient in my field naturally led me to be involved in large-scale projects around the digitalization of business processes, mainly in such sectors like insurance and finance. Really passionate about AI (Machine Learning, NLP and Deep Learning), I joined Blue Prism in 2019 as a pre-sales solution consultant, where I can combine my subject matter skills with automation to help my customers to automate complex business processes in a more efficient way. In parallel with my professional activity, I run a blog aimed at showing how to understand and analyze data as simply as possible: Learning, convincing by the arguments and passing on my knowledge could be my caracteristic triptych.

View all posts by Benoit Cayla →

Leave a Reply

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Fork me on GitHub