Index
Introduction
Did you know the latest trendy Gradient Boosting algorithm: CatBoost ? No ?
you probably already know the XGBoost but also the good old Light GBM. So it’s time to see what this latest addition to the gradient booster family has to offer us more (or less) than its predecessors.
This algorithm comes straight from the Russian company Yandex (which is nothing more than a search engine and a Russian portal, moreover Yandex is the default engine for Mozilla Firefox in Russia and Turkey!). It was made available to the Open-Source community in April 2017 and is particularly effective in certain cases.
Installation
The installation in Python (and / or R) is rather simple:
If you are using Anaconda (like me), run a command line and type:
conda config --add channels conda-forge
conda install catboost
pip install ipywidget
# pour activer les extensions Jupyter tapez aussi :
jupyter nbextension enable --py widgetsnbextension
Just using the pip utility:
pip install catboost
pip install ipywidget
# pour activer les extensions Jupyter tapez aussi :
jupyter nbextension enable --py widgetsnbextension
Here CatBoost is ready for use, launch your Jupyter notebook and type in the first line:
import catboost as cb
Categorical variables with CatBoost
CatBoost is unique (compared to its competitors) that it assigns column indexes to categorical data so that they can be encoded via One-Hot encoding using the one_hot_max_size parameter . If nothing is passed in the cat_features argument , then CatBoost will treat all columns as “regular” numeric fields.
Warning :
- If a column containing string values is not supplied in cat_features, CatBoost outputs in error.
- Columns of type int are treated as numeric in all cases
- the column must be explicitly specified in cat_features for the algorithm to treat it as categorical.
Excellent news, and to put it simply … the algorithm will take care of one-hot encoding for us.
CatBoost vs XGBoost
And if we compare the two most used Gradient boosting algorithms: XGBoost vs CatBoost . For that I invite you to reread my article on XGBoost because we are going to use exactly the same use case: which I remind you was the Titanic dataset.
As we saw in the previous chapter, we don’t have to worry about the one-hot encodate , so we’ll leave our categorical variables (Cabin, class, etc.) as they are and let CatBoost take care of them. . Our data preparation function is reduced as follows:
import catboost as cb
from sklearn.preprocessing import MinMaxScaler
import pandas as pd
import xgboost as xgb
train = pd.read_csv("../titanic/data/train.csv")
test = pd.read_csv("../titanic/data/test.csv")
def dataprep(data):
cabin = data['Cabin'].fillna('X').str[0]
# Age
age = data['Age'].fillna(data['Age'].mean())
emb = data['Embarked'].fillna('X').str[0]
# Prix du billet / Attention une donnée de test n'a pas de Prix !
faresc = pd.DataFrame(MinMaxScaler().fit_transform(data[['Fare']].fillna(0)), columns = ['Prix'])
# Classe
pc = pd.DataFrame(MinMaxScaler().fit_transform(data[['Pclass']]), columns = ['Classe'])
dp = data[['SibSp', 'Sex']].join(pc).join(faresc).join(age).join(cabin).join(emb)
return dp
Xtrain = dataprep(train)
Xtest = dataprep(test)
y = train.Survived
Note however that we still have to manage the NAs (empty or nulls), otherwise the training will crash, which is rather normal after all.
Now let’s train our model. First of all we must call CatBoost by presenting its hyper-parameters to it but also by specifying which columns are categorical. It is the role of the vector cat_features (below) which specifies the row of the columns of this type. Just don’t forget to pass this vector to the fit () training function to take it into account.
Then a list (params) proposes the parameters.
cat_features = [1,5,6]
params = {'depth': [4, 7, 10],
'learning_rate' : [0.03, 0.1, 0.15],
'l2_leaf_reg': [1,4,9],
'iterations': [300]}
cbc = cb.CatBoostClassifier()
cbc.fit(Xtrain, y, cat_features)
The model is now trained, let’s look at its result:
p_cbc = cbc.predict(Xtrain)
print ("Score Train -->", round(cbc.score(Xtrain, y) *100,2), " %")
Score Train --> 87.77 %
an 87.7 % is more than reasonable for a first test.
Let’s change some settings:
clf = cb.CatBoostClassifier(eval_metric="AUC",
depth=10,
iterations= 500,
l2_leaf_reg= 9,
learning_rate= 0.15)
clf.fit(Xtrain, y, cat_features)
print ("Score Train -->", round(clf.score(Xtrain, y) *100,2), " %")
Score Train --> 92.59 %
Hmm, that’s better (maybe too much for that matter) because such a score must be validated: beware of over-training! Something besides that the algorithm can manage (via parameters). In any case, this algorithm turns out to be very interesting if at least we adjust the input variables well.
As usual the above sources are available on GitHub .
Conclusion
In a few points:
- Catboost is a very powerful algorithm but also slower than its competitors.
- It does not have any other booster than the trees available
- It can be used for regression or classification tasks
- This algorithm has several parameters for learning categorical variables (which it manages automatically as we have seen)
- It has overfit detection parameters .