The persistence of machine learning models

Share this post

The training of Machine models is often a heavy and above all extremely time-consuming task. This is therefore a job that must be able to be serialized somewhere so that programs using it do not have to re-perform this long operation. This is called persistence, and frameworks such as Scikit-Learn, XGBoost and others provide for this type of operation.

Index

With Scikit-Learn

If you are using Scikit-Learn , nothing is easier. you will need to use the dump and load methods and voila. Follow the guide…

First of all we will train a simple model (a good old linear regression):

import pandas as pd
import matplotlib.pyplot as plt
from sklearn import linear_model
data = pd.read_csv("./data/univariate_linear_regression_dataset.csv")
plt.scatter (data.col2, data.col1)
X = data.col2.values.reshape(-1, 1)
y = data.col1.values.reshape(-1, 1)
regr = linear_model.LinearRegression()
regr.fit(X, y)

Then we will test our model thus trained with the fit () method

regr.predict([[30]])

We get a forecast of 22.37707681

Now let’s dump our model. We will thus save it in a file (here myfirstmodel.modele):

from joblib import dump, load
dump(regr, 'monpremiermodele.modele')

The trained model is thus saved in a binary file. We can now imagine turning off our computer, and turning it back on for example. We will reactivate our model via the load () method combined with the file previously saved on disk:

regr2 = load('monpremiermodele.modele')
regr2.predict([[30]])

If we re-test the prediction with the same value as just after training we get – not by magic – exactly the same result.

As usual you will find the full code on Github .

With XGBoost

We’ve already seen this in the article on XGBoost, but here’s a little recap. The XBoost library (in standalone mode) includes of course the possibility of saving and reloading a model:

boost._Booster.save_model('titanic.modele')

Loading a saved model:

boost = xgb.Booster({'nthread': 4}) boost.load_model('titanic.modele')

With CatBoost

We did not mention this aspect there in the article which presented the CatBoost algorithm . We are going to remedy this shortcoming as much as obviously we will still proceed in a different way (well on some details…).

To save a Catboost model:

cb.CatBoost.save_model(clf, 
                       "catboost.modele", 
                       format="cbm", 
                       export_parameters=None, 
                       pool=None)

you will notice that we have many more parameters and therefore possibilities to save the model (format, export of parameters, training data, etc.). Do not hesitate to consult the documentation to see the description of these parameters.

And to reload an existing model (from the file):

from catboost import CatBoostClassifier
clf2 = CatBoostClassifier()
clf2.load_model(fname="catboost.modele", format="cbm")

The nuance here is that it is the model object (clf2) that calls the load_model () method and not the CatBoost object.

And now you will be able to prepare your models to be able to reuse them directly (ie without training) from your programs or API.

Share this post

The persistence of machine learning models

With Scikit-Learn

With XGBoost

With CatBoost

Benoit Cayla

Leave a Reply Cancel reply

With Scikit-Learn

With XGBoost

With CatBoost

Benoit Cayla

You might also like

Tutorial: Just do NLP with SpaCy!

Paris “Velib” Cycles usage analysis

Get started with Tesseract

Leave a Reply Cancel reply