Getting started in Auto-ML with AutoGluon

Share this post

What is Auto-ML?

AutoML (or Automated Machine Learning) consists of automating the learning process inherent in any machine learning solution. AutoML will therefore manage and produce (behind the shop and without the intervention of a Data Scientist) the complete modeling pipeline from the raw data set to the final deployable model. This approach is proposed as an artificial intelligence solution itself which makes it possible to simplify and accelerate this modeling phase. At least that’s his promise! let’s see what that gives …

The idea is therefore to allow non-experts to design and use machine learning models and techniques without having to first become an expert in this field.

The reality is quite different because if these AutoML solutions work not too badly (we will see it here with AutoGluon), in reality the paw of the artist (the DataScientist) remains essential to adjust the data beforehand and refine them. .

However – and at present – these solutions at least have the virtue of democratizing machine learning by making it more accessible. So let’s not neglect its educational contribution, and then we are at the beginning and it seems obvious that this type of solution is shaping up as the future of AI.

AutoGluon

To start in AutoML we will use AutoGluon.

AutoGluon is a very recent Open-Source library published by Amazon . It is true that I could have started simpler by using a more graphical tool like Dataiku (which will certainly be the subject of a later article) … but I wanted to keep the comparison possible with the Python / scikit-learn work done previously (Cf. articles).

By going to the AutoGluon website, we quickly understand its usefulness:

  • Quick prototyping of ML models in a few lines
  • Take advantage of automatic hyperparameter adjustment, model selection / architecture search and data processing.
  • Automatic use of Deep Learning techniques.
  • Improvement of existing data models and pipelines

Moreover, AutoGuon is customizable… so why go without it?

First of all, let’s install the AutoGluon Python libraries (with PIP and without GPU):

pip install --upgrade mxnet
pip install autogluon

AutoGluon vs Titanic

Let’s start by declaring the libraries:

import autogluon as ag
import pandas as pd
from autogluon import TabularPrediction as task

Then let’s get the titanic (Kaggle) dataset:

train_data = task.Dataset(file_path="../datasources/titanic/train.csv")
print(train_data.head())
Loaded data from: ../datasources/titanic/train.csv | Columns = 12 / 12 | Rows = 891 -> 891

   PassengerId  Survived  Pclass  \
0            1         0       3   
1            2         1       1   
2            3         1       3   
3            4         1       1   
4            5         0       3   

                                                Name     Sex   Age  SibSp  \
0                            Braund, Mr. Owen Harris    male  22.0      1   
1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   
2                             Heikkinen, Miss. Laina  female  26.0      0   
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   
4                           Allen, Mr. William Henry    male  35.0      0   

   Parch            Ticket     Fare Cabin Embarked  
0      0         A/5 21171   7.2500   NaN        S  
1      0          PC 17599  71.2833   C85        C  
2      0  STON/O2. 3101282   7.9250   NaN        S  
3      0            113803  53.1000  C123        S  
4      0            373450   8.0500   NaN        S  

Let’s take a closer look at the dataset (which looks like, but isn’t, a Pandas dataframe):

print("Détail sur la colonne survivant: \n", train_data['Survived'].describe())
Détail sur la colonne survivant: 
 count    891.000000
mean       0.383838
std        0.486592
min        0.000000
25%        0.000000
50%        0.000000
75%        1.000000
max        1.000000
Name: Survived, dtype: float64

Now let’s train the model directly on the raw data. We see here the interest of AutoML. No need to:

  • Segment the dataset
  • Prepare the data
  • Choose a machine learning model
  • … And even less need to refine the hyperparameters!
dir = 'models'
label_col = 'Survived'
predictor = task.fit(train_data=train_data, label=label_col, output_directory=dir)

When we run the previous command, AutoGluon goes to work … the lines scroll as below. We can then see the work steps that we performed manually take place

Beginning AutoGluon training ...
AutoGluon will save models to models/
Train Data Rows:    891
Train Data Columns: 12
Preprocessing data ...
Here are the first 10 unique label values in your data:  [0 1]
AutoGluon infers your prediction problem is: binary  (because only two unique label-values observed)
If this is wrong, please specify `problem_type` argument in fit() instead (You may specify problem_type as one of: ['binary', 'multiclass', 'regression'])

Selected class <--> label mapping:  class 1 = 1, class 0 = 0
Feature Generator processed 891 data points with 33 features
Original Features:
	int features: 4
	object features: 5
	float features: 2
Generated Features:
	int features: 22
All Features:
	int features: 26
	object features: 5
	float features: 2
	Data preprocessing and feature engineering runtime = 0.34s ...
AutoGluon will gauge predictive performance using evaluation metric: accuracy
To change this, specify the eval_metric argument of fit()
AutoGluon will early stop models using evaluation metric: accuracy
/opt/anaconda3/lib/python3.7/imp.py:342: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated since Python 3.3,and in 3.9 it will stop working
  return _load(spec)
Fitting model: RandomForestClassifierGini ...
	0.8268	 = Validation accuracy score
	0.81s	 = Training runtime
	0.12s	 = Validation runtime

...

Fitting model: weighted_ensemble_k0_l1 ...
	0.8547	 = Validation accuracy score
	0.39s	 = Training runtime
	0.0s	 = Validation runtime
AutoGluon training complete, total runtime = 13.3s ...

What happened during the training (with fit ())?

We are in a binary classification problem (survivor or not). AutoGluon automatically deduces that the appropriate performance measure is precision. It analyzes and thus deduces the type of each entity (that is, which columns contain continuous numbers versus discrete categories). AutoGluon also handles missing data issues and scaling feature values. That makes life easier, doesn’t it?

In the example above, we did not specify separate validation data. AutoGluon therefore automatically chooses a random breakdown of the data for training. The data used for validation is separate from the training data and is used to determine which hyperparameter patterns and values ​​produce the best results. And yes ! the idea is not to use just one algorithm / model… for this, Autogluon tests several models and assembles them to guarantee optimal predictive performance.

The solution will therefore try different types of models (including deep learning models) and choose the best one while adjusting the hyperparameters best suited to each one.

Now let’s look at the predictions of this model:

predictor = task.load(dir) # Nécéssaire seulement si le modèle n'avait pas été chargé au préalable
y_train = train_data[label_col]
x_train_data = train_data.drop(labels=[label_col],axis=1) 
y_train_pred = predictor.predict(x_train_data)
print("Predictions:  ", y_train_pred)
Predictions:   [0 1 1 1 0 0 0 0 1 1 1 1 0 ...
 0 0 1 0 1 1 1 1 0 0 0 1 0 0 1 1 0 0 1 0 1 0 0 1 1 0 0 0 1 1 0 0 0 0 0 0 1
 0 1 0]

And his performance in all of this? easy use the evaluate_predictions () method

perf = predictor.evaluate_predictions(y_true=y_train, y_pred=y_train_pred, auxiliary_metrics=True)
Evaluation: accuracy on test data: 0.961841
Evaluations on test data:
{
    "accuracy": 0.9618406285072951,
    "accuracy_score": 0.9618406285072951,
    "balanced_accuracy_score": 0.954702329594478,
    "matthews_corrcoef": 0.919374313148082,
    "f1_score": 0.9618406285072951
}
Detailed (per-class) classification report:
{
    "0": {
        "precision": 0.9541446208112875,
        "recall": 0.9854280510018215,
        "f1-score": 0.9695340501792115,
        "support": 549
    },
    "1": {
        "precision": 0.9753086419753086,
        "recall": 0.9239766081871345,
        "f1-score": 0.9489489489489489,
        "support": 342
    },
    "accuracy": 0.9618406285072951,
    "macro avg": {
        "precision": 0.9647266313932981,
        "recall": 0.954702329594478,
        "f1-score": 0.9592414995640801,
        "support": 891
    },
    "weighted avg": {
        "precision": 0.9622681844904067,
        "recall": 0.9618406285072951,
        "f1-score": 0.9616326981918379,
        "support": 891
    }
}

We have a performance of 96% on the training data, which is not bad but it does not mean much. To see if our model is performing well we will of course compare it to the Kaggle test game and submit the result of the prediction in order to see our score.

Let’s submit our model to Kaggle

To do this, we will retrieve the test set from the Kaggle site and use the prediction of our model:

test_data = task.Dataset(file_path='../datasources/titanic/test.csv')
print(test_data.head())
predictor = task.load(dir) # Nécéssaire seulement si le modèle n'avait pas été chargé au préalable
y_pred = predictor.predict(test_data)
print("Predictions:  ", y_pred)
Predictions:   [0 1 0 0 1 0 1 0 1 0 0 0 1 0 1 1 0 0 0 1 0 1 1 1 1 0 1 0 0 0 0 0 1 1 1 0 1
 1 0 1 0 0 0 1 1 0 0 0 1 1 0 0 1 1 0 0 0 0 0 1 0 0 0 0 1 1 1 0 0 1 1 0 0 0
 1 0 0 1 0 1 0 0 0 0 0 0 1 0 1 1 1 0 1 0 0 0 1 0 1 0 1 0 0 0 1 0 0 0 0 0 0
 1 1 1 1 0 0 1 0 1 1 0 1 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 1 0 0 0
 0 0 1 0 0 1 0 0 1 1 0 1 1 1 1 0 0 0 0 0 1 0 0 0 0 0 0 1 1 1 1 1 0 1 1 0 1
 0 1 0 0 0 0 0 0 0 1 0 1 0 0 0 1 1 1 1 0 0 1 0 1 0 0 0 0 1 0 0 1 0 1 0 1 0
 1 0 1 1 0 1 0 0 0 1 0 0 0 0 0 0 1 1 1 1 1 0 0 0 1 0 1 1 1 0 0 0 0 0 0 0 1
 0 0 0 1 1 0 0 0 0 0 0 0 0 1 1 0 1 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0
 1 0 0 0 0 0 0 0 1 1 1 1 0 1 0 0 0 0 1 1 0 0 0 0 0 0 0 0 1 0 1 1 0 0 1 0 0
 1 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 1 1 0 0 0 1 0 1 0 0 0 0 1 1 0 1 0 0 0 1 0
 0 1 0 0 1 1 1 0 0 0 0 0 1 1 0 1 0 0 0 0 0 1 1 0 1 1 0 1 0 0 1 0 1 0 0 0 0
 0 1 1 1 1 0 0 1 0 0 0]

Now let’s format the data predict so that kaggle can study our result:

final = pd.DataFrame()
for i in range(len(test_data)):
    row = {'PassengerId' : str(test_data['PassengerId'][i]) , 'Survived' : str(y_pred[i])}
    final = final.append(row , ignore_index=True)
final.to_csv("result.csv", columns=["PassengerId", "Survived"], index=False)

… Suspence !!!

Ouch! after submission on kaggle we get a score of 0.55980 …

Conclusion

It’s not terrible anyway (56%) but it proves one thing. These tools (AutoML) are not magic. It’s true I could have customized the behavior of AutoGluon but then what about the Automatic side? Beyond the result, I find that this type of tool has at least two virtues. The first is educational, of course, the second is to allow rapid prototyping.

For me it also proves one thing is that knowledge of data (and therefore its preparation) is a crucial step in any Machine Learning project. Data Scientists, rest assured, your job is not – yet – in danger. On the other hand, it is interesting to note how these tools increasingly hide the technical aspects in favor of better business knowledge.

Share this post

Benoit Cayla

In more than 15 years, I have built-up a solid experience around various integration projects (data & applications). I have, indeed, worked in nine different companies and successively adopted the vision of the service provider, the customer and the software editor. This experience, which made me almost omniscient in my field naturally led me to be involved in large-scale projects around the digitalization of business processes, mainly in such sectors like insurance and finance. Really passionate about AI (Machine Learning, NLP and Deep Learning), I joined Blue Prism in 2019 as a pre-sales solution consultant, where I can combine my subject matter skills with automation to help my customers to automate complex business processes in a more efficient way. In parallel with my professional activity, I run a blog aimed at showing how to understand and analyze data as simply as possible: datacorner.fr Learning, convincing by the arguments and passing on my knowledge could be my caracteristic triptych.

View all posts by Benoit Cayla →

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Fork me on GitHub