Index
What is Auto-ML?
AutoML (or Automated Machine Learning) consists of automating the learning process inherent in any machine learning solution. AutoML will therefore manage and produce (behind the shop and without the intervention of a Data Scientist) the complete modeling pipeline from the raw data set to the final deployable model. This approach is proposed as an artificial intelligence solution itself which makes it possible to simplify and accelerate this modeling phase. At least that’s his promise! let’s see what that gives …
The idea is therefore to allow non-experts to design and use machine learning models and techniques without having to first become an expert in this field.
The reality is quite different because if these AutoML solutions work not too badly (we will see it here with AutoGluon), in reality the paw of the artist (the DataScientist) remains essential to adjust the data beforehand and refine them. .
However – and at present – these solutions at least have the virtue of democratizing machine learning by making it more accessible. So let’s not neglect its educational contribution, and then we are at the beginning and it seems obvious that this type of solution is shaping up as the future of AI.
AutoGluon
To start in AutoML we will use AutoGluon.
AutoGluon is a very recent Open-Source library published by Amazon . It is true that I could have started simpler by using a more graphical tool like Dataiku (which will certainly be the subject of a later article) … but I wanted to keep the comparison possible with the Python / scikit-learn work done previously (Cf. articles).
By going to the AutoGluon website, we quickly understand its usefulness:
- Quick prototyping of ML models in a few lines
- Take advantage of automatic hyperparameter adjustment, model selection / architecture search and data processing.
- Automatic use of Deep Learning techniques.
- Improvement of existing data models and pipelines
Moreover, AutoGuon is customizable… so why go without it?
First of all, let’s install the AutoGluon Python libraries (with PIP and without GPU):
pip install --upgrade mxnet
pip install autogluon
AutoGluon vs Titanic
Let’s start by declaring the libraries:
import autogluon as ag
import pandas as pd
from autogluon import TabularPrediction as task
Then let’s get the titanic (Kaggle) dataset:
train_data = task.Dataset(file_path="../datasources/titanic/train.csv")
print(train_data.head())
Loaded data from: ../datasources/titanic/train.csv | Columns = 12 / 12 | Rows = 891 -> 891
PassengerId Survived Pclass \
0 1 0 3
1 2 1 1
2 3 1 3
3 4 1 1
4 5 0 3
Name Sex Age SibSp \
0 Braund, Mr. Owen Harris male 22.0 1
1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1
2 Heikkinen, Miss. Laina female 26.0 0
3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1
4 Allen, Mr. William Henry male 35.0 0
Parch Ticket Fare Cabin Embarked
0 0 A/5 21171 7.2500 NaN S
1 0 PC 17599 71.2833 C85 C
2 0 STON/O2. 3101282 7.9250 NaN S
3 0 113803 53.1000 C123 S
4 0 373450 8.0500 NaN S
Let’s take a closer look at the dataset (which looks like, but isn’t, a Pandas dataframe):
print("Détail sur la colonne survivant: \n", train_data['Survived'].describe())
Détail sur la colonne survivant:
count 891.000000
mean 0.383838
std 0.486592
min 0.000000
25% 0.000000
50% 0.000000
75% 1.000000
max 1.000000
Name: Survived, dtype: float64
Now let’s train the model directly on the raw data. We see here the interest of AutoML. No need to:
- Segment the dataset
- Prepare the data
- Choose a machine learning model
- … And even less need to refine the hyperparameters!
dir = 'models'
label_col = 'Survived'
predictor = task.fit(train_data=train_data, label=label_col, output_directory=dir)
When we run the previous command, AutoGluon goes to work … the lines scroll as below. We can then see the work steps that we performed manually take place
Beginning AutoGluon training ...
AutoGluon will save models to models/
Train Data Rows: 891
Train Data Columns: 12
Preprocessing data ...
Here are the first 10 unique label values in your data: [0 1]
AutoGluon infers your prediction problem is: binary (because only two unique label-values observed)
If this is wrong, please specify `problem_type` argument in fit() instead (You may specify problem_type as one of: ['binary', 'multiclass', 'regression'])
Selected class <--> label mapping: class 1 = 1, class 0 = 0
Feature Generator processed 891 data points with 33 features
Original Features:
int features: 4
object features: 5
float features: 2
Generated Features:
int features: 22
All Features:
int features: 26
object features: 5
float features: 2
Data preprocessing and feature engineering runtime = 0.34s ...
AutoGluon will gauge predictive performance using evaluation metric: accuracy
To change this, specify the eval_metric argument of fit()
AutoGluon will early stop models using evaluation metric: accuracy
/opt/anaconda3/lib/python3.7/imp.py:342: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated since Python 3.3,and in 3.9 it will stop working
return _load(spec)
Fitting model: RandomForestClassifierGini ...
0.8268 = Validation accuracy score
0.81s = Training runtime
0.12s = Validation runtime
...
Fitting model: weighted_ensemble_k0_l1 ...
0.8547 = Validation accuracy score
0.39s = Training runtime
0.0s = Validation runtime
AutoGluon training complete, total runtime = 13.3s ...
What happened during the training (with fit ())?
We are in a binary classification problem (survivor or not). AutoGluon automatically deduces that the appropriate performance measure is precision. It analyzes and thus deduces the type of each entity (that is, which columns contain continuous numbers versus discrete categories). AutoGluon also handles missing data issues and scaling feature values. That makes life easier, doesn’t it?
In the example above, we did not specify separate validation data. AutoGluon therefore automatically chooses a random breakdown of the data for training. The data used for validation is separate from the training data and is used to determine which hyperparameter patterns and values produce the best results. And yes ! the idea is not to use just one algorithm / model… for this, Autogluon tests several models and assembles them to guarantee optimal predictive performance.
The solution will therefore try different types of models (including deep learning models) and choose the best one while adjusting the hyperparameters best suited to each one.
Now let’s look at the predictions of this model:
predictor = task.load(dir) # Nécéssaire seulement si le modèle n'avait pas été chargé au préalable
y_train = train_data[label_col]
x_train_data = train_data.drop(labels=[label_col],axis=1)
y_train_pred = predictor.predict(x_train_data)
print("Predictions: ", y_train_pred)
Predictions: [0 1 1 1 0 0 0 0 1 1 1 1 0 ...
0 0 1 0 1 1 1 1 0 0 0 1 0 0 1 1 0 0 1 0 1 0 0 1 1 0 0 0 1 1 0 0 0 0 0 0 1
0 1 0]
And his performance in all of this? easy use the evaluate_predictions () method
perf = predictor.evaluate_predictions(y_true=y_train, y_pred=y_train_pred, auxiliary_metrics=True)
Evaluation: accuracy on test data: 0.961841
Evaluations on test data:
{
"accuracy": 0.9618406285072951,
"accuracy_score": 0.9618406285072951,
"balanced_accuracy_score": 0.954702329594478,
"matthews_corrcoef": 0.919374313148082,
"f1_score": 0.9618406285072951
}
Detailed (per-class) classification report:
{
"0": {
"precision": 0.9541446208112875,
"recall": 0.9854280510018215,
"f1-score": 0.9695340501792115,
"support": 549
},
"1": {
"precision": 0.9753086419753086,
"recall": 0.9239766081871345,
"f1-score": 0.9489489489489489,
"support": 342
},
"accuracy": 0.9618406285072951,
"macro avg": {
"precision": 0.9647266313932981,
"recall": 0.954702329594478,
"f1-score": 0.9592414995640801,
"support": 891
},
"weighted avg": {
"precision": 0.9622681844904067,
"recall": 0.9618406285072951,
"f1-score": 0.9616326981918379,
"support": 891
}
}
We have a performance of 96% on the training data, which is not bad but it does not mean much. To see if our model is performing well we will of course compare it to the Kaggle test game and submit the result of the prediction in order to see our score.
Let’s submit our model to Kaggle
To do this, we will retrieve the test set from the Kaggle site and use the prediction of our model:
test_data = task.Dataset(file_path='../datasources/titanic/test.csv')
print(test_data.head())
predictor = task.load(dir) # Nécéssaire seulement si le modèle n'avait pas été chargé au préalable
y_pred = predictor.predict(test_data)
print("Predictions: ", y_pred)
Predictions: [0 1 0 0 1 0 1 0 1 0 0 0 1 0 1 1 0 0 0 1 0 1 1 1 1 0 1 0 0 0 0 0 1 1 1 0 1
1 0 1 0 0 0 1 1 0 0 0 1 1 0 0 1 1 0 0 0 0 0 1 0 0 0 0 1 1 1 0 0 1 1 0 0 0
1 0 0 1 0 1 0 0 0 0 0 0 1 0 1 1 1 0 1 0 0 0 1 0 1 0 1 0 0 0 1 0 0 0 0 0 0
1 1 1 1 0 0 1 0 1 1 0 1 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 1 0 0 0
0 0 1 0 0 1 0 0 1 1 0 1 1 1 1 0 0 0 0 0 1 0 0 0 0 0 0 1 1 1 1 1 0 1 1 0 1
0 1 0 0 0 0 0 0 0 1 0 1 0 0 0 1 1 1 1 0 0 1 0 1 0 0 0 0 1 0 0 1 0 1 0 1 0
1 0 1 1 0 1 0 0 0 1 0 0 0 0 0 0 1 1 1 1 1 0 0 0 1 0 1 1 1 0 0 0 0 0 0 0 1
0 0 0 1 1 0 0 0 0 0 0 0 0 1 1 0 1 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 1 1 1 1 0 1 0 0 0 0 1 1 0 0 0 0 0 0 0 0 1 0 1 1 0 0 1 0 0
1 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 1 1 0 0 0 1 0 1 0 0 0 0 1 1 0 1 0 0 0 1 0
0 1 0 0 1 1 1 0 0 0 0 0 1 1 0 1 0 0 0 0 0 1 1 0 1 1 0 1 0 0 1 0 1 0 0 0 0
0 1 1 1 1 0 0 1 0 0 0]
Now let’s format the data predict so that kaggle can study our result:
final = pd.DataFrame()
for i in range(len(test_data)):
row = {'PassengerId' : str(test_data['PassengerId'][i]) , 'Survived' : str(y_pred[i])}
final = final.append(row , ignore_index=True)
final.to_csv("result.csv", columns=["PassengerId", "Survived"], index=False)
… Suspence !!!
Ouch! after submission on kaggle we get a score of 0.55980 …
Conclusion
It’s not terrible anyway (56%) but it proves one thing. These tools (AutoML) are not magic. It’s true I could have customized the behavior of AutoGluon but then what about the Automatic side? Beyond the result, I find that this type of tool has at least two virtues. The first is educational, of course, the second is to allow rapid prototyping.
For me it also proves one thing is that knowledge of data (and therefore its preparation) is a crucial step in any Machine Learning project. Data Scientists, rest assured, your job is not – yet – in danger. On the other hand, it is interesting to note how these tools increasingly hide the technical aspects in favor of better business knowledge.