Preparing the datasets

Share this post

Spliting or just preparing the datasets in a Machine Learning project is a very important step that should not be neglected, otherwise you risk over evaluating your model (over-fitting) or quite simply the opposite (under fitting). Indeed by nature a model will stick (but not too much) to its training data.

This step is therefore a preliminary step but also an optimization step that should not be overlooked. We will see in this article how to manage your datasets with Python and Orange.

Training data vs. test data

As we have seen in the process of a Machine Learning project, it is essential to have two minimum data sets: one for training the model and the other for its validation. But very often you recover data in bulk!

Never mind, just cut your data set in two (say 30% for the test data that we put aside and the rest for training)!

Complete Dataset can be divided into 2 parts:

  • 70% training data
  • 30% test data

But in this case, how do you break down your data while maintaining a certain consistency and above all representativeness? Difficult, if not impossible in fact, especially when your dataset reaches a large amount. We will see how to overcome this problem through several techniques.

Split your data

Python Scikit-Learn offers a very practical function for splitting datasets: train_test_split . Here we are asking for a 33% dataset breakdown for test data and the rest for training:

import pandas as pd
from sklearn.model_selection import train_test_split
train = pd.read_csv("../datasources/titanic/train.csv")
X = train.drop(['Survived', 'Name', 'Sex', 'Ticket', 'Cabin', 'Embarked' ], axis=1)
X['Pclass'] = X['Pclass'].fillna(5)
X['Age'] = X['Age'].fillna(X['Age'].mean())
X['Fare'] = X['Fare'].fillna(X['Fare'].mean())
y = train['Survived']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
print ("Train=" + str(X_train.shape) + ", Test=" + str(X_test.shape))
Train=(596, 6), Test=(295, 6)

The train_test_split function returns our data sets (Observation = X, result = y) divided into training and test sets. by adjusting the random_state and shuffle parameters one can even adjust the degree of (random) choice of distribution in one or the other dataset.

If you use Orange (free and open-source Data-Science tool), you will only have to use the Data Sampler widget:

Cliquez pour agrandir

Cut/split again

If this first division is essential or even vital, it is unfortunately not always enough. In order to optimize the adjustment of its model, during the training phase, we will cut out the training set again to make sure that we do not stick too much to the data used.

To optimize the training and check the consistency of the model, we will split our training datasets several times and train it each time (via iteration) on a part. We will then be able to see if the scoring is coherent on all the divisions carried out:

We can directly perform this cross-validation with the algorithm (using the hyperparameter cv). In the case below we cut in two steps:

clf = svm.SVC(kernel='linear', C=1)
score = cross_val_score(clf,X, y, cv=2)
array([0.66591928, 0.73033708])

You can also use the KFold object from sklearn. In the example below we carry out a division in 4:

kf = KFold(n_splits=4, random_state=None, shuffle=False)
for train_index, test_index in kf.split(X):
    print("TRAIN:", train_index.shape, "TEST:", test_index.shape , str("\n"))
    print("TRAIN:", train_index, "\n\nTEST:", test_index , str("\n"))

It is then enough to pass the division carried out to the algorithm to see the result:

score = cross_val_score(clf,X, y, cv=kf)
array([0.6367713 , 0.68609865, 0.71748879, 0.72522523])

As usual, find the sources for this article on Github .

Share this post

Benoit Cayla

In more than 15 years, I have built-up a solid experience around various integration projects (data & applications). I have, indeed, worked in nine different companies and successively adopted the vision of the service provider, the customer and the software editor. This experience, which made me almost omniscient in my field naturally led me to be involved in large-scale projects around the digitalization of business processes, mainly in such sectors like insurance and finance. Really passionate about AI (Machine Learning, NLP and Deep Learning), I joined Blue Prism in 2019 as a pre-sales solution consultant, where I can combine my subject matter skills with automation to help my customers to automate complex business processes in a more efficient way. In parallel with my professional activity, I run a blog aimed at showing how to understand and analyze data as simply as possible: Learning, convincing by the arguments and passing on my knowledge could be my caracteristic triptych.

View all posts by Benoit Cayla →

Leave a Reply

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Privacy Preference Center


What is a cookie and what is it used for?

A cookie (or connection witness) is a text file that can be saved, subject to your choices, in a dedicated space on the hard drive of your terminal (computer, tablet, etc.) when consulting a online service through your browser software.
It is transmitted by a website's server to your browser. Each cookie is assigned an anonymous identifier. The cookie file allows its issuer to identify the terminal in which it is registered during the period of validity or registration of the cookie concerned. A cookie cannot be traced back to a natural person.

When you visit this site, it may be required to install, subject to your choice, various statistical cookies.
What types of cookies are placed by the website?

Google Analytics & Matomo Statistics Cookies

These cookies are used to establish statistics of visits to my site and to detect navigation problems in order to monitor and improve the quality of our services.
Exercise your choices according to the browser you use

You can configure your browser at any time in order to express and modify your wishes in terms of cookies, and in particular regarding statistical cookies. You can express your choices by setting your browser to refuse certain cookies.

If you refuse cookies, your visit to the site will no longer be counted in Google Analytics & Matomo and you will no longer be able to benefit from a number of features that are nevertheless necessary to navigate certain pages of this site.
However, you can oppose the registration of cookies by following the operating procedure available below:

On Internet Explorer
1. Go to Tools> Internet Options.
2. Click on the privacy tab.
3. Click on the advanced button, check the box "Ignore automatic management of cookies".

On Firefox
1. At the top of the Firefox window, click the Firefox button (Tools menu in Windows XP), then select Options.
2. Select the Privacy panel.
3. Configure Conservation rules: to use the personalized parameters for the history.
4. Uncheck Accept cookies.

On Chrome
1. Click on the wrench icon which is located in the browser toolbar.
2. Select Settings.
3. Click Show advanced settings.
4. In the “Confidentiality” section, click on the Content settings button.
5. In the "Cookies" section, you can block cookies and data from third-party sites

On Safari
1. Go to Settings> Preferences
2. Click on the Privacy tab
3. In the "Block cookies" area, check the "always" box.

About Opera
1. Go to Settings> Preferences
2. Click on the advanced tab
3. In the "Cookies" area, check the "Never accept cookies" box.
social network sharing cookies

On certain pages of this site there are buttons or modules of third-party social networks that allow you to use the functionalities of these networks and in particular to share content on this site with other people.
When you go to a web page on which one of these buttons or modules is located, your browser can send information to the social network which can then associate this visualization with your profile.

Social network cookies, over which this site has no control, may then be placed in your browser by these networks. I invite you to consult the confidentiality policies specific to each of these social networking sites, in order to become aware of the purposes for using the browsing information that social networks can collect using these buttons and modules.
- Twitter
- Google+
- LinkedIn

Statistiqcs only

Fork me on GitHub