Titanic: Let’s go further! (Part 2)

Share this post

To follow up on my previous article on the prediction of Titanic survivors, it seems important to me to illustrate a few other techniques and thereby go further in the modeling. This article is therefore dedicated to working on the characteristics given to us. As usual for this type of Machine Learning project I will be using Python, Scikit-learn and Jupyter.

Preliminary work

  • Declare the Python libraries that we are going to use (Pandas, RegEx, scikit-learn, etc.)
  • Import / read training and test files in DataFrame Pandas
  • Create a full DataFrame that concatenates the two previous datasets.
import pandas as pd
import re
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import LinearSVC
from sklearn.preprocessing import MinMaxScaler

train = pd.read_csv("./data/train.csv")
test = pd.read_csv("./data/test.csv")
full = pd.concat([train, test]) # Assemble les deux jeux de données

Feature ingeneering

The objective of this post is to work on the characteristics and not on the algorithms themselves. For that we are going to rework some data which as it is is little or not usable.

Ticket & Price : Did you notice that the price mentioned is not the unit price per person? and that the Ticket data is not a unique key? We will have to correct this by calculating the unit price per person and modify the wrong (Fare) prices (i.e. grouped tickets)

Passenger without price : It is an error in the data. Indeed a passenger has no price (Fare). We will have to assign it one in order to keep this data.

The last name : This is not information directly provided, so you will have to “parse” the Name string to retrieve it correctly. Then we will group the data (train & test) by last name and we will count the number of people per family. With this in mind, it will be interesting to do one-hot encoding on families of more than 2, 3 people.

The title : Like the last name, this information must be retrieved from the Name column. Then it will be interesting to re-categorize these titles for example to have 3 final categories: Women & children, adults and VIP.

Age : age is interesting information in itself but creating additional age categories is even more so.

We will stop there for this article, but there are still many other areas for improvement.

Ticket & Price

First let’s look at the Ticket feature. Let’s check that all passengers have a ticket:

noticket = []
full['Ticket'].fillna('X')
for ticketnn in full['Ticket']:
    if (ticketnn == 'X'):
        noticket.append(1)
    else:
        noticket.append(0)
pd.DataFrame(noticket)[0].value_counts()

Good news, all passengers have this information!

test['Ticket'].value_counts().head()

Regardons maintenant les valeurs distinctes de ces Tickets :

PC 17608      5
113503        4
CA. 2343      4
C.A. 31029    3
347077        3
Name: Ticket, dtype: int64

Now let’s look at the distinct values ​​of these Tickets:

PC 17608      5
113503        4CA. 2343      4C.A. 31029    3
347077        3Name: Ticket, dtype: int64

Interestingly, the values ​​are not unique as we might think. This is in fact that we could have group tickets (several people with the same ticket). Clà changes a lot of things because if the Ticket could be grouped, so also the price.

We will therefore have to divide the price of the Ticket by the number of people with the same ticket!

Calculation of the ticket unit price

To do this we are going to use Pandas’ abilities to perform (left) joins between DataFrame. Beforehand, we will constitute a DataFrame which groups the Tickets with their number of occurrences: TicketCounts. Then we will make a left join between the dataset and this new DataFrame. We will then only have to add a UnitPrice column which divides the total price by the number of people on the Ticket. Be careful here to use the fillna () function on the number of tickets.

# Prépartion d'un DF (TicketCounts) contenant les ticket avec leur nb d'occurence
TicketCounts = pd.DataFrame(test['Ticket'].value_counts().head())
TicketCounts['TicketCount'] = TicketCounts['Ticket'] # renomme la colonne Ticket
TicketCounts['Ticket'] = TicketCounts.index # rajoute une colonne Ticket pour le merge (jointure)

# Reporte le résultat dans le dataframe test (jointure des datasets)
fin = pd.merge(test, TicketCounts, how='left', on='Ticket')
fin['PrixUnitaire'] = fin['Fare'] / fin['TicketCount'].fillna(1)

Passenger without Price!

Be careful, because we also have a passenger who does not have a Prize. Let’s take a look at who it is:

import numpy as np
test.loc[np.isnan(test['Fare'])]

This is a 3rd class passenger, so let’s calculate the average price for this type of ticket:

test.loc[test['Pclass'] == 3]['Fare'].mean()
12.459677880184334

We will assign this price to this passenger.

The last name

The last name is not immediately usable. It must be extracted from the Name characteristic which contains other information such as the title. Let’s use the RegEx for that:

familynames = []
for noms in full["Name"]:
    familynames.append(re.search('([A-Za-z0-9]*),\ ([A-Za-z0-9 ]*)\. (.*)', noms).group(1))
pdfamilynames = pd.DataFrame(familynames, columns = ['familynames'])

The idea now is to do a one-hot encoding with the last name. It might sound a little crazy but we have little data and some last names appear in both datasets.
We will first create a DataFrame with the last names appearing 2 or more times:

# Créé une liste des noms de famille avec plus de 2 occurences
famsurv = full.join(pdfamilynames)
famCount = famsurv['familynames'].value_counts()
pdfamCounts = pd.DataFrame(famCount, columns = ['familynames'])
pdfamCounts['famCount'] = pdfamCounts['familynames']
pdfamCounts['familynames'] = pdfamCounts.index
pdfamCounts[pdfamCounts['famCount'] >= 2]

This DataFrame can then be used through a function to add the dummies ( one-hot ) columns :

# Fonction ajoutant les colonnes noms famille dans un DF
def addColumnFamilyName(data):
    # ajoute les colonnes nulles avec les noms de famille
    for family in pdfamCounts['familynames']:
        data[family] = 0
    # récupère le nom de famille dans le DF
    for idx, f in enumerate(data["Name"]):
        # Modifie les colonnes dummies du nom de famille en 1 ou 0 selon le nom de famille
        iNom = re.search('([A-Za-z0-9]*),\ ([A-Za-z0-9 ]*)\. (.*)', f).group(1)
        for col in data.columns:
            if (col == iNom):
                data.loc[idx, col] = 1

We will use this function when preparing the data (later).

The title

In the same way as the last name, we have to extract the title by paring the characteristic Name. Let’s look at the titles on the whole dataset (full):

full['Titre'] = full.Name.map(lambda x : x.split(",")[1].split(".")[0])
full['NomFamille'] = full.Name.map(lambda x : x.split(",")[0])
titre = pd.DataFrame(full['Titre'])
full['Titre'].value_counts() # affiche tous les titres possible

Here are the possibilities that we will deal with:


 Mr              757
 Miss            260
 Mrs             197
 Master           61
 Dr                8
 Rev               8
 Col               4
 Ms                2
 Mlle              2
 Major             2
 Mme               1
 Lady              1
 Capt              1
 Don               1
 Jonkheer          1
 Sir               1
 the Countess      1
 Dona              1
Name: Titre, dtype: int64

For the titles we will create categories that we will encode ( one-hot ) then. Normally the instructions for women and children had to be respected, but in my opinion the rank and file were also privileged. So let’s create 3 categories: Woman and child, VIP and others:

X = test
X['Rang'] = 0
X['Titre'] = X.Name.map(lambda x : x.split(",")[1].split(".")[0])
vip = ['Don','Sir', 'Major', 'Col', 'Jonkheer', 'Dr', 'Rev']
femmeenfant = ['Miss', 'Mrs', 'Lady', 'Mlle', 'the Countess', 'Ms', 'Mme', 'Dona', 'Master']
for idx, titre in enumerate(X['Titre']):
    if (titre.strip() in femmeenfant) :
        X.loc[idx, 'Rang'] = 'FE'
    elif (titre.strip() in vip) :
        X.loc[idx, 'Rang'] = 'VIP'
    else :
        X.loc[idx, 'Rang'] = 'Autres'
X['Rang'].value_counts()

Age

Here too, we are going to create several age categories according to the Age variable:

  • Babies: 0 to 3 years old
  • Children: from 3 to 15 years old
  • Adults aged 15 to 60
  • The “old” over 60 years
<pre class="wp-block-syntaxhighlighter-code">age = X['Age'].fillna(X['Age'].mean())
catAge = []
for i in range(X.shape[0]) :
    if age[i] <= 3:
        catAge.append("bebe")
    elif age[i] > 3 and age[i] >= 15:
        catAge.append("enfant")
    elif age[i] > 15 and age[i] <= 60:
        catAge.append("adulte")
    else:
        catAge.append("vieux")
print(pd.DataFrame(catAge, columns = ['catAge'])['catAge'].value_counts())
cat = pd.get_dummies(pd.DataFrame(catAge, columns = ['catAge']), prefix='catAge')
cat.head(3)</pre>

Let’s take a look at the result:

adulte    373
enfant     21
vieux      14
bebe       10
Name: catAge, dtype: int64

Global preparation / model function

Now let’s put all these elements together in a preparation function:

def dataprep(data):
    # Sexe
    sexe = pd.get_dummies(data['Sex'], prefix='sex')

    # Cabine, récupération du pont (on remplace le pont T proche du pont A)
    cabin = pd.get_dummies(data['Cabin'].fillna('X').str[0].replace('T', 'A'), prefix='Cabin')

    # Age et catégories d'age
    age = data['Age'].fillna(data['Age'].mean())
    catAge = []
    for i in range(data.shape[0]) :
        if age[i] > 3:
            catAge.append("bebe")
        elif age[i] >= 3 and age[i] < 15:
            catAge.append("enfant")
        elif age[i] >= 15 and age[i] < 60:
            catAge.append("adulte")
        else:
            catAge.append("vieux")
    catage = pd.get_dummies(pd.DataFrame(catAge, columns = ['catAge']), prefix='catAge')

    # Titre et Rang
    data['Titre'] = data.Name.map(lambda x : x.split(",")[1].split(".")[0]).fillna('X')
    data['Rang'] = 0
    vip = ['Don','Sir', 'Major', 'Col', 'Jonkheer', 'Dr']
    femmeenfant = ['Miss', 'Mrs', 'Lady', 'Mlle', 'the Countess', 'Ms', 'Mme', 'Dona', 'Master']
    for idx, titre in enumerate(data['Titre']):
        if (titre.strip() in femmeenfant) :
            data.loc[idx, 'Rang'] = 'FE'
        elif (titre.strip() in vip) :
            data.loc[idx, 'Rang'] = 'VIP'
        else :
            data.loc[idx, 'Rang'] = 'Autres'
    rg = pd.get_dummies(data['Rang'], prefix='Rang')

    # Embarquement
    emb = pd.get_dummies(data['Embarked'], prefix='emb')

    # Prix unitaire - Ticket, Prépartion d'un DF (TicketCounts) contenant les ticket avec leur nb d'occurence
    TicketCounts = pd.DataFrame(data['Ticket'].value_counts())
    TicketCounts['TicketCount'] = TicketCounts['Ticket'] # renomme la colonne Ticket
    TicketCounts['Ticket'] = TicketCounts.index # rajoute une colonne Ticket pour le merge (jointure)
    # reporte le résultat dans le dataframe test (jointure des datasets)
    fin = pd.merge(data, TicketCounts, how='left', on='Ticket')
    fin['PrixUnitaire'] = fin['Fare'] / fin['TicketCount'].fillna(1)
    prxunit = pd.DataFrame(fin['PrixUnitaire'])
    # Prix moyen 3eme classe (pour le passager de 3eme qui n'a pas de prix) ... on aurait pu faire une fonction ici ;-)
    prx3eme = data.loc[data['Pclass'] == 3]['Fare'].mean()
    prxunit = prxunit['PrixUnitaire'].fillna(prx3eme)

    # Classe
    pc = pd.DataFrame(MinMaxScaler().fit_transform(data[['Pclass']]), columns = ['Classe'])

    dp = data[['SibSp', 'Parch', 'Name']].join(pc).join(sexe).join(emb).join(prxunit).join(cabin).join(age).join(catage).join(rg)
    addColumnFamilyName(dp)
    del dp['Name']

    return dp

Let’s train the model

Xtrain = dataprep(train)
Xtest = dataprep(test)

y = train.Survived
clf = LinearSVC(random_state=4)
clf.fit(Xtrain, y)
p_tr = clf.predict(Xtrain)
print ("Score Train : ", round(clf.score(Xtrain, y) *100,4), " %")

We thus obtain a very (too!?) Beautiful 98% (on the training data). On the test data we will have a reasonable 76.5%!

Share this post

Benoit Cayla

In more than 15 years, I have built-up a solid experience around various integration projects (data & applications). I have, indeed, worked in nine different companies and successively adopted the vision of the service provider, the customer and the software editor. This experience, which made me almost omniscient in my field naturally led me to be involved in large-scale projects around the digitalization of business processes, mainly in such sectors like insurance and finance. Really passionate about AI (Machine Learning, NLP and Deep Learning), I joined Blue Prism in 2019 as a pre-sales solution consultant, where I can combine my subject matter skills with automation to help my customers to automate complex business processes in a more efficient way. In parallel with my professional activity, I run a blog aimed at showing how to understand and analyze data as simply as possible: datacorner.fr Learning, convincing by the arguments and passing on my knowledge could be my caracteristic triptych.

View all posts by Benoit Cayla →

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Fork me on GitHub