To follow up on my previous article on the prediction of Titanic survivors, it seems important to me to illustrate a few other techniques and thereby go further in the modeling. This article is therefore dedicated to working on the characteristics given to us. As usual for this type of Machine Learning project I will be using Python, Scikit-learn and Jupyter.
Index
Preliminary work
- Declare the Python libraries that we are going to use (Pandas, RegEx, scikit-learn, etc.)
- Import / read training and test files in DataFrame Pandas
- Create a full DataFrame that concatenates the two previous datasets.
import pandas as pd
import re
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import LinearSVC
from sklearn.preprocessing import MinMaxScaler
train = pd.read_csv("./data/train.csv")
test = pd.read_csv("./data/test.csv")
full = pd.concat([train, test]) # Assemble les deux jeux de données
Feature ingeneering
The objective of this post is to work on the characteristics and not on the algorithms themselves. For that we are going to rework some data which as it is is little or not usable.
Ticket & Price : Did you notice that the price mentioned is not the unit price per person? and that the Ticket data is not a unique key? We will have to correct this by calculating the unit price per person and modify the wrong (Fare) prices (i.e. grouped tickets)
Passenger without price : It is an error in the data. Indeed a passenger has no price (Fare). We will have to assign it one in order to keep this data.
The last name : This is not information directly provided, so you will have to “parse” the Name string to retrieve it correctly. Then we will group the data (train & test) by last name and we will count the number of people per family. With this in mind, it will be interesting to do one-hot encoding on families of more than 2, 3 people.
The title : Like the last name, this information must be retrieved from the Name column. Then it will be interesting to re-categorize these titles for example to have 3 final categories: Women & children, adults and VIP.
Age : age is interesting information in itself but creating additional age categories is even more so.
We will stop there for this article, but there are still many other areas for improvement.
Ticket & Price
First let’s look at the Ticket feature. Let’s check that all passengers have a ticket:
noticket = []
full['Ticket'].fillna('X')
for ticketnn in full['Ticket']:
if (ticketnn == 'X'):
noticket.append(1)
else:
noticket.append(0)
pd.DataFrame(noticket)[0].value_counts()
Good news, all passengers have this information!
test['Ticket'].value_counts().head()
Regardons maintenant les valeurs distinctes de ces Tickets :
PC 17608 5
113503 4
CA. 2343 4
C.A. 31029 3
347077 3
Name: Ticket, dtype: int64
Now let’s look at the distinct values of these Tickets:
PC 17608 5
113503 4CA. 2343 4C.A. 31029 3
347077 3Name: Ticket, dtype: int64
Interestingly, the values are not unique as we might think. This is in fact that we could have group tickets (several people with the same ticket). Clà changes a lot of things because if the Ticket could be grouped, so also the price.
We will therefore have to divide the price of the Ticket by the number of people with the same ticket!
Calculation of the ticket unit price
To do this we are going to use Pandas’ abilities to perform (left) joins between DataFrame. Beforehand, we will constitute a DataFrame which groups the Tickets with their number of occurrences: TicketCounts. Then we will make a left join between the dataset and this new DataFrame. We will then only have to add a UnitPrice column which divides the total price by the number of people on the Ticket. Be careful here to use the fillna () function on the number of tickets.
# Prépartion d'un DF (TicketCounts) contenant les ticket avec leur nb d'occurence
TicketCounts = pd.DataFrame(test['Ticket'].value_counts().head())
TicketCounts['TicketCount'] = TicketCounts['Ticket'] # renomme la colonne Ticket
TicketCounts['Ticket'] = TicketCounts.index # rajoute une colonne Ticket pour le merge (jointure)
# Reporte le résultat dans le dataframe test (jointure des datasets)
fin = pd.merge(test, TicketCounts, how='left', on='Ticket')
fin['PrixUnitaire'] = fin['Fare'] / fin['TicketCount'].fillna(1)
Passenger without Price!
Be careful, because we also have a passenger who does not have a Prize. Let’s take a look at who it is:
import numpy as np
test.loc[np.isnan(test['Fare'])]
This is a 3rd class passenger, so let’s calculate the average price for this type of ticket:
test.loc[test['Pclass'] == 3]['Fare'].mean()
12.459677880184334
We will assign this price to this passenger.
The last name
The last name is not immediately usable. It must be extracted from the Name characteristic which contains other information such as the title. Let’s use the RegEx for that:
familynames = []
for noms in full["Name"]:
familynames.append(re.search('([A-Za-z0-9]*),\ ([A-Za-z0-9 ]*)\. (.*)', noms).group(1))
pdfamilynames = pd.DataFrame(familynames, columns = ['familynames'])
The idea now is to do a one-hot encoding with the last name. It might sound a little crazy but we have little data and some last names appear in both datasets.
We will first create a DataFrame with the last names appearing 2 or more times:
# Créé une liste des noms de famille avec plus de 2 occurences
famsurv = full.join(pdfamilynames)
famCount = famsurv['familynames'].value_counts()
pdfamCounts = pd.DataFrame(famCount, columns = ['familynames'])
pdfamCounts['famCount'] = pdfamCounts['familynames']
pdfamCounts['familynames'] = pdfamCounts.index
pdfamCounts[pdfamCounts['famCount'] >= 2]
This DataFrame can then be used through a function to add the dummies ( one-hot ) columns :
# Fonction ajoutant les colonnes noms famille dans un DF
def addColumnFamilyName(data):
# ajoute les colonnes nulles avec les noms de famille
for family in pdfamCounts['familynames']:
data[family] = 0
# récupère le nom de famille dans le DF
for idx, f in enumerate(data["Name"]):
# Modifie les colonnes dummies du nom de famille en 1 ou 0 selon le nom de famille
iNom = re.search('([A-Za-z0-9]*),\ ([A-Za-z0-9 ]*)\. (.*)', f).group(1)
for col in data.columns:
if (col == iNom):
data.loc[idx, col] = 1
We will use this function when preparing the data (later).
The title
In the same way as the last name, we have to extract the title by paring the characteristic Name. Let’s look at the titles on the whole dataset (full):
full['Titre'] = full.Name.map(lambda x : x.split(",")[1].split(".")[0])
full['NomFamille'] = full.Name.map(lambda x : x.split(",")[0])
titre = pd.DataFrame(full['Titre'])
full['Titre'].value_counts() # affiche tous les titres possible
Here are the possibilities that we will deal with:
Mr 757
Miss 260
Mrs 197
Master 61
Dr 8
Rev 8
Col 4
Ms 2
Mlle 2
Major 2
Mme 1
Lady 1
Capt 1
Don 1
Jonkheer 1
Sir 1
the Countess 1
Dona 1
Name: Titre, dtype: int64
For the titles we will create categories that we will encode ( one-hot ) then. Normally the instructions for women and children had to be respected, but in my opinion the rank and file were also privileged. So let’s create 3 categories: Woman and child, VIP and others:
X = test
X['Rang'] = 0
X['Titre'] = X.Name.map(lambda x : x.split(",")[1].split(".")[0])
vip = ['Don','Sir', 'Major', 'Col', 'Jonkheer', 'Dr', 'Rev']
femmeenfant = ['Miss', 'Mrs', 'Lady', 'Mlle', 'the Countess', 'Ms', 'Mme', 'Dona', 'Master']
for idx, titre in enumerate(X['Titre']):
if (titre.strip() in femmeenfant) :
X.loc[idx, 'Rang'] = 'FE'
elif (titre.strip() in vip) :
X.loc[idx, 'Rang'] = 'VIP'
else :
X.loc[idx, 'Rang'] = 'Autres'
X['Rang'].value_counts()
Age
Here too, we are going to create several age categories according to the Age variable:
- Babies: 0 to 3 years old
- Children: from 3 to 15 years old
- Adults aged 15 to 60
- The “old” over 60 years
<pre class="wp-block-syntaxhighlighter-code">age = X['Age'].fillna(X['Age'].mean())
catAge = []
for i in range(X.shape[0]) :
if age[i] <= 3:
catAge.append("bebe")
elif age[i] > 3 and age[i] >= 15:
catAge.append("enfant")
elif age[i] > 15 and age[i] <= 60:
catAge.append("adulte")
else:
catAge.append("vieux")
print(pd.DataFrame(catAge, columns = ['catAge'])['catAge'].value_counts())
cat = pd.get_dummies(pd.DataFrame(catAge, columns = ['catAge']), prefix='catAge')
cat.head(3)</pre>
Let’s take a look at the result:
adulte 373
enfant 21
vieux 14
bebe 10
Name: catAge, dtype: int64
Global preparation / model function
Now let’s put all these elements together in a preparation function:
def dataprep(data):
# Sexe
sexe = pd.get_dummies(data['Sex'], prefix='sex')
# Cabine, récupération du pont (on remplace le pont T proche du pont A)
cabin = pd.get_dummies(data['Cabin'].fillna('X').str[0].replace('T', 'A'), prefix='Cabin')
# Age et catégories d'age
age = data['Age'].fillna(data['Age'].mean())
catAge = []
for i in range(data.shape[0]) :
if age[i] > 3:
catAge.append("bebe")
elif age[i] >= 3 and age[i] < 15:
catAge.append("enfant")
elif age[i] >= 15 and age[i] < 60:
catAge.append("adulte")
else:
catAge.append("vieux")
catage = pd.get_dummies(pd.DataFrame(catAge, columns = ['catAge']), prefix='catAge')
# Titre et Rang
data['Titre'] = data.Name.map(lambda x : x.split(",")[1].split(".")[0]).fillna('X')
data['Rang'] = 0
vip = ['Don','Sir', 'Major', 'Col', 'Jonkheer', 'Dr']
femmeenfant = ['Miss', 'Mrs', 'Lady', 'Mlle', 'the Countess', 'Ms', 'Mme', 'Dona', 'Master']
for idx, titre in enumerate(data['Titre']):
if (titre.strip() in femmeenfant) :
data.loc[idx, 'Rang'] = 'FE'
elif (titre.strip() in vip) :
data.loc[idx, 'Rang'] = 'VIP'
else :
data.loc[idx, 'Rang'] = 'Autres'
rg = pd.get_dummies(data['Rang'], prefix='Rang')
# Embarquement
emb = pd.get_dummies(data['Embarked'], prefix='emb')
# Prix unitaire - Ticket, Prépartion d'un DF (TicketCounts) contenant les ticket avec leur nb d'occurence
TicketCounts = pd.DataFrame(data['Ticket'].value_counts())
TicketCounts['TicketCount'] = TicketCounts['Ticket'] # renomme la colonne Ticket
TicketCounts['Ticket'] = TicketCounts.index # rajoute une colonne Ticket pour le merge (jointure)
# reporte le résultat dans le dataframe test (jointure des datasets)
fin = pd.merge(data, TicketCounts, how='left', on='Ticket')
fin['PrixUnitaire'] = fin['Fare'] / fin['TicketCount'].fillna(1)
prxunit = pd.DataFrame(fin['PrixUnitaire'])
# Prix moyen 3eme classe (pour le passager de 3eme qui n'a pas de prix) ... on aurait pu faire une fonction ici ;-)
prx3eme = data.loc[data['Pclass'] == 3]['Fare'].mean()
prxunit = prxunit['PrixUnitaire'].fillna(prx3eme)
# Classe
pc = pd.DataFrame(MinMaxScaler().fit_transform(data[['Pclass']]), columns = ['Classe'])
dp = data[['SibSp', 'Parch', 'Name']].join(pc).join(sexe).join(emb).join(prxunit).join(cabin).join(age).join(catage).join(rg)
addColumnFamilyName(dp)
del dp['Name']
return dp
Let’s train the model
Xtrain = dataprep(train)
Xtest = dataprep(test)
y = train.Survived
clf = LinearSVC(random_state=4)
clf.fit(Xtrain, y)
p_tr = clf.predict(Xtrain)
print ("Score Train : ", round(clf.score(Xtrain, y) *100,4), " %")
We thus obtain a very (too!?) Beautiful 98% (on the training data). On the test data we will have a reasonable 76.5%!