Index
The Python Pandas library
The Pandas library is a Python library which aims to make your life easier in terms of data manipulation. It is therefore an essential element that must be mastered as a data scientist. The data structures managed by Pandas can contain all types of elements, namely (in Pandas jargon) Series and DataFrame and Panel. As part of our experiments, we will rather use the Dataframe because they offer a two-dimensional view of the data (like an Excel table), and this is exactly what we will try to use for our models.
For information :
- A Series Pandas is a labeled vector capable of containing any type of object (1-dimensional array)
- A DataFrame is a two-dimensional matrix where the columns can be of different types (2-dimensional array)
- A Panel is a three-dimensional data structure.
If you have opted for a Python distribution (Anaconda for example) good news, you have nothing to do, this library is already present in said distribution … if this is not the case the pip command will allow you to install:
$ pip install pandas
The other good news is that this library is designed to work with two other famous and essential libraries: NumPy and matplotlib!
As part of this little tutorial, I recommend that you use Jupyter notebook (the sources are available below) in order to go faster.
Do not hesitate to consult the official documentation either: https://pandas.pydata.org/pandas-docs/stable/api.html
Data types
Let’s first create a simple dataset (DataFrame)
import pandas as pd
pd.DataFrame({'Colonne 1': [1], 'Colonne 2': [2]})
The first command tells Python that we are going to work with the Pandas library.
The second line tells Pandas to create a two-dimensional array.
Here is the result :
Colonne 1 | Colonne 2 | |
0 | 1 | 2 |
The columns have beautiful labels but if you also want to label your rows you will have to specify it with the index property as follows:
pd.DataFrame({'Colonne 1': [35, 41], 'Colonne 2': [1, 2]}, index=['Ligne 1', 'Ligne 2'])
Here is the result :
Colonne 1 | Colonne 2 | |
Ligne 1 | 35 | 41 |
Ligne 2 | 41 | 2 |
It’s not necessarily the most useful but it can still be useful, here’s how to create a series (vector):
pd.Series(["Valeur1", "Valeur2", "Valeur3", "Valeur4"], index=["Index1", "Index2", "Index3", "Index4"], name='Ma série')
The result :
Index1 | Valeur1 |
Index2 | Valeur2 |
Index3 | Valeur3 |
Index4 | Valeur4 |
Name: Ma série, dtype: object |
On the other hand, to recover data from a csv file, it is more useful, for that use the read_csv command:
csv = pd.read_csv("./datasets/housing/housing.csv")
Note: If you want to read an excel file instead, refer to the read_excel command .
Once read the file, several commands are very useful:
- head () : used to display the first lines of the DataFrame
- tail () : same but for the last lines
- describe () : Very useful, gives indications on the data (count, standard deviation, median value, quantile, min, max, etc.). Warning ! some statistical indicators are only valid for numerical variables (eg mean, min, max, etc.), and conversely for non-numeric ones (eg top, freq, etc.), hence “NaN” in certain situations. To retrieve all the statistics, perform a
df["colonne 1"].describe(include='all')
- shape () : object dimensions
- value_Counts () allows to obtain a very useful array of values + distribution frequency:
df["colonne 1"].
value_counts () - etc.
Access the data (DataFrame)
Get a vector
We can recover a column vector (a column of the DataFrame) very simply via:
csv.longitude ou csv["longitude"]
NB: csv is a DataFrame which has a column labeled longitude.
Recover a cell
csv.longitude[0]
Or
csv["longitude"][0]
Handling DataFrame data
In the same spirit, we can retrieve pieces from the DataFrame via the iloc () and loc () commands:
Get the first 4 columns and the first 3 lines:
csv.iloc[:3, :4]
Filtering on columns (via labels):
csv.loc[:, ('longitude', 'latitude')]
Handling character data
It is often useful to break up strings. for this Pandas offers several very practical functions. The split () function for example allows you to split the character string according to a separator.
Example:
monDataframe["index1"].str.split("-", expand=True)
In the example above, we cut out the character string corresponding to the “index1” column of the DataFrame mondataframe with the character – separator . Note two things here. the str attribute is used in the first place to handle the data of the DataFrame as a character. Second, the expand = True option allows you to return a new DataFrame instead of a series (it is much more practical later).
See the official Python documentation for other character handling functions.
Jupyter notebooks in this tutorial
Find the examples and results above in the two jupyter notebooks on my Github