Python Pandas - Tutorial (Part N ° 1)

Share this post

Index

The Python Pandas library

The Pandas library is a Python library which aims to make your life easier in terms of data manipulation. It is therefore an essential element that must be mastered as a data scientist. The data structures managed by Pandas can contain all types of elements, namely (in Pandas jargon) Series and DataFrame and Panel. As part of our experiments, we will rather use the Dataframe because they offer a two-dimensional view of the data (like an Excel table), and this is exactly what we will try to use for our models.

For information :

A Series Pandas is a labeled vector capable of containing any type of object (1-dimensional array)
A DataFrame is a two-dimensional matrix where the columns can be of different types (2-dimensional array)
A Panel is a three-dimensional data structure.

If you have opted for a Python distribution (Anaconda for example) good news, you have nothing to do, this library is already present in said distribution … if this is not the case the pip command will allow you to install:

$ pip install pandas

The other good news is that this library is designed to work with two other famous and essential libraries: NumPy and matplotlib!

As part of this little tutorial, I recommend that you use Jupyter notebook (the sources are available below) in order to go faster.

Do not hesitate to consult the official documentation either: https://pandas.pydata.org/pandas-docs/stable/api.html

Data types

Let’s first create a simple dataset (DataFrame)

import pandas as pd
pd.DataFrame({'Colonne 1': [1], 'Colonne 2': [2]})

The first command tells Python that we are going to work with the Pandas library.

The second line tells Pandas to create a two-dimensional array.

Here is the result :

	Colonne 1	Colonne 2
0	1	2

The columns have beautiful labels but if you also want to label your rows you will have to specify it with the index property as follows:

pd.DataFrame({'Colonne 1': [35, 41], 'Colonne 2': [1, 2]}, index=['Ligne 1', 'Ligne 2'])

Here is the result :

	Colonne 1	Colonne 2
Ligne 1	35	41
Ligne 2	41	2

It’s not necessarily the most useful but it can still be useful, here’s how to create a series (vector):

pd.Series(["Valeur1", "Valeur2", "Valeur3", "Valeur4"], index=["Index1", "Index2", "Index3", "Index4"], name='Ma série')

The result :

Index1	Valeur1
Index2	Valeur2
Index3	Valeur3
Index4	Valeur4
Name: Ma série, dtype: object

On the other hand, to recover data from a csv file, it is more useful, for that use the read_csv command:

csv = pd.read_csv("./datasets/housing/housing.csv")

Note: If you want to read an excel file instead, refer to the read_excel command .

Once read the file, several commands are very useful:

head () : used to display the first lines of the DataFrame
tail () : same but for the last lines
describe () : Very useful, gives indications on the data (count, standard deviation, median value, quantile, min, max, etc.). Warning ! some statistical indicators are only valid for numerical variables (eg mean, min, max, etc.), and conversely for non-numeric ones (eg top, freq, etc.), hence “NaN” in certain situations. To retrieve all the statistics, perform adf["colonne 1"].describe(include='all')
shape () : object dimensions
value_Counts () allows to obtain a very useful array of values + distribution frequency: df["colonne 1"].value_counts ()
etc.

Access the data (DataFrame)

Get a vector

We can recover a column vector (a column of the DataFrame) very simply via:

csv.longitude ou csv["longitude"]

NB: csv is a DataFrame which has a column labeled longitude.

Recover a cell

csv.longitude[0]

csv["longitude"][0]

Handling DataFrame data

In the same spirit, we can retrieve pieces from the DataFrame via the iloc () and loc () commands:

Get the first 4 columns and the first 3 lines:

csv.iloc[:3, :4]

Filtering on columns (via labels):

csv.loc[:, ('longitude', 'latitude')]

Handling character data

It is often useful to break up strings. for this Pandas offers several very practical functions. The split () function for example allows you to split the character string according to a separator.

Example:

monDataframe["index1"].str.split("-", expand=True)

In the example above, we cut out the character string corresponding to the “index1” column of the DataFrame mondataframe with the character – separator . Note two things here. the str attribute is used in the first place to handle the data of the DataFrame as a character. Second, the expand = True option allows you to return a new DataFrame instead of a series (it is much more practical later).

See the official Python documentation for other character handling functions.

Jupyter notebooks in this tutorial

Find the examples and results above in the two jupyter notebooks on my Github

Continuation of the tutorial (Part N ° 2) here

Share this post

Python Pandas – Tutorial (Part N ° 1)

The Python Pandas library

Data types

Access the data (DataFrame)

Get a vector

Recover a cell

Handling DataFrame data

Handling character data

Jupyter notebooks in this tutorial

Benoit Cayla

Leave a Reply Cancel reply

The Python Pandas library

Data types

Access the data (DataFrame)

Get a vector

Recover a cell

Handling DataFrame data

Handling character data

Jupyter notebooks in this tutorial

Benoit Cayla

You might also like

Advanced use of Tesseract with Python

The Matplotlib library

Kaggle: Let’s start with the Titanic! (Part 1)

Leave a Reply Cancel reply