Python Pandas – Tutorial (Part N ° 1)

Share this post

The Python Pandas library

The Pandas library is a Python library which aims to make your life easier in terms of data manipulation. It is therefore an essential element that must be mastered as a data scientist. The data structures managed by Pandas can contain all types of elements, namely (in Pandas jargon) Series and DataFrame and Panel. As part of our experiments, we will rather use the Dataframe because they offer a two-dimensional view of the data (like an Excel table), and this is exactly what we will try to use for our models.

For information :

  • A Series Pandas is a labeled vector capable of containing any type of object (1-dimensional array)
  • A DataFrame is a two-dimensional matrix where the columns can be of different types (2-dimensional array)
  • A Panel is a three-dimensional data structure.

If you have opted for a Python distribution (Anaconda for example) good news, you have nothing to do, this library is already present in said distribution … if this is not the case the pip command will allow you to install:

$ pip install pandas

The other good news is that this library is designed to work with two other famous and essential libraries: NumPy and matplotlib!

As part of this little tutorial, I recommend that you use Jupyter notebook (the sources are available below) in order to go faster.

Do not hesitate to consult the official documentation either: https://pandas.pydata.org/pandas-docs/stable/api.html

Data types

Let’s first create a simple dataset (DataFrame)

import pandas as pd
pd.DataFrame({'Colonne 1': [1], 'Colonne 2': [2]})

The first command tells Python that we are going to work with the Pandas library.

The second line tells Pandas to create a two-dimensional array.

Here is the result :

 Colonne 1 Colonne 2
012

The columns have beautiful labels but if you also want to label your rows you will have to specify it with the index property as follows:

pd.DataFrame({'Colonne 1': [35, 41], 'Colonne 2': [1, 2]}, index=['Ligne 1', 'Ligne 2'])

Here is the result :

 Colonne 1 Colonne 2
Ligne 1
3541
Ligne 2412

It’s not necessarily the most useful but it can still be useful, here’s how to create a series (vector):

pd.Series(["Valeur1", "Valeur2", "Valeur3", "Valeur4"], index=["Index1", "Index2", "Index3", "Index4"], name='Ma série')

The result :

Index1 Valeur1
Index2Valeur2
Index3Valeur3
Index4Valeur4
Name: Ma série, dtype: object

On the other hand, to recover data from a csv file, it is more useful, for that use the read_csv command:

csv = pd.read_csv("./datasets/housing/housing.csv")

Note: If you want to read an excel file instead, refer to the read_excel command .

Once read the file, several commands are very useful:

  • head () : used to display the first lines of the DataFrame
  • tail () : same but for the last lines
  • describe () : Very useful, gives indications on the data (count, standard deviation, median value, quantile, min, max, etc.). Warning ! some statistical indicators are only valid for numerical variables (eg mean, min, max, etc.), and conversely for non-numeric ones (eg top, freq, etc.), hence “NaN” in certain situations. To retrieve all the statistics, perform adf["colonne 1"].describe(include='all')
  • shape () : object dimensions
  • value_Counts () allows to obtain a very useful array of values ​​+ distribution frequency: df["colonne 1"].value_counts ()
  • etc.

Access the data (DataFrame)

Get a vector

We can recover a column vector (a column of the DataFrame) very simply via:

csv.longitude ou csv["longitude"]

NB: csv is a DataFrame which has a column labeled longitude.

Recover a cell

csv.longitude[0]

Or

csv["longitude"][0]

Handling DataFrame data

In the same spirit, we can retrieve pieces from the DataFrame via the iloc () and loc () commands:

Get the first 4 columns and the first 3 lines:

csv.iloc[:3, :4]

Filtering on columns (via labels):

csv.loc[:, ('longitude', 'latitude')]

Handling character data

It is often useful to break up strings. for this Pandas offers several very practical functions. The split () function for example allows you to split the character string according to a separator.

Example:

monDataframe["index1"].str.split("-", expand=True)

In the example above, we cut out the character string corresponding to the “index1” column of the DataFrame mondataframe with the character  separator . Note two things here. the str attribute is used in the first place to handle the data of the DataFrame as a character. Second, the expand = True option allows you to return a new DataFrame instead of a series (it is much more practical later).

See the official Python documentation for other character handling functions.

Jupyter notebooks in this tutorial

Find the examples and results above in the two jupyter notebooks on my Github

Continuation of the tutorial (Part N ° 2) here

Share this post

Benoit Cayla

In more than 15 years, I have built-up a solid experience around various integration projects (data & applications). I have, indeed, worked in nine different companies and successively adopted the vision of the service provider, the customer and the software editor. This experience, which made me almost omniscient in my field naturally led me to be involved in large-scale projects around the digitalization of business processes, mainly in such sectors like insurance and finance. Really passionate about AI (Machine Learning, NLP and Deep Learning), I joined Blue Prism in 2019 as a pre-sales solution consultant, where I can combine my subject matter skills with automation to help my customers to automate complex business processes in a more efficient way. In parallel with my professional activity, I run a blog aimed at showing how to understand and analyze data as simply as possible: datacorner.fr Learning, convincing by the arguments and passing on my knowledge could be my caracteristic triptych.

View all posts by Benoit Cayla →

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Fork me on GitHub