dataprep.eda: a newcomer in data analysis

Share this post

A new arrival in the world of profiling

In 2016 Pandas Profiling made its appearance, and with this library opened the world of data preparation in the Python universe (oh so essential phase of a Data science project). I had also devoted an article on the subject here.

2020 will undoubtedly be the year of “dataprep” and more particularly of dataprep.eda! indeed the dataprep project is broken down into several sub-projects (including that of eda: Exploratory Data Analysis, and dataprep.data_connector) … but others are in progress and given the name of the library we can easily s ‘imagine what they will cover.

In short, let’s see what this bookstore has in the belly …

Installation of dataprep.eda

Like any Python library, it all starts with installation and deployment. For this nothing is more efficient than using the pip command:

pip install dataprep

For more information on the library, go to https://sfu-db.github.io/dataprep/ you will also find (in English) user documents, etc.

If you want to use conda or even mamba type the following commands:

conda install -c conda-forge mamba
ou
mamba install -c conda-forge -c sfu-db dataprep

The dataprep library is really very simple and has 4 main functions that we will see later:

Distribution analysis

As often, we will use the titanic data to look at the distribution of this data set in a single line:

import pandas as pd
from dataprep.eda import *
from pandas_profiling import ProfileReport
data = pd.read_csv("../datasources/titanic/train.csv")
plot(data)

A simple call to the plot () function on a Pandas daraframe is enough to display the following graphs:

Dans cet article on va passer au en revue le petit nouveau du profiling Python : datapre.eda

Each column presents a distribution graph (a bar of course). Clicking on the “Show Stats Info” header provides general information (Number of missing data, duplicates, etc.).

You don’t necessarily want to have all the columns? no problem, you just need to specify the desired column (s) to the plot () function as follows.

Single column details

plot(data, "Age")

In this detail visualization you will even have access to additional viz to help you in your analysis. A certain number of tabs (Stats, Histogram, etc.) allow you to visualize at a glance – almost – all the information of the desired column.

Confront 2 columns

It is also often convenient to look at two columns between them, once again the plot () function allows us to do this very easily. Below we will display in the same viz the age and price data of the ticket:

plot(data, "Age", "Fare")

Once again several tabs are proposed in order to present the data either for example under a point graph (scatter plot) or box and whisker plot (box plot) or under a hexagonal point graph (hexagonal binning plot ):

Missing Values ​​Analysis

An important aspect – especially for machine learning projects – is the analysis of missing data. dataprep.eda offers the plot_correlation () function

plot_missing(data)

As before, several tabs are available to present different ways of viewing this information (heat map, bars, etc.). Note that if you have categorical data the graphs will be different… try it!

Correlation analysis

Another very interesting aspect is the correlation analysis which allows you to discover the degree of connection between the columns of the dataset. We saw this notion in a previous article so I won’t go over the different calculation methods .

With datapre.eda, we don’t ask ourselves any questions and we use the plot_correlation () function

plot_correlation(data)

Report creation with create_report ()

Report generation is started using the command

ProfileReport(data).to_widgets()

Conclusion

This bookstore, as I said in the introduction, is recent (2020) and already very promising. After a few tests we can already say that compared to Pandas Profiling we find these advantages:

  • Clearly a better design of APIs
  • According to some benchmarks – and especially via the promises of the designers – this library is up to 100 times faster! to check anyway.
  • Intelligent and adaptive visualization (admittedly much less static)
  • Allows the handling of much larger data (which was clearly a limitation of Pandas Profiling).
Share this post

Benoit Cayla

In more than 15 years, I have built-up a solid experience around various integration projects (data & applications). I have, indeed, worked in nine different companies and successively adopted the vision of the service provider, the customer and the software editor. This experience, which made me almost omniscient in my field naturally led me to be involved in large-scale projects around the digitalization of business processes, mainly in such sectors like insurance and finance. Really passionate about AI (Machine Learning, NLP and Deep Learning), I joined Blue Prism in 2019 as a pre-sales solution consultant, where I can combine my subject matter skills with automation to help my customers to automate complex business processes in a more efficient way. In parallel with my professional activity, I run a blog aimed at showing how to understand and analyze data as simply as possible: datacorner.fr Learning, convincing by the arguments and passing on my knowledge could be my caracteristic triptych.

View all posts by Benoit Cayla →

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Privacy Preference Center

Analytics

NOTICE RELATING TO COOKIES
What is a cookie and what is it used for?

A cookie (or connection witness) is a text file that can be saved, subject to your choices, in a dedicated space on the hard drive of your terminal (computer, tablet, etc.) when consulting a online service through your browser software.
It is transmitted by a website's server to your browser. Each cookie is assigned an anonymous identifier. The cookie file allows its issuer to identify the terminal in which it is registered during the period of validity or registration of the cookie concerned. A cookie cannot be traced back to a natural person.

When you visit this site, it may be required to install, subject to your choice, various statistical cookies.
What types of cookies are placed by the website?


Google Analytics & Matomo Statistics Cookies

These cookies are used to establish statistics of visits to my site and to detect navigation problems in order to monitor and improve the quality of our services.
Exercise your choices according to the browser you use

You can configure your browser at any time in order to express and modify your wishes in terms of cookies, and in particular regarding statistical cookies. You can express your choices by setting your browser to refuse certain cookies.

If you refuse cookies, your visit to the site will no longer be counted in Google Analytics & Matomo and you will no longer be able to benefit from a number of features that are nevertheless necessary to navigate certain pages of this site.
However, you can oppose the registration of cookies by following the operating procedure available below:

On Internet Explorer
1. Go to Tools> Internet Options.
2. Click on the privacy tab.
3. Click on the advanced button, check the box "Ignore automatic management of cookies".

On Firefox
1. At the top of the Firefox window, click the Firefox button (Tools menu in Windows XP), then select Options.
2. Select the Privacy panel.
3. Configure Conservation rules: to use the personalized parameters for the history.
4. Uncheck Accept cookies.

On Chrome
1. Click on the wrench icon which is located in the browser toolbar.
2. Select Settings.
3. Click Show advanced settings.
4. In the “Confidentiality” section, click on the Content settings button.
5. In the "Cookies" section, you can block cookies and data from third-party sites

On Safari
1. Go to Settings> Preferences
2. Click on the Privacy tab
3. In the "Block cookies" area, check the "always" box.

About Opera
1. Go to Settings> Preferences
2. Click on the advanced tab
3. In the "Cookies" area, check the "Never accept cookies" box.
social network sharing cookies

On certain pages of this site there are buttons or modules of third-party social networks that allow you to use the functionalities of these networks and in particular to share content on this site with other people.
When you go to a web page on which one of these buttons or modules is located, your browser can send information to the social network which can then associate this visualization with your profile.

Social network cookies, over which this site has no control, may then be placed in your browser by these networks. I invite you to consult the confidentiality policies specific to each of these social networking sites, in order to become aware of the purposes for using the browsing information that social networks can collect using these buttons and modules.
- Twitter
- Google+
- LinkedIn

Statistiqcs only

Fork me on GitHub