Analyze your data with Pandas-profiling

Share this post

The world of Pandas does not stop expanding! With the new pandas_profiling library, automatic (and quite complete) data analysis is within everyone’s reach. Simple and quick to implement, I couldn’t miss this little gem of profiling.

Index

First profiling

Let’s start by installing the library with Python / pip:

$ pip install pandas_profiling

As one might suspect, this library uses DataFrame Pandas . We are therefore going to start again from the titanic data to see what it gives.

from pandas_profiling import ProfileReport
import pandas as pd
train = pd.read_csv('../datasources/titanic/train.csv')

Then we generate the profiling report. 2 lines of code are enough:

prof = ProfileReport(train)
prof.to_file(output_file='rapport.html')

A report.html file is then generated. This comprehensive report presents all the information you need to start working with data.

Description of the analyzed elements

This is what the generated html page looks like:

We find there, organized sequentially (or via the top menu):

A summary allowing at a glance to find the major concerns to be managed for a data scientist: missing data, cardinality, etc.

Basic information like: The number of columns / rows, missing or duplicated data, occupied memory, etc.

By column:

Quantile statistics: minimum value, Q1, median, Q3, maximum, range, interquartile range
Descriptive statistics such as mean, median, standard deviation, sum, median absolute deviation, coefficient of variation, kurtosis, asymmetry
Type inference: The actual data types found in the dataset ( categorical strings , numeric strings , etc.)
The most frequent values
Distribution of values (Histograms)

A detailed analysis of correlations highlighting highly correlated variables, Spearman, Pearson and Kendall matrices:

Matrix of missing values, counting, heat map and dendrogram of missing values

Text analysis informs you about categories (upper case, space), scripts (Latin, Cyrillic) and blocks (ASCII) of text data.

Let’s go a little further with the bookstore

For those who use Jupyter (like me) instead of generating a separate html file we can display the report in a notebook:

profile = ProfileReport(train, title='Analyse du fichier Titanic', html={'style':{'full_width':True}})
profile.to_notebook_iframe()

And if we wanted to process the result data in a script or another program, we can also retrieve the results of the analyzes in JSON format:

profile.to_json()

To save manually:

profile.to_file(output_file="titanic.html")

And for large data files, the analysis could take a very long time. You can then limit the analysis via the minimal option. Certain heavy treatments such as correlations will not be executed:

profile = ProfileReport(large_dataset, minimal=True)

Conclusion

You now know how to do advanced profiling on a dataset without any effort (well ok just 2 lines of Python). Be careful because as I said just before the treatments underlying this profiling can be very heavy … beware of the big Dataframe so otherwise you will have to arm yourself with a patient.

Find the elements of this article in Github.

Share this post

Analyze your data with Pandas-profiling

First profiling

Description of the analyzed elements

Let’s go a little further with the bookstore

Conclusion

Benoit Cayla

Leave a Reply Cancel reply

First profiling

Description of the analyzed elements

Let’s go a little further with the bookstore

Conclusion

Benoit Cayla

You might also like

Explore your data with DataExplore

Paris “Velib” Cycles usage analysis

Strings Comparison

Leave a Reply Cancel reply