Analyze your data with Pandas-profiling

Share this post

The world of Pandas does not stop expanding! With the new pandas_profiling library, automatic (and quite complete) data analysis is within everyone’s reach. Simple and quick to implement, I couldn’t miss this little gem of profiling.

First profiling

Let’s start by installing the library with Python / pip:

$ pip install pandas_profiling

As one might suspect, this library uses DataFrame Pandas . We are therefore going to start again from the titanic data to see what it gives.

from pandas_profiling import ProfileReport
import pandas as pd
train = pd.read_csv('../datasources/titanic/train.csv')

Then we generate the profiling report. 2 lines of code are enough:

prof = ProfileReport(train)
prof.to_file(output_file='rapport.html')

report.html file is then generated. This comprehensive report presents all the information you need to start working with data.

Description of the analyzed elements

This is what the generated html page looks like:

We find there, organized sequentially (or via the top menu):

  • A summary allowing at a glance to find the major concerns to be managed for a data scientist: missing data, cardinality, etc.
  • Basic information like: The number of columns / rows, missing or duplicated data, occupied memory, etc.

By column:

  • Quantile statistics: minimum value, Q1, median, Q3, maximum, range, interquartile range
  • Descriptive statistics such as mean, median, standard deviation, sum, median absolute deviation, coefficient of variation, kurtosis, asymmetry
  • Type inference: The actual data types found in the dataset ( categorical strings , numeric strings , etc.)
  • The most frequent values
  • Distribution of values ​​(Histograms)
  • A detailed analysis of correlations highlighting highly correlated variables, Spearman, Pearson and Kendall matrices:
  • Matrix of missing values, counting, heat map and dendrogram of missing values
  • Text analysis informs you about categories (upper case, space), scripts (Latin, Cyrillic) and blocks (ASCII) of text data.

Let’s go a little further with the bookstore

For those who use Jupyter (like me) instead of generating a separate html file we can display the report in a notebook:

profile = ProfileReport(train, title='Analyse du fichier Titanic', html={'style':{'full_width':True}})
profile.to_notebook_iframe()

And if we wanted to process the result data in a script or another program, we can also retrieve the results of the analyzes in JSON format:

profile.to_json()

To save manually:

profile.to_file(output_file="titanic.html")

And for large data files, the analysis could take a very long time. You can then limit the analysis via the minimal option. Certain heavy treatments such as correlations will not be executed:

profile = ProfileReport(large_dataset, minimal=True)

Conclusion

You now know how to do advanced profiling on a dataset without any effort (well ok just 2 lines of Python). Be careful because as I said just before the treatments underlying this profiling can be very heavy … beware of the big Dataframe so otherwise you will have to arm yourself with a patient.

Find the elements of this article in Github.

Share this post

Benoit Cayla

In more than 15 years, I have built-up a solid experience around various integration projects (data & applications). I have, indeed, worked in nine different companies and successively adopted the vision of the service provider, the customer and the software editor. This experience, which made me almost omniscient in my field naturally led me to be involved in large-scale projects around the digitalization of business processes, mainly in such sectors like insurance and finance. Really passionate about AI (Machine Learning, NLP and Deep Learning), I joined Blue Prism in 2019 as a pre-sales solution consultant, where I can combine my subject matter skills with automation to help my customers to automate complex business processes in a more efficient way. In parallel with my professional activity, I run a blog aimed at showing how to understand and analyze data as simply as possible: datacorner.fr Learning, convincing by the arguments and passing on my knowledge could be my caracteristic triptych.

View all posts by Benoit Cayla →

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Fork me on GitHub