NLP with Python NLTK

Share this post

In a previous article we saw how the SpaCy library could help us analyze and especially exploit textual data. We will see in this article how to use the other (somewhat competitive, but not that much) NLTK library of Python.

Installation of NLTK

To start we need to install the library. For that you must of course already have Python on your machine and then use PIP (package installer for Python) to install the nltk library:

pip install nltk

It’s not really finished, because once the library is installed, you must now download the entire NLTK corpus in order to be able to use its functionalities correctly.

corpus is a set of documents, artistic or not (texts, images, videos, etc.), grouped together for a specific purpose.

Wikipedia

In our case, we understand by corpus only the textual elements.

To install these famous NLTK corpuses, and if like me you are using Jupyter simply type:

import nltk
nltk.download()

Wait a few seconds and a window should open:

Téléchargement des corpus NLTK

We are not going to be selective and take everything for this tutorial. Select ”  All  ” and click the Download button in the lower left corner of the window. Then wait until everything is downloaded to your destination folder …

Let’s start!

Now that our NLTK environment is ready, we will see together the basic functions of this library. For that we will use this text:

Wikipedia is an online, universal, multilingual, wiki-based collective encyclopedia wiki project. Do you like wikipedia encyclopedia?

Then import the libraries that will be necessary:

from nltk.corpus import stopwords
from nltk import word_tokenize
from nltk.tokenize import sent_tokenize
import re

data = u"""Wikipédia est un projet wiki d’encyclopédie collective en ligne, universelle, multilingue et fonctionnant sur le principe du wiki. Aimez-vous l'encyclopédie wikipedia ?"""

Les “stops words”

The “stop words”

First of all, it will be necessary to remove all the words that do not really add value to the overall analysis of the text. These words are called “stop words” and of course this list is specific to each language. Good news, NLTK offers a list of stop words in French (not all languages ​​are available):

french_stopwords = set(stopwords.words('french'))
filtre_stopfr =  lambda text: [token for token in text if token.lower() not in french_stopwords]

Thanks to Python’s lambda function, we created a small function that will allow us in a single line to filter a text from the list of French stop words.

french_stopwords = set (stopwords.words (‘french’)) filtr_stopfr = lambda text: [token for token in text if token.lower () not in french_stopwords]

Thanks to Python’s lambda function, we created a small function that will allow us in a single line to filter a text from the list of French stop words.

filtre_stopfr( word_tokenize(data, language="french") )
['Wikipédia',
 'projet',
 'wiki',
 '’',
 'encyclopédie',
 'collective',
 'ligne',
 ',',
 'universelle',
 ',',
 'multilingue',
 'fonctionnant',
 'principe',
 'wiki',
 '.',
 'Aimez-vous',
 "l'encyclopédie",
 'wikipedia',
 '?']

That’s all the words (like articles, etc.) have been removed from the global wordlist. This function is really useful, compare for example with a filter which would only use RegEx:

sp_pattern = re.compile( """[\.\!\"\s\?\-\,\']+""", re.M).split
sp_pattern(data)

Tokenization

With NLTK you can split by word with the function word_tokenize (…) or by sentence sent_tokenize (…). Let’s start with a breakdown by sentence:

sent_tokenize(data, language="french")
['Wikipédia est un projet wiki d’encyclopédie collective en ligne, universelle, multilingue et fonctionnant sur le principe du wiki.',
 "Aimez-vous l'encyclopédie wikipedia ?"]

Interesting but the tokenization by word is even more of course:

word_tokenize(data, language="french")
['Wikipédia',
 'est',
 'un',
 'projet',
 'wiki',
 'd',
...
 'Aimez-vous',
 "l'encyclopédie",
 'wikipedia',
 '?']

And if we combine this function with the stop words filter seen previously, it’s even more interesting:

filtre_stopfr( word_tokenize(data, language="french") )
['Wikipédia',
 'projet',
 'wiki',
 '’',
 'encyclopédie',
 'collective',
 'ligne',
 ',',
 'universelle',
 ',',
 'multilingue',
 'fonctionnant',
 'principe',
 'wiki',
 '.',
 'Aimez-vous',
 "l'encyclopédie",
 'wikipedia',
 '?']

Distribution frequency of values

It can be interesting to have the frequency of distribution of the values, for that there is of course a function FreqDist ():

fd = nltk.FreqDist(phfr) 
print(fd.most_common())
[('wiki', 2), (',', 2), ('Wikipédia', 1), ('projet', 1), ('’', 1), ('encyclopédie', 1), ('collective', 1), ('ligne', 1), ('universelle', 1), ('multilingue', 1), ('fonctionnant', 1), ('principe', 1), ('.', 1), ('Aimez-vous', 1), ("l'encyclopédie", 1), ('wikipedia', 1), ('?', 1)]

This function returns a two-dimensional array in which we find each value of the corpus with its frequency of distribution. For example, the word wiki is placed twice in the text.

Stemmatization

Now it would be interesting to group together the words having the same synthaxic root, for that we will use the stemmatization function of NLTK: stem (). Good news again there is a function for the French FrenchStemmer () ! we will of course use it here:

example_words = ["donner","don","donne","donnera","dons","test"]
stemmer = FrenchStemmer()

for w in example_words:
    print(stemmer.stem(w))
don
don
don
don
don
test

The stemmatization function has found the roots of the word “donation”. Honestly it does not always work as well unfortunately, but this function can nevertheless be useful for a first clearing.

Conclusion

The objective of this article being to show the basic functionalities of NLTK we have not gone into detail. This library is rich, very rich even but requires a lot of adjustment if you want to create powerful NLP-based functions. The differences with SpaCy ? in fact there are a lot of them, the philosophy of these two libraries is totally different (object vs traditional approach for NLTK for example) … in fact I think that we should rather see these two libraries as complementary, each bringing to the other its own shortcomings.

Share this post

Benoit Cayla

In more than 15 years, I have built-up a solid experience around various integration projects (data & applications). I have, indeed, worked in nine different companies and successively adopted the vision of the service provider, the customer and the software editor. This experience, which made me almost omniscient in my field naturally led me to be involved in large-scale projects around the digitalization of business processes, mainly in such sectors like insurance and finance. Really passionate about AI (Machine Learning, NLP and Deep Learning), I joined Blue Prism in 2019 as a pre-sales solution consultant, where I can combine my subject matter skills with automation to help my customers to automate complex business processes in a more efficient way. In parallel with my professional activity, I run a blog aimed at showing how to understand and analyze data as simply as possible: datacorner.fr Learning, convincing by the arguments and passing on my knowledge could be my caracteristic triptych.

View all posts by Benoit Cayla →

Leave a Reply

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Privacy Preference Center

Analytics

NOTICE RELATING TO COOKIES
What is a cookie and what is it used for?

A cookie (or connection witness) is a text file that can be saved, subject to your choices, in a dedicated space on the hard drive of your terminal (computer, tablet, etc.) when consulting a online service through your browser software.
It is transmitted by a website's server to your browser. Each cookie is assigned an anonymous identifier. The cookie file allows its issuer to identify the terminal in which it is registered during the period of validity or registration of the cookie concerned. A cookie cannot be traced back to a natural person.

When you visit this site, it may be required to install, subject to your choice, various statistical cookies.
What types of cookies are placed by the website?


Google Analytics & Matomo Statistics Cookies

These cookies are used to establish statistics of visits to my site and to detect navigation problems in order to monitor and improve the quality of our services.
Exercise your choices according to the browser you use

You can configure your browser at any time in order to express and modify your wishes in terms of cookies, and in particular regarding statistical cookies. You can express your choices by setting your browser to refuse certain cookies.

If you refuse cookies, your visit to the site will no longer be counted in Google Analytics & Matomo and you will no longer be able to benefit from a number of features that are nevertheless necessary to navigate certain pages of this site.
However, you can oppose the registration of cookies by following the operating procedure available below:

On Internet Explorer
1. Go to Tools> Internet Options.
2. Click on the privacy tab.
3. Click on the advanced button, check the box "Ignore automatic management of cookies".

On Firefox
1. At the top of the Firefox window, click the Firefox button (Tools menu in Windows XP), then select Options.
2. Select the Privacy panel.
3. Configure Conservation rules: to use the personalized parameters for the history.
4. Uncheck Accept cookies.

On Chrome
1. Click on the wrench icon which is located in the browser toolbar.
2. Select Settings.
3. Click Show advanced settings.
4. In the “Confidentiality” section, click on the Content settings button.
5. In the "Cookies" section, you can block cookies and data from third-party sites

On Safari
1. Go to Settings> Preferences
2. Click on the Privacy tab
3. In the "Block cookies" area, check the "always" box.

About Opera
1. Go to Settings> Preferences
2. Click on the advanced tab
3. In the "Cookies" area, check the "Never accept cookies" box.
social network sharing cookies

On certain pages of this site there are buttons or modules of third-party social networks that allow you to use the functionalities of these networks and in particular to share content on this site with other people.
When you go to a web page on which one of these buttons or modules is located, your browser can send information to the social network which can then associate this visualization with your profile.

Social network cookies, over which this site has no control, may then be placed in your browser by these networks. I invite you to consult the confidentiality policies specific to each of these social networking sites, in order to become aware of the purposes for using the browsing information that social networks can collect using these buttons and modules.
- Twitter
- Google+
- LinkedIn

Statistiqcs only

Fork me on GitHub