Tutorial: Just do NLP with SpaCy!

Share this post

Index

NLP (Natural Language Processing) yes but with what?

As promised! I told you in my article on the bag of words that we would go further with NLP, and here we are. Now, which tool to choose? which library to use? indeed the Python world seems to be torn between two packages : the historic NLTK (Natural Language Toolkit) and the new little SpaCy (2015) .

Far from me the idea of making a comparison of these two very good libraries which are NLTKand SpaCy . There are pros and cons with each, but the ease of use, the availability of pre-trained word vectors as well as statistical models in several languages (English, German and Spanish) but especially in French definitely have me. switches to SpaCy. Make way for young people.

This little tutorial will therefore show you how to use this library.

Install and use the library

This library is not installed by default with Python. The easiest way to install it is to run a command line and use the pip utility as follows:

pip install -U spaCy
python -m spacy download fr
python -m spacy download fr_core_news_md

NB: The last two commands allow you to use models already trained in French. Then to use SpaCy you have to import the library but also initialize it with the right language with the load () directive

import spacy
from spacy import displacy
nlp = spacy.load('fr')

Tokenization

The first thing we are going to do is to “tokenize” a sentence in order to cut it grammatically. Tokenization is the operation of segmenting a sentence into “atomic” units: tokens. The most common tokenizations are splitting into words or sentences. We will start with the words and for that we will use the token class of SpaCy:

import spacy
from spacy import displacy
nlp = spacy.load('fr')
doc = nlp('Demain je travaille à la maison')
for token in doc:
    print(token.text)

The previous code cuts out the sentence ‘Tomorrow I work at home’ and displays each item

Demain
je
travaille
à
la
maison
en
France

Nothing very impressive so far, we just cut out a sentence. Now let’s take a closer look at what SpaCy did in addition to this slicing:

doc = nlp("Demain je travaille à la maison.")
for token in doc:
    print("{0}\t{1}\t{2}\t{3}\t{4}\t{5}\t{6}\t{7}\t{8}".format(
        token.text,
        token.idx,
        token.lemma_,
        token.is_punct,
        token.is_space,
        token.shape_,
        token.pos_,
        token.tag_,
        token.ent_type_
    ))

The token object is much richer than it looks and returns a lot of grammatical information about the words in the sentence:

Demain	0	Demain	False	False	Xxxxx	PROPN	PROPN___	
je	7	il	False	False	xx	PRON	PRON__Number=Sing|Person=1	
travaille	10	travailler	False	False	xxxx	VERB	VERB__Mood=Ind|Number=Sing|Person=1|Tense=Pres|VerbForm=Fin	
à	20	à	False	False	x	ADP	ADP___	
la	22	le	False	False	xx	DET	DET__Definite=Def|Gender=Fem|Number=Sing|PronType=Art	
maison	25	maison	False	False	xxxx	NOUN	NOUN__Gender=Fem|Number=Sing	
.	31	.	True	False	.	PUNCT	PUNCT___

This object informs us of the use / qualification of each word (for more details go to SpaCy help ):

text: The original text / word
lemma_: the basic form of the word (for a conjugated verb for example we will have its infinitive)
pos_: The part-of-speech tag (details here )
tag_: The detailed part-of-speech information (detail here )
dep_: Syntax dependency (inter-token)
shape: format / pattern
is_alpha: Alphanumeric?
is_stop: Is the word part of a Stop-List?
etc.

Tokenison now sentences. In fact, the work is already done (Cf. documentation ) and you just have to retrieve the sents object from the document:

doc = nlp("Demain je travaille à la maison. Je vais pouvoir faire du NLP")
for sent in doc.sents:
    print(sent)

The result :

Demain je travaille à la maison.
Je vais pouvoir faire du NLP

SpaCy of course allows you to recover nominal sentences (ie sentences without verbs):

doc = nlp("Terrible désillusion pour la championne du monde")
for chunk in doc.noun_chunks:
    print(chunk.text, " --> ", chunk.label_)

Terrible désillusion pour la championne du monde  -->  NP

NER

SpaCy has a very efficient statistical entity recognition system (NER or Named Entity Recognition) which will assign labels to contiguous ranges of tokens.

doc = nlp("Demain je travaille en France chez Tableau")
for ent in doc.ents:
    print(ent.text, ent.label_)

The preceding code goes through the words in their context and will recognize that France is an element of localization and that Tableau is a (commercial) organization:

France LOC
Tableau ORG

Dependencies

Personally I find that this is the most impressive function because thanks to it we will be able to recompose a sentence by connecting the words in their context. In the example below we take our sentence and indicate each dependency with the properties dep_

<pre class="wp-block-syntaxhighlighter-code">doc = nlp('Demain je travaille en France chez Tableau')
for token in doc:
    print("{0}/{1} <--{2}-- {3}/{4}".format(
                                        token.text, 
                                        token.tag_, 
                                        token.dep_, 
                                        token.head.text, 
                                        token.head.tag_))</pre>

The result is interesting but not really readable:

Demain/PROPN___ <--advmod-- travaille/VERB__Mood=Ind|Number=Sing|Person=1|Tense=Pres|VerbForm=Fin
je/PRON__Number=Sing|Person=1 <--nsubj-- travaille/VERB__Mood=Ind|Number=Sing|Person=1|Tense=Pres|VerbForm=Fin
travaille/VERB__Mood=Ind|Number=Sing|Person=1|Tense=Pres|VerbForm=Fin <--ROOT-- travaille/VERB__Mood=Ind|Number=Sing|Person=1|Tense=Pres|VerbForm=Fin
en/ADP___ <--case-- France/PROPN__Gender=Fem|Number=Sing
France/PROPN__Gender=Fem|Number=Sing <--obl-- travaille/VERB__Mood=Ind|Number=Sing|Person=1|Tense=Pres|VerbForm=Fin
chez/ADP___ <--case-- Tableau/NOUN__Gender=Masc|Number=Sing
Tableau/NOUN__Gender=Masc|Number=Sing <--obl-- travaille/VERB__Mood=Ind|Number=Sing|Person=1|Tense=Pres|VerbForm=Fin

Fortunately SpaCy has a trace function to make this result visual. To use it you need to import displacy:

from spacy import displacy
displacy.render(doc, style='dep', jupyter=True, options={'distance': 130})

It is immediately more speaking like that right?

This concludes this first part on the use of NLP with the SpaCy library. You can find the codes for this mini-tutorial on GitHub.

For those curious about NLP, do not hesitate to read my article on NLTK as well .

Share this post

One thought on “Tutorial: Just do NLP with SpaCy!”

Tutorial: Just do NLP with SpaCy!

NLP (Natural Language Processing) yes but with what?

Install and use the library

Tokenization

NER

Dependencies

Benoit Cayla

One thought on “Tutorial: Just do NLP with SpaCy!”

Leave a Reply Cancel reply

NLP (Natural Language Processing) yes but with what?

Install and use the library

Tokenization

NER

Dependencies

Benoit Cayla

You might also like

Managing location data

Access Hive & HDFS via PySpark

Introduction to nlpcloud.io

One thought on “Tutorial: Just do NLP with SpaCy!”

Leave a Reply Cancel reply