Index
Introduction
In a previous article we saw how we could with or without NLP ( using “bags of words” ) compare strings, or rather sentences. I propose in this post to approach the comparison of words. The question we are going to ask ourselves here is not so much the meaning of a sentence but rather the resemblance of one word to another.
We will therefore discuss the various issues that we commonly encounter when dealing with strings. Typing errors, simple mistakes, abbreviations, in short, various and varied errors that prevent us from processing information correctly. Very clearly we will use distance algorithms (Levenshtein, Hamming, etc.) which have repeatedly proven their efficiency. Then we will go a little further with a library which has at least one can say an original name: Fuzzywuzzy !
Old-fashioned comparison!
To begin with, the first thing to do when comparing strings is to remove the elements that do not add value to the string itself. This is called noise. Imagine that in my system I have two first names: “ Ben oit. “And” benoît “. It seems obvious to a human being that it is the same first name. It seems just as obvious that a “basic” comparison will not give the expected result. Capital letters, umlauts, spaces and other punctuation marks are important but prevent us from comparing our strings.
We will therefore have to remove or replace them (like the î in i, or the B in b). Likewise, we will choose the case (upper case or lower case) in order to compare the strings on the same level.
Here is an example of a Python function to compare these two strings:
noise_list = [".", ",", "?", "!", ";"]
def prepString(_str, _noise = True, _multiplespaces = True):
# remove noise (punctuation) if asked (by default yes)
if _noise:
for car in noise_list:
_str = _str.replace(car, "")
# replace multiple spaces by ine in string if requested (default yes)
if _multiplespaces:
_str = re.sub("\s+", " ", _str).strip()
return _str.strip().lower()
def compareString(Str1, Str2):
return prepString(Str1) == prepString(Str2)
compareString("Ben oit.,", "BEN oit")
True
Hamming comparison
The objective of this comparison is to count the differences that there are between two strings of characters. Please note that the chains compared must be of the same size, which very quickly becomes a drawback.
def hamming_distance(string1, string2):
if (len(string1) != len (string2)):
return -1
# Start with a distance of zero, and count up
distance = 0
# Loop over the indices of the string
L = len(string1)
for i in range(L):
# Add 1 to the distance if these two characters are not equal
if string1[i] != string2[i]:
distance += 1
# Return the final count of differences
return distance
hamming_distance("BINUIT", "BENOIT")
2
We have two differences between BINUIT and BENOIT (the I instead of the E and the U instead of the O). In reality, the Hamming comparison (hamming code) is rather used to check / correct errors in data transmissions.
Jaro-Winkler comparison
This comparison is much finer and will give us a score that will give us the distance between the two compared chains. This distance measurement is taken from the Jaro distance which is often used for deduplication operations.
If s is the character string:
See wikipedia
- | s_1 | is the length of the character string s1 (same for S_2)
- m is the number of characters
- t is the number of transpositions
Python has of course a library to manage this formula: jaro-winkler
Let’s install the library via pip:
pip install jaro-winkler
Then let’s use it:
import jaro
# Calcul classique jaro
jaro.jaro_metric("BINUIT", "BENOIT")
0.7777777777777777
# Distance Jaro avec ajustement Winkler
jaro.jaro_winkler_metric("BINUIT", "BENOIT")
0.7999999999999999
The returned score (between 0 and 1) allows us to assess the similarity between the two strings.
Levenshtein
Levenshtein’s algorithm gives us two metrics: distance and ratio.
- Distance: This is the measure of the minimum number of modifications (insertions, deletions or substitutions) that must be carried out to change a sequence from one word to another.
- Ratio: Similarity between two words
import Levenshtein as lev
def levCalclulate(str1, str2):
Distance = lev.distance(str1, str2)
Ratio = lev.ratio(str1, str2)
print("Levenshtein entre {0} et {1}".format(str1, str2))
print("> Distance: {0}\n> Ratio: {1}\n".format(Distance, Ratio))
levCalclulate("Benoit", "Ben")
levCalclulate("Benoit", "Benoist")
Levenshtein entre Benoit et Ben
> Distance: 3
> Ratio: 0.6666666666666666
Levenshtein entre Benoit et Benoist
> Distance: 1
> Ratio: 0.9230769230769231
Fuzzywuzzy
This library allows us to go further than distance type comparisons that we have seen so far. It makes it possible to detect Link & Membership, inversions, etc.
Link and membership (ratio)
Str1 = "Les scouts de France"
Str2 = "France"
Partial_Ratio = fuzz.partial_ratio(Str1.lower(),Str2.lower())
print(Partial_Ratio)
100
Inversions in the chain (Token_Sort_Ratio)
Str1 = "Ceci est un test"
Str2 = "un test est ceci"
Ratio = fuzz.ratio(Str1.lower(),Str2.lower())
Partial_Ratio = fuzz.partial_ratio(Str1.lower(),Str2.lower())
Token_Sort_Ratio = fuzz.token_sort_ratio(Str1,Str2)
print(Ratio)
print(Partial_Ratio)
print(Token_Sort_Ratio)
50
50
100
Strings of different size (Cf. token_set_ratio)
Str1 = "Ce soir on regarde un match France contre Angleterre"
Str2 = "France vs Angleterre"
Ratio = fuzz.ratio(Str1.lower(),Str2.lower())
Partial_Ratio = fuzz.partial_ratio(Str1.lower(),Str2.lower())
Token_Sort_Ratio = fuzz.token_sort_ratio(Str1,Str2)
Token_Set_Ratio = fuzz.token_set_ratio(Str1,Str2)
print(Ratio)
print(Partial_Ratio)
print(Token_Sort_Ratio)
print(Token_Set_Ratio)
50
70
53
92
Conclusion
It’s pretty straightforward to use these algorithms actually. The difficulty does not lie in the use but rather in the choice of the right algorithm. Indeed there are many algorithms for comparing strings. You can go to https://pypi.org/project/textdistance/ for a brief overview of the distance algorithms. There are many and each has its specificities and especially its specialty. you will find some that specialize in short strings, long strings, with numbers or not, etc.
In other words, it is better to document and understand these algorithms well before implementing them and why not also combining them.
The sources for this article are on Github