Strings Comparison

Share this post

Introduction

In a previous article we saw how we could with   or without NLP ( using “bags of words” ) compare strings, or rather sentences. I propose in this post to approach the comparison of words. The question we are going to ask ourselves here is not so much the meaning of a sentence but rather the resemblance of one word to another.

We will therefore discuss the various issues that we commonly encounter when dealing with strings. Typing errors, simple mistakes, abbreviations, in short, various and varied errors that prevent us from processing information correctly. Very clearly we will use distance algorithms (Levenshtein, Hamming, etc.) which have repeatedly proven their efficiency. Then we will go a little further with a library which has at least one can say an original name:  Fuzzywuzzy  !

Old-fashioned comparison!

To begin with, the first thing to do when comparing strings is to remove the elements that do not add value to the string itself. This is called noise. Imagine that in my system I have two first names: “  Ben oit.  “And”  benoît “. It seems obvious to a human being that it is the same first name. It seems just as obvious that a “basic” comparison will not give the expected result. Capital letters, umlauts, spaces and other punctuation marks are important but prevent us from comparing our strings.

We will therefore have to remove or replace them (like the î in i, or the B in b). Likewise, we will choose the case (upper case or lower case) in order to compare the strings on the same level.

Here is an example of a Python function to compare these two strings:

noise_list = [".", ",", "?", "!", ";"]
def prepString(_str, _noise = True, _multiplespaces = True):
    # remove noise (punctuation) if asked (by default yes)
    if _noise:
        for car in noise_list:
            _str = _str.replace(car, "")
    # replace multiple spaces by ine in string if requested (default yes)
    if _multiplespaces:
        _str = re.sub("\s+", " ", _str).strip()
    return _str.strip().lower()

def compareString(Str1, Str2):
    return prepString(Str1) == prepString(Str2)

compareString("Ben  oit.,", "BEN oit")
True

Hamming comparison

The objective of this comparison is to count the differences that there are between two strings of characters. Please note that the chains compared must be of the same size, which very quickly becomes a drawback.

def hamming_distance(string1, string2): 
    if (len(string1) != len (string2)):
        return -1
    # Start with a distance of zero, and count up
    distance = 0
    # Loop over the indices of the string
    L = len(string1)
    for i in range(L):
        # Add 1 to the distance if these two characters are not equal
        if string1[i] != string2[i]:
            distance += 1
    # Return the final count of differences
    return distance

hamming_distance("BINUIT", "BENOIT")
2

We have two differences between BINUIT and BENOIT (the I instead of the E and the U instead of the O). In reality, the Hamming comparison (hamming code) is rather used to check / correct errors in data transmissions.

Jaro-Winkler comparison

This comparison is much finer and will give us a score that will give us the distance between the two compared chains. This distance measurement is taken from the Jaro distance which is often used for deduplication operations.

If s is the character string:

See  wikipedia

  • | s_1 | is the length of the character string s1 (same for S_2)
  • m is the number of characters
  • t is the number of transpositions

Python has of course a library to manage this formula:  jaro-winkler

Let’s install the library via pip:

pip install jaro-winkler

Then let’s use it:

import jaro
# Calcul classique jaro
jaro.jaro_metric("BINUIT", "BENOIT")
0.7777777777777777
# Distance Jaro avec ajustement Winkler
jaro.jaro_winkler_metric("BINUIT", "BENOIT")
0.7999999999999999

The returned score (between 0 and 1) allows us to assess the similarity between the two strings.

Levenshtein

Levenshtein’s algorithm gives us two metrics: distance and ratio.

  • Distance: This is the measure of the minimum number of modifications (insertions, deletions or substitutions) that must be carried out to change a sequence from one word to another.
  • Ratio: Similarity between two words
import Levenshtein as lev
def levCalclulate(str1, str2):
    Distance = lev.distance(str1, str2)
    Ratio = lev.ratio(str1, str2)
    print("Levenshtein entre {0} et {1}".format(str1, str2))
    print("> Distance: {0}\n> Ratio: {1}\n".format(Distance, Ratio))

levCalclulate("Benoit", "Ben")
levCalclulate("Benoit", "Benoist")
Levenshtein entre Benoit et Ben
> Distance: 3
> Ratio: 0.6666666666666666

Levenshtein entre Benoit et Benoist
> Distance: 1
> Ratio: 0.9230769230769231

Fuzzywuzzy

This library allows us to go further than distance type comparisons that we have seen so far. It makes it possible to detect Link & Membership, inversions, etc.

Link and membership (ratio)

Str1 = "Les scouts de France"
Str2 = "France"
Partial_Ratio = fuzz.partial_ratio(Str1.lower(),Str2.lower())
print(Partial_Ratio)
100

Inversions in the chain (Token_Sort_Ratio)

Str1 = "Ceci est un test"
Str2 = "un test est ceci"
Ratio = fuzz.ratio(Str1.lower(),Str2.lower())
Partial_Ratio = fuzz.partial_ratio(Str1.lower(),Str2.lower())
Token_Sort_Ratio = fuzz.token_sort_ratio(Str1,Str2)
print(Ratio)
print(Partial_Ratio)
print(Token_Sort_Ratio)
50
50
100

Strings of different size (Cf. token_set_ratio)

Str1 = "Ce soir on regarde un match France contre Angleterre"
Str2 = "France vs Angleterre"
Ratio = fuzz.ratio(Str1.lower(),Str2.lower())
Partial_Ratio = fuzz.partial_ratio(Str1.lower(),Str2.lower())
Token_Sort_Ratio = fuzz.token_sort_ratio(Str1,Str2)
Token_Set_Ratio = fuzz.token_set_ratio(Str1,Str2)
print(Ratio)
print(Partial_Ratio)
print(Token_Sort_Ratio)
print(Token_Set_Ratio)
50
70
53
92

Conclusion

It’s pretty straightforward to use these algorithms actually. The difficulty does not lie in the use but rather in the choice of the right algorithm. Indeed there are many algorithms for comparing strings. You can go to https://pypi.org/project/textdistance/ for a brief overview of the distance algorithms. There are many and each has its specificities and especially its specialty. you will find some that specialize in short strings, long strings, with numbers or not, etc.

In other words, it is better to document and understand these algorithms well before implementing them and why not also combining them.

The sources for this article are on Github

Share this post

Benoit Cayla

In more than 15 years, I have built-up a solid experience around various integration projects (data & applications). I have, indeed, worked in nine different companies and successively adopted the vision of the service provider, the customer and the software editor. This experience, which made me almost omniscient in my field naturally led me to be involved in large-scale projects around the digitalization of business processes, mainly in such sectors like insurance and finance. Really passionate about AI (Machine Learning, NLP and Deep Learning), I joined Blue Prism in 2019 as a pre-sales solution consultant, where I can combine my subject matter skills with automation to help my customers to automate complex business processes in a more efficient way. In parallel with my professional activity, I run a blog aimed at showing how to understand and analyze data as simply as possible: datacorner.fr Learning, convincing by the arguments and passing on my knowledge could be my caracteristic triptych.

View all posts by Benoit Cayla →

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Fork me on GitHub