In my “how to handle strings” serie, we saw:
- Some general Stuff
- The NLP with spacy
- The bags of words
How not to talk about regular expressions or regEx?
Indeed RegEx are an essential tool for handling, searching for, replacing character strings. It is therefore impossible not to talk about it when dealing with data of this type.
Index
Before you start …
Before starting, I would like to clarify that the objective of this article is not to give a course or a reference on RegEx. Far from me, in fact, the idea of giving a lecture on RegEx. It would not be of much use … because RegEx, it is practiced. And anyway there are really plenty of great sites on the net
We will go through some useful links to learn RegEx. Then we will see how to practice them effectively (through cheat sheets, tools, etc.). Then we will see examples of implementation with languages such as Python or Java.
The idea is therefore to provide a practical and useful RegEx file!
Memento “cheat sheet”
You will find on the internet a plethora of RegEx mementos in one page. So I am not going to redo what has already been done well. on the other hand, here are some useful links to find this information:
- htregular-expressions-cheat-sheet-v2 (an English PDF file to keep)
- Openclassroom (in French)
- RexEgg (in English)
It is far from exhaustive, a simple search through your favorite search engine will show you.
The essential tools
As I mentioned above regular expressions are practical. In fact it is tested! to be more precise, we often spend our time adjusting our RegEx. Without tools and via a “taton approach” you can spend / waste hours finding the right recipe.
We must therefore use tools that will allow us to test and “debug” our regular expressions. here are a few (free and easily accessible on the net):
regex101
Accessible at https://regex101.com/
A very practical tool with highlighting to better understand what is happening, an integrated help and of course a visualization of the results live (when typing). The tool even provides the code (Python, Java, etc.) that matches if you ask it. Likewise, an already ready-to-use Regex library is available. In short, a great tool!
debuggex
Accessible at https://www.debuggex.com/
A very practical tool with a slightly different approach because it is graphic and somewhat more interactive. Basically the diagram above shows the decomposition of your RegEx which makes it more visual (very useful for very large regular expressions). Very useful also the “slider” below which allows you to move the channel cursor.
Other tools
In fact there are really a lot of tools out there for testing regular expressions, I would just mention these in addition:
- ExtendsClass which also has the merit of offering several other tools for developers
- Rubular
- Directory-info (really basic tool)
- Regex pal
- QuentinC
- etc.
Some useful examples
Check an email (in lowercase):
\b[a-z0-9._%+-]+@[a-z0-9.-]+\.[a-z]{2,}\b
Control the format of an IP address:
\b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\b
Check a VISA card number (be careful, this RegEx only works for VISA, and not Mastercard for example):
^4[0-9]{12}(?:[0-9]{3})?$
Check that a password meets certain criteria (1 letter, 1 number, at least 1 special character String between 8 and 15 characters):
^(?=.*[A-Za-z])(?=.*\d)(?=.*[&-+!*$@%_])([&-+!*$@%_\w]{8,15})$
Use with Python
To use RegEx with Python, nothing could be simpler, just use the re module (import re). In the example below we will retrieve for example two elements (groups) from a string:
import re
regex = r"([A-Z])([0-9]*)"
matches = re.finditer(regex, "A122 Z3")
for matchNum, match in enumerate(matches, start=1):
print ("N° de groupe: {matchNum} | {start}-{end}: chaine {match}".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group()))
for groupNum in range(0, len(match.groups())):
groupNum = groupNum + 1
print ("Element {groupNum} trouvé: {start}-{end}: {group}".format(groupNum = groupNum, start = match.start(groupNum), end = match.end(groupNum), group = match.group(groupNum)))
Result:
N° de groupe: 1 | 0-4: chaine A122
Element 1 trouvé: 0-1: A
Element 2 trouvé: 1-4: 122
N° de groupe: 2 | 5-7: chaine Z3
Element 1 trouvé: 5-6: Z
Element 2 trouvé: 6-7: 3
Use with Java
In java, it is not more complicated because we will use the regex package (java.util.regex .. *) which makes our life just as simple:
<pre class="wp-block-syntaxhighlighter-code">import java.util.regex.Matcher;
import java.util.regex.Pattern;
final Pattern pattern = Pattern.compile("([A-Z])([0-9]*)", Pattern.MULTILINE);
final Matcher matcher = pattern.matcher("A122 Z3");
while (matcher.find()) {
System.out.println("Trouvé: " + matcher.group(0));
for (int i = 1; i <= matcher.groupCount(); i++) {
System.out.println("Group " + i + ": " + matcher.group(i));
}
}</pre>