Managing string character

Share this post

If you want to have an analytical approach to your data, you have of course been faced with the difficulty of using character strings. So much so that very often you have had to give up or even put them aside. Lack of tools? complexity of managing complex semantics? in short, strings are often the cause of great frustration for analysts and other data scientists.

Let’s take a quick brief about this.

Did you say Text ?

While it is rather easy to process numeric data as I said in the preamble, character strings are more complex to work with. A first reason is that it is difficult to put them into an equation. But the real complexity in the treatment of character strings lies above all in the semantic analysis that we will be able to bring to it. Indeed, if each sentence has a meaning, the latter can only be deduced after analyzing each word. A sentence is therefore a sort of recipe in which each ingredient has its dosage, its place and an order which must be precise.

Fortunately, in what I mentioned above I am only talking about purely textual data. Why fortunately? and quite simply because we will not always be dealing with this type of data. We can easily agree on 3 main types of character string data:

  • Categorical data
  • Structured chains
  • Free texts

We just briefly talked about free texts, now let’s see the other types of channels.

Categorical data

Categorical data is also often referred to as list data. As the name suggests, this is data based on lists of values. The classic example is the concept of color or sex. You fill out an input form and the information in the categorical data field is then a drop-down list imposing a value on you in a defined list.

Of course, it is not always easy to work with categorical lists, because very often we do not know the list of values. Worse, if you are working on a Machine Learning algorithm, you will have to be careful that the field of possible values ​​of the training data set is the same as that of the test set! to do this, a data analysis (profiling) is necessary in order to discover this famous list of values ​​that make up our categorical data. Following this analysis and by constantly distributing a list of defined values, you will be able to conclude that a variable is categorical or not!

If you are doing machine learning all you have to do is process your categorical variable with an appropriate one-hot encoding . If you are in an integration project, a simple lookup will certainly suffice, for example to check the consistency of this data.

These categorical data can also undergo so-called data quality pre-processing. Indeed, imagine for example that your color (brown) was incorrectly entered (for example brown with only one r). Its meaning must remain the same but if you do nothing you will have two categories: brown and brown. It is therefore essential to transcode your brown in brown beforehand in order to keep all its meaning.

Structured chains

Structured chains are often the easiest to process. This is totally true when we know the structure such as XML, JSON, addresses (AFNOR), phone number, etc.

But it remains more complex as soon as we do not really know the structure! Here again, profiling can help to discover the structure. But often this will not be enough and will require a more in-depth manual analysis.

In many cases, it will be necessary to use specialized tools, such as address validation tools.

The texts

I’m finally getting there: textual data! This is where we usually get stuck and where the frustration of not being able to use information is felt. Why ? and quite simply because a sentence has a meaning, a semantics and a simple analysis by taxonomy will not always be sufficient. We will therefore have to treat these damn chains with a different approach.

We then speak of Text Mining or text mining.

Text Mining is a very important part of another very popular discipline: Artificial Intelligence. If we take the definition of Wikipedia : “It designates a set of computer processing consisting in extracting knowledge according to a criterion of novelty or similarity in texts produced by humans for humans. In practice, this amounts to putting into an algorithm a simplified model of linguistic theories in computer systems for learning and statistics. ” 

In general, there are two main stages in text mining: the decomposition and the analysis itself. The decomposition allows to recognize the phonemes, the sentences, the overall grammatical role, the taxonomy, etc. In short, the idea is to intelligently divide the text and to find in this division the place and role of each grammatical element.

The second step is more complex because it involves finely analyzing the sentences, separating or even isolating the sets that have meaning in order to take advantage of them.

I will stop there for this article but then I will offer you new posts allowing a gradual approach to text analysis. Starting with the analysis by bag of words then going to NLP (Natural Language Processing) .

Share this post

Benoit Cayla

In more than 15 years, I have built-up a solid experience around various integration projects (data & applications). I have, indeed, worked in nine different companies and successively adopted the vision of the service provider, the customer and the software editor. This experience, which made me almost omniscient in my field naturally led me to be involved in large-scale projects around the digitalization of business processes, mainly in such sectors like insurance and finance. Really passionate about AI (Machine Learning, NLP and Deep Learning), I joined Blue Prism in 2019 as a pre-sales solution consultant, where I can combine my subject matter skills with automation to help my customers to automate complex business processes in a more efficient way. In parallel with my professional activity, I run a blog aimed at showing how to understand and analyze data as simply as possible: datacorner.fr Learning, convincing by the arguments and passing on my knowledge could be my caracteristic triptych.

View all posts by Benoit Cayla →

Leave a Reply

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Fork me on GitHub