Is big data dead? Long live Machine Learning

Share this post

You have noticed ? we hardly speak of Big Data anymore! Yet this “Buzword” has been at the heart of the marketing strategy of many companies and software publishers over the past ten years. But the buz starts to flop, what is it really? it is indeed difficult to imagine that the deluge of data will not take place. In fact, it is rather the opposite. The flood is there, the growth of stored data is present. And it’s not really ready to stop there.

The end of a Buz? the beginning of another …

Doing Hadoop to make Hadoop – as we have seen – naturally makes no sense! Sometimes you have to let nature act, natural selection does its work. If we take a look (after ten years or so) at the Hadoop projects that still exist and are developing, we will then find what meaning this technology has naturally taken on the market.

But then why are we no longer talking about Big Data? in fact, to be more precise why do we no longer talk about Hadoop (of course we have to differentiate these two terms)? If the problem of managing mass data has not disappeared – on the contrary – it is rather that of the means to manage it which is disappearing from the radar screens of the press. In fact, isn’t the real question around data growth: what are we going to do with it?

In recent years, companies have therefore set about managing the how… now we no longer wonder about the why. Fortunately it seems that a meaning, a utility should I say for this data emerges thanks to Artificial Intelligence. Nothing new there either, except the combination of several factors – including the explosion of data – which gives a boost to the AI. It can now display impressive confidence levels.

AI feeding on Big Data… it’s actually quite relevant, and it’s even happening at the right time.

Hadoop: on the road to a quest for meaning

Under this somewhat provocative title hides a real marketing question. But what is Hadoop anyway?

Hadoop is a free and open source framework written in Java intended to facilitate the creation of distributed (data storage and processing) and scalable (scalable) applications allowing applications to work with thousands of nodes and petabytes of data. Each node is therefore made up of standard machines grouped into a cluster. All Hadoop modules are designed with the fundamental idea that hardware failures are frequent and therefore should be handled automatically by the framework.


Hadoop is therefore not Big Data, it is a way of managing Big Data!

Beyond the technical definition of Wikipedia, it must be remembered that Hadoop made its appearance in 2008 in the form of the Open Source project of the same name. This is the start of an incredible marketing Buz that will last around 10 years. Of course, this notoriety will follow a normal distribution curve in terms of palatability. At its peak, in the early 2010s, it was clear that all CIOs needed to have their Big Data project running on a Hadoop cluster. A real must for any IT department that wanted to be innovative. At the time of the assessment, how many projects have actually given rise to operational type projects in production? Strangely, we find a lot less statistics on this point. How many managed to actually tame the elephant? Little apparently.

It’s not really surprising either. If Hadoop technology is full of quality – but also so many limits – its first concern is that behind the growing enthusiasm around this technology, everyone wanted to use it quickly. We then put it everywhere, often to try. Unfortunately this technology could not meet all use cases and furthermore took years to stabilize. Many (too many) project failures (Cf. Silicon article) then followed one another.

Personally, I see 3 major reasons:

  1. Hadoop technology is not suitable for all types of data storage (Operational vs Business Intelligence). It may seem obvious but very often choices have been made on Hadoop only on cost criteria (because Open Source). However, to quote Nick Heudecker, it is illusory to think that Hadoop will replace all databases!
  2. The skills required to set up and make a hadoop cluster exist are rare, and it sometimes takes a lot to keep a cluster operational.
  3. This technology is too complex because it is in fact an assembly of multiple technologies (also Open Source: HDFS, YARN, MapReduce, SPARK, ZooKeeper, HBASE, PIG, HIVE, etc.). As a result, a hadoop system is constantly updating (or almost) updating a single component “potentially endangering” the other components (hence the emergence of distributions).

What future then for Hadoop?

Finally, the failures of Hadoop projects are not really technological failures but rather the youthful mistakes of a technology and a market that needed to find its place in a complex and rapidly changing IT ecosystem. The proof is also the merger of the two giants of the Cloudera & Hortonworks domain. The market has tightened recently to refocus on its essence.

Store mass data from IOTs for example or other data probes instead of wanting to replace databases at all costs for example. Added to this the popularization of the cloud which undermines these complex technologies to implement, we understand better then why companies prefer more and more to delegate this complexity to third-party services.

Finally give a use to all this data

For all those who had designed their datalake – a little under the impulse of this fashion phenomenon – then a question arises. What to do with it of course? Difficult to use it as a DataWarehouse (or in certain cases of ultimately rather limited use), it is necessary to make profitable these famous data wells which have in addition the natural property of growing.

On closer inspection, there is one area that needs a lot of data to exist. It is of course on Artificial Intelligence and more particularly Machine Learning (in which I incorporate Deep Learning). Operating in a learning logic, these algorithms need a lot but then a lot to ingest data to be efficient and relevant. The Datalake is therefore a perfect source of data for these algorithms.

Besides, Machine Learning and Deep Learning aren’t really new. Most of the algorithms underlying AI are not very young and have been used for a long time (Bayes, decision trees, etc.). The real novelty for AI is that now the data to teach these algorithms are there, but also the power of machines is also growing.

And this is how datalakes are and are increasingly becoming the main data sources for our dear Data-Scientists. It is up to them to draw the precious data they need to produce their analyzes… but that’s another story.


So, is Hadoop dead?

Not really … let’s say that its function, on the other hand, has refocused to become what it should never have ceased to be, namely a massive and heterogeneous data sink management tool. A real data well in which all types of information are stored in order to be able to immerse an empty bucket. Of course, we don’t necessarily know how long it will take to reassemble it, but only one thing remains certain: the bucket will be full.

Its use was therefore found naturally – for those whose usefulness to have so much data was really an issue -. Which ? and well to produce analytics and predictive. That was why Hadoop was initially designed for this: to be able to feed effective analytical and predictive models! We can say that Hadoop naturally returns to its primary function, quite simply.

Do not hesitate to react by posting a comment below.

Share this post

Benoit Cayla

In more than 15 years, I have built-up a solid experience around various integration projects (data & applications). I have, indeed, worked in nine different companies and successively adopted the vision of the service provider, the customer and the software editor. This experience, which made me almost omniscient in my field naturally led me to be involved in large-scale projects around the digitalization of business processes, mainly in such sectors like insurance and finance. Really passionate about AI (Machine Learning, NLP and Deep Learning), I joined Blue Prism in 2019 as a pre-sales solution consultant, where I can combine my subject matter skills with automation to help my customers to automate complex business processes in a more efficient way. In parallel with my professional activity, I run a blog aimed at showing how to understand and analyze data as simply as possible: Learning, convincing by the arguments and passing on my knowledge could be my caracteristic triptych.

View all posts by Benoit Cayla →

Leave a Reply

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Fork me on GitHub