Hadoop 2.0 a step forward, but still a just framework for programmers

Back in the 1800s John Godfrey Saxe wrote a famous poem about six blind men trying to discover what an elephant is by touching a part and describing it. Saxe observes that, “each was partly in the right, and all were in the wrong.” Fast forward to today and we have a similar situation with Hadoop. Opinions about Hadoop are as varied – and sometimes as incorrect – as they were in the poem.

Hadoop has been variously described as the ideal way to do transaction processing, the ideal way to do search, and the ideal way to do analysis, all of which are quite different use cases. If that were not unlikely enough, it is also claimed to be the best way to analyze structured data, semi-structured data and unstructured data. In fact, we are led to believe that it is everything to everyone. 

How is this possible?

Hadoop is a primitive, undifferentiated technology that can be molded in various ways. In the evolutionary tree, it is far closer to low level programming languages like C and Java than it is to function-specific programs like database management systems and even higher level user applications like spreadsheets.

When people look at Hadoop and describe it in widely varying ways, they are all right since it is like clay that can be, in theory, molded to whatever shape required. The problem is they are also wrong in that Hadoop really is just a lump of clay. Turning it into something useful requires a lot of skill, time and effort. Hadoop 2.0 has done nothing to change this.

Now I am not suggesting Hadoop 2.0 isn’t a high quality version of it. You definitely need lower level technologies upon which to build the higher level ones. It’s just that the current hype seems to be misplaced. When people praise Frank Lloyd Wright’s Fallingwater, how often do they emphasize the chemical composition of the concrete? The important thing about a piece of software is how easy it is to use and apply productively.

Like the traditional analytical stack that employs things like data integration (DI), data warehousing and business intelligence (BI), Hadoop – and Hadoop 2.0 – have given us a new stack that is equally as complex and acronym rich: HDFS2 to YARN to Hbase to various flavors of BI. In this new Hadoop world, data still has to be continually moved from place to place. Too many layers separate the user from their data. Too much time and know-how is required to prepare data. The result: gainfully employed technologists, frustrated business managers, and a lost opportunity to remove the barriers that separate business users from insight.

The only other happy party in this new world are recruiters, who are able to reap rich rewards for bagging unicorns – the fabled data scientists who possess a mastery of statistics, PhDs in computer science, and untold experience with Python, Hadoop, MapReduce, JSON and Hive – literally, the stuff of legends.

Business managers don’t want to worry about how to take advantage of YARN. They don’t want to learn the meaning of new phrases like Hive, Spark, data reservoirs, data lakes, all of which now populate the tech discourse. They don’t want to have to ask IT to write a query or to merge in additional data sets. In short, they don’t want another system with lots of moving parts. They just want a simple tool that they can use to get answers, as quickly and painlessly as possible.

At the end of the day, Hadoop 2.0 remains a framework for programmers. As necessary as low level technologies are, it’s time we shifted our attention to the end user for whom low level technologies are as interesting as the wiring inside their office walls.

Hadoop may ultimately enable a renaissance in the user experience, but it hasn’t so far. After several years of hype, you can’t blame business users who sometimes feel that Hadoop is a white elephant. Let’s focus on business user oriented software — software that allows users to easily access and analyze unlimited amounts of data from various sources, on their own and without numerous degrees or the overhead of the traditional stack. Only then, will business users see the full value of their data.

Sandy Steier is CEO and Co-Founder, 1010data