Hadoop: Powering the next generation of analytics

Much like the numerous emerging enterprise technologies surfacing ever year, the hype over Apache Hadoop will soon be over. Hadoop, the framework that allows for the distributed processing of large data sets across clusters of computers, using simple programming models, needs to recede into the background so as to deliver on its promise of being the enterprise data hub of choice and the relational database management system that powers most of the online world today. It cannot be exclusive and remain a specialized skillset, but, in fact, needs to take a revolutionary approach to operate and be the preferred platform that powers the next generation of analytics. In other words, Hadoop must overcome the hype and evolve.

That is not to say that Hadoop has not made tremendous progress over the years. From a monolithic storage and batch architecture, it has transformed significantly into a modern, modular data platform. Hadoop is now capable of interacting with data discovery through analytic SQL engines like Impala, as well as supporting Apache Spark to provide a next generation data processing layer for a combination of batch and streaming workloads. With these many developments in place, delivering ease of use and increase performance for developers has not been compromised.

Nevertheless, there is still more that needs to be done. Hadoop needs to be able to address the fundamental challenges that users are still facing – we identify the three most pressing issues below.

Better data engineering: Improving Spark for the enterprise

The role of data engineering needs to first be addressed before we go into the discussion of data analytics. Some of you may be saying “Data engineering, really?”. Yes, and rightly so! With the responsibility of designing and building the infrastructure jointly with the data science team, data engineers are quite literally providing the foundation for the next generation of analytics.

Exceptional data engineering needs to be easy to use and flexible. It needs to be able to perform. Take Spark for example, a general compute engine that supports a wide range of applications, it is now widely popular among users. In addition to data processing, applications also need ways to ingest, store, and serve data, while enterprise teams need tools for operations, security and data governance. In this sense, Spark naturally complements the comprehensive and complex Hadoop ecosystem with its ease of use, flexibility and performance.

Comprehensive security: Enforcing unified access control policy

One of Hadoop’s defining characteristics is its ability to allow access to unlimited data in a variety of ways. By moving beyond MapReduce, the programming paradigm that allows for massive scalability through parallel processing of large data sets, complex application architectures that required many separate systems for data preparation, staging, discovery, modeling, and operational deployment, can now be consolidated into a single end-to-end workflow on a common platform. This empowers users with more diverse skills, allowing them to extract value from data.

Of course, this flexibility needs to be balanced with security requirements. To ensure that sensitive data cannot fall into the wrong hands, a comprehensive security approach should be adopted to ensure that every access path to data respects policy in the same way, right down to the most granular level.

However, the reality today is that each access engine handles security differently. Take Impala and Hive for example. These Hadoop modules offer row and column-based access controls with shared policy definitions through Apache Sentry. In contrast, Spark and MapReduce support only file or table level controls. This fragmentation forces a reliance on the lowest common denominator – coarse-grained permissions – which often result in several undesirable outcomes: limitations of data and access, security silos or, worse, inconsistent policy due to human error in policy replication. Ultimately, the issue constrains the types of applications that can be built.

By providing a common API for policy-compliant data access, third party products are better integrated into the Hadoop cluster, while also providing dynamic data masking everywhere. With users empowered to securely gain value from data using their tools of choice, the focus should then shift to a more fundamental problem – how to store data for the next generation of analytics.

Fast analytics on fast data

The next generation of applications built on Hadoop are collapsing the distance between data collection, insight, and action; in other words, becoming more real-time. In the best case, analytical models are embedded right in the operational application, directly influencing business outcomes as users interact with them. On the flip side, considering a simpler case, an operational dashboard requires the ability to integrate data and immediately analyze it.

It turns out that this is pretty difficult to achieve with Hadoop today, and that can be largely attributed to storage constraints concerning updates. At an early stage, users are already faced with a dilemma: Do I pick HDFS, which offers high throughput reads, which is great for analytics, but with no capability to update files, or Apache HBase, which offers low-latency updates that are great for operational applications but perform poorly for analytics?

Often, the result is a complex and complicated hybrid of the two, with HBase for updates and periodic syncs to HDFS for analytics. This is arduous for a few reasons:

  • Data pipelines need to be maintained to move data and ensure synchronization between storage systems
  • The same data is being stored multiple times, which increases the total cost of ownership
  • There is latency between when data arrives and when it can be analyzed
  • Data that is written to HDFS will need to be rewritten and if it needs to be corrected (remember, no updates)

Looking ahead

Hadoop has come a long way in its first 10 years. As Matt Aslett of 451 Research recently summarized, “Hadoop has evolved from a batch processing engine to encompass a set of replaceable components in a wider distributed data-processing ecosystem that includes in-memory processing and high-performance SQL analytics.”

Naturally, Hadoop’s storage options are also evolving, and this is just the beginning. With Spark as the new data processing architecture, a new unified security layer, and a new storage engine for simplified real-time analytic applications, Hadoop is ready for its next phase: Powering the next generation of analytics.

Daniel Ng, Senior Director, APAC for Cloudera