Moving beyond Hadoop for big data needs

Hadoop and MapReduce have long been mainstays of the big data movement, but some companies now need new and faster ways to extract business value from massive — and constantly growing — datasets.

While many large organizations are still turning to the open source Hadoop big data framework, its creator, Google, and others have already moved on to newer technologies.

The Apache Hadoop platform is an open source version of the Google File System and Google MapReduce technology. It was developed by the search engine giant to manage and process huge volumes of data on commodity hardware.

It’s been a core part of the processing technology used by Google to crawl and index the Web.

Hundreds of enterprises have adopted Hadoop over the past three or so years to manage fast-growing volumes of structured, semi-structured and unstructured data.

The open source technology has proved to be a cheaper option than traditional enterprise data warehousing technologies for applications such as log and event data analysis, security event management, social media analytics and other applications involving petabyte-scale data sets.

Analysts note that some enterprises have started looking beyond Hadoop not because of limitations in the technology, but for the purposes it was designed.

Hadoop is built for handling batch-processing jobs where data is collected and processed in batches. Data in a Hadoop environment is broken up and stored in a cluster of highly distributed commodity servers or nodes.

In order to get a report from the data, users have to first write a job, submit it and wait for it to get distributed to all of the nodes and get processed.

While the Hadoop platform performs well, it’s not fast enough for some key applications, says Curt Monash, a database and analytics expert and principal at Monash Research. For instance, Hadoop does not fare well in running interactive, ad hoc queries against large datasets, he said.

“Hadoop has trouble with is interactive responses,” Monash said. “If you can stand latencies of a few seconds, Hadoop is fine. But Hadoop MapReduce is never going to be useful for sub-second latencies.”

Companies needing such capabilities are already looking beyond Hadoop for their big data analytics needs.

Google, in fact, started using an internally developed technology called Dremel some five years ago to interactively analyze or “query” massive amounts of log data generated by its thousands of servers around the world.

Google says the Dremel technology supports “interactive analysis of very large datasets over shared clusters of commodity machines.”

The technology can run queries over trillion-row data tables in seconds and scales to thousands of CPUs and petabytes of data, and supports a SQL-query like language makes it easy for users to interact with data and to formulate ad hoc queries, Google says.

Though conventional relational database management technologies have supported interactive querying for years, Dremel offers far greater scalability and speed, contends Google.

Thousands of users at Google operations use Dremel for a variety of applications, such as analyzing crawled web documents, tracking installation data for Android applications, crash reporting and for maintaining disk I/O statistics for hundreds of thousands of disks.

Dremel, though, isn’t a replacement for MapReduce and Hadoop, said Ju-kay Kwek, product manager of Google’s recently-launched BigQuery hosted big data analytics service based on Dremel.

Google uses Dremel in conjunction with MapReduce, he said. Hadoop MapReduce is used to prepare, clean, transform and stage massive amounts of server log data, and then Dremel is used to analyze the data.