Big Data: Why you must consider open source

A quiet revolution has been taking place in the technology world in recent years. The popularity of Open Source software has soared as more and more businesses have realized the value of moving away from walled-in, proprietary technologies of old.

And it’s no coincidence that this transformation has taken place in parallel with the explosion of interest in Big Data and analytics. The modular, fluid and constantly-evolving nature of Open Source is in synch with the needs of cutting edge analytics projects for faster, more flexible and, vitally, more secure systems and platforms with which to implement them.

Open Source and Big Data

So what exactly is Open Source, and what is it that makes it such a good fit for Big Data projects? Well, like Big Data, Open Source is really nothing new – it’s a concept which has existed since the early days of computing. However, it’s only more recently, with the huge growth in the number of people, and amount of data online, that its full potential is starting to be explored.

The lazy description of Open Source is often that it is “free” software. Certainly that’s how you will hear the more popular Open Source consumer and business products (such as the Microsoft Office alternative LibreOffice, or the web browser Firefox) described. But there’s much more to it than that. Generally, truly Open Source products are distributed under one of many different Open Source licenses, such as the GNU Public License or the Apache license. As well as granting the user the right to freely download and use the project, it can also be modified and redistributed. Software developers can even strip out useful parts from one Open Source project to use in their own products – which could either be Open Source themselves, or proprietary. In general, the only stipulation is that they must acknowledge where Open Source material has been used in their own products, and include the relevant licensing documentation in their distribution.

Advantages of Open Source

Open Source development has many advantages over its alternative – proprietary development. Because anyone can contribute to the projects, the most popular have huge teams of enthusiastic volunteers constantly working to refine and improve the end product.

In fact, Justin Kestelyn, senior director of technical evangelism and developer relations at leading Open Source vendor Cloudera, tells me that proprietary solutions could be on their way out entirely, in some fields of information technology.

He says “Emerging data management platforms are just never proprietary any more. Most customers would simply see them as too risky for new applications.

“There are multiple – and at this point in history, thoroughly validated – business benefits to using open source software.”

Among those reasons, he says, are the lack of fees allowing customers to evaluate and test products and technologies at no expense, the enthusiasm of the global development community, the appeal of working in an Open Source environment to developers, and the freedom from “lock in”.

This last one has one caveat, though, Kestelyn explains – “Be careful, though, of Open Source software that leaves you on an architectural island, with commercial support only available from a single vendor. This can make the principle moot.”

The literal meaning of Open Source is that the raw source code behind the project is available for anyone to inspect, scrutinize and improve. This brings big security benefits – flaws which could lead to the loss of valuable or personal data are more likely to be spotted when hundreds or thousands of people are examining the code in its raw form. In contrast, in the world of proprietary development, only the handful of people whose job it is to write and then test the code will ever see the exact nature of the nuts and bolts holding it all together.