Social data warehousing is worth the trouble

Classic data warehousing collected enormous amounts of relational data from sources across the enterprise and then correlated it to create more meaning than could be seen in any one system.

Most of the data was purely relational, and most of the inferences were pretty straightforward, even if the joins were tricky.

But when you’re looking at social marketing, sales 2.0 and social CRM, you need to pay a lot more attention to time-series data and the interactions across social networks. These can all mean a ton of data.

First, with behavioral scoring, a marketing automation system not only needs to track every email you’ve sent, but also every response-including all the pages a user visits, all the cookies that have been dropped, every phone call and the click path that led to a purchase. The system needs to track almost the same amount of data for anonymous visitors as it does for leads. Even small companies may be recording millions of data points a month.
Second, with social networking, it’s not enough to know which social networks someone belongs to. The high ground is creating graphs of the social network based on patterns of emails, phone records and social postings to help you understand who are the mavens or connectors who have the most influence on the community.
Third, instant messaging and other social feeds can be helpful to track audience sentiment and support lexical analysis. But these are the height of unstructured data, particularly if you include attached files. However, these can be important to record if you are interested in analyzing brand mentions or logo appearances in stills and videos.
Social Data Challenges Reflect Quantity and Quality
It’s not just that each of the feeds described above is big. It’s the need to maintain time sequencing and correlate events across several media. That leads to dreaded combinatorial explosions.
The obvious answer is to do most of your analysis on extracts or tallies, rather than on the underlying record-level details. That works as long as you’re doing fairly stable analyses, where you can pre-determine most of the queries and all of the extracts. Soon enough, though, somebody will have a follow-up question that requires examining the detailed data, so you’ll need to have tools that can drill down below the extracted summaries.
The economics of the cloud, and the speed of deployment there, can make for compelling advantages. There are now a number of solid BI tools available only in the cloud, and users of cloud-based operational systems are increasingly using SaaS for their data warehouses.

With social data, though, it’s only the extracts that can be practically handled in a pure cloud warehouse.
The underlying details-for ad hoc queries, hypothesis testing, and extract formulation -will almost certainly have to be done with on-premises databases. Fortunately, disk and memory capacities continue to fall while capacity expands. (The laptop I’m writing this article on, for example, has more than a TB of internal disk space).
The real cost of the on-premises warehouse, though, will be the software and the data analyst. While there is good news in terms of analytical power, neither the people nor the software is likely to come down in price any time soon.