OSCON 2011 Session: Hadoop – Enterprise Data Warehouse Integration

Last Monday, July 25th, I gave a presentation at OSCON about the integration between Hadoop and the Enterprise Data Warehouse (EDW).

The session was well attended. The dialog and the exchange of opinions was very good. The conversations continued after the session which, to me, shows that Big Data management is top of many practitioners’ mind.

I very much enjoyed the dialog. Also learned few things! Thanks O’Reilly for another successful OSCON!

The point I tried to get across is that there are multiple facets to the integration between Hadoop and EDW . Two main points seem to prevail:

  • System-level Design
  • Data Lifecycle Management

The System-level Design describes how the two stacks (Hadoop and EDW) integrate, the use cases, the external interfaces, the overall performance and the data analytics capabilities that the integrated system needs to deliver.

Data Lifecycle Management consists of all the processes and procedures for managing the flow of data, from ingestion to processing to archiving and deletion.

Let’s start with the System-level Design. Conversations we had with a number of customers and vendors indicate that the most common use case of integration between Hadoop and Data Warehouse looks like something this.

Customers use Hadoop as a staging area for mostly unstructured data, although sometimes data can be semi-structured or fully structured. They collect, cleanse, aggregate the data in Hadoop and then export a subset or an aggregated version to the Data Warehouse for advanced analytics. Sometimes they push the results of the analytics back to Hadoop for archiving and re-runs purposes (e.g. what-if scenarios).

This ETL (Extract, Transform, Load) flow is mostly operational: data is collected in Hadoop, it’s being transformed and staged and then shipped off to the data warehouse where it’s being processed for daily sales reports, dashboards, etc. In this scenario, only a subset of the initial data is analyzed and that’s because to generate my daily sales report for example, all I need is the sales data only from the previous day. That’s why the data flow (1)->(2)->(5) is mostly managed by the DW/BI Operations folks.

For the next question, e.g. who has a pool, buys school supplies and wears a green t-shirt, I’d need access to the actual data, not only to a subset of it. That’s what the arrow marked (3) signifies: in order for the developer to write a new analytics function, he needs access to the raw data, not to a subset that was already massaged to fit a particular analysis algorithm. Eventually the analyst develops the new analysis routine and gives it to the Ops guys to deploy it and run it in Production.

The difference between (3) and (5) is that (3) is the access needed to develop new analytics models whereas (5) is the gateway for running Production models. In most cases the development team owns (3), and Operations owns (5).

Let’s see an example. Next figure depicts how our colleagues in Marketing may want to use data from either internal or external sources to analyze the performance of a particular product in the marketplace.

Let’s say I’m a Product Manager for a laptop product line and want to see why one of the laptop models doesn’t sell as well.

To make it work, I’d need to pull data in from:

  • Engineering, i.e. product tech sheets, test results, manufacturing reports, etc.
  • Sales, i.e. monthly sales volume by region, customer pipeline, etc.
  • Product Support, i.e. quality reports, customer call records, part dispatch reports, etc.

and stitch it together with customer sentiment data from social networks. I use Hadoop as data aggregation / data integration platform. I load all this data in Hadoop and then use Hadoop to process it / structure it in a way that makes the data suitable for analysis and then use something like the Dell Aster Data solution or other business analysis tools to turn the data into actionable insights. This is the real power of Hadoop: I can pull data from a variety of different sources and then use Hadoop to pre-process that data and make it available for advanced analytics.

To recap, we started with the integration between Hadoop and the Data Warehouse and we saw that (1) the integration consists of multiple data processing streams, and (2) Development and Operations need different data access levels. This system-level perspective has a gap though. The architecture shows us how the systems come together, how different organizations interface with the system but does very little when it comes to the actual evolution (not flow!) of the data. Without a clear understanding of how data evolves is difficult to define the right processing flow and the correct checkpoints. We call that aspect the Data Lifecycle.

Data Lifecycle begins with the data being generated. As we already know, there are a variety of sources that generate data. Also the format of that data varies broadly. Data is then collected. The reason that Generate and Collect are two separate phases is because not all the data that is generated should or can be collected.

Once the data is collected and loaded into the system, there are several intermediary steps that the data goes through before is stored and/or made available for analysis. One example of intermediary step is Data Aggregation – logs from all my machines in the data center need to be correlated. Another example is Data QA – I need to check if there is any gap in the logs or whether one of the machines failed to sent its log. In case of errors, I may need to ask the source to re-generate the data.

The next phase is processing that data, which means turning the data into actionable information that is then presented. Last phase of the data lifecycle is consumption of the insights for example trigger a new power management policy inside the data center.

The dotted lines that you see above indicate live feedback loops between the various phases. For example, the way the data is processed my require adjustments to the way data is being collected.

The next question to answer is how do we separate the functions and procedures associated with each phase between Hadoop and EDW. I’d venture to say that first two phases of the lifecycle can be categorized as Data Cleansing & Aggregation. The next two phases can be thought of as Data Analysis.

The question remains: how do we separate the functionality between Hadoop and EDW, but also how do we integrate the two?

Next figure shows how! Hadoop is mainly responsible for Data Cleansing and Aggregation and the Data Warehouse does most of the Data Analytics.

There is a nice overlap between the two systems and we call that the Integration Domain. That’s your integration point between the two systems! The Integration Domain consists not only of data management and processing functions but also the IT infrastructure between the two systems, for example the network implementation, the machines that act as gateways between the two systems and so on. Any limitations introduced by the Integration Domain will negatively impact the overall performance of the system and eventually lead to poor end user experience.

Let’s talk a bit about the design of the connector between Hadoop and EDW.

There are two ways to implement the connector:

  1. using one of the utilities offered by the Hadoop ecosystem (Sqoop, Hive, etc.)
  2. using the low-level MapReduce Java API

Leveraging the ecosystem utilities will get you going in no time, however the performance is average. Best-possible performance can be achieved by interfacing with the Java API however that’s going to require advanced programming skills and a specific network design.

In conclusion:

  • Hadoop/EDW integration is no small task
  • Good design decisions can lead to real benefits
    • Use Hadoop for: cost, quick load time, scalability
    • Use EDW for: SQL, query execution performance (especially ad-hoc or interactive queries)
  • The integration presents different challenges for developers and Operations.

Comments/concerns/suggestions are always welcome. Thanks for your attention!

For more details on the integration, please check out three of my YouTube videos: