Last Monday, July 25th, I gave a presentation at OSCON about the integration between Hadoop and the Enterprise Data Warehouse (EDW).
The session was well attended. I thought that the dialog and the exchange of opinions and ideas was very good. Few of the dialogs continued after the session which (once again!) indicates that managing Big Data is top of many practitioners’ mind.
I personally enjoyed the dialog and I also learned quite a few things! Thanks O’Reilly for another successful OSCON!
The point I tried to make in my dissertation is that there are multiple aspects of the integration between Hadoop and EDW that system designers / system integrators need to consider. Two tenets that seem to prevail are:
- System-level Design
- Data Lifecycle Management
The System-level Design describes how the two solutions, Hadoop and EDW, integrate, the use cases, the external interfaces that the integrated system needs to provide, the performance and the data analysis capabilities that the system needs to deliver.
Data Lifecycle Management encompasses all the processes and procedures for managing the flow of data throughout its lifecycle, from creation to the time when it is archived and/or is deleted.
Let’s start with the System-level Design. Conversations we had with a large number of customers and vendors indicate that the most common use case of integration between Hadoop and Data Warehouse looks like something this.
Customers use Hadoop as a staging area for mostly unstructured data, although sometimes it can be semi-structured or fully structured data. They collect, cleanse, aggregate the data in Hadoop and then export either a subset or an aggregated version of it to the Data Warehouse for advanced analytics. Sometimes they push the results of the analytics back on Hadoop for archiving and re-runs purposes.
This ETL (Extract, Transform, Load) flow is mostly operational: data is collected in Hadoop, it’s being transformed and staged and then shipped off to the data warehouse where it is being processed for the daily sales reports, dashboards, etc. In this scenario only a subset of the initial data is analyzed and that’s because to generate my daily sales report for example, all I need is the sales data only from the previous day. That’s why the data flow (1)->(2)->(5) is mostly managed by the DW/BI Operations folks.
For the next question to ask, for example who has a pool, needs to buy school supplies and wears a green t-shirt, I’d need access to the actual data, not only to a subset of it. That’s what the arrow marked (3) signifies: in order for the developer to write a new analytics function, he needs access to the raw data, not to a subset that was already massaged to fit a particular analysis algorithm. Eventually the developer develops the new analysis routine and gives it to the Ops guys for running it in production.
The difference between (3) and (5) is that (3) is the access needed to develop new analytics models whereas (5) is the gateway for execution of established models. In most cases the development team owns (3), and Operations owns (5), although I’ve seen cases where there is no clear distinction in ownership.
Let’s see an example. Next figure depicts how our colleagues in Marketing may want to use data from either internal or external sources to analyze the performance of a particular product in the market.
Let’s say I’m a Product Manager for a laptop product line and want to see why one of the laptops models doesn’t sell very well.
I’d need to pull data in from:
- Engineering, i.e. product tech sheets, test results, manufacturing reports, etc.
- Sales, i.e. monthly sales volume by region, customer pipeline, etc.
- Product Support, i.e. quality reports, customer call records, part dispatch reports, etc.
and stitch it together with customer sentiment data from social networks. I use Hadoop as data aggregation / data integration platform. I load all this data in Hadoop and then use Hadoop to process it / structure it in a way that makes the data suitable for analysis and then use something like a Dell Aster Data solution or other business analysis tools to turn the data into actionable information. This is the real power of Hadoop: I can pull data in from a variety of different sources and then use Hadoop to pre-process that data and then make it available for advanced analytics.
To recap, we started with the integration between Hadoop and the Data Warehouse and we saw that (1) the integration consists of multiple data processing streams and (2) development and operations need different data access levels. This system-level perspective of the integration however has a gap. It tells us how the system comes together, it shows us how different organizations interface with the system but does very little when it comes to the actual evolution of the data through the system. Without a clear understanding of how data evolves through the system is difficult to define the right processing functions and the correct checkpoints.
Data Lifecycle begins with the data being generated. As we already know, there are a variety of sources that generate data. Also the format of that data varies broadly. Data is then collected. The reason that Generate and Collect are two separate phases is because not all the data that is generated should or can be collected.
Once the data is collected and loaded into the system, there are several intermediary steps that the data goes through before is stored and/or made available for analysis. One example of intermediary step is Data Aggregation – logs from all my machines in the data center need to be correlated. Another example is Data QA – I need to check if there is any gap in the logs or whether one of the machines failed to sent its log. In case of errors, I may need to ask the source to re-generate the data.
The next phase is processing that data, which means turning the data into actionable information that is then presented. Last phase of the data lifecycle is consumption of the insights for example trigger a new power management policy inside the data center.
The dotted lines that you see above indicate live feedback loops between the various phases. For example, the way the data is processed my require adjustments in the way data is being collected or stored.
The next question to answer is how do we separate the functions and procedures associated with each phase between Hadoop and EDW. I’d venture to say that first two phases of the lifecycle can be categorized as Data Cleansing & Aggregation. The next two phases can be thought of as Data Analysis.
The question remains: how do we separate the functionality between Hadoop and EDW, but also how do we integrate the two?
Next figure shows how! Hadoop is mainly responsible for Data Cleansing and Aggregation and the Data Warehouse does most of the Data Analytics.
There is a nice overlap between the two systems and we call that the Integration Domain. That’s your integration point between the two systems! The Integration Domain consists not only of data management and processing functions but also the IT infrastructure between the two systems, for example the network implementation, the machines that act as gateways between the two systems and so on. Any limitations introduced by the Integration Domain will negatively impact the overall performance of the system and eventually result in poor end user experience.
For more details on the integration, please check out three of my YouTube videos:
- Aggregating Data for Advanced Analytics
- How Hadoop works with Data Warehouse
- Example use case: Hadoop + EDW web page analytics
Let’s talk a bit about the design of the connector between Hadoop and EDW.
There are two ways to implement the connector:
- using one of the utilities offered by the Hadoop ecosystem (Sqoop, Hive, etc.)
- using the low-level MapReduce Java API
Leveraging the ecosystem utilities will get you going in no time, however the performance is average. Best-possible performance can be achieved by interfacing with the Java API however that’s going to require advanced programming skills and a specific network design.
- Hadoop/EDW integration is no small task
- Good design decisions can lead to real benefits
- Use Hadoop for: cost, quick load time, scalability
- Use EDW for: SQL, query execution performance (especially ad-hoc or interactive queries)
- The integration presents different challenges for developers and Operations.
Comments/concerns/suggestions are always welcome. Thanks for your attention!