People use Hadoop for storing, processing and analyzing ever-increasing volumes of data. The question is no longer about whether Hadoop can scale to meet the demand. The question is now at what point the operational cost of scaling Hadoop exceeds the value realized from the data analysis. How much of the Hadoop scale-out story is reality vs. hype?
Read on so you can separate the myth from the reality.
Hadoop can scale out organically — myth or reality?
Reality! Hadoop can divide applications into small chunks of work, each of which is then assigned to individual compute nodes to run simultaneously. More compute nodes means more “power” to the applications! The immediate effect is linear scalability between number of applications and the resource capacity of the Hadoop cluster. Hadoop can start very small and grow very big by adding resources. It can scale out efficiently — the question is, how far?
Hadoop can scale out indefinitely — myth or reality?
Myth! Despite all the hype, customers need to realize that there are boundaries. Dividing the application into tasks makes Hadoop very scalable, but there is a point when doing this creates bottlenecks. For example, Yahoo has over 40,000 machines running Hadoop, yet their largest cluster is only 4,500 nodes. At present, the upper limit appears to be around 4,000-5,000 nodes in a single Hadoop cluster. Scaling Hadoop within those boundaries is not a trivial exercise, which leads to some customers avoiding the scalability conversation. Is that a good idea?
Scalability is important only when there is lots of data — myth or reality?
Myth! Hadoop delivers tangible value only when its resources can keep up with the workload. Waiting to start scaling until the volume of data gets out of control forces customers to make decisions based on lead times instead of sound design principles. Designing for scalability should be at the core of the solution, not an afterthought.
Oversubscribing or overprovisioning resources in an attempt to keep up with workloads is not sustainable. Just because you might have 20 petabytes(PB) of data in the future doesn’t mean you should purchase all 20PB of storage today, which leads us to the next hypothesis.
Hadoop delivers value as long as it scales — myth or reality?
The graph below illustrates the three potential Hadoop ROI scenarios described in the study.
The dark solid brown line depicts the continued falling cost of storing data. The solid blue line represents the business investment in infrastructure. The dotted lines show three potential Hadoop ROI scenarios.
- Green dotted line — Hadoop continues to deliver greater value in spite of increased investments in the infrastructure. This case represents a business that manages its data very efficiently, and continues to extract value as it makes incremental investments in the infrastructure.
- Orange dotted line — Hadoop delivers tangible value in the beginning; however there is an inflection point when investing in the infrastructure is no longer cost-effective. The tell-tale signs are (a) data becoming stale, (b) inefficient data management (e.g. huge duplication), (c) data analysis algorithms are unable to scale. To deliver a positive ROI, the business needs to evaluate its data management processes, re-asses the analytical algorithms, etc.
- Yellow dotted line — Hadoop fails to deliver value, even in the beginning. This is usually the case when the data content, the analysis, and the business goals are not completely aligned. It is not necessarily Hadoop’s fault that the technology does not deliver the expected value.
The bottom line: Failure to formulate a data strategy that overcomes the cost of incremental investments while continuing to deliver relevant value can present significant ROI challenges to operating a Hadoop cluster as business or market conditions change.
 Source: IDC’s “Extracting Value from Chaos” Study, sponsored by EMC, June 2011