Startup MapR Underpins EMCs Hadoop Effort

San Jose, Calif.-based storage startup MapR, which provides a high-performance alternative for the Hadoop Distributed File System, will serve as the storage component for EMCs forthcoming Greenplum HD Enterprise Edition Hadoop distribution. This alliance helps differentiate EMC from other Hadoop vendors, and adds immediate credibility to MapRs technology along with a strong distribution channel.

Todays announcement of the licensing agreement between the two companies confirms what I suspected when EMC unveiled its Hadoop plans earlier this month, after MapR CEO John Schroeder took the stage at EMC World and EMC itself described Enterprise Edition features that closely resemble what MapR provides.

Hadoop is an Apache Software Foundation project that consists of a set of tools for storing and processing large amounts of unstructured data. The two core components are the Hadoop Distributed File System for storing data and Hadoop MapReduce for writing parallel-processing jobs.

EMCs Hadoop strategy is actually quite unique, and its decision to embrace MapR is strong evidence of this. Coming into the Hadoop world with knowledge of the shortcomings of the current version of HDFS, EMC wanted a storage layer that would improve upon HDFS in terms of performance, avaialbility and ease of use. It could have attempted to bolt on its Isilon clustered file system or used its considerable engineering talent to improve upon HDFS, but EMC spotted a quality product in MapR and jumped on it.

Anot her unique element of EMCs Hadoop distribution is that rather than being based on the official Apache version of the code, its based on Facebooks Hadoop code (sub reqd) that has been optimized for scalability and multi-site deployment.

Not to be outdone, commercial Hadoop pioneer Cloudera announced an HDFS partnership of its own yesterday. Cloudera Distribution of Hadoop users can now RainStors data retention system to improve upon HDFS with serious compression, deduplication and compliance features. RainStor claims it can reduce the footprint of HDFS volumes by 97 percent while providing built-in security, audit trails and granular retention and expiry policies for managing the lifecycle of stored data. Additionally, customers can access data stored within RainStor via standard avenues such as SQL.

Both companies are taking different approaches to improving the HDFS experience. By not tethering itself to the Apache Hadoop project, EMC is able to address enterprise needs by leveraging MapRs innate high availability, high performance, and advanced features such as mirroring and replication. Cloudera, on the other hand, is a major contributor to Apache Hadoop and will incorporate changes to the HDFS architecture and features as Apache officially adopts them. However, Cloudera can rely on partnerships like that with RainStor to improve the HDFS experience without distracting it attention from improving the open source Apache Hadoop code.

Its arguable that the primary benefit to the Cloudera approach is that its open source, which means customers willing to wait for HDFS improvement! s wont h ave to pay for them. EMCs Greenplum HD Enterprise Edition, which incudes the MapR technology, will cost customers money.

As interest in Hadoop gains momentum among mainstream companies, the competition to provide the most-complete Hadoop experience is getting intense. Whether they rely on almost solely on the Apache Hadoop code, such as Cloudera, or not, such as EMC, vendors need to show potential customers that they can address real-world needs. There isnt a lot of money being spent on Hadoop products right now, but all signs point to that changing very soon, and then well see whose approach carries the day.

Related content from GigaOM Pro (subscription reqd):


Comments

Popular posts from this blog

China Watch: Magical New Maglev, Fire the Ambassador?

Live Blog: GMIC G-Startup Competition 2011

Chinese Pinterest Huaban.com Grabs Money and Attention