How Facebook moved 30 petabytes of Hadoop data

For anyone who didnt know, Facebook is a huge Hadoop user, and it does some very cool things to stretch the open source big-data platform to meet Facebooks unique needs. Today, it shared the latest of those innovations moving its whopping 30-petabyte cluster from one data center to another.

Facebooks Paul Yang detailed the process on the Facebook Engineering page. The move was necessary because Facebook had run out of both power and space to expand the cluster very likely the largest in the world and had to find it a new home. Yang writes that there were two options, physical migration of the machines or replication, and Facebook chose replication to minimize downtime.

Once it made that decision, Facebooks data team undertook a multi-step process to copy over data, trying to ensure that any file changes made during the copying process were accounted for before the new system went live. Perhaps not surprisingly, the sheer size of Facebooks cluster created problems:

There were many challenges in both the replication and switchover steps. For replication, the challenges were in developing a system that could handle the size of the warehouse. The warehouse had grown to millions of files, directories, and Hive objects. Although wed previously used a similar replication system for smaller clusters, the rate of object creation meant that the previous system couldnt ! keep up.

Ultimately, Yang writes, the migration proved that disaster recovery is possible with Hadoop clusters. This could be an important capability for organizations considering relying on Hadoop (by running Hive atop the Hadoop Disributed File System) as a data warehouse, like Facebook does. As Yang notes, Unlike a traditional warehouse using SAN/NAS storage, HDFS-based warehouses lack built-in data-recovery functionality. We showed that it was possible to efficiently keep an active multi-petabyte cluster properly replicated, with only a small amount of lag.

For Facebook, though, it looks like its fast-growing Hadoop data warehouse is just part of a larger trend toward needing more space. Last night, Facebook confirmed its building a second data center in Prineville, Ore., next to its existing one. That will make three for the company, which also is building a data center in Forest City, N.C.

Image courtesy of Flickr user daretothink.

Related research and analysis from GigaOM Pro:
Subscriber content. Sign up for a free trial.


Comments

Popular posts from this blog

China Watch: Magical New Maglev, Fire the Ambassador?

Live Blog: GMIC G-Startup Competition 2011

Chinese Pinterest Huaban.com Grabs Money and Attention