MemeStreams | MemeStreams Discussion

Create an Account

This page contains all of the posts and discussion on MemeStreams referencing the following web page: MapReducing 20 petabytes per day/ Google's guts!. You can find discussions on MemeStreams as you surf the web, even if you aren't a MemeStreams member, using the Threads Bookmarklet.

MapReducing 20 petabytes per day/ Google's guts! by unmanaged at 11:28 pm EST, Feb 2, 2008
Googlers Jeff Dean and Sanjay Ghemawat have an article, "MapReduce: Simplified Data Processing on Large Clusters", in the January 2008 issue of Communications of the ACM. It is a great introduction to MapReduce, but what I found most interesting was the numbers they cite on usage of MapReduce at Google. Jeff and Sanjay report that, on average, 100k MapReduce jobs are executed every day, processing more than 20 petabytes of data per day. More than 10k distinct MapReduce programs have been implemented. In the month of September 2007, 11,081 machine years of computation were used for 2.2M MapReduce jobs. On average, these jobs used 400 machines and completed their tasks in 395 seconds. What is so remarkable about this is how casual it makes large scale data processing. Anyone at Google can write a MapReduce program that uses hundreds or thousands of machines from their cluster. Anyone at Google can process terabytes of data. And they can get their results back in about 10 minutes, so they can iterate on it and try something else if they didn't get what they wanted the first time. It is an amazing tool for Google. Google's massive cluster and the tools built on top of it rightly have been called a "competitive advantage", the "secret source of Google's power", and a "major force multiplier". By the way, there is another little tidbit in the paper about Google's machine configuration that might be of interest. They describe the machines as dual processor boxes with gigabit ethernet and 4-8G of memory. The question of how much memory Google has per box in its cluster has come up a few times, including in my previous posts, "Four petabytes in memory?" and "Power, performance, and Google". Update: Others now have picked up on this paper and commented on it, including Niall Kennedy, Kevin Burton, Paul Kedrosky, Todd Huff, Ionut Alex Chitu, and Erick Schonfeld. Update: Let me also point out Professor David Patterson's comments on MapReduce in the preceding article in the Jan 2008 issue of CACM. David said, "The beauty of MapReduce is that any programmer can understand it, and its power comes from being able to harness thousands of computers behind that simple interface. When paired with the distributed Google File System to deliver data, programmers can write simple functions that do amazing things."
Link - Reply

RE: MapReducing 20 petabytes per day/ Google's guts!
by noteworthy at 2:56 pm EST, Feb 4, 2008

unmanaged wrote:

Googlers Jeff Dean and Sanjay Ghemawat have an article, "MapReduce: Simplified Data Processing on Large Clusters", in the January 2008 issue of Communications of the ACM.
It is a great introduction to MapReduce, but what I found most interesting was the numbers they cite on usage of MapReduce at Google.

Thanks for the pointer to this discussion. Last month I recommended the CACM paper on MapReduce.