MapReduce:Simplified Data Processing On Large Clusters

MapReduce:Simplified Data Processing On Large Clusters

Jeffrey Dean and Sanjay Ghemawat

jeff@google.com , sanjay@google.com

google,Inc.

Abstract

MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function the processesa key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. Many real world tasks are expressible in this model, as shown in the paper.

Programs written in this functional style are automatically parallelized and executed on a large cluster of commodity machines. The run-time system takes care of the details of partitioning the input data, scheduling the rogram’s execution across a set of machines, handling machine failures, and managing the reuired inter-machine communication. This allows programmers without any experience with parallel and dis tributed systems to easily utilize the resources of a large distributed system.

Our implementation of MapReduce runs on a large cluster of commodity machines and is highly scalable: a typical MapReduce comutation processes many terabytes of data on thousands of machines. Programmers find the system easy to use: hundreds of MapReduce programs have been implemented and upwards of one thousand MapReduce jobs are executed on Google’s clusters every day.

1 Introduction

Over the past five years, the authors and many other at Google have implemented hundreds of special-purpose computations that process large amounts of raw data, such as arawled documents, web request logs, etc., to compute various kinds of derived data,such as inverted indices, variousrepresentations of the graph structure of web documents, summaries of the number of pages crawled per host, the set of most frequent queries in a given day, etc. Most such computations are conceptually straight forward. However, the input data is usually large and the commutations have to be distributed across hundreds or thousands of machines in order to finish in a reasonable amount of time. The issues of how to paralleize the computation, distribute the data, and handle failures conspire to obscure the original simple computation with large amounts of comlex code to deal with these issues.

As a reaction to this complexity, we designed a new abstraction that allows us to express the simple computations we were trying to performbut hides the messy details of parallelization, fault-tolerance, data distribution and load balancing in a library. Our abstraction is inspired by the map and reduce primitives present in Lisp and many other funtional languages. We realized that most of our computations involved applying a map operation to each logical “record” in our input in order to compute a set of intermediate key/value pairs, and then applying a reduce operation to all the values that shared the same key, in order to combine the derived data appropriately. Our use of a functional model with users-pecified map and reduce operations allows us to parallelize large computations easily and to use re-execution as the primary mechanism for fault tolerance.

The mejor contibutions of this work are a simple and powerful interface that enables automatic parallelization and dis tribution of large-scale computations, combined with an implementation fo this interface that achieves high performance on large clusters of commodity PCs.

Section 2 describes the basic programming model and gives several examples. Section 3 describes an implementation of the MapReduce interface tailored towards our cluster-based computing environment. Section 4 describes several refinements of the programming model that we have found useful. Section 5 has preformance measurements of our implementation for a variety of tasks. Section 6 explores the use of MapReduce within Google including our experiences in using it as the basis for a rewrite of our production indexing system. Section 7 discusses related and future work.

2 Programming Model

The computation takes a set of input key/value pairs, and produces a set of output key/value pairs. The user of the MapReduce library expresses the computation as two functions: Map and Reduce.

Reduce function.

选择自己所爱的,爱自己所选择的。

MapReduce:Simplified Data Processing On Large Clusters

相关文章:

你感兴趣的文章:

标签云: