MapReduce: Simplified Data Processing on Large Clusters

Jeffrey Dean and Sanjay Ghemawat

What is the Problem?

For companies like Google, they need to process extremely large data, and the computations have to be distributed across hundreds or thousands of machines in order to finish in a reasonable amount of time. While handling such distributed tasks, they have to face the issue of how to parallelize the computation, distribute the data, and handle failures, which obscures the original simple computation with large amounts of complex engineering work to deal with these issues.

Summary

This paper presents a new programming model, MapReduce, which is inspired by the map and reduce primitives in many functional languages. This model provides users with an abstraction that allows them to express the simple computations they are trying to perform but hides the messy details of parallelization, fault-tolerance, data distribution and load balancing in a library. In this paper, the authors also describe an implementation of the MapReduce interface towards the cluster-based computing environment in Google and discuss several useful refinements of the programming model.

Key Insights

Although tasks of processing a large amount of data seem complex, they can be expressed as simple and straightforward solutions: map-/reduce-based style.
In this kind of distributed system, network bandwidth can be a relatively scarce resource as well as the bottleneck of the whole process.

Notable Design Details/Strengths

MapReduce provides a simple and powerful interface that enables automatic parallelization and distribution of large-scale computations. It hides complex technical details under the simple interface and helps users deal with data partition problem, task scheduling problem, failure handling problem, the inter-machine communication problem. Moreover, such design can achieve high performance on large clusters of commodity PCs.
The distributed fashion and the use of a master node enable the system to tolerate node failures by simply rescheduling the work to other available nodes. Furthermore, the master node periodically stores checkpoints of the master data structures so that a new copy of the master node can be restarted and resumes the whole work from the last interrupted state.
MapReduce does not limit input and output types. In this case, users can implement their readers to retrieve data from different sources and can produce customed data types.

Limitations/Weaknesses

Such programming model is not suitable for real-time data processing scenario. First, it relies on a distributed file system to store large amounts of data. Then, MapReduce framework starts to process the data. This two-step process may downgrade the performance.
MapReduce only supports a restricted programming model, which requires that tasks must be written as acyclic data-flow programs. Moreover, based on the programming model in this paper, the map function and the reduce function must be stateless, which imposes limitations that are felt in fields such as machine learning, where iterative algorithms may lead to data dependency among iterations.

Summary of Key Results

The overhead introduced by MapReduce is mainly due to the propagation of the program to all worker machines, and delays interacting with GFS to open the set of input files and to get the information needed for the locality optimization.
Backup tasks greatly reduce the total time required to complete the MapReduce task. For the sort program in this paper, with backup tasks disabled, five stragglers result in an unnecessary increase of 44% in elapsed time.

Open Questions

This paper does not provide detailed statistics to clarify how many kinds of programs can be applied to this programming model. Without such information, it is hard to say that MapReduce is useful for processing a large amount of data, because, nowadays, programs become much more complex and may not be converted to MapReduce style. As well, the performance of MapReduce must be a major problem, which is also hard to solve, because the restricted programming model sacrifices the space for improvement, though such programming model does give users high convenience.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MapReduce: Simplified Data Processing on Large Clusters

Jeffrey Dean and Sanjay Ghemawat

What is the Problem?

Summary

Key Insights

Notable Design Details/Strengths

Limitations/Weaknesses

Summary of Key Results

Open Questions

FilesExpand file tree

11.13.md

Latest commit

History

11.13.md

File metadata and controls

MapReduce: Simplified Data Processing on Large Clusters

Jeffrey Dean and Sanjay Ghemawat

What is the Problem?

Summary

Key Insights

Notable Design Details/Strengths

Limitations/Weaknesses

Summary of Key Results

Open Questions