Background Knowledge

If you are new to Hadoop/Spark, then this page will help you learn some critical concepts behind them.

Before going deeper technically, let's consider Big Data.

What is Big Data?

In Wikipedia, Big Data is defined as:

Big data usually includes data sets with sizes beyond the ability of commonly used software tools to capture, curate, manage, and process data within a tolerable elapsed time

...

Big data "size" is a constantly moving target, as of 2012 ranging from a few dozen terabytes to many petabytes of data

To me, this is too technical, and sounds missing something very important to business. So here's my definition:

Big Data is a strategy where you could become more competitive as you have more data. Often, the winner is the one who has the biggest data.

Typically, I imagine a market where Google is. This sounds very challenging, but worth trying!

What is Hadoop/Spark?

Hadoop was inspired by Google's MapReduce and Google FileSystem. In Google I/O 2008, Jeff Dean, the inventor of MapReduce and Google FileSystem, pointed out the following two facts of large scale cluster.

Reliability has to come from the software
How can we make it easy to write distributed programs?

You can still enjoy his presentation here, Underneath the Covers at Google: Current Systems and Future Directions.

Simply put, the answer available to everyone is Hadoop/Spark.

But note that it could become more than necessary if you don't play in Big Data.

Data Parallelism

At least, you should know that Hadoop/Spark executes a job in parallel. There are two types of parallelism. One is Data Parallelism, and another is Task Parallelism. In Data Parallelism, each distributed job executes the same code. While in Task Parallelism, it doesn't.

Hadoop/Spark follows Data Parallelism.

It may not be clear immediately, so let me show you an example.

Suppose you summarize user access to your website with a single large access log. Items to be summarized are:

how many unique users in each hour?
how many requests per user?
average response time per page

You write a program for each, A, B, C, respectively. Now, how to execute each program on your cluster, say, 3 nodes, L, M, N? In both cases, the log file exists on every node.

If you run A on L, B on M, C on N concurrently, then it is called Task Parallelism. When you think about multi-cores instead of multi-nodes, then you will understand it is what you usually expect. The execution time T is:

T = MAX(t(A), t(B), t(C))

where:
t(X) means an execution time of program X

In Data Parallelism, they are executed as follows:

A on L, A on M, A on N
B on L, B on M, B on N
C on L, C on M, C on N

In Data Parallelism, each job on each node reads only a chunk of the file. That's why it can be executed faster as you have more parallelism. The execution time T is:

T = t(A)/3 + t(B)/3 + t(C)/3

where:
t(X) means an execution time of program X

Data Locality

Google found, in any large scale cluster, the cost of moving data around is very high. This is mentioned in the said presentation.

Imagine you have a cluster consists of 1,000 nodes. Each of your data has 3 replicas. If you randomly distribute your job, there is only 3/1000 chances where a job can get its data locally. In most cases(997/1000), your job has to move its data. Now you see network is becoming a bottleneck as you launch more jobs concurrently.

So, it is critical to move a job to a node where its data exists in order for your network not to become a bottleneck.

This is called Data Locality.

Data Parallel Framework

As explained in Data Locality, it is critical to distribute jobs where their data exist.

When a framework is asked to execute a program in parallel, it fist checks chunks of each input file, and assign each chunk to a new job. So basically(some optimizations are available), you can assume there are as many jobs as the number of chunks of all the input files.

Then, the framework deliver each job to a node where its chunk exist. Each node executes its assigned job, and returns a result from its chunk to the program.

Map Reduce

Map Reduce is not only a particular implementation of parallel execution, but also a common design that runs on Data Parallel Framework like Hadoop and Spark, and which helps you to understand how it works.

Let's take one of previous examples, "how many requests per user?".

To answer the question, you need to do the followings:

map each line(access) to user
count the number of lines for each user

1 is done in Map, and 2 is done in Reduce.

Map is a phase to map each line to some key. Here, a key is a user. Thus, after Map job, your program gets a result like this:

user name	line
foo	XXXXXXXXXX, YYYYYYYYYY
...	...
bar	ZZZZZZZZZZ

Then, a program gathers map results, shuffles(sort) them, and make a Reduce job for each key. Say, a job for foo, another for bar.

A reduce job does a simple "count" operation of line entries, and produces "the number of requests for a particular user".

Finally, your program will get a result like this:

user name	number of requests
foo	5
...	...
bar	7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Background Knowledge

What is Big Data?

What is Hadoop/Spark?

Data Parallelism

Data Locality

Data Parallel Framework

Map Reduce

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally