Inversed Indexing with MapReduce

1. What the program does

Inversed indexing is a batch processing to sort documents for each popular keyword, so that it makes easier and faster for document-retrieval requests to find documents with many hits for a set of user-given keywords. For instance, given a request for finding documents with term1, term2, and term3, we will perform intersection of the corresponding three lists of postings and realize that document 11 includes all of these keywords.

The main( String[] args ) function receives keywords or terms in args. As an output, the program creates a list of documents for each keyword sorted by the number of occurences of the given keyword in the document.

2. Local testing

Prerequisites: installed Java 1.8 and Maven

install Hadoop with homebrew: brew install hadoop
check version of hadoop: brew info hadoop
check that pom.xml has the same version of hadoop client
run mvn package to compile the project into a jar file
run the application over a test set of data in a standalone mode hadoop jar target/inv-indexes-1.0-SNAPSHOT.jar kosiachenko.InvIndexes hdfs/input hdfs/output rfc UDP
check the hdfs/output/part-00000 file to see the results

3. Hadoop and MapReduce Installation

After testing the program locally, I installed the Hadoop File System on the cluster of 4 machines (including configuring the YARN resource-management platform) for testing on an actual cluster.

I also copied the input files (169 files containing the IETF (Internet Engineering Task Force) protocol standards and draft standards) into the HDFS.

4. Implementation details

The workflow of my program is as follows:

I also implemented sorting of document by the number of keyword occurences in that document in my reducer function.

5. Issues encountered

The main issue I encountered was with the Combiner. After spending many hours trying to debug the mysterious behavior of my program I figured out what was the problem. The code of the main() function I coppied from the spec includes the following line:

conf.setCombinerClass(Reduce.class);

It means that we specify our Reduce class also as our Combine class. According to the Internet, Combiner is an optional class in a MapReduce program which operates by accepting the inputs from the Map class and thereafter passing the output key-value pairs to the Reducer class. Because my initial design of the program didn’t expect for Reducer output to be passed as input back to the Reducer, it didn’t work as expected and produced wrong results. The workflow was basically as follows:

I figured I can solve this issue in several different ways:

Delete the line in main() where we set the Reduce.class as our Combiner. This was what I initially did, but than I realized that this could lead to a sub-optimal performance.
Handle Combiner output in Reduce class as a valid input. This is what I have in my final program. Returning the Combiner improved program performance by 1,16 times for the Yarn-version (from 175 sec to 150 sec).
Write my own Combiner class. This seemed like an overkill in this case.

6. Performance analysis

I didn't observe any performance speed-up from parallelizing the job over 4-machine cluster compared to the local job. The completion of the task actually took longer in that case.

But these results are expected, because I worked with a very small dataset (on the Hadoop scale), so the network communication overhead was way bigger than any performance gains from parallelization.

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
hdfs		hdfs
images		images
src/main/java/kosiachenko		src/main/java/kosiachenko
.gitignore		.gitignore
README.md		README.md
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Inversed Indexing with MapReduce

1. What the program does

2. Local testing

3. Hadoop and MapReduce Installation

4. Implementation details

5. Issues encountered

6. Performance analysis

About

Uh oh!

Releases

Packages

Languages

kosiachenko/InvertedIndexer_MapReduce

Folders and files

Latest commit

History

Repository files navigation

Inversed Indexing with MapReduce

1. What the program does

2. Local testing

3. Hadoop and MapReduce Installation

4. Implementation details

5. Issues encountered

6. Performance analysis

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages