Big Data

Lab1: WordCount (Aggregations)

Pedagogical Objectives

Description

WordCount is a simple example of a MapReduce job. It counts how often each word occurs in a text file.

There are several possible implementations of WordCount. In the following, we give a basic implementation and ask you to write an improved version using the design patterns presented in the course.

Basic Implementation (provided)

We provide the source code of a basic implementation of the WordCount with extensive comments. Follow these steps to run the WordCount example:

  1. Download the source code. The zip file contains the source code, readme.txt and pom.xml.

  2. Try to read and understand the code line by line.

  3. Compile the code using mvn compile.

  4. Create a jar file using mvn package. The jar file will be created inside target folder.

  5. Download the sample input file quote.

  6. Run WordCount on Hadoop (in standalone/pseudo-distributed mode):

       $ ${HADOOP_HOME}/bin/hadoop jar target/wordcount-1.0-SNAPSHOT.jar \
       heigvd.bda.labs.wordcount.WordCount NUM_REDUCERS INPUT_PATH OUTPUT_PATH
    

    The INPUT_PATH refers to a folder on the local/distributed file system that contains the downloaded file.

    Make sure that the OUTPUT_PATH does not exist on the local/distributed file system (otherwise Hadoop will refuse to run to avoid accidentally overwriting results from a previous run).

After you have run the job you should have an output (part-r-00000) that should look like:

Albert  1
Einstein    1
Not 1
and 1
be  2
can 2
counted 1
counted.    1
counts  1
counts, 1
everything  2
not 1
that    2

Question

Test the WordCount code on the file bible_shakes.nopunc.gz which contains the Bible and the works of Shakespeare (with all punctuation marks removed). Use the pseudo-distributed mode. Bring up the Web Interface of Task tracker in your browser and look at the provided job statistiques.

  1. How many intermediate key-value pairs are produced?

Improved Implementation

Your task is to improve the basic implementation using the In-Mapper Combining design pattern presented in the course. Copy the file WordCount.java into WordCountImproved.java (inside the same package) and modify the code. When you have completed the code, use mvn to compile and create a jar file.

Questions

Test your code on the file bible_shakes.nopunc.gz and answer these questions in the order that they appear:

  1. How does the given code isolate the words to be counted? What kind of "words" do you see in the output?

  2. What is the first term in part-r-00000 and how many times does it appear?

  3. Do you think the number of reducers affects the WordCount example? Try with different values and substantiate your answer.

  4. Use the JobTracker Web interface to examine the job counters: can you explain the differences among basic and improved implementations? For example, look at the amount of bytes shuffled by Hadoop.

  5. How many unique words are there? (Hint: read the counter values)

  6. Some words are more used than others and reducers associated with those words will work more than the others. Do you think this is good? Why?

  7. Modify the WordCount example so that only words consisting entirely of letters are counted. Your code should not be case sensitive. You need to add few lines into the words method.

  8. Now, what is the first term in part-r-00000 and how many times does it appear?

  9. How many unique words are there after the modification mentioned in 7?

Lab Deliverables

Submit your lab report (including the source code and answers to the questions) in pdf format via moodle before Tuesday September 29, 8:30am Tuesday October 6, 8:30am.