Big Data

Lab2: Bigrams (Pairs and Stripes)

The Bigram example consists of counting the co-occurrences of pairs of words. Formally speaking, for this lab we consider Bigrams to be every ordered pair of sequential words in the text appearing in the same line. In simpler terms, the word w1 co-occurs with the word w2 if the two words are in the same line and the first word is immediately followed by the second one. For example, the text with the line "UNIX is simple" contains the pairs ("UNIX", "is") and ("is", "simple"). The co-occurrence for each pair is 1.

There are two possible implementations for counting the co-occurrences. In the following, we give the implementation using the Pairs design pattern and ask you to improve the code using the Stripes design pattern. In the implementation using the Pairs design pattern, the key is a pair of words and the value is an integer. TextPair.java provides an implementation of the Pair design pattern for Text objects in Hadoop.

Pairs Implementation (provided)

The source code of the Pairs implementation for the bigram example is provided and commented. Follow these steps to test the Bigram example:

  1. Download the source code. The zip file contains the source code, a readme file and a pom.xml file.

  2. Try to read and understand the code line by line.

  3. Compile the code using mvn compile.

  4. Create a jar file using mvn package.

  5. Test the code on the quote file.

  6. Run the Bigram count on Hadoop (in standalone/pseudo-distributed mode):

    $ ${HADOOP_HOME}/bin/hadoop jar target/bigram-1.0-SNAPSHOT.jar \
      heigvd.bda.labs.bigram.Pairs NUM_REDUCERS INPUT_PATH OUTPUT_PATH
    
    • The INPUT_PATH refers to a folder on the local/distributed file system that contains the downloaded file.

    • The OUTPUT_PATH should not exist on the local/distributed file system.

The output (part-r-00000) should look like (without pre-processing):

Albert  Einstein    1
Not everything  1
and not 1
be  counted 1
be  counted.    1
can be  2
counted counts, 1
counts  can 1
counts, and 1
everything  that    2
not everything  1
that    can 1
that    counts  1

Stripes Implementation

Your task is to improve the Bigram implementation using the Stripes design pattern presented in the course. All improvements and pre-processing (including case-folding and removing non-alphabetic words) learned in the WordCount example should be taken into account for this lab.

Copy the file Pairs.java into a new file Stripes.java (inside the same package) and modify the code. An implementation of a string-to-int associative array (that is, a HashMap) required for the Stripes pattern is provided in the class StringToIntMapWritable. When you have completed the code, use maven to compile and create a jar file.

Questions

Test your code on the file bible_shakes.nopunc.gz and answer the following questions in the order that they appear:

  1. Do you think the reducer of the Stripes implementation can be used as combiner? How the result changes when you use the reducer as combiner?

  2. Do you think the Stripes pattern could be used with the in-Mapper combiner pattern? If yes, implement it inside ImprovedStripes class. Do not use reducer as combiner in this exercise.

  3. How does the number of reducers influence the behavior of the Stripes approach?

  4. How does the number of reducers influence the behavior of the Pairs approach?

  5. Compared to the Pairs pattern, for a given number of reducers, how many bytes are shuffled using Stripes pattern without in-Mapper combiner pattern? with in-Mapper combiner pattern?

  6. Compare Pairs and Stripes design patterns and mention their advantages and drawbacks.

Running Hadoop on the Amazon cloud

Your task is to run the code that you have developed on the Amazon cloud. Amazon calls the service that provides Hadoop clusters Elastic MapReduce (EMR).

To run your code on Amazon follow the EMR Getting Started Guide with some small modifications. The instructions in the guide comprise eight steps, for this lab we can skip some steps and modify some others as explained in the following.

Questions

Answer the following questions in the order that they appear:

  1. How much time did the job flow take according to the EMR console and how long did it take in total (wall clock time)?

  2. How many EC2 instances were created to run the job and what were their roles? How many normalized instance hours did Amazon bill for the job?

Lab Deliverables

Submit your lab report (including the source code and answers to the questions) in pdf format via moodle before Tuesday October 13, 8:30am.