Big Data

Lab3: Relative Frequencies (Order Inversion)

In this lab we continue the Bigram example. So far we have computed the number of times two words co-occur as bigrams in a text. Now we want to extend the MapReduce program to compute relative frequencies instead of absolute counts. The relative frequency of a bigram (wi, wj) relative to its first word wi is the number of occurrences of the pair (wi, wj) related to the number of occurrences of the word wi. For instance, if the word "euro" followed by the word "crisis" occurs 10 times and the word "euro" occurs 20 times in total, we say that the frequency of the pair ("euro","crisis") is 0.5.

Order Inversion Implementation

Your task is to implement relative frequencies using Pairs and Order Inversion. You can reuse some of the code from previous labs.

There is one file for this exercise called OrderInversion.java. The run method of the job is already implemented, your task is to implement the mapper, the reducer and the partitioner. Note that inside the OrderInversion class there is a field called ASTERISK which can be used to output the total number of occurrences of a word.

  1. Download the source code. The zip file contains the source code, a readme file and a pom.xml file.

  2. Complete the code by adding the mapper, the reducer and the partitioner (see TODO).

  3. Compile the code using mvn compile.

  4. Create a jar file using mvn package.

  5. Test the code on the file quote.

  6. Run the completed MapReduce program on Hadoop (in standalone/pseudo-distributed mode) using a single reducer:

    $ ${HADOOP_HOME}/bin/hadoop jar target/relfreq-1.0-SNAPSHOT.jar \
      heigvd.bda.labs.relfreq.OrderInversion \
      NUM_REDUCERS INPUT_PATH OUTPUT_PATH
    
    • The INPUT_PATH refers to a folder on the local/distributed file system that contains the downloaded file.

    • The OUTPUT_PATH should not exist on the local/distributed file system.

The output (part-r-00000) should look like:

albert  einstein    1.0
and not 1.0
be  counted 1.0
can be  1.0
counted counts  1.0
counts  and 0.5
counts  can 0.5
everything  that    1.0
not everything  1.0
that    can 0.5
that    counts  0.5

Questions

Test your code on the file bible_shakes.nopunc.gz and answer the following questions (by considering the role of the combiner).

  1. Do you think the Order Inversion approach is 'faster' than a naive approach with multiple jobs? For example, consider implementing a compound job in which you compute the numerator and the denominator separately, and then perform the computation of the relative frequency.

  2. What is the impact of the use of a 'special' compound key on the amount of shuffled bytes?

Evaluation on Amazon EMR

In this task you implement relative frequencies using Stripes and evaluate the performance of your algorithm using a Wikipedia dataset on Amazon EMR.

  1. The Stripes version of relative frequencies is quite simple to implement (no order inversion necessary). Copy the Bigram Stripes code into a new Java file Stripes.java in the relfreq package. You have to modify the reduce() function to compute the sum of co-occurrences and divide the co-occurrence by the computed sum.

    Hint: To hold the relative frequencies implement a StringToDoubleMapWritable class that you can easily derive from the existing StringToIntMapWritable implementation.

  2. Test your code in standalone/pseudo-distributed mode.

  3. A dataset with a subset of Wikipedia (~5GB) is available via Amazon S3 in the bucket

    s3n://mse-bda-data/data/wikipedia

    The format of file is as follows: each line starts by the article id, an underline and the sentence id, followed by an space and the sentence content. All ponctuations and non-alphabetic characters are removed and all characters are in lowercase. The file requires no additional pre-processing.

  4. Run the Stripes version of relative frequencies on Wikipedia dataset using 11 large instances (1 master and 10 core instances) of Amazon EMR. Fix the number of reducers to 19.

    Hint: the job should take less than an hour.

  5. Repeat the previous step with the Pairs version of relative frequencies and compare the results with Stripes version.

Questions

  1. How much time did the job flow take according to the EMR console, how many instance hours were charged and how much did the job flow cost?

  2. Extrapolate the results of the previous question to processing the whole English Wikipeda which has currently a size of ~20 GB.

Lab Deliverables

Submit your lab report (including the source code and answers to the questions) in pdf format via moodle before Tuesday October 26, 8:30am.