Lab3: Relative Frequencies (Order Inversion)
In this lab we continue the Bigram example. So far we have computed the number of times two words co-occur as bigrams in a text. Now we want to extend the MapReduce program to compute relative frequencies instead of absolute counts. The relative frequency of a bigram (wi, wj) relative to its first word wi is the number of occurrences of the pair (wi, wj) related to the number of occurrences of the word wi. For instance, if the word "euro" followed by the word "crisis" occurs 10 times and the word "euro" occurs 20 times in total, we say that the frequency of the pair ("euro","crisis") is 0.5.
Order Inversion Implementation
Your task is to implement relative frequencies using Pairs and Order Inversion. You can reuse some of the code from previous labs.
There is one file for this exercise called OrderInversion.java
. The
run method of the job is already implemented, your task is to
implement the mapper, the reducer and the partitioner. Note that
inside the OrderInversion class there is a field called ASTERISK which
can be used to output the total number of occurrences of a word.
-
Download the source code. The zip file contains the source code, a readme file and a
pom.xml
file. -
Complete the code by adding the mapper, the reducer and the partitioner (see TODO).
-
Compile the code using
mvn compile
. -
Create a jar file using
mvn package
. -
Test the code on the file
quote
. -
Run the completed MapReduce program on Hadoop (in standalone/pseudo-distributed mode) using a single reducer:
$ ${HADOOP_HOME}/bin/hadoop jar target/relfreq-1.0-SNAPSHOT.jar \ heigvd.bda.labs.relfreq.OrderInversion \ NUM_REDUCERS INPUT_PATH OUTPUT_PATH
-
The
INPUT_PATH
refers to a folder on the local/distributed file system that contains the downloaded file. -
The
OUTPUT_PATH
should not exist on the local/distributed file system.
-
The output (part-r-00000
) should look like:
albert einstein 1.0
and not 1.0
be counted 1.0
can be 1.0
counted counts 1.0
counts and 0.5
counts can 0.5
everything that 1.0
not everything 1.0
that can 0.5
that counts 0.5
Questions
Test your code on the file bible_shakes.nopunc.gz
and answer the
following questions (by considering the role of the combiner).
-
Do you think the Order Inversion approach is 'faster' than a naive approach with multiple jobs? For example, consider implementing a compound job in which you compute the numerator and the denominator separately, and then perform the computation of the relative frequency.
-
What is the impact of the use of a 'special' compound key on the amount of shuffled bytes?
Evaluation on Amazon EMR
In this task you implement relative frequencies using Stripes and evaluate the performance of your algorithm using a Wikipedia dataset on Amazon EMR.
-
The Stripes version of relative frequencies is quite simple to implement (no order inversion necessary). Copy the Bigram Stripes code into a new Java file
Stripes.java
in therelfreq
package. You have to modify the reduce() function to compute the sum of co-occurrences and divide the co-occurrence by the computed sum.Hint: To hold the relative frequencies implement a
StringToDoubleMapWritable
class that you can easily derive from the existingStringToIntMapWritable
implementation. -
Test your code in standalone/pseudo-distributed mode.
-
A dataset with a subset of Wikipedia (~5GB) is available via Amazon S3 in the bucket
s3n://mse-bda-data/data/wikipedia
The format of file is as follows: each line starts by the article id, an underline and the sentence id, followed by an space and the sentence content. All ponctuations and non-alphabetic characters are removed and all characters are in lowercase. The file requires no additional pre-processing.
-
Run the Stripes version of relative frequencies on Wikipedia dataset using 11 large instances (1 master and 10 core instances) of Amazon EMR. Fix the number of reducers to 19.
Hint: the job should take less than an hour.
-
Repeat the previous step with the Pairs version of relative frequencies and compare the results with Stripes version.
Questions
-
How much time did the job flow take according to the EMR console, how many instance hours were charged and how much did the job flow cost?
-
Extrapolate the results of the previous question to processing the whole English Wikipeda which has currently a size of ~20 GB.
Lab Deliverables
Submit your lab report (including the source code and answers to the questions) in pdf format via moodle before Tuesday October 26, 8:30am.