Lab2: Bigrams (Pairs and Stripes)
The Bigram example consists of counting the co-occurrences of pairs of words. Formally speaking, for this lab we consider Bigrams to be every ordered pair of sequential words in the text appearing in the same line. In simpler terms, the word w1 co-occurs with the word w2 if the two words are in the same line and the first word is immediately followed by the second one. For example, the text with the line "UNIX is simple" contains the pairs ("UNIX", "is") and ("is", "simple"). The co-occurrence for each pair is 1.
There are two possible implementations for counting the
co-occurrences. In the following, we give the implementation using the
Pairs design pattern and ask you to improve the code using the Stripes
design pattern. In the implementation using the Pairs design pattern,
the key is a pair of words and the value is an
integer. TextPair.java
provides an implementation of the Pair design
pattern for Text objects in Hadoop.
Pairs Implementation (provided)
The source code of the Pairs implementation for the bigram example is provided and commented. Follow these steps to test the Bigram example:
-
Download the source code. The zip file contains the source code, a readme file and a
pom.xml
file. -
Try to read and understand the code line by line.
-
Compile the code using
mvn compile
. -
Create a jar file using
mvn package
. -
Test the code on the
quote
file. -
Run the Bigram count on Hadoop (in standalone/pseudo-distributed mode):
$ ${HADOOP_HOME}/bin/hadoop jar target/bigram-1.0-SNAPSHOT.jar \ heigvd.bda.labs.bigram.Pairs NUM_REDUCERS INPUT_PATH OUTPUT_PATH
-
The
INPUT_PATH
refers to a folder on the local/distributed file system that contains the downloaded file. -
The
OUTPUT_PATH
should not exist on the local/distributed file system.
-
The output (part-r-00000
) should look like (without pre-processing):
Albert Einstein 1
Not everything 1
and not 1
be counted 1
be counted. 1
can be 2
counted counts, 1
counts can 1
counts, and 1
everything that 2
not everything 1
that can 1
that counts 1
Stripes Implementation
Your task is to improve the Bigram implementation using the Stripes design pattern presented in the course. All improvements and pre-processing (including case-folding and removing non-alphabetic words) learned in the WordCount example should be taken into account for this lab.
Copy the file Pairs.java
into a new file Stripes.java
(inside the
same package) and modify the code. An implementation of a
string-to-int associative array (that is, a HashMap) required for the
Stripes pattern is provided in the class
StringToIntMapWritable
. When you have completed the code,
use maven to compile and create a jar file.
Questions
Test your code on the file bible_shakes.nopunc.gz
and answer the
following questions in the order that they appear:
-
Do you think the reducer of the Stripes implementation can be used as combiner? How the result changes when you use the reducer as combiner?
-
Do you think the Stripes pattern could be used with the in-Mapper combiner pattern? If yes, implement it inside ImprovedStripes class. Do not use reducer as combiner in this exercise.
-
How does the number of reducers influence the behavior of the Stripes approach?
-
How does the number of reducers influence the behavior of the Pairs approach?
-
Compared to the Pairs pattern, for a given number of reducers, how many bytes are shuffled using Stripes pattern without in-Mapper combiner pattern? with in-Mapper combiner pattern?
-
Compare Pairs and Stripes design patterns and mention their advantages and drawbacks.
Running Hadoop on the Amazon cloud
Your task is to run the code that you have developed on the Amazon cloud. Amazon calls the service that provides Hadoop clusters Elastic MapReduce (EMR).
To run your code on Amazon follow the EMR Getting Started Guide with some small modifications. The instructions in the guide comprise eight steps, for this lab we can skip some steps and modify some others as explained in the following.
-
Step 1: Sign up for the AWS Service - We assume you have already created an AWS account.
-
Step 2: Create the Amazon S3 Bucket - Perform this step to create a bucket to hold the jar file and the output data.
-
Choose S3 Service.
-
Create a bucket with name mse-bda-your_last_name. Note that the bucket name must be unique across all existing bucket names in Amazon S3.
-
Choose region US Standard.
-
-
Step 3: Upload Your Data - You can skip this step as we have already uploaded the data for you. The test corpus is available in the bucket
mse-bda-data
at the locationdata/bible_shakes.nopunc.gz
, or on the web athttps://s3.amazonaws.com/mse-bda-data/data/bible_shakes.nopunc.gz
. -
Step 4: Upload Your Jar - Upload your jar file to the S3 bucket you created. Write down its path (which starts with s3n://mse-bda-your_last_name), you will need it later.
-
Step 5: Create Security Credentials:
-
Choose Security Credentials from your user account and click on Continue to Security Credentials.
-
Choose Access Keys and click on Create New Access Key.
-
Download Key File and save it on your computer.
-
Make sure that status of Access Key is Active.
-
-
Step 6: Launch the Cluster using Custom Jar:
-
Choose Elastic MapReduce Service.
-
Change your region to US East (N. Virginia).
-
Click Create Cluster.
-
At the step Create Cluster, select Go to advanced options enter the following information:
-
Cluster Name: Enter a name. We recommended you use a descriptive name. It does not need to be unique.
-
Termination Protection: Select No.
-
Choose Log folder S3 location: Browse your bucket. You can create a folder
logs
on your bucket that contains the log files.
-
-
At the step Software Configuration, enter the following information:
-
Hadoop distribution: Select Amazon.
-
AMI Version: Select Hadoop 1.0.3 (AMI version 2.4.2)
-
Applications to be installed: Remove Hive and Pig.
-
-
At the step Hardware Configuration, select the type and number of instances (keep the default values).
-
At the step Steps, enter the following information:
-
Add step: Select Custom JAR.
-
Select Configure and add.
-
Name: You can choose a name or leave the default value.
-
JAR Location: Browse your bucket to select your jar file in S3.
-
JAR Arguments: The arguments you would normally pass on the command line after the jar file (a series of strings separated by space). This is normally the name of the main class followed by the program arguments. In this case, the arguments are:
- Main class:
heigvd.bda.labs.bigram.Pairs
- Number of reducers
- Input path and output path: For S3, these can be
specified in the form
s3n://bucket-name/path
. The input data is located ats3n://mse-bda-data/data/bible_shakes.nopunc.gz
, for the output path specify a path in the bucket you created earlier that does not yet exist, likes3n://mse-bda-einstein/output
.
- Main class:
-
Action on failure: Choose Terminate cluster.
-
Choose Add.
-
Auto-terminate: Choose Yes.
-
Click Create Cluster.
-
-
The Amazon EMR console shows the new cluster starting. Starting a new cluster may take several minutes, depending on the number and type of EC2 instances Amazon EMR is launching and configuring. Click the Refresh button to update the view of the cluster's progress.
-
-
Step 7: Monitor the Cluster - You can monitor your cluster using Monitoring and Steps. Measure the approximate time the job flow spends in its different states, STARTING, RUNNING, SHUTTING DOWN and COMPLETED.
-
Step 8: View the Results - You can check the output created on your bucket on S3. You can also view the debug log files created for each step.
-
Step 9: Clean Up - Make sure that your cluster is terminated. The status should be Terminated.
Questions
Answer the following questions in the order that they appear:
-
How much time did the job flow take according to the EMR console and how long did it take in total (wall clock time)?
-
How many EC2 instances were created to run the job and what were their roles? How many normalized instance hours did Amazon bill for the job?
Lab Deliverables
Submit your lab report (including the source code and answers to the questions) in pdf format via moodle before Tuesday October 13, 8:30am.