Big Data Analytics

Lab4: Inverted Index

In this lab, you'll be creating an inverted index. An inverted index is a data structure common to nearly all information retrieval systems. Let us consider the following famous lines from Shakespeare's Merchants of Venice:

1: if you prick us do we not bleed
2: if you tickle us do we not laugh
3: if you poison us do we not die and
4: if you wrong us shall we not revenge

An inverted index consists of a collection of postings lists, one associated with each unique term in the collection. Each postings list consists of a number of individual postings. Each posting holds a document identifier (docno) and the frequency (i.e., count) of the term in that document.

Let's treat each line in the above sample data as if it were a "document". The complete inverted index would look something like this:

and     : 1 : (3, 1)
bleed   : 1 : (1, 1)
die     : 1 : (3, 1)
do      : 3 : (1, 1), (2, 1), (3, 1)
if      : 4 : (1, 1), (2, 1), (3, 1), (4, 1)
laugh   : 1 : (2, 1)
not     : 4 : (1, 1) (2, 1), (3, 1), (4, 1)
poison  : 1 : (3, 1)
prick   : 1 : (1, 1)
revenge : 1 : (4, 1)
shall   : 1 : (4, 1)
tickle  : 1 : (2, 1)
us      : 4 : (1, 1), (2, 1), (3, 1), (4, 1)
we      : 4 : (1, 1), (2, 1), (3, 1), (4, 1)
wrong   : 1 : (4, 1)
you     : 4 : (1, 1), (2, 1), (3, 1), (4, 1)

As you can see, we have a postings list for each word that appears in the collection. Let us look at the postings list corresponding to the term if in a bit more detail:

if      : 4 : (1, 1), (2, 1), (3, 1), (4, 1)

The number directly after the term is its document frequency or df for short. The df specifies the number of documents that contain this term. Since if appears in all four documents, its df is 4. Although the df can be easily reconstructed by counting the number of postings, it is often explicitly stored in the inverted index. The postings list contains a number of postings, each of which is a (docno, tf) tuple. The docno is simply a unique identifier for the document (one through four, in this case). The tf, which stands for term frequency, is the number of times the term appears in the document. The term if appears once in every document. Typically, postings are sorted by ascending docno (as shown above).

Practical Tips

In this exercise, you'll have to create and manipulate postings lists, which are complex objects that have their own internal structure. Let's consider this term and its associated postings list as an example:

if      : 4 : (1, 1), (2, 1), (3, 1), (4, 1)

You have different choices to represent postings lists. First, you can encode them as Java strings wrapped inside Hadoop Text objects. The string format might look something like this:

4:1,1:2,1:3,1:4,1

The downside is that when manipulating postings, you'll have to do a lot of string-based operations (e.g., splits). This approach will work, but it's pretty ugly. The second approach is to write your own custom Writable. As an example, you can look at the TextPair and StringToIntMapWriatble files provided in previous labs.

This exercise is a good opportunity to learn about different output formats. An OutputFormat (see Hadoop API) describes how output key-value pairs are written to HDFS. By default, Hadoop uses TextOutputFormat, which writes out the key-value pairs in human-readable text format. This is good for you, but can be annoying if you want to further manipulate the output programmatically — since you'll have to read in the text file and parse the key-value pairs back into Java objects (even if you have your own custom Writables).

As an alternative, you might want to consider SequenceFileOutputFormat. You can specify that format with the setOutputFormatClass method in Job class. If you do this, the output of your MapReduce job will be stored in one or more SequenceFiles. The advantage of SequenceFiles is that they store key-value pairs in a machine-readable format, i.e., as serialized byte representations of the Java objects (not human readable, but can be programmatically manipulated quite easily). The SequenceFile API provides methods for reading key-value pairs — saving you the effort of having to manually parse plain text files. Of course, SequenceFiles aren't very useful if you are using Text objects as output values.

Along the same lines, you might also want to take a look at MapFileOutputFormat, which writes the output key-value pairs to... as you've guessed, MapFiles! These files have the additional advantage of supporting random access to the keys. You should learn to use SequenceFiles, but MapFiles are likely more useful for this exercise. Once again, see the API for details.

Simple Implementation (without Secondary Sort)

Your task is to write a MapReduce program that builds an inverted index (as described above). A template is provided here. Each postings list should explicitly store the df, as well as all the individual postings. Postings should be sorted by ascending docno (postings corresponding to smaller docnos should precede postings corresponding to larger docnos). Note that the description above only specifies the logical structure of the inverted index and you are free in your choice of data structures for the actual implementation (e.g., each posting does not literally need to be a tuple denoted by parentheses).

Run the inverted indexer on your local machine with the sample input provided before, bible_shakes.nopunc.gz. As with the above case, treat each line as if it were an individual "document". When you map over a plain text file using TextInputFormat in Hadoop, the key passed to the mapper contains the byte offset of the line from the beginning of the file, while the value contains the text of the line. Use this offset value as the unique docno. As part of this exercise you'll also need to write a utility (outside MapReduce) that takes a given docno (i.e., the byte offset) and returns the associated line.

Questions

Look up the postings corresponding to the term "starcross". There should only be one line in the entire collection that contains the term. What is that line? What's its docno (i.e., byte offset)?
Look up the postings corresponding to the term "gold". Generate a histogram of tf values. That is, in how many lines does "gold" appear once, twice, three times, etc.?
Do the same for the terms "silver" and "bronze".

Improved Implementation (using Secondary Sort)

Your task is now to improve your simple implementation and test your code for the Wikipedia dataset on Amazon EMR.

The Simple Implementation currently buffers and sorts postings in the reducer, which as we discussed in class is not a scalable solution. Address this scalability bottleneck using the techniques discussed in the class. In this implementation, you have to define your custom partitioner.

Dataset: The Wikipedia dataset (~5GB) is available via Amazon S3 in the bucket s3n://mse-bda-data/data/wikipedia. The folder contains several files, each file consists of several articles of wikipedia. The format of file is as follows: the line started by ### contains the article id and the following line contains the article content (without line/carriage return).

Evaluation: Run your code on Wikipedia dataset using 11 large instances (1 master and 10 core instances) of Amazon EMR. Fix the number of reducers to 19.

Questions

How many mapper do you have in total? How you can estimate number of mappers for a given dataset?
How much time did the job flow take according to the EMR console, how many instance hours were charged and how much did the job flow cost?
Extrapolate the results of the previous question to processing the whole English Wikipeda which has currently a size of ~20 GB.

Lab Deliverables

Submit your lab report (including the source code and answers to the questions) in pdf format via moodle before Tuesday November 10, 8:30am.