This assignment is to be completed in MapReduce in Java. You will
be working in the same repo as before, except that everything should
go into the package namespace
ca.uwaterloo.cs451.a3
.
Look at the inverted indexing and boolean retrieval implementation
in Bespin. Make sure you understand the
code. Starting from the inverted indexing
baseline BuildInvertedIndex
, modify the indexer code in
the following ways:
1. Index Compression. The index should be compressed using
VInts
:
see org.apache.hadoop.io.WritableUtils
. You should also
use gap-compression (i.e., delta-compression) techniques as appropriate.
2. Buffering postings. The baseline indexer implementation currently buffers and sorts postings in the reducer, which as we discussed in class is not a scalable solution. Address this scalability bottleneck using techniques we discussed in class and in the textbook.
3. Term partitioning. The baseline indexer implementation currently uses only one reducer and therefore all postings lists are shuffled to the same node and written to HDFS in a single partition. Change this so we can specify the number of reducers (hence, partitions) as a command-line argument. This is, of course, easy to do, but we need to make sure that the searcher understands this partitioning also.
Note: Do not unnecessarily change code not directly related to
points #1-#3 above. In particular, do not change how the documents are
tokenized and do not remove df in BuildInvertedIndex
.
Note: The major scalability issue is buffering uncompressed postings in memory. In your solution, you'll still end up buffering each postings list, but in compressed form (raw bytes, no additional object overhead). This is fine because if you use the right compression technique, the postings lists are quite small. As a data point, on a collection of 50 million web pages, 2GB heap is more than enough for a full positional index (and in this assignment you're not asked to store positional information in your postings).
To go into a bit more detail: in the reference implementation, the
final key type is PairOfWritables<IntWritable,
ArrayListWritable<PairOfInts>>
. The most obvious idea
is to change that into something
like PairOfWritables<VIntWritable,
ArrayListWritable<PairOfVInts>>
. This does not work!
The reason is that you will still be materializing each posting, i.e.,
all PairOfVInts
objects in memory. This translates into a
Java object for every posting, which is wasteful in terms of memory
usage and will exhaust memory pretty quickly as you scale. In other
words, you're still buffering objects—just inside
the ArrayListWritable
.
This new indexer should be
named BuildInvertedIndexCompressed
. This new class should
be in the package ca.uwaterloo.cs451.a3
. Make
sure it works on the Shakespeare collection.
Modify BooleanRetrieval
so that it works with the new
compressed indexes. Name this new
class BooleanRetrievalCompressed
. This new class should
be in the same package as above and give the same
output as the old version.
Use BuildInvertedIndex
and BooleanRetrieval
from Bespin as your starting
points. That is, copy over into your repo, rename, and begin your
assignment from there. Don't unnecessarily change code not directly
related to points #1-#3 above. In particular, do not change how
the documents are tokenized, etc. in BuildInvertedIndex
(otherwise there's no good way to check for the correctness of your
algorithm). Also, do not change the fetchLine
method in BooleanRetrieval
so that everyone's output
looks the same.
In more detail, make sure that you can build the inverted index with the following command (make sure your implementation runs in the Linux student CS environment, as that is where we will be doing the marking):
$ hadoop jar target/assignments-1.0.jar ca.uwaterloo.cs451.a3.BuildInvertedIndexCompressed \ -input data/Shakespeare.txt -output cs451-bigdatateach-a3-index-shakespeare -reducers 4
We should be able to control the number of partitions (#3 above)
with the -reducers
option. That is, the code should give
the correct results no matter what we set the value to.
Once we build the index, we should then be able to run a boolean
query as follows (in exactly the same manner
as BooleanRetrieval
in Bespin):
$ hadoop jar target/assignments-1.0.jar ca.uwaterloo.cs451.a3.BooleanRetrievalCompressed \ -index cs451-bigdatateach-a3-index-shakespeare -collection data/Shakespeare.txt \ -query "outrageous fortune AND" $ hadoop jar target/assignments-1.0.jar ca.uwaterloo.cs451.a3.BooleanRetrievalCompressed \ -index cs451-bigdatateach-a3-index-shakespeare -collection data/Shakespeare.txt \ -query "white red OR rose AND pluck AND"
Of course, we will try your program with additional queries to verify its correctness.
HINT: The size of the compressed index for Shakespeare collection is 3.7 MB using our sample solution. You can usedu -h
to
read the size of the index file. If the size of your index file is different,
your code probably has a bug and you will lose the correctness mark.
NOTE: If you use hadoop fs -du -h
on student.cs you will get 3.6MB instead! They use a different rounding method when in "human readable" mode
Now let's try running your implementation on the Datasci cluster,
on the sample Wikipedia
file /data/cs451/enwiki-20180901-sentences-0.1sample.txt
on HDFS:
$ hadoop jar target/assignments-1.0.jar ca.uwaterloo.cs451.a3.BuildInvertedIndexCompressed \ -input /data/cs451/enwiki-20180901-sentences-0.1sample.txt \ -output cs451-bigdatateach-a3-index-wiki -reducers 4
The Wikipedia sample contains a sentence on each line, so each "document" is actually a sentence. Each sentence begins with the article title and the sentence id, e.g., "Anarchism.0004" is sentence 4 from the article "Anarchism".
And let's try running a query:
$ hadoop jar target/assignments-1.0.jar ca.uwaterloo.cs451.a3.BooleanRetrievalCompressed \ -index cs451-bigdatateach-a3-index-wiki \ -collection /data/cs451/enwiki-20180901-sentences-0.1sample.txt \ -query "waterloo stanford OR cheriton AND" $ hadoop jar target/assignments-1.0.jar ca.uwaterloo.cs451.a3.BooleanRetrievalCompressed \ -index cs451-bigdatateach-a3-index-wiki \ -collection /data/cs451/enwiki-20180901-sentences-0.1sample.txt \ -query "big data AND hadoop spark OR AND"HINT: The size of the compressed index for the sample Wikipedia collection is
940.0 M 2.8 G cs451-2022w-a3-index-wiki/part-r-00000when we run
hadoop fs -du -h
. If the size of your index file is different,
your code probably has a bug and you will lose the correctness mark.
Update: This file was generated with one reducer so that's the entire index. 940.0 MB is the size of the file. The second number, 2.8 G, is the capacity used on the HDFS cluster (940MB * 3 ~ 2.8GB)
Please follow these instructions carefully!
Make sure your repo has the following items:
ca.uwaterloo.cs451.a3
.Make sure your implementation runs in the Linux student CS
environment on the Shakespeare collection and also on sample Wikipedia
file /data/cs451/enwiki-20180901-sentences-0.1sample.txt
on HDFS in the Datasci cluster, per above.
Specifically, we will clone your repo and use the below check scripts:
check_assignment3_public_linux.py
in the Linux Student CS environment.check_assignment3_public_datasci.py
on the Datasci cluster.When you've done everything, commit to your repo and remember to push back to origin. You should be able to see your edits in the web interface. Before you consider the assignment "complete", we would recommend that you verify everything above works by performing a clean clone of your repo and run the public check scripts.
That's it!
This assignment is worth a total of 50 points, broken down as follows:
check_assignment3_public_linux.py
(on Linux Student CS)
an check_assignment3_public_datasci.py
(on the Datasci cluster) successfully without any errors.To help you gauge the efficiency of your solution, we are giving you the running times of our reference implementations. Keep in mind that these are machine dependent and can vary depending on the server/cluster load.
Class name | Running time Linux | Running time Datasci |
---|---|---|
BuildInvertedIndexCompressed | 20 seconds | 5 minutes |
BooleanRetrievalCompressed | < 100 ms (without hadoop startup time) | < 250 ms (without hadoop startup time) |