README

This package is used to find clusters of near duplicate documents in a large
corpus using shingling. Below is an example of near duplicates

Example 1

1. C++ (pronounced "see plus plus") is a statically typed, free-form,
multi-paradigm, compiled, general-purpose programming language. It is regarded
as an intermediate-level language, as it comprises a combination of both
high-level and low-level language features.

2. C++ (pronounced "see plus plus") is a statically typed, compiled,
general purpose programming language. It is regarded as an intermediate level
language, as it comprises a combination of both high-level and low-level
language features.

The code essentially computes the set of n-grams in all documents and computes
the Jaccard coefficient between the sets of ngrams in the two documents.

INSTALLATION

The code includes packages from the boost library for matrix computations,
handling program arguments, etc. Make sure to install the boost library before
compiling this package. Also, modify the path mentioned in the second line of
the Makefile to the home directory of the boost library.

The package can be easily compiled using the simple "make" command.

RUNNING THE CODE

The make outputs the main executable clusterShingles which can be used as
follows to cluster a set of tweets.

./clusterShingles [OPTIONS] --meta_file filename.txt

ALLOWED OPTIONS

1) --help

Produces a help message with a description of the allowed options.

2) --similarity_th arg

The similarity_threshold for two tweets to be considered near-duplicates,
defaults to 0.75

3) --shingle_size arg

The size of shingles, by default it is set to 3 (trigrams).

4) --number_hashes arg
The number of hash functions used for computing the Jaccard similarity,
defaults to 100

5) --maxWordsInADocumentFromBeginning arg

The maximum number of words in a document from the beginning. This option
is useful when you want to find the near duplicates using the top few
sentences in documents alone. Defaults to 500.

6) --minClusterSize arg

Minimum number of near-duplicates to be considered a cluster. Defaults to 4.

7) --dump_sim Outputs the similarity values computed between the tweets if set
to true. Defaults to false.

8) --meta_file filename.txt

The file containing the tweets to be clustered. 

INPUT FORMAT

The file should consist of two fields in each line. The first field should 
correspond to the id of the tweet while the second field should contain the 
tweet.

OUTPUT FORMAT

The code outputs the total number of clusters in the first line. Then, for
each cluster, the size of the cluster, the documents in the cluster are
printed to the standard output.