Near Duplicate Detection |
|
This package helps in finding clusters of near-duplicate documents in a large corpus.
The code represents each document as a set of n-grams contained in the document. The similarity between two documents is computed using the Jaccard
coefficient between the two corresponding sets of n-grams. However, pairwise similarity computation is avoided through the use of hashing and probabilistic
computation of the Jaccard coefficient. For details refer to the paper by Broder et al. [1]
Here is a README which explains instructions for usage and input data format. Click here to download the package References
|