Near Duplicate Detection

This package helps in finding clusters of near-duplicate documents in a large corpus. The code represents each document as a set of n-grams contained in the document. The similarity between two documents is computed using the Jaccard coefficient between the two corresponding sets of n-grams. However, pairwise similarity computation is avoided through the use of hashing and probabilistic computation of the Jaccard coefficient. For details refer to the paper by Broder et al. [1]
Here is a README which explains instructions for usage and input data format.

Click here to download the package

References

Andrei Broder, Steven C. Glassman, Mark S. Manasse, and Geoffrey Zweig. Syntactic clustering of the web. In Proceedings of WWW. 1997.