Download

Clairlib
The Clair library (i.e. Clairlib) is a suite of open-source Perl modules intended to simplify a number of generic tasks in natural language processing (NLP), information retrieval (IR), and network analysis (NA). Its architecture also allows for external software to be plugged in with very little effort. To download, please visit www.clairlib.org

MEAD
a prerequisite for all Clairlib versions MEAD

MEAD Evaluation add-on
an Evaluation Framework for Extractive Summarization: MEADeval (temporarily unavailable).

AAN: The AAN corpus includes three networks, paper citation, author citation and auth or collaboration. The paper citation network (paper-citation-network.txt) is a directed network composed of nodes labeled with paper ids which correspond to in dividual papers (acl-metadata.txt). The author citation network (author-citation-network.txt), a directed network, is compiled from the paper network and the metadata file. For each citation in the paper network, where paper A cites paper B, and for each author in paper A, an edge is created for that author to each author in paper B. The author collaboration network (author-collaboration-network.txt), an undirected network, is composed of authors where, for each paper in t he paper citation network, an edge is created between each collaborator for that paper.Download

CSTBank: Cross-document Structure Theory Bank Download

Surveyor: paper collection Download

Cartoons: data set Download

CreateDebate: data set Download

Similarity: data set Download

FRAUD: CLAIR collection of fraud email Download

SUMMBank: a collection of summaries used in the JHU workshop in 2001Download

String Similarity Measures A C++ package for computing similarity between strings. The package supports the following similarity measures

Cosine Similarity
Jaccard Similarity
Similarity based on Levenshtein Distance
P-Spectrum Kernel
Length-Weighted Kernel

Node Similarity Measures A C++ library for computing similarity between nodes in a graph. The library supports the following similarity measures

SimRank
Random walk based similarity measure

Relational Classification Dataset

Contains 380 papers manually classified into three research areas: Machine Translation, Dependency Parsing and Summarization.
Contains Authorship information, venue information, title and citation information for all the papers.

Publication Classification

Contains 383 papers manually classified into 31 research areas using session information.

Near Duplicate Detection
A C++ package for detecting near-duplicate documents in a large corpus