README

The classification dataset is derived from the AAN dataset. The AAN dataset
consists of Natural Language Processing (NLP) papers published in the ACL
anthology.  The AAN dataset can be found here.
http://tangra.cs.yale.edu/newaan/index.php/home/download

There are 380 papers classified into one of the following three classes:

1. Machine Translation 
2. Dependency Parsing 
3. Summarization

SELECTION PROCESS:

We chose a subset of papers in 3 topics (Machine Translation, Dependency
Parsing, Summarization) from the ACL anthology. These topics are three 
main research areas in Natural Language Processing (NLP). Specifically,
we collected all papers which were cited by papers whose titles
contain any of the following phrases, "Dependency Parsing", "Machine
Translation", "Summarization". From this list, we removed all the papers which
contained any of the above phrases in their title because this would make the
classification task easy. The pruned list contains 1190 papers. We manually
classified each paper into four classes (Dependency Parsing, Machine
Translation, Summarization, Other) by considering the full text of the paper.
The manually cleaned data set consists of 275 Machine Translation papers, 73
Dependency Parsing papers and 32 Summarization papers. 

FILES INCLUDED:

Here is a brief description of the directory structure and what is included in
the dataset

	aan_mds 
	| 
	|-----metadata.txt    Contains the id, title, authorship,
	|	    	      venue and the class name for all the papers.  
	| 
	|-----papers_text     This directory contains the full text of the 380
    |	    	      papers. We obtained this text by converting the
	|		      PDF of the paper to text using PDFBox.  
	|
	|-----citations.txt   The file contains citations between ALL the
       			      papers in the AAN data set not just the 
			      citations between the 380	papers in the dataset.
			      This is because many link/citation similarity
			      measures like cocitation or coupling compute
			      similarity between two papers using citations 
			      between other papers.
	  
FORMAT OF metadata.txt: 

The file contains 6 different fields delimited by the tab character for each
paper in its own line.  The first field contains the class information
denoted by a single character. "Machine Translation", "Dependency Parsing"
and "Parsing" are denoted by 'M', 'D' and 'S' respectively. The second field
is the ACL id of the paper and the third field is the title of the paper. 
The fourth field is the list of authors delimited by "; " unlesspreceded by a
";". The fifth field is the venue name and the last field is the year of 
publication.