README

We have manually annotated all the publications in ACL 2005-2008 based on
session information. 

DATASET CREATION

We compiled session information from three different conferences: COLING, ACL
and EMNLP from 2005-2008 and merged semantically sessions. For example, we
merged "Text Categorization" with "Text Classification".

Below is a list of the 31 sessions we chose.  

Applications Coreference Corpus Annotation Discourse and Dialogue Generation
Grammar Induction Grammars Inference and Entailment Information Extraction
Information Retrieval Lexical Acquisition from Corpora Lexical Issues Machine
Learning and Statistical Methods Machine Translation Morphology Multimodality
and Situated Language Processing Named Entity Parsing Question Answering
Segmentation Semantic Role Labeling Semantics Sentiment and Opinion Speech and
Language Modeling Speech Processing Summarization Tagging Text Classification
Topic Modeling Web Corpora Word Sense Disambiguation

We manually annotated 383 publications from ACL 2005-2008 into the above 31
different classes. 

FILES INCLUDED
 
acl_topics.txt: The file contains three fields on each line.
The first field contains the class name (session name) while the second field
contains the ACL id. The last field contains the title of the paper. 

text_content: This directory contains the text files of all the papers
included in this data set.