test image test image CoSQL 1.0 test image test image

A Conversational Text-to-SQL Challenge
Towards Cross-Domain Natural Language Interfaces to Databases

What is CoSQL?

CoSQL is a corpus for building cross-domain Conversational text-to-SQL systems. It is the dialogue version of the Spider and SParC tasks. CoSQL consists of 30k+ turns plus 10k+ annotated SQL queries, obtained from a Wizard-of-Oz collection of 3k dialogues querying 200 complex databases spanning 138 domains. Each dialogue simulates a real-world DB query scenario with a crowd worker as a user exploring the database and a SQL expert retrieving answers with SQL, clarifying ambiguous questions, or otherwise informing of unanswerable questions.
CoSQL Paper (EMNLP'19) CoSQL Post
Related challenges: UnifiedSKG, single-turn Spider and multi-turn SParC text-to-SQL tasks UnifiedSKG Framework Spider Chanllenge (EMNLP'18) SParC Chanllenge (ACL'19)


  • 11/15/2020 We will use Test Suite Accuracy as our official evaluation metric for Spider, SParC, and CoSQL. Please find the evaluation code from here.
  • 01/16/2020 Added some supplemental files to the CoSQL dataset (No change to the data content for each task).
  • 10/30/2019 CoSQL dataset is out, see you at EMNLP 2019, Hong Kong!

Why CoSQL?

CoSQL introduces new challenges compared to existing task-oriented dialogue tasks:
  • the dialogue states are grounded in domain-independent SQL program instead of domain-specific slot-value pairs.
  • because testing is done on unseen databases, success requires generalizing to new domains.

Compared to other semantic parsing/text-to-SQL tasks, CoSQL presents new challenges:
  • user questions are not necessarily answerable.
  • it involves system responses to clarify ambiguous questions, verify returned results, and notify users of unanswerable or unrelated questions.
  • each dialog is obtained via the Wizard-of-Oz setting between a crowd worker and a SQL expert.

CoSQL includes three tasks:
  • SQL-grounded dialogue state tracking to map user utterances into SQL queries if possible given the interaction history
  • natural language response generation based on an executed SQL and its results for user verification
  • user dialogue act prediction to detect and resolve ambiguous and unanswerable questions

Getting Started

The data is split into training, development, and unreleased test sets. Download a copy of the dataset (distributed under the CC BY-SA 4.0 license):

CoSQL Dataset Details of baseline models and evaluation script can be found on the following GitHub site: CoSQL GitHub Page

Once you have built a model that works to your expectations on the dev set, you can submit it to get official scores on the dev and a hidden test set. To preserve the integrity of test results, we do not release the test set to the public. Instead, we request you to submit your model so that we can run it on the test set for you. Here's a tutorial walking you through official evaluation of your model:

Submission Tutorial

Data Examples

Some examples look like the following:

test image

Have Questions or Want to Contribute ?

Ask us questions at our Github issues page or contact Tao Yu, Rui Zhang, or Xi Victoria Lin.

We expect the dataset to evolve. We would greatly appreciate it if you could donate us your non-private databases or SQL queries for the project.


We thank Pranav Rajpurkar for giving us the permission to build this website based on SQuAD.

Part of our CoSQL team at YINS:

test image

Leaderboard - SQL-grounded Dialogue State Tracking

In CoSQL, user dialogue states are grounded in SQL queries. Dialogue state tracking (DST) in this case is to predict the correct SQL query for each user utterance with INFORM_SQL label given the interaction context and the DB schema. Comparing to other context-dependent text-to-SQL tasks such as SParC, the DST task in CoSQL also includes the ambiguous questions if the user affirms the system clarification of them. In this case, the system clarification is also given as part of the interaction context to predict the SQL query corresponding to the question. As in Spider and SParC tasks, we report results of Exact Set Match without Values:

Rank Model Question Match Interaction Match


Oct 5, 2021


53.9 24.6


Jul 14, 2021

Element AI, a ServiceNow company

(Scholak et al., EMNLP'21) code
54.6 23.7


Jan 7, 2022


53.8 22.1


Sep. 21, 2020

Yale & Microsoft Research & PSU

(Yu et al. ICLR '21)
51.6 21.2


Aug 24, 2020

Alibaba DAMO

(Hui et al. AAAI '21) code
46.8 17.0


Jan 26, 2021

University of Science and Technology of China

(Wang et al. AAAI '21) code
41.8 15.2


Nov. 16, 2020


46.1 15.1


May 26, 2020

Peking University

(Cai et al. EMNLP '20) code
42.5 15.0


Aug 30, 2019

Yale University & Salesforce Research

(Zhang et al. EMNLP '19) code
40.8 13.7


May 21, 2020

University of Washington & Facebook AI Research

(Zhong et al., EMNLP '20)
39.7 12.8


Aug 30, 2019

Yale University & Salesforce Research

(Yu et al. EMNLP '19) code
13.9 2.6


Aug 30, 2019

Yale University

(Yu et al. EMNLP '18) code
14.1 2.2
and Execution with Values:
Rank Model Question Match Interaction Match


May 21, 2020

University of Washington & Facebook AI Research

(Zhong et al., EMNLP '20)
35.9 8.4

Leaderboard - Response Generation from SQL and Query Results

This task requires generating a natural language description of the SQL query and the result for each system response labeled as INFORM_SQL. It considers a SQL query, the execution result, and the DB schema. Preserving logical consistency (Logic Correctness Rate (LCR)) between SQL and NL response is crucial in this task, in addition to naturalness and syntactical correctness.

Rank Model BLEU Grammar LCR (%)


Aug 30, 2019
Template baseline 9.3 4.0 41.0


Aug 30, 2019
Pointer-generator baseline 15.1 3.6 35.0


Aug 30, 2019
Seq2Seq baseline 14.1 3.5 27.0

Leaderboard - User Dialogue Act Prediction

For a real-world DB querying dialogue system, it has to decide if the user question can be mapped to a SQL query or if special actions are needed. We define a series of dialogue acts for the DB user and the SQL expert (refer to the paper for more details). For example, if the user question can be answered by a SQL query, the dialogue act of the question is INFORM_SQL.

Rank Model Accuracy


Dec 20, 2019
UTran-SQL 87.2


Aug 30, 2019
TBCNN-pair baseline 83.9


Aug 30, 2019
Majority baseline 62.8