test image test image CoSQL 1.0 test image test image

A Conversational Text-to-SQL Challenge
Towards Cross-Domain Natural Language Interfaces to Databases

What is CoSQL?

Aug 28, 2024: The early access version of Spider 2.0 (a more realistic and challenging text-to-SQL task) is now available! We expect to release the whole dataset in 1-2 weeks. As this is a preliminary release, there may be errors. Your feedback would be invaluable in refining the dataset!

CoSQL is a corpus for building cross-domain Conversational text-to-SQL systems. It is the dialogue version of the Spider and SParC tasks. CoSQL consists of 30k+ turns plus 10k+ annotated SQL queries, obtained from a Wizard-of-Oz collection of 3k dialogues querying 200 complex databases spanning 138 domains. Each dialogue simulates a real-world DB query scenario with a crowd worker as a user exploring the database and a SQL expert retrieving answers with SQL, clarifying ambiguous questions, or otherwise informing of unanswerable questions.
XLANG Lab for Building LLM/VLM Agents CoSQL Paper (EMNLP'19) CoSQL Post
Related Works from XLANG Lab: Spider 2.0 Text-to-SQL ('24) Spider2-V ('24) OSWorld ('24) DS-1000 Challenge (ICML'23) Binder Framework (ICLR '23) UnifiedSKG Framework (EMNLP'22) Spider Chanllenge (EMNLP'18) SParC Chanllenge (ACL'19)

News

Why CoSQL?

CoSQL introduces new challenges compared to existing task-oriented dialogue tasks:
  • the dialogue states are grounded in domain-independent SQL program instead of domain-specific slot-value pairs.
  • because testing is done on unseen databases, success requires generalizing to new domains.

Compared to other semantic parsing/text-to-SQL tasks, CoSQL presents new challenges:
  • user questions are not necessarily answerable.
  • it involves system responses to clarify ambiguous questions, verify returned results, and notify users of unanswerable or unrelated questions.
  • each dialog is obtained via the Wizard-of-Oz setting between a crowd worker and a SQL expert.

CoSQL includes three tasks:
  • SQL-grounded dialogue state tracking to map user utterances into SQL queries if possible given the interaction history
  • natural language response generation based on an executed SQL and its results for user verification
  • user dialogue act prediction to detect and resolve ambiguous and unanswerable questions

Getting Started

The data is split into training, development, and unreleased test sets. Download a copy of the dataset (distributed under the CC BY-SA 4.0 license):

CoSQL Dataset Details of baseline models and evaluation script can be found on the following GitHub site: CoSQL GitHub Page

Once you have built a model that works to your expectations on the dev set, you can submit it to get official scores on the dev and a hidden test set. To preserve the integrity of test results, we do not release the test set to the public. Instead, we request you to submit your model so that we can run it on the test set for you. Here's a tutorial walking you through official evaluation of your model:

Submission Tutorial

Data Examples

Some examples look like the following:

test image

Have Questions or Want to Contribute ?

Ask us questions at our Github issues page or contact Tao Yu, Rui Zhang, or Xi Victoria Lin.

We expect the dataset to evolve. We would greatly appreciate it if you could donate us your non-private databases or SQL queries for the project.

Acknowledgement

We thank Pranav Rajpurkar for giving us the permission to build this website based on SQuAD.

Part of our CoSQL team at YINS:

test image

Leaderboard - SQL-grounded Dialogue State Tracking

In CoSQL, user dialogue states are grounded in SQL queries. Dialogue state tracking (DST) in this case is to predict the correct SQL query for each user utterance with INFORM_SQL label given the interaction context and the DB schema. Comparing to other context-dependent text-to-SQL tasks such as SParC, the DST task in CoSQL also includes the ambiguous questions if the user affirms the system clarification of them. In this case, the system clarification is also given as part of the interaction context to predict the SQL query corresponding to the question. As in Spider and SParC tasks, we report results of Exact Set Match without Values:

Rank Model Question Match Interaction Match

1

Feb 14, 2022
STAR

Alibaba DAMO & SIAT

(Cai and Li et al., EMNLP-Findings '22) code demo
57.8 28.2

2

Apr 2, 2022
CQR-SQL

Tencent Cloud Xiaowei

(Xiao et al.,'22)
58.3 27.4

3

Jun 4, 2022
RASAT + PICARD

SJTU LUMIA & Netmind.AI

(Qi et al., EMNLP'22) code
55.7 26.5

4

Dec 26, 2022
MT Training + N-best List Rerankers + PICARD

Alexa AI

(Parthasarathi et al., ICASSP'23)
55.8 24.8

5

Oct 5, 2021
HIE-SQL + GraPPa

Alibaba DAMO

(Zheng et al. ACL-Findings '22)
53.9 24.6

6

Jul 14, 2021
T5-3B+PICARD

Element AI, a ServiceNow company

(Scholak et al., EMNLP'21) code
54.6 23.7

7

Jan 7, 2022
RATSQL++ + ELECTRA

Anonymous

53.8 22.1

8

Sep. 21, 2020
RAT-SQL + SCoRe

Yale & Microsoft Research & PSU

(Yu et al. ICLR '21)
51.6 21.2

9

Aug 24, 2020
R²SQL + BERT

Alibaba DAMO

(Hui et al. AAAI '21) code
46.8 17.0

10

Jan 26, 2021
IST-SQL + BERT

University of Science and Technology of China

(Wang et al. AAAI '21) code
41.8 15.2

11

Nov. 16, 2020
WaveSQL + BERT

Anonymous

46.1 15.1

12

May 26, 2020
IGSQL + BERT

Peking University

(Cai et al. EMNLP '20) code
42.5 15.0

13

Aug 30, 2019
EditSQL + BERT

Yale University & Salesforce Research

(Zhang et al. EMNLP '19) code
40.8 13.7

14

May 21, 2020
GAZP + BERT

University of Washington & Facebook AI Research

(Zhong et al., EMNLP '20)
39.7 12.8

15

Apr 21, 2021
MemCE

UoE

(Jain et al., TACL '21)
28.4 6.2

16

Aug 30, 2019
CD-Seq2Seq

Yale University & Salesforce Research

(Yu et al. EMNLP '19) code
13.9 2.6

17

Aug 30, 2019
SyntaxSQL-con

Yale University

(Yu et al. EMNLP '18) code
14.1 2.2
and Execution with Values:
Rank Model Question Match Interaction Match

1

Jun 4, 2022
RASAT + PICARD

SJTU LUMIA & Netmind.AI

(Qi et al., EMNLP'22) code
66.3 37.4

2

May 21, 2020
GAZP + BERT

University of Washington & Facebook AI Research

(Zhong et al., EMNLP '20)
35.9 8.4

Leaderboard - Response Generation from SQL and Query Results

This task requires generating a natural language description of the SQL query and the result for each system response labeled as INFORM_SQL. It considers a SQL query, the execution result, and the DB schema. Preserving logical consistency (Logic Correctness Rate (LCR)) between SQL and NL response is crucial in this task, in addition to naturalness and syntactical correctness.

Rank Model BLEU Grammar LCR (%)

1

Dec 15, 2022
Complexity Aware Prompts + T5

Alexa AI

28.1 - -

2

Aug 30, 2019
Template baseline 9.3 4.0 41.0

3

Aug 30, 2019
Pointer-generator baseline 15.1 3.6 35.0

4

Aug 30, 2019
Seq2Seq baseline 14.1 3.5 27.0

Leaderboard - User Dialogue Act Prediction

For a real-world DB querying dialogue system, it has to decide if the user question can be mapped to a SQL query or if special actions are needed. We define a series of dialogue acts for the DB user and the SQL expert (refer to the paper for more details). For example, if the user question can be answered by a SQL query, the dialogue act of the question is INFORM_SQL.

Rank Model Accuracy

1

Dec 20, 2019
UTran-SQL 87.2

2

Aug 30, 2019
TBCNN-pair baseline 83.9

3

Aug 30, 2019
Majority baseline 62.8