CoSQL: A Conversational Text-to-SQL Challenge Towards Cross-Domain Natural Language Interfaces to Databases

What is CoSQL?

Nov 12, 2024: We have released Spider 2.0 full paper, data and code. Follow the guideline to submit your scores to the leaderboard!

Aug 28, 2024: The early access version of Spider 2.0 (a more realistic and challenging text-to-SQL task) is now available! We expect to release the whole dataset in 1-2 weeks. As this is a preliminary release, there may be errors. Your feedback would be invaluable in refining the dataset!

CoSQL is a corpus for building cross-domain Conversational text-to-SQL systems. It is the dialogue version of the Spider and SParC tasks. CoSQL consists of 30k+ turns plus 10k+ annotated SQL queries, obtained from a Wizard-of-Oz collection of 3k dialogues querying 200 complex databases spanning 138 domains. Each dialogue simulates a real-world DB query scenario with a crowd worker as a user exploring the database and a SQL expert retrieving answers with SQL, clarifying ambiguous questions, or otherwise informing of unanswerable questions.

XLANG Lab for Building LLM/VLM Agents CoSQL Paper (EMNLP'19) CoSQL Post

Related Works from XLANG Lab: Spider 2.0 Text-to-SQL ('24) Spider2-V ('24) OSWorld ('24) DS-1000 Challenge (ICML'23) Binder Framework (ICLR '23) UnifiedSKG Framework (EMNLP'22) Spider Chanllenge (EMNLP'18) SParC Chanllenge (ACL'19)

News

11/12/2024 We have released Spider 2.0 full paper, data and code. Follow the guideline to submit your scores to the leaderboard.
08/28/2024 The early access version of Spider 2.0 (a more realistic and challenging text-to-SQL task) is now available! As this is a preliminary release, there may be errors. Your feedback would be invaluable in refining the dataset!
07/15/2024 Spider 2.0-vision (Benchmarking Multimodal Agents on Automating Data Science and Engineering Workflows) is out! Spider 2.0-SQL (much more realistic and challenging than Spider 1.0!) will be released in August.
08/10/2023 Please check out XLANG Lab for Building LLM/VLM Agents!
11/20/2022 Please check out our recent work DS-1000: A Natural and Reliable Benchmark for Data Science Code Generation. Please check out examples, data, and code on the DS-1000 project site!!
10/18/2022 Please check out our recent work Binder: an easy but sota neural-symbolic built on GPT-3 Codex & SQL/Python interpreter. It injects GPT-3 Codex prompt API calls in programming languages! Please check out Binder demo, code, paper, and video on the Binder project site!!
02/15/2022 Please check out our recent work UnifiedSKG: Unifying and Multi-Tasking Structured Knowledge Grounding with Text-to-Text Language Models. We open-sourced simple but SOTA/strong models for 21 tasks including text-to-SQL! Please check out our code in the UnifiedSKG repo!!
11/15/2020 We will use Test Suite Accuracy as our official evaluation metric for Spider, SParC, and CoSQL. Please find the evaluation code from here.
01/16/2020 Added some supplemental files to the CoSQL dataset (No change to the data content for each task).
10/30/2019 CoSQL dataset is out, see you at EMNLP 2019, Hong Kong!

Why CoSQL?

CoSQL introduces new challenges compared to existing task-oriented dialogue tasks:

the dialogue states are grounded in domain-independent SQL program instead of domain-specific slot-value pairs.
because testing is done on unseen databases, success requires generalizing to new domains.

Compared to other semantic parsing/text-to-SQL tasks, CoSQL presents new challenges:

user questions are not necessarily answerable.
it involves system responses to clarify ambiguous questions, verify returned results, and notify users of unanswerable or unrelated questions.
each dialog is obtained via the Wizard-of-Oz setting between a crowd worker and a SQL expert.

CoSQL includes three tasks:

SQL-grounded dialogue state tracking to map user utterances into SQL queries if possible given the interaction history
natural language response generation based on an executed SQL and its results for user verification
user dialogue act prediction to detect and resolve ambiguous and unanswerable questions

Getting Started

The data is split into training, development, and unreleased test sets. Download a copy of the dataset (distributed under the CC BY-SA 4.0 license):

CoSQL Dataset Details of baseline models and evaluation script can be found on the following GitHub site: CoSQL GitHub Page

Once you have built a model that works to your expectations on the dev set, you can submit it to get official scores on the dev and a hidden test set. To preserve the integrity of test results, we do not release the test set to the public. Instead, we request you to submit your model so that we can run it on the test set for you. Here's a tutorial walking you through official evaluation of your model:

Submission Tutorial

Data Examples

Some examples look like the following:

Have Questions or Want to Contribute ?

Ask us questions at our Github issues page or contact Tao Yu, Rui Zhang, or Xi Victoria Lin.

We expect the dataset to evolve. We would greatly appreciate it if you could donate us your non-private databases or SQL queries for the project.

Acknowledgement

We thank Pranav Rajpurkar for giving us the permission to build this website based on SQuAD.

Part of our CoSQL team at YINS:

Leaderboard - SQL-grounded Dialogue State Tracking

In CoSQL, user dialogue states are grounded in SQL queries. Dialogue state tracking (DST) in this case is to predict the correct SQL query for each user utterance with INFORM_SQL label given the interaction context and the DB schema. Comparing to other context-dependent text-to-SQL tasks such as SParC, the DST task in CoSQL also includes the ambiguous questions if the user affirms the system clarification of them. In this case, the system clarification is also given as part of the interaction context to predict the SQL query corresponding to the question. As in Spider and SParC tasks, we report results of Exact Set Match without Values:

Rank	Model	Question Match	Interaction Match
1 Feb 14, 2022	STAR Alibaba DAMO & SIAT (Cai and Li et al., EMNLP-Findings '22) code demo	57.8	28.2
2 Apr 2, 2022	CQR-SQL Tencent Cloud Xiaowei (Xiao et al.,'22)	58.3	27.4
3 Jun 4, 2022	RASAT + PICARD SJTU LUMIA & Netmind.AI (Qi et al., EMNLP'22) code	55.7	26.5
4 Dec 26, 2022	MT Training + N-best List Rerankers + PICARD Alexa AI (Parthasarathi et al., ICASSP'23)	55.8	24.8
5 Oct 5, 2021	HIE-SQL + GraPPa Alibaba DAMO (Zheng et al. ACL-Findings '22)	53.9	24.6
6 Jul 14, 2021	T5-3B+PICARD Element AI, a ServiceNow company (Scholak et al., EMNLP'21) code	54.6	23.7
7 Jan 7, 2022	RATSQL++ + ELECTRA Anonymous	53.8	22.1
8 Sep. 21, 2020	RAT-SQL + SCoRe Yale & Microsoft Research & PSU (Yu et al. ICLR '21)	51.6	21.2
9 Aug 24, 2020	R²SQL + BERT Alibaba DAMO (Hui et al. AAAI '21) code	46.8	17.0
10 Jan 26, 2021	IST-SQL + BERT University of Science and Technology of China (Wang et al. AAAI '21) code	41.8	15.2
11 Nov. 16, 2020	WaveSQL + BERT Anonymous	46.1	15.1
12 May 26, 2020	IGSQL + BERT Peking University (Cai et al. EMNLP '20) code	42.5	15.0
13 Aug 30, 2019	EditSQL + BERT Yale University & Salesforce Research (Zhang et al. EMNLP '19) code	40.8	13.7
14 May 21, 2020	GAZP + BERT University of Washington & Facebook AI Research (Zhong et al., EMNLP '20)	39.7	12.8
15 Apr 21, 2021	MemCE UoE (Jain et al., TACL '21)	28.4	6.2
16 Aug 30, 2019	CD-Seq2Seq Yale University & Salesforce Research (Yu et al. EMNLP '19) code	13.9	2.6
17 Aug 30, 2019	SyntaxSQL-con Yale University (Yu et al. EMNLP '18) code	14.1	2.2

and Execution with Values:

Rank	Model	Question Match	Interaction Match
1 Jun 4, 2022	RASAT + PICARD SJTU LUMIA & Netmind.AI (Qi et al., EMNLP'22) code	66.3	37.4
2 May 21, 2020	GAZP + BERT University of Washington & Facebook AI Research (Zhong et al., EMNLP '20)	35.9	8.4

Leaderboard - Response Generation from SQL and Query Results

This task requires generating a natural language description of the SQL query and the result for each system response labeled as INFORM_SQL. It considers a SQL query, the execution result, and the DB schema. Preserving logical consistency (Logic Correctness Rate (LCR)) between SQL and NL response is crucial in this task, in addition to naturalness and syntactical correctness.

Rank	Model	BLEU	Grammar	LCR (%)
1 Dec 15, 2022	Complexity Aware Prompts + T5 Alexa AI	28.1	-	-
2 Aug 30, 2019	Template baseline	9.3	4.0	41.0
3 Aug 30, 2019	Pointer-generator baseline	15.1	3.6	35.0
4 Aug 30, 2019	Seq2Seq baseline	14.1	3.5	27.0

Leaderboard - User Dialogue Act Prediction

For a real-world DB querying dialogue system, it has to decide if the user question can be mapped to a SQL query or if special actions are needed. We define a series of dialogue acts for the DB user and the SQL expert (refer to the paper for more details). For example, if the user question can be answered by a SQL query, the dialogue act of the question is INFORM_SQL.

Rank	Model	Accuracy
1 Dec 20, 2019	UTran-SQL	87.2
2 Aug 30, 2019	TBCNN-pair baseline	83.9
3 Aug 30, 2019	Majority baseline	62.8

CoSQL 1.0

A Conversational Text-to-SQL Challenge Towards Cross-Domain Natural Language Interfaces to Databases