Spider 1.0 test image

Yale Semantic Parsing and Text-to-SQL Challenge

What is Spider?

Spider is a large-scale complex and cross-domain semantic parsing and text-to-SQL dataset annotated by 11 Yale students. The goal of the Spider challenge is to develop natural language interfaces to cross-domain databases. It consists of 10,181 questions and 5,693 unique complex SQL queries on 200 databases with multiple tables covering 138 different domains. In Spider 1.0, different complex SQL queries and databases appear in train and test sets. To do well on it, systems must generalize well to not only new SQL queries but also new database schemas.
Why we call it "Spider"? It is because our dataset is complex and cross-domain like a spider crawling across mutiple complex(with many foreign keys) nests(databases). Spider Paper (EMNLP'18) Spider Post
SParC, the context-dependent version of the Spider task, introduces a new Semantic Parsing in Context challenge. SParC Challenge (ACL'19)


  • 9/24/2019 (Min et al., EMNLP 2019) tanslated Spider in Chinese (coming up soon)!
  • 5/17/2019 Our paper SParC: Cross-Domain Semantic Parsing in Context with Salesforce Research was accepted to ACL 2019! It introduces the context-dependent version of the Spider challenge: SParC!
  • 5/17/2019 Please report any annotation errors here, we really appreciate your help and will update the data release in this summer!
  • 1/14/2019 The submission tutorial is out!.
  • 12/17/2018 We updated 7 sqlite database files (issue 14). Please download the Spider dataset from this page again.
  • 10/25/2018 The evaluation script and results were updated (issue 5). Please download the lastest versions of the script and papers. Also, please follow instructions in issue 3 to generate the latest SQL parsing results (fixed a bug).

Why Spider?

test image

As the above spider chart shows, Spider 1.0 is distinct from most of the previous semantic parsing tasks because:
  • ATIS, Geo, Academic: Each of them contains only a single database with a limited number of SQL queries, and has exact same SQL queries in train and test splits.
  • WikiSQL: The numbers of SQL queries and tables are significantly large. But all SQL queries are simple, and each database is only a simple table without any foreign key.
Spider 1.0 spans the largest area in the chart, making it the first complex and cross-domain semantic parsing and text-to-SQL dataset! Read more on the blog post.

Getting Started

The data is split into training, development, and unreleased test sets. Download a copy of the dataset (distributed under the CC BY-SA 4.0 license):

Spider Dataset Details of baseline models and evaluation script can be found on the following GitHub site: Spider GitHub Page

Once you have built a model that works to your expectations on the dev set, you can submit it to get official scores on the dev and a hidden test set. To preserve the integrity of test results, we do not release the test set to the public. Instead, we request you to submit your model so that we can run it on the test set for you. Here's a tutorial walking you through official evaluation of your model:

Submission Tutorial

Data Examples

Some examples look like the following:

test image

Have Questions or Want to Contribute ?

Ask us questions at our Github issues page or contact Tao Yu, Rui Zhang, or Michihiro Yasunaga.

We expect the dataset to evolve. We would greatly appreciate it if you could donate us your non-private databases or SQL queries for the project.


We thank Graham Neubig, Tianze Shi, Catherine Finegan-Dollak, and the anonymous reviewers for their precious comments on this project. Also, we thank Pranav Rajpurkar for giving us the permission to build this website based on SQuAD.

Our team at the summit of the East Rock park in New Haven (The pose is "NLseq2SQL"):

test image

Leaderboard - Exact Set Match without Values

For exact matching evaluation, instead of simply conducting string comparison between the predicted and gold SQL queries, we decompose each SQL into several clauses, and conduct set comparison in each SQL clause. Please refer to the paper and the Github page for more details.

Rank Model Dev Test


June 24, 2019
IRNet v2 + BERT

Microsoft Research Asia

63.9 55.0


Sep 20, 2019


60.2 54.8


May 19, 2019

Microsoft Research Asia

(Guo and Zhan et al., ACL '19) code
61.9 54.7


Sep 19, 2019


60.6 53.7


Sep 1, 2019

Yale University & Salesforce Research

(Zhang et al., EMNLP '19) code
57.6 53.4


June 24, 2019
IRNet v2

Microsoft Research Asia

55.4 48.5


Aug 30, 2019

Tel-Aviv University & Allen Institute for AI

(Bogin et al., EMNLP '19) code
52.7 47.4


May 19, 2019

Microsoft Research Asia

(Guo and Zhan et al., ACL '19) code
53.2 46.7


June 11, 2019


51.5 45.6


June 12, 2019


48.7 44.1


Aug 31, 2019


52.9 42.5


May 16, 2019

Tel-Aviv University & Allen Institute for AI

(Bogin et al., ACL '19) code
40.7 39.4


Feb 25, 2019


40.8 37.4


May 30, 2019

Allen Institute for AI

(Lin et al., '19)
34.8 33.8


Sep 1, 2019

Yale University & Salesforce Research

(Zhang et al., EMNLP '19) code
36.4 32.9


Sep 20, 2018
SyntaxSQLNet + augment

Yale University

(Yu et al., EMNLP '18) code
24.8 27.2


April 18, 2019

SAP Labs Korea

(Lee, EMNLP'19)
28.5 24.3


Sep 20, 2018

Yale University

(Yu et al., EMNLP '18) code
18.9 19.7


Sep 20, 2018

Shanghai Jiao Tong University (modified by Yale)

(Xu et al., '18) code
10.9 12.4


Sep 20, 2018

Yale University

(Yu et al., NAACL '18) code
8.0 8.2


Sep 20, 2018
Seq2Seq + attention

University of Edinburgh (modified by Yale)

(Dong and Lapata, ACL '16) code
1.8 4.8

Other papers used Spider (evaluated on the dev but not test set):
  1. (Min et al., EMNLP 2019), Westlake University, Spider in Chinese
  2. (Yao et al., EMNLP 2019), OSU & Facebook AI Research
  3. (Shaw et al., ACL 2019), Google
  4. (Shin et al., NeurlPS 2019), UC Berkeley & MSR
  5. (Weir et al., SIGMOD 2019), Brown University & TU Darmstadt
  6. (Baik et al., ICDE 2019), U of Michigan & IBM

Leaderboard - Execution with Value Selection

Our current models do not predict any value in SQL conditions so that we do not provide execution accuracies. However, we encourage you to provide it in the future submissions. For value prediction, you can assume that a list of gold values for each question is given. Your model has to fill them into the right slots in the SQL. Is your system going to the first one showing up on this leaderboard?

Rank Model Dev Test

Example Split Results

For comparison, the models achieve much higher results if we split the dataset based on data examples instead of databases since the systems don't need to generalize to new database schemas.

Rank Model Exact Set Match


Sep 20, 2018

Yale University

(Yu et al. NAACL '18)


Sep 20, 2018

Shanghai Jiao Tong University

(Xu et al. '18)


Sep 20, 2018
Seq2Seq + attention

University of Edinburgh

(Dong and Lapata, 2016)