Spider2-E

Spider2-E (Spider2-EmergenceAI) is a corrected, reproducible version of the Snowflake split of the Spider 2.0 benchmark by xlang-ai. The task definitions are unchanged. Spider2-E modifies the gold artifacts (test set, gold SQL queries, gold result CSVs) because some upstream gold items can no longer be reproduced against live Snowflake, and patches the evaluation script to reject SQL submissions that bake answer values into the query.

This document describes every change made versus the upstream spider2-snow directory and the reasoning behind it. Run setup is unchanged — see README-spider2-snow.md (verbatim copy of the upstream README). Evaluation usage is summarized below; the new flag is documented in §4.

Types of Ground truth artifacts

The Spider2 evaluation suite ships two distinct ground-truth artifacts per instance, both modified independently by Spider2-E:

Artifact	Directory	What it is
Gold SQL queries	`evaluation_suite/gold/sql/`	The reference SQL the benchmark expects, one `.sql` file per instance.
Gold result CSVs	`evaluation_suite/gold/exec_result/`	The expected query output as CSV, one or more `.csv` files per instance (suffixed `_a.csv`, `_b.csv`, … when multiple variants are accepted). `evaluate.py` compares a submission's executed result against these.

A submission is scored against the gold result CSVs (any matching variant counts as correct). The gold SQL queries exist as a reference but aren't directly used by evaluate.py. Throughout this README the two are treated as separate things — an instance can have one removed without the other.

What changed at a glance

Artifact	Upstream	Spider2-E	Net change
Instance set (`spider2-snow.jsonl`)	547 instances	524 instances	−23 instances
Eval config (`gold/spider2snow_eval.jsonl`)	547 instances	524 instances	−23 instances
Gold SQL queries (`gold/sql/`)	120 files	118 files	−3 + 1 = −2 net files
Gold result CSVs (`gold/exec_result/`)	1,544 CSVs	1,469 CSVs	−80 + 5 = −75 net CSVs
Evaluation script (`gold/../evaluate.py`)	upstream	patched	hardcoded-SQL rejection + empty-result fix

Four categories of change:

23 instances removed entirely — gold cannot be reproduced against live Snowflake (revoked data shares or pervasive data drift on rolling/refreshed datasets). Both the test entry and all gold artifacts deleted.
3 gold SQLs removed from upstream + 1 deterministic replacement added — see §2. Of the 3 removed, 2 instances (sf012, sf040) are also fully removed in §1b. The third (sf_bq294) is kept and gets a new deterministic gold SQL written from scratch (anchors the reference year at 2025 instead of using CURRENT_DATE), reproducing all 3 upstream gold result variants.
5 gold result CSV variants added — for instances where BigQuery vs Snowflake engine semantics legitimately produce different output from the upstream gold; today's Snowflake output is added as an additional accepted variant.
Evaluation script patched — rejects SQL submissions that bake gold values into CASE lookup tables, and treats empty-result-vs-empty-gold as a match instead of a failure.

For every instance not listed in §1–§3, the upstream gold artifacts are kept verbatim — Spider2-E is a strict subset-plus-deltas of upstream gold.

1. Removed instances (23)

For these 23 instances both the test entry and all gold artifacts (gold SQL if present, all gold result CSV variants, eval-config row) have been removed. Two reasons.

1a. Snowflake data shares revoked (3)

The Snowflake data shares behind these instances have been revoked by their publishers. SHOW DATABASES returns a tombstone (origin = <revoked>) and any query against them returns:

003030 (02000): SQL compilation error:
Shared database is no longer available for use.

Instance	db_id	Tables	Why
`sf009`	`NETHERLANDS_OPEN_MAP_DATA`	`BUILDINGS`	Share revoked by publisher
`sf013`	`NETHERLANDS_OPEN_MAP_DATA`	`ROAD`	Same revoked share as sf009
`sf029`	`AMAZON_VENDOR_ANALYTICS__SAMPLE_DATASET`	`SALES`, `TRAFFIC`, `INVENTORY`, `NET_PPM`	Share unmounted by publisher

These instances cannot be re-enabled until the publishers republish the shares. None of the three had a gold SQL in upstream gold/sql/. 10 gold result CSVs deleted (5 sf009, 4 sf013, 1 sf029).

1b. Live data drift — gold no longer reproducible (20)

For these 20 instances the prompt is well-formed and the SQL can be executed, but the underlying Snowflake table has drifted (rolling time windows, refreshed counts, regenerated identifiers, granularity shifts) since the upstream gold result was captured. Capturing today's output as a new gold variant only buys correctness for one evaluation cycle — the next agent will face the same drift and a stale variant. Removal is the only stable answer.

Instance	db_id	Drift symptom
`sf006`	`FINANCE__ECONOMICS`	Same 56 states, refreshed branch counts and pct-change values
`sf008`	`US_REAL_ESTATE`	Home price index value drifted (6.55 → 6.8)
`sf012`	`WEATHER__ENVIRONMENT`	2012 FEMA NFIP claim amounts already disagree across the 3 upstream gold variants and differ from live (~492.83 / .84 / .89 M vs 492.89 M today) — Cybersyn-managed table is mutable
`sf037`	`US_REAL_ESTATE`	POI_ID format changed from hex to UUID upstream
`sf040`	`US_ADDRESSES__POI`	Top-10 northernmost addresses shifts as records are added/corrected (gold SQL also removed historically — see §2)
`sf_bq009`	`GA360`	Revenue difference value drifted (4659.15 → 32785.17)
`sf_bq024`	`USFS_FIA`	`EVALUATION_TYPE` shifted from `EXPCURR/EXPCHNG` to `EXPMORT`
`sf_bq058`	`GOOG_BLOCKCHAIN`	Optimism blockchain data not present in this Snowflake share
`sf_bq063`	`DEPS_DEV_V1`	Top-result GitHub URL shifts as npm versions are published
`sf_bq102`	`GNOMAD`	Genomic start position moved (41197780 → 41276093) — assembly/annotation update
`sf_bq130`	`COVID19_NYT`	Result granularity shifted from county-level to state-level
`sf_bq165`	`TCGA_MITELMAN`	MITELMAN cohort total changed (3373 → 3377); 6 of 320 chromosomal bands now have ±1 count differences
`sf_bq190`	`THELOOK_ECOMMERCE`	Oldest-female-user count drifted (465 → 467)
`sf_bq249`	`GITHUB_REPOS`	Whitespace-category counts shifted as repo data grew
`sf_bq256`	`CRYPTO`	Final Ethereum balance value sign flipped — possibly aggregation-direction divergence rather than pure drift
`sf_bq275`	`GA360`	GA360 sessions are rolling/dynamic — `fullVisitorId` set changes per capture
`sf_bq366`	`THE_MET`	Same rows as gold, different ordering (font/drawing swap)
`sf_ga014`	`GA4`	Live GA4 session counts drift slightly between captures
`sf_ga018`	`GA4`	PLP-to-PDP conversion ratio changed (26.68 % → 21.68 %)
`sf_ga021`	`FIREBASE`	Cohort event-type now resolves to `level_complete_quickplay` instead of `level_start_quickplay`

For each of the 20: test-set entry removed, eval-config row removed, all gold result CSV variants deleted (70 CSVs total). Two of the 20 (sf012, sf040) had a gold SQL in upstream gold/sql/, both already removed historically; see §2.

2. Gold SQL changes (3 removed, 1 added)

Three upstream gold SQLs were removed because each is non-deterministic and would produce a different result on every evaluation. One of the three (sf_bq294) was then replaced with a deterministic version derived from a verified-correct agent submission. The other two had their instances also fully removed (see §1b). Net file count: 120 → 118 (−3 + 1 = −2).

Instance	db_id	Why upstream gold SQL was removed	Spider2-E status
`sf_bq294`	`SAN_FRANCISCO_PLUS`	Upstream gold SQL embeds `EXTRACT(YEAR FROM CURRENT_DATE)` to compute member age — a calendar-dependent expression in a comparison column. Every age value increments by +1 each calendar year past 2025.	Instance kept; replaced with deterministic gold SQL. New SQL anchors the reference year at the literal `2025` (matching the year the upstream gold result CSVs were captured) and is written for Snowflake syntax. Verified to score 1 against all 3 upstream gold variants under the eval `condition_cols`.
`sf012`	`WEATHER__ENVIRONMENT`	Upstream gold SQL aggregates the live Cybersyn FEMA NFIP claim table; the 3 pre-existing gold result variants already disagree with each other for year 2012, and live values differ from all three. The dataset is mutable.	Instance fully removed in §1b (data-drift case)
`sf040`	`US_ADDRESSES__POI`	Top-10 northernmost addresses from the live Cybersyn `US_ADDRESSES` table; the result set shifts as records are added/corrected. The upstream gold SQL was also revised post-generation, leaving one of its gold variants structurally incompatible with the revised SQL.	Instance fully removed in §1b (data-drift case)

So the at-a-glance row −3 + 1 = −2 net files for gold/sql/ covers: 3 removals (sf012, sf040, sf_bq294-upstream) + 1 addition (sf_bq294-Spider2-E).

3. Added gold result variants (5)

For these 5 instances the agent's SQL is structurally correct but Snowflake legitimately produces a different result from the upstream gold because of fundamental BigQuery vs Snowflake engine differences (missing UDFs, spatial-function semantics, hash/UUID generation, array unnesting, float precision). The agent's SQL was executed against live Snowflake and the result saved as a new gold result variant <instance>_<next_letter>.csv under gold/exec_result/. The original gold variants are untouched; evaluate.py accepts any matching variant.

Instance	db_id	New variant	Rows × Cols	Reason
`sf_bq111`	`TCGA_MITELMAN`	`sf_bq111_f.csv`	21 × 9	BigQuery `corr_pvalue` UDF has no Snowflake equivalent; p-values and correlations diverge
`sf_bq276`	`NOAA_PORTS`	`sf_bq276_d.csv`	234 × 9	Spatial predicate evaluates port/storm geometry differently across engines (different port matched)
`sf_bq430`	`EBI_CHEMBL`	`sf_bq430_f.csv`	810 × 6	UUID/hash generation from activity IDs and SMILES strings differs between engines
`sf_bq458`	`WORD_VECTORS_US`	`sf_bq458_e.csv`	2,371 × 4	`UNNEST` vs `LATERAL FLATTEN` of article-vector arrays drops different rows; original gold is empty
`sf_local344`	`F1`	`sf_local344_c.csv`	4 × 2	Float-boundary edge case yields a different overtake count (gold 28 vs exec 20)

Note: sf_bq458 was captured by running its SQL on the SNOWFLAKE_LEARNING_WH warehouse — the COMPUTE_WH_PARTICIPANT warehouse used by default has a 120 s statement timeout that the query exceeds (~258 s actual runtime). Re-evaluation against this variant requires a warehouse without that ceiling.

Spider2-E previously also added gold result variants for the data-drift cases now listed in §1b. Those variants were removed when the underlying instances were removed — adding a freshly-captured variant didn't solve the problem since the data continued to drift faster than the variant could keep up.

4. Evaluation script changes

evaluation_suite/evaluate.py has two behavior changes versus upstream. Comparison logic, threading, CLI surface, and output format are otherwise unchanged.

4a. Hardcoded-SQL rejection (default on)

Submissions that bake gold values into CASE lookup tables are rejected with score = 0 before execution, instead of being allowed to "pass" by trivially matching the expected output.

Reject signals (SQL-structural — no dependency on the gold result CSVs):

Signal	Triggers when	Action
`NO_FROM`	`SELECT` with no `FROM` clause (pure literal return).	Reject
`HARDCODED_CASE_MAP`	≥ 2 `WHEN 'literal' THEN 'literal'` pairs where the THEN values are strings ≥ 20 chars. This is the canonical "hardcoded ID-to-answer-value lookup table" pattern.	Reject
`VALUES_CLAUSE`	SQL contains a `VALUES (...)` constructor.	Warn only (common in legitimate inline reference tables)
`UNION_ALL_LITERALS`	Two or more pure-literal `SELECT` blocks joined by `UNION ALL`.	Warn only (common in legitimate inline scaffolds)

HARDCODED_CASE_MAP is the discriminating signal — it has zero known false positives across the 547 upstream instances and catches the canonical hardcoding pattern (e.g., WHEN '0000000000…' THEN 'TRC20-wallet-address' for 5 paired hex→address mappings). Earlier draft signals based on IN (...) filter contents and on overlap between SQL literals and gold CSV cell values were dropped because they could not be distinguished from legitimate enum filtering on documented schemas (NAICS industry codes, SNOMED concept IDs, DICOM transfer-syntax UIDs, GA event names, well-known contract addresses, etc.).

When a submission is rejected, error_info reports the matched flag and the offending values, so the verdict can be audited. Pass --allow_hardcoded to disable rejection entirely.

4b. Empty-result handling

Upstream evaluate.py short-circuited any predicted SQL that returned zero rows with "No data found for the specified query.", marking it as a failure even when the gold for that instance was also empty. The patched script saves the header-only CSV and runs the normal comparison: empty-vs-empty matches; empty-vs-non-empty still fails. No change for non-empty results.

4c. Per-variant `condition_cols` padding

When the per-instance eval config has fewer condition_cols entries than there are gold result variants, the missing entries are now padded with [] (compare full row) instead of crashing with IndexError. This was hit by the data-drift additions before they were removed; it remains useful as a defensive default for any future variant additions.

Repository layout

Spider2-E/
├── README.md                            # This file
├── README-spider2-snow.md               # Upstream README, verbatim
├── requirements.txt                     # Pinned third-party dependencies for evaluate.py
├── .gitignore                           # Excludes credentials, env files, logs, caches
├── .github/
│   └── dependabot.yml                   # Weekly pip + github-actions dependency-update PRs
├── spider2-snow.jsonl                   # Test set (524 instances; upstream 547)
├── evaluation_suite/
│   ├── README.md                        # Upstream evaluation README, verbatim
│   ├── evaluate.py                      # Patched (see §4)
│   ├── evaluate_utils.py                # Upstream
│   ├── example_submission_folder/       # Upstream sample SQL submissions
│   ├── example_submission_folder_csv/   # Upstream sample CSV submissions
│   └── gold/
│       ├── spider2snow_eval.jsonl       # Per-instance condition_cols / ignore_order (524 records)
│       ├── sql/                         # Gold SQL queries (118 files; −3 + 1 = −2 net vs upstream)
│       └── exec_result/                 # Gold result CSVs (1,469 files; −80 + 5 = −75 net vs upstream)
└── resource/                            # Upstream supporting docs and schemas

How to evaluate

Setup (one-time):

Install pinned dependencies (a fresh virtual environment is recommended):
```
pip install -r requirements.txt
```
Place your snowflake_credential.json in evaluation_suite/. The repo .gitignore already excludes *credential*.json, so the file is safe from accidental commits — do not rename or relocate it in a way that bypasses that rule.

Run (CLI usage is the same as upstream):

cd evaluation_suite
python evaluate.py --result_dir <your_predicted_sqls_folder> --mode sql
# or
python evaluate.py --result_dir <your_predicted_csvs_folder> --mode exec_result

New flag (SQL mode only):

--allow_hardcoded — disable hardcoded-SQL rejection. Default is off, i.e. submissions flagged by NO_FROM or HARDCODED_CASE_MAP (see §4a) score 0.

Sample submissions are under evaluation_suite/example_submission_folder/ and evaluation_suite/example_submission_folder_csv/. See evaluation_suite/README.md for the upstream submission-folder convention.

Provenance

Upstream: xlang-ai/Spider2 — spider2-snow directory.
Spider2-E commits are scoped to gold-artifact changes and the evaluation-script patch only; no edits to task instructions, supporting resources, or the upstream eval-suite scaffolding.
db_id verification for every modified instance was done against the upstream spider2-snow.jsonl.
All Snowflake executions referenced above were performed in late April 2026 using the standard snowflake-connector-python SDK against credentials provisioned for the PARTICIPANT role.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Spider2-E

Types of Ground truth artifacts

What changed at a glance

1. Removed instances (23)

1a. Snowflake data shares revoked (3)

1b. Live data drift — gold no longer reproducible (20)

2. Gold SQL changes (3 removed, 1 added)

3. Added gold result variants (5)

4. Evaluation script changes

4a. Hardcoded-SQL rejection (default on)

4b. Empty-result handling

4c. Per-variant `condition_cols` padding

Repository layout

How to evaluate

Provenance

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 43 Commits
.github		.github
evaluation_suite		evaluation_suite
resource		resource
.gitignore		.gitignore
README-spider2-snow.md		README-spider2-snow.md
README.md		README.md
requirements.txt		requirements.txt
spider2-snow.jsonl		spider2-snow.jsonl

Folders and files

Latest commit

History

Repository files navigation

Spider2-E

Types of Ground truth artifacts

What changed at a glance

1. Removed instances (23)

1a. Snowflake data shares revoked (3)

1b. Live data drift — gold no longer reproducible (20)

2. Gold SQL changes (3 removed, 1 added)

3. Added gold result variants (5)

4. Evaluation script changes

4a. Hardcoded-SQL rejection (default on)

4b. Empty-result handling

4c. Per-variant condition_cols padding

Repository layout

How to evaluate

Provenance

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

4c. Per-variant `condition_cols` padding

Packages