Spider2-E (Spider2-EmergenceAI) is a corrected, reproducible version of the Snowflake split of the Spider 2.0 benchmark by xlang-ai. The task definitions are unchanged. Spider2-E modifies the gold artifacts (test set, gold SQL queries, gold result CSVs) because some upstream gold items can no longer be reproduced against live Snowflake, and patches the evaluation script to reject SQL submissions that bake answer values into the query.
This document describes every change made versus the upstream spider2-snow directory and the reasoning behind it. Run setup is unchanged — see README-spider2-snow.md (verbatim copy of the upstream README). Evaluation usage is summarized below; the new flag is documented in §4.
The Spider2 evaluation suite ships two distinct ground-truth artifacts per instance, both modified independently by Spider2-E:
| Artifact | Directory | What it is |
|---|---|---|
| Gold SQL queries | evaluation_suite/gold/sql/ |
The reference SQL the benchmark expects, one .sql file per instance. |
| Gold result CSVs | evaluation_suite/gold/exec_result/ |
The expected query output as CSV, one or more .csv files per instance (suffixed _a.csv, _b.csv, … when multiple variants are accepted). evaluate.py compares a submission's executed result against these. |
A submission is scored against the gold result CSVs (any matching variant counts as correct). The gold SQL queries exist as a reference but aren't directly used by evaluate.py. Throughout this README the two are treated as separate things — an instance can have one removed without the other.
| Artifact | Upstream | Spider2-E | Net change |
|---|---|---|---|
Instance set (spider2-snow.jsonl) |
547 instances | 524 instances | −23 instances |
Eval config (gold/spider2snow_eval.jsonl) |
547 instances | 524 instances | −23 instances |
Gold SQL queries (gold/sql/) |
120 files | 118 files | −3 + 1 = −2 net files |
Gold result CSVs (gold/exec_result/) |
1,544 CSVs | 1,469 CSVs | −80 + 5 = −75 net CSVs |
Evaluation script (gold/../evaluate.py) |
upstream | patched | hardcoded-SQL rejection + empty-result fix |
Four categories of change:
- 23 instances removed entirely — gold cannot be reproduced against live Snowflake (revoked data shares or pervasive data drift on rolling/refreshed datasets). Both the test entry and all gold artifacts deleted.
- 3 gold SQLs removed from upstream + 1 deterministic replacement added — see §2. Of the 3 removed, 2 instances (
sf012,sf040) are also fully removed in §1b. The third (sf_bq294) is kept and gets a new deterministic gold SQL written from scratch (anchors the reference year at 2025 instead of usingCURRENT_DATE), reproducing all 3 upstream gold result variants. - 5 gold result CSV variants added — for instances where BigQuery vs Snowflake engine semantics legitimately produce different output from the upstream gold; today's Snowflake output is added as an additional accepted variant.
- Evaluation script patched — rejects SQL submissions that bake gold values into
CASElookup tables, and treats empty-result-vs-empty-gold as a match instead of a failure.
For every instance not listed in §1–§3, the upstream gold artifacts are kept verbatim — Spider2-E is a strict subset-plus-deltas of upstream gold.
For these 23 instances both the test entry and all gold artifacts (gold SQL if present, all gold result CSV variants, eval-config row) have been removed. Two reasons.
The Snowflake data shares behind these instances have been revoked by their publishers. SHOW DATABASES returns a tombstone (origin = <revoked>) and any query against them returns:
003030 (02000): SQL compilation error:
Shared database is no longer available for use.
| Instance | db_id | Tables | Why |
|---|---|---|---|
sf009 |
NETHERLANDS_OPEN_MAP_DATA |
BUILDINGS |
Share revoked by publisher |
sf013 |
NETHERLANDS_OPEN_MAP_DATA |
ROAD |
Same revoked share as sf009 |
sf029 |
AMAZON_VENDOR_ANALYTICS__SAMPLE_DATASET |
SALES, TRAFFIC, INVENTORY, NET_PPM |
Share unmounted by publisher |
These instances cannot be re-enabled until the publishers republish the shares. None of the three had a gold SQL in upstream gold/sql/. 10 gold result CSVs deleted (5 sf009, 4 sf013, 1 sf029).
For these 20 instances the prompt is well-formed and the SQL can be executed, but the underlying Snowflake table has drifted (rolling time windows, refreshed counts, regenerated identifiers, granularity shifts) since the upstream gold result was captured. Capturing today's output as a new gold variant only buys correctness for one evaluation cycle — the next agent will face the same drift and a stale variant. Removal is the only stable answer.
| Instance | db_id | Drift symptom |
|---|---|---|
sf006 |
FINANCE__ECONOMICS |
Same 56 states, refreshed branch counts and pct-change values |
sf008 |
US_REAL_ESTATE |
Home price index value drifted (6.55 → 6.8) |
sf012 |
WEATHER__ENVIRONMENT |
2012 FEMA NFIP claim amounts already disagree across the 3 upstream gold variants and differ from live (~492.83 / .84 / .89 M vs 492.89 M today) — Cybersyn-managed table is mutable |
sf037 |
US_REAL_ESTATE |
POI_ID format changed from hex to UUID upstream |
sf040 |
US_ADDRESSES__POI |
Top-10 northernmost addresses shifts as records are added/corrected (gold SQL also removed historically — see §2) |
sf_bq009 |
GA360 |
Revenue difference value drifted (4659.15 → 32785.17) |
sf_bq024 |
USFS_FIA |
EVALUATION_TYPE shifted from EXPCURR/EXPCHNG to EXPMORT |
sf_bq058 |
GOOG_BLOCKCHAIN |
Optimism blockchain data not present in this Snowflake share |
sf_bq063 |
DEPS_DEV_V1 |
Top-result GitHub URL shifts as npm versions are published |
sf_bq102 |
GNOMAD |
Genomic start position moved (41197780 → 41276093) — assembly/annotation update |
sf_bq130 |
COVID19_NYT |
Result granularity shifted from county-level to state-level |
sf_bq165 |
TCGA_MITELMAN |
MITELMAN cohort total changed (3373 → 3377); 6 of 320 chromosomal bands now have ±1 count differences |
sf_bq190 |
THELOOK_ECOMMERCE |
Oldest-female-user count drifted (465 → 467) |
sf_bq249 |
GITHUB_REPOS |
Whitespace-category counts shifted as repo data grew |
sf_bq256 |
CRYPTO |
Final Ethereum balance value sign flipped — possibly aggregation-direction divergence rather than pure drift |
sf_bq275 |
GA360 |
GA360 sessions are rolling/dynamic — fullVisitorId set changes per capture |
sf_bq366 |
THE_MET |
Same rows as gold, different ordering (font/drawing swap) |
sf_ga014 |
GA4 |
Live GA4 session counts drift slightly between captures |
sf_ga018 |
GA4 |
PLP-to-PDP conversion ratio changed (26.68 % → 21.68 %) |
sf_ga021 |
FIREBASE |
Cohort event-type now resolves to level_complete_quickplay instead of level_start_quickplay |
For each of the 20: test-set entry removed, eval-config row removed, all gold result CSV variants deleted (70 CSVs total). Two of the 20 (sf012, sf040) had a gold SQL in upstream gold/sql/, both already removed historically; see §2.
Three upstream gold SQLs were removed because each is non-deterministic and would produce a different result on every evaluation. One of the three (sf_bq294) was then replaced with a deterministic version derived from a verified-correct agent submission. The other two had their instances also fully removed (see §1b). Net file count: 120 → 118 (−3 + 1 = −2).
| Instance | db_id | Why upstream gold SQL was removed | Spider2-E status |
|---|---|---|---|
sf_bq294 |
SAN_FRANCISCO_PLUS |
Upstream gold SQL embeds EXTRACT(YEAR FROM CURRENT_DATE) to compute member age — a calendar-dependent expression in a comparison column. Every age value increments by +1 each calendar year past 2025. |
Instance kept; replaced with deterministic gold SQL. New SQL anchors the reference year at the literal 2025 (matching the year the upstream gold result CSVs were captured) and is written for Snowflake syntax. Verified to score 1 against all 3 upstream gold variants under the eval condition_cols. |
sf012 |
WEATHER__ENVIRONMENT |
Upstream gold SQL aggregates the live Cybersyn FEMA NFIP claim table; the 3 pre-existing gold result variants already disagree with each other for year 2012, and live values differ from all three. The dataset is mutable. | Instance fully removed in §1b (data-drift case) |
sf040 |
US_ADDRESSES__POI |
Top-10 northernmost addresses from the live Cybersyn US_ADDRESSES table; the result set shifts as records are added/corrected. The upstream gold SQL was also revised post-generation, leaving one of its gold variants structurally incompatible with the revised SQL. |
Instance fully removed in §1b (data-drift case) |
So the at-a-glance row −3 + 1 = −2 net files for gold/sql/ covers: 3 removals (sf012, sf040, sf_bq294-upstream) + 1 addition (sf_bq294-Spider2-E).
For these 5 instances the agent's SQL is structurally correct but Snowflake legitimately produces a different result from the upstream gold because of fundamental BigQuery vs Snowflake engine differences (missing UDFs, spatial-function semantics, hash/UUID generation, array unnesting, float precision). The agent's SQL was executed against live Snowflake and the result saved as a new gold result variant <instance>_<next_letter>.csv under gold/exec_result/. The original gold variants are untouched; evaluate.py accepts any matching variant.
| Instance | db_id | New variant | Rows × Cols | Reason |
|---|---|---|---|---|
sf_bq111 |
TCGA_MITELMAN |
sf_bq111_f.csv |
21 × 9 | BigQuery corr_pvalue UDF has no Snowflake equivalent; p-values and correlations diverge |
sf_bq276 |
NOAA_PORTS |
sf_bq276_d.csv |
234 × 9 | Spatial predicate evaluates port/storm geometry differently across engines (different port matched) |
sf_bq430 |
EBI_CHEMBL |
sf_bq430_f.csv |
810 × 6 | UUID/hash generation from activity IDs and SMILES strings differs between engines |
sf_bq458 |
WORD_VECTORS_US |
sf_bq458_e.csv |
2,371 × 4 | UNNEST vs LATERAL FLATTEN of article-vector arrays drops different rows; original gold is empty |
sf_local344 |
F1 |
sf_local344_c.csv |
4 × 2 | Float-boundary edge case yields a different overtake count (gold 28 vs exec 20) |
Note: sf_bq458 was captured by running its SQL on the SNOWFLAKE_LEARNING_WH warehouse — the COMPUTE_WH_PARTICIPANT warehouse used by default has a 120 s statement timeout that the query exceeds (~258 s actual runtime). Re-evaluation against this variant requires a warehouse without that ceiling.
Spider2-E previously also added gold result variants for the data-drift cases now listed in §1b. Those variants were removed when the underlying instances were removed — adding a freshly-captured variant didn't solve the problem since the data continued to drift faster than the variant could keep up.
evaluation_suite/evaluate.py has two behavior changes versus upstream. Comparison logic, threading, CLI surface, and output format are otherwise unchanged.
Submissions that bake gold values into CASE lookup tables are rejected with score = 0 before execution, instead of being allowed to "pass" by trivially matching the expected output.
Reject signals (SQL-structural — no dependency on the gold result CSVs):
| Signal | Triggers when | Action |
|---|---|---|
NO_FROM |
SELECT with no FROM clause (pure literal return). |
Reject |
HARDCODED_CASE_MAP |
≥ 2 WHEN 'literal' THEN 'literal' pairs where the THEN values are strings ≥ 20 chars. This is the canonical "hardcoded ID-to-answer-value lookup table" pattern. |
Reject |
VALUES_CLAUSE |
SQL contains a VALUES (...) constructor. |
Warn only (common in legitimate inline reference tables) |
UNION_ALL_LITERALS |
Two or more pure-literal SELECT blocks joined by UNION ALL. |
Warn only (common in legitimate inline scaffolds) |
HARDCODED_CASE_MAP is the discriminating signal — it has zero known false positives across the 547 upstream instances and catches the canonical hardcoding pattern (e.g., WHEN '0000000000…' THEN 'TRC20-wallet-address' for 5 paired hex→address mappings). Earlier draft signals based on IN (...) filter contents and on overlap between SQL literals and gold CSV cell values were dropped because they could not be distinguished from legitimate enum filtering on documented schemas (NAICS industry codes, SNOMED concept IDs, DICOM transfer-syntax UIDs, GA event names, well-known contract addresses, etc.).
When a submission is rejected, error_info reports the matched flag and the offending values, so the verdict can be audited. Pass --allow_hardcoded to disable rejection entirely.
Upstream evaluate.py short-circuited any predicted SQL that returned zero rows with "No data found for the specified query.", marking it as a failure even when the gold for that instance was also empty. The patched script saves the header-only CSV and runs the normal comparison: empty-vs-empty matches; empty-vs-non-empty still fails. No change for non-empty results.
When the per-instance eval config has fewer condition_cols entries than there are gold result variants, the missing entries are now padded with [] (compare full row) instead of crashing with IndexError. This was hit by the data-drift additions before they were removed; it remains useful as a defensive default for any future variant additions.
Spider2-E/
├── README.md # This file
├── README-spider2-snow.md # Upstream README, verbatim
├── requirements.txt # Pinned third-party dependencies for evaluate.py
├── .gitignore # Excludes credentials, env files, logs, caches
├── .github/
│ └── dependabot.yml # Weekly pip + github-actions dependency-update PRs
├── spider2-snow.jsonl # Test set (524 instances; upstream 547)
├── evaluation_suite/
│ ├── README.md # Upstream evaluation README, verbatim
│ ├── evaluate.py # Patched (see §4)
│ ├── evaluate_utils.py # Upstream
│ ├── example_submission_folder/ # Upstream sample SQL submissions
│ ├── example_submission_folder_csv/ # Upstream sample CSV submissions
│ └── gold/
│ ├── spider2snow_eval.jsonl # Per-instance condition_cols / ignore_order (524 records)
│ ├── sql/ # Gold SQL queries (118 files; −3 + 1 = −2 net vs upstream)
│ └── exec_result/ # Gold result CSVs (1,469 files; −80 + 5 = −75 net vs upstream)
└── resource/ # Upstream supporting docs and schemas
Setup (one-time):
- Install pinned dependencies (a fresh virtual environment is recommended):
pip install -r requirements.txt
- Place your
snowflake_credential.jsoninevaluation_suite/. The repo.gitignorealready excludes*credential*.json, so the file is safe from accidental commits — do not rename or relocate it in a way that bypasses that rule.
Run (CLI usage is the same as upstream):
cd evaluation_suite
python evaluate.py --result_dir <your_predicted_sqls_folder> --mode sql
# or
python evaluate.py --result_dir <your_predicted_csvs_folder> --mode exec_resultNew flag (SQL mode only):
--allow_hardcoded— disable hardcoded-SQL rejection. Default is off, i.e. submissions flagged byNO_FROMorHARDCODED_CASE_MAP(see §4a) score 0.
Sample submissions are under evaluation_suite/example_submission_folder/ and evaluation_suite/example_submission_folder_csv/. See evaluation_suite/README.md for the upstream submission-folder convention.
- Upstream:
xlang-ai/Spider2—spider2-snowdirectory. - Spider2-E commits are scoped to gold-artifact changes and the evaluation-script patch only; no edits to task instructions, supporting resources, or the upstream eval-suite scaffolding.
- db_id verification for every modified instance was done against the upstream
spider2-snow.jsonl. - All Snowflake executions referenced above were performed in late April 2026 using the standard
snowflake-connector-pythonSDK against credentials provisioned for thePARTICIPANTrole.