Skip to content

EmergenceAI/Spider2-E

Repository files navigation

Spider2-E

Spider2-E (Spider2-EmergenceAI) is a corrected, reproducible version of the Snowflake split of the Spider 2.0 benchmark by xlang-ai. The task definitions are unchanged. Spider2-E modifies the gold artifacts (test set, gold SQL queries, gold result CSVs) because some upstream gold items can no longer be reproduced against live Snowflake, and patches the evaluation script to reject SQL submissions that bake answer values into the query.

This document describes every change made versus the upstream spider2-snow directory and the reasoning behind it. Run setup is unchanged — see README-spider2-snow.md (verbatim copy of the upstream README). Evaluation usage is summarized below; the new flag is documented in §4.


Types of Ground truth artifacts

The Spider2 evaluation suite ships two distinct ground-truth artifacts per instance, both modified independently by Spider2-E:

Artifact Directory What it is
Gold SQL queries evaluation_suite/gold/sql/ The reference SQL the benchmark expects, one .sql file per instance.
Gold result CSVs evaluation_suite/gold/exec_result/ The expected query output as CSV, one or more .csv files per instance (suffixed _a.csv, _b.csv, … when multiple variants are accepted). evaluate.py compares a submission's executed result against these.

A submission is scored against the gold result CSVs (any matching variant counts as correct). The gold SQL queries exist as a reference but aren't directly used by evaluate.py. Throughout this README the two are treated as separate things — an instance can have one removed without the other.


What changed at a glance

Artifact Upstream Spider2-E Net change
Instance set (spider2-snow.jsonl) 547 instances 524 instances −23 instances
Eval config (gold/spider2snow_eval.jsonl) 547 instances 524 instances −23 instances
Gold SQL queries (gold/sql/) 120 files 118 files −3 + 1 = −2 net files
Gold result CSVs (gold/exec_result/) 1,544 CSVs 1,469 CSVs −80 + 5 = −75 net CSVs
Evaluation script (gold/../evaluate.py) upstream patched hardcoded-SQL rejection + empty-result fix

Four categories of change:

  1. 23 instances removed entirely — gold cannot be reproduced against live Snowflake (revoked data shares or pervasive data drift on rolling/refreshed datasets). Both the test entry and all gold artifacts deleted.
  2. 3 gold SQLs removed from upstream + 1 deterministic replacement added — see §2. Of the 3 removed, 2 instances (sf012, sf040) are also fully removed in §1b. The third (sf_bq294) is kept and gets a new deterministic gold SQL written from scratch (anchors the reference year at 2025 instead of using CURRENT_DATE), reproducing all 3 upstream gold result variants.
  3. 5 gold result CSV variants added — for instances where BigQuery vs Snowflake engine semantics legitimately produce different output from the upstream gold; today's Snowflake output is added as an additional accepted variant.
  4. Evaluation script patched — rejects SQL submissions that bake gold values into CASE lookup tables, and treats empty-result-vs-empty-gold as a match instead of a failure.

For every instance not listed in §1–§3, the upstream gold artifacts are kept verbatim — Spider2-E is a strict subset-plus-deltas of upstream gold.


1. Removed instances (23)

For these 23 instances both the test entry and all gold artifacts (gold SQL if present, all gold result CSV variants, eval-config row) have been removed. Two reasons.

1a. Snowflake data shares revoked (3)

The Snowflake data shares behind these instances have been revoked by their publishers. SHOW DATABASES returns a tombstone (origin = <revoked>) and any query against them returns:

003030 (02000): SQL compilation error:
Shared database is no longer available for use.
Instance db_id Tables Why
sf009 NETHERLANDS_OPEN_MAP_DATA BUILDINGS Share revoked by publisher
sf013 NETHERLANDS_OPEN_MAP_DATA ROAD Same revoked share as sf009
sf029 AMAZON_VENDOR_ANALYTICS__SAMPLE_DATASET SALES, TRAFFIC, INVENTORY, NET_PPM Share unmounted by publisher

These instances cannot be re-enabled until the publishers republish the shares. None of the three had a gold SQL in upstream gold/sql/. 10 gold result CSVs deleted (5 sf009, 4 sf013, 1 sf029).

1b. Live data drift — gold no longer reproducible (20)

For these 20 instances the prompt is well-formed and the SQL can be executed, but the underlying Snowflake table has drifted (rolling time windows, refreshed counts, regenerated identifiers, granularity shifts) since the upstream gold result was captured. Capturing today's output as a new gold variant only buys correctness for one evaluation cycle — the next agent will face the same drift and a stale variant. Removal is the only stable answer.

Instance db_id Drift symptom
sf006 FINANCE__ECONOMICS Same 56 states, refreshed branch counts and pct-change values
sf008 US_REAL_ESTATE Home price index value drifted (6.55 → 6.8)
sf012 WEATHER__ENVIRONMENT 2012 FEMA NFIP claim amounts already disagree across the 3 upstream gold variants and differ from live (~492.83 / .84 / .89 M vs 492.89 M today) — Cybersyn-managed table is mutable
sf037 US_REAL_ESTATE POI_ID format changed from hex to UUID upstream
sf040 US_ADDRESSES__POI Top-10 northernmost addresses shifts as records are added/corrected (gold SQL also removed historically — see §2)
sf_bq009 GA360 Revenue difference value drifted (4659.15 → 32785.17)
sf_bq024 USFS_FIA EVALUATION_TYPE shifted from EXPCURR/EXPCHNG to EXPMORT
sf_bq058 GOOG_BLOCKCHAIN Optimism blockchain data not present in this Snowflake share
sf_bq063 DEPS_DEV_V1 Top-result GitHub URL shifts as npm versions are published
sf_bq102 GNOMAD Genomic start position moved (41197780 → 41276093) — assembly/annotation update
sf_bq130 COVID19_NYT Result granularity shifted from county-level to state-level
sf_bq165 TCGA_MITELMAN MITELMAN cohort total changed (3373 → 3377); 6 of 320 chromosomal bands now have ±1 count differences
sf_bq190 THELOOK_ECOMMERCE Oldest-female-user count drifted (465 → 467)
sf_bq249 GITHUB_REPOS Whitespace-category counts shifted as repo data grew
sf_bq256 CRYPTO Final Ethereum balance value sign flipped — possibly aggregation-direction divergence rather than pure drift
sf_bq275 GA360 GA360 sessions are rolling/dynamic — fullVisitorId set changes per capture
sf_bq366 THE_MET Same rows as gold, different ordering (font/drawing swap)
sf_ga014 GA4 Live GA4 session counts drift slightly between captures
sf_ga018 GA4 PLP-to-PDP conversion ratio changed (26.68 % → 21.68 %)
sf_ga021 FIREBASE Cohort event-type now resolves to level_complete_quickplay instead of level_start_quickplay

For each of the 20: test-set entry removed, eval-config row removed, all gold result CSV variants deleted (70 CSVs total). Two of the 20 (sf012, sf040) had a gold SQL in upstream gold/sql/, both already removed historically; see §2.


2. Gold SQL changes (3 removed, 1 added)

Three upstream gold SQLs were removed because each is non-deterministic and would produce a different result on every evaluation. One of the three (sf_bq294) was then replaced with a deterministic version derived from a verified-correct agent submission. The other two had their instances also fully removed (see §1b). Net file count: 120 → 118 (−3 + 1 = −2).

Instance db_id Why upstream gold SQL was removed Spider2-E status
sf_bq294 SAN_FRANCISCO_PLUS Upstream gold SQL embeds EXTRACT(YEAR FROM CURRENT_DATE) to compute member age — a calendar-dependent expression in a comparison column. Every age value increments by +1 each calendar year past 2025. Instance kept; replaced with deterministic gold SQL. New SQL anchors the reference year at the literal 2025 (matching the year the upstream gold result CSVs were captured) and is written for Snowflake syntax. Verified to score 1 against all 3 upstream gold variants under the eval condition_cols.
sf012 WEATHER__ENVIRONMENT Upstream gold SQL aggregates the live Cybersyn FEMA NFIP claim table; the 3 pre-existing gold result variants already disagree with each other for year 2012, and live values differ from all three. The dataset is mutable. Instance fully removed in §1b (data-drift case)
sf040 US_ADDRESSES__POI Top-10 northernmost addresses from the live Cybersyn US_ADDRESSES table; the result set shifts as records are added/corrected. The upstream gold SQL was also revised post-generation, leaving one of its gold variants structurally incompatible with the revised SQL. Instance fully removed in §1b (data-drift case)

So the at-a-glance row −3 + 1 = −2 net files for gold/sql/ covers: 3 removals (sf012, sf040, sf_bq294-upstream) + 1 addition (sf_bq294-Spider2-E).


3. Added gold result variants (5)

For these 5 instances the agent's SQL is structurally correct but Snowflake legitimately produces a different result from the upstream gold because of fundamental BigQuery vs Snowflake engine differences (missing UDFs, spatial-function semantics, hash/UUID generation, array unnesting, float precision). The agent's SQL was executed against live Snowflake and the result saved as a new gold result variant <instance>_<next_letter>.csv under gold/exec_result/. The original gold variants are untouched; evaluate.py accepts any matching variant.

Instance db_id New variant Rows × Cols Reason
sf_bq111 TCGA_MITELMAN sf_bq111_f.csv 21 × 9 BigQuery corr_pvalue UDF has no Snowflake equivalent; p-values and correlations diverge
sf_bq276 NOAA_PORTS sf_bq276_d.csv 234 × 9 Spatial predicate evaluates port/storm geometry differently across engines (different port matched)
sf_bq430 EBI_CHEMBL sf_bq430_f.csv 810 × 6 UUID/hash generation from activity IDs and SMILES strings differs between engines
sf_bq458 WORD_VECTORS_US sf_bq458_e.csv 2,371 × 4 UNNEST vs LATERAL FLATTEN of article-vector arrays drops different rows; original gold is empty
sf_local344 F1 sf_local344_c.csv 4 × 2 Float-boundary edge case yields a different overtake count (gold 28 vs exec 20)

Note: sf_bq458 was captured by running its SQL on the SNOWFLAKE_LEARNING_WH warehouse — the COMPUTE_WH_PARTICIPANT warehouse used by default has a 120 s statement timeout that the query exceeds (~258 s actual runtime). Re-evaluation against this variant requires a warehouse without that ceiling.

Spider2-E previously also added gold result variants for the data-drift cases now listed in §1b. Those variants were removed when the underlying instances were removed — adding a freshly-captured variant didn't solve the problem since the data continued to drift faster than the variant could keep up.


4. Evaluation script changes

evaluation_suite/evaluate.py has two behavior changes versus upstream. Comparison logic, threading, CLI surface, and output format are otherwise unchanged.

4a. Hardcoded-SQL rejection (default on)

Submissions that bake gold values into CASE lookup tables are rejected with score = 0 before execution, instead of being allowed to "pass" by trivially matching the expected output.

Reject signals (SQL-structural — no dependency on the gold result CSVs):

Signal Triggers when Action
NO_FROM SELECT with no FROM clause (pure literal return). Reject
HARDCODED_CASE_MAP ≥ 2 WHEN 'literal' THEN 'literal' pairs where the THEN values are strings ≥ 20 chars. This is the canonical "hardcoded ID-to-answer-value lookup table" pattern. Reject
VALUES_CLAUSE SQL contains a VALUES (...) constructor. Warn only (common in legitimate inline reference tables)
UNION_ALL_LITERALS Two or more pure-literal SELECT blocks joined by UNION ALL. Warn only (common in legitimate inline scaffolds)

HARDCODED_CASE_MAP is the discriminating signal — it has zero known false positives across the 547 upstream instances and catches the canonical hardcoding pattern (e.g., WHEN '0000000000…' THEN 'TRC20-wallet-address' for 5 paired hex→address mappings). Earlier draft signals based on IN (...) filter contents and on overlap between SQL literals and gold CSV cell values were dropped because they could not be distinguished from legitimate enum filtering on documented schemas (NAICS industry codes, SNOMED concept IDs, DICOM transfer-syntax UIDs, GA event names, well-known contract addresses, etc.).

When a submission is rejected, error_info reports the matched flag and the offending values, so the verdict can be audited. Pass --allow_hardcoded to disable rejection entirely.

4b. Empty-result handling

Upstream evaluate.py short-circuited any predicted SQL that returned zero rows with "No data found for the specified query.", marking it as a failure even when the gold for that instance was also empty. The patched script saves the header-only CSV and runs the normal comparison: empty-vs-empty matches; empty-vs-non-empty still fails. No change for non-empty results.

4c. Per-variant condition_cols padding

When the per-instance eval config has fewer condition_cols entries than there are gold result variants, the missing entries are now padded with [] (compare full row) instead of crashing with IndexError. This was hit by the data-drift additions before they were removed; it remains useful as a defensive default for any future variant additions.


Repository layout

Spider2-E/
├── README.md                            # This file
├── README-spider2-snow.md               # Upstream README, verbatim
├── requirements.txt                     # Pinned third-party dependencies for evaluate.py
├── .gitignore                           # Excludes credentials, env files, logs, caches
├── .github/
│   └── dependabot.yml                   # Weekly pip + github-actions dependency-update PRs
├── spider2-snow.jsonl                   # Test set (524 instances; upstream 547)
├── evaluation_suite/
│   ├── README.md                        # Upstream evaluation README, verbatim
│   ├── evaluate.py                      # Patched (see §4)
│   ├── evaluate_utils.py                # Upstream
│   ├── example_submission_folder/       # Upstream sample SQL submissions
│   ├── example_submission_folder_csv/   # Upstream sample CSV submissions
│   └── gold/
│       ├── spider2snow_eval.jsonl       # Per-instance condition_cols / ignore_order (524 records)
│       ├── sql/                         # Gold SQL queries (118 files; −3 + 1 = −2 net vs upstream)
│       └── exec_result/                 # Gold result CSVs (1,469 files; −80 + 5 = −75 net vs upstream)
└── resource/                            # Upstream supporting docs and schemas

How to evaluate

Setup (one-time):

  1. Install pinned dependencies (a fresh virtual environment is recommended):
    pip install -r requirements.txt
  2. Place your snowflake_credential.json in evaluation_suite/. The repo .gitignore already excludes *credential*.json, so the file is safe from accidental commits — do not rename or relocate it in a way that bypasses that rule.

Run (CLI usage is the same as upstream):

cd evaluation_suite
python evaluate.py --result_dir <your_predicted_sqls_folder> --mode sql
# or
python evaluate.py --result_dir <your_predicted_csvs_folder> --mode exec_result

New flag (SQL mode only):

  • --allow_hardcoded — disable hardcoded-SQL rejection. Default is off, i.e. submissions flagged by NO_FROM or HARDCODED_CASE_MAP (see §4a) score 0.

Sample submissions are under evaluation_suite/example_submission_folder/ and evaluation_suite/example_submission_folder_csv/. See evaluation_suite/README.md for the upstream submission-folder convention.


Provenance

  • Upstream: xlang-ai/Spider2spider2-snow directory.
  • Spider2-E commits are scoped to gold-artifact changes and the evaluation-script patch only; no edits to task instructions, supporting resources, or the upstream eval-suite scaffolding.
  • db_id verification for every modified instance was done against the upstream spider2-snow.jsonl.
  • All Snowflake executions referenced above were performed in late April 2026 using the standard snowflake-connector-python SDK against credentials provisioned for the PARTICIPANT role.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages