Skip to content

feat(snowflake): GeoArrow support for bulk ingestion (GEOGRAPHY/GEOMETRY)#4114

Closed
jatorre wants to merge 1 commit intoapache:mainfrom
jatorre:snowflake-geoarrow-import
Closed

feat(snowflake): GeoArrow support for bulk ingestion (GEOGRAPHY/GEOMETRY)#4114
jatorre wants to merge 1 commit intoapache:mainfrom
jatorre:snowflake-geoarrow-import

Conversation

@jatorre
Copy link

@jatorre jatorre commented Mar 17, 2026

Summary

Adds geospatial column support to the Snowflake ADBC driver's bulk ingestion path. When Arrow columns carry geoarrow.wkb or geoarrow.wkt extension metadata, the driver automatically creates GEOGRAPHY or GEOMETRY columns in Snowflake and converts the data.

  • Detects geoarrow columns from ARROW:extension:name field metadata (handles C Data Interface where Go-level extension types are stripped)
  • New statement option adbc.snowflake.statement.ingest_geo_type: "geography" (default, WGS84/4326) or "geometry" (any SRID)
  • Extracts SRID from geoarrow CRS metadata (PROJJSON or "EPSG:NNNN" format) for GEOMETRY columns
  • Unit tests for type mapping and SRID extraction

How it works

  1. Bulk ingest loads data as BINARY via the existing Parquet → PUT → COPY INTO pipeline (unchanged)
  2. After COPY, geoarrow columns are detected and converted via CTAS with TO_GEOGRAPHY/TO_GEOMETRY
  3. For GEOMETRY columns, SRID is applied via ST_SETSRID if present in geoarrow metadata

Why CTAS instead of direct COPY INTO GEOGRAPHY?

Snowflake's COPY INTO from Parquet cannot load WKB directly into GEOGRAPHY/GEOMETRY columns — only CSV and JSON/AVRO support direct geospatial loading from stages (docs). The CTAS workaround (rename → CTAS with conversion → drop staging) adds minimal overhead at scale.

A future optimization could use COPY transforms (SELECT ... FROM @stage) to convert inline.

Benchmark results

Tested with Czech Republic OSM Geofabrik data (real-world geometries):

Dataset Rows Throughput Geometry type
POIs 465,280 38,119 rows/sec Point
Roads 1,885,651 56,804 rows/sec LineString
Buildings 5,014,886 68,611 rows/sec Polygon

This is a 4.4x improvement over the previous approach (WKT string + staging table + server-side TRY_TO_GEOGRAPHY, ~8,600 rows/sec) and approaches SnowSQL staging performance (79K rows/sec) without needing the SnowSQL CLI.

Context

This is part of a broader effort to add GeoArrow support across ADBC drivers for seamless geospatial data transfer between DuckDB and cloud data warehouses. Related work:

Test plan

  • Unit tests for toSnowflakeType with geoarrow extension types
  • Unit tests for extractSRIDFromMeta (PROJJSON, simple EPSG string, null, empty, invalid)
  • Existing TestIngestBatchedParquetWithFileLimit still passes
  • End-to-end tested against real Snowflake with points, lines, and polygons
  • Verified GEOGRAPHY column type created in Snowflake via INFORMATION_SCHEMA
  • Integration test with GEOMETRY type + custom SRID (not yet tested against Snowflake)

🤖 Generated with Claude Code

…EOMETRY)

Detect geoarrow.wkb/geoarrow.wkt columns during adbc_insert and create
GEOGRAPHY or GEOMETRY columns in Snowflake, with automatic WKB→geo
conversion and SRID support.

How it works:
1. Bulk ingest loads data as BINARY via existing Parquet→PUT→COPY INTO
2. After COPY, geoarrow columns are detected from Arrow field metadata
   (ARROW:extension:name) and converted via CTAS with TO_GEOGRAPHY or
   TO_GEOMETRY. SRID is extracted from geoarrow CRS metadata (PROJJSON
   or "EPSG:NNNN") and applied via ST_SETSRID for GEOMETRY columns.

The CTAS post-processing is needed because Snowflake's COPY INTO from
Parquet cannot load WKB directly into GEOGRAPHY/GEOMETRY columns — only
CSV and JSON/AVRO support direct geospatial loading from stages. See:
https://docs.snowflake.com/en/sql-reference/data-types-geospatial#loading-geospatial-data-from-stages

New statement option:
- adbc.snowflake.statement.ingest_geo_type: "geography" (default) or
  "geometry". GEOGRAPHY is WGS84/SRID 4326; GEOMETRY supports any SRID.

Benchmarked with Czech Republic OSM Geofabrik data against Snowflake:
- Points (465K):    38,119 rows/sec
- LineStrings (1.9M): 56,804 rows/sec
- Polygons (5M):    68,611 rows/sec

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@jatorre jatorre requested a review from zeroshade as a code owner March 17, 2026 22:34
@jatorre
Copy link
Author

jatorre commented Mar 18, 2026

Moving to adbc-drivers/snowflake per maintainer request. Will re-open there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant