feat(snowflake): GeoArrow support for bulk ingestion (GEOGRAPHY/GEOMETRY)#4114
Closed
jatorre wants to merge 1 commit intoapache:mainfrom
Closed
feat(snowflake): GeoArrow support for bulk ingestion (GEOGRAPHY/GEOMETRY)#4114jatorre wants to merge 1 commit intoapache:mainfrom
jatorre wants to merge 1 commit intoapache:mainfrom
Conversation
…EOMETRY) Detect geoarrow.wkb/geoarrow.wkt columns during adbc_insert and create GEOGRAPHY or GEOMETRY columns in Snowflake, with automatic WKB→geo conversion and SRID support. How it works: 1. Bulk ingest loads data as BINARY via existing Parquet→PUT→COPY INTO 2. After COPY, geoarrow columns are detected from Arrow field metadata (ARROW:extension:name) and converted via CTAS with TO_GEOGRAPHY or TO_GEOMETRY. SRID is extracted from geoarrow CRS metadata (PROJJSON or "EPSG:NNNN") and applied via ST_SETSRID for GEOMETRY columns. The CTAS post-processing is needed because Snowflake's COPY INTO from Parquet cannot load WKB directly into GEOGRAPHY/GEOMETRY columns — only CSV and JSON/AVRO support direct geospatial loading from stages. See: https://docs.snowflake.com/en/sql-reference/data-types-geospatial#loading-geospatial-data-from-stages New statement option: - adbc.snowflake.statement.ingest_geo_type: "geography" (default) or "geometry". GEOGRAPHY is WGS84/SRID 4326; GEOMETRY supports any SRID. Benchmarked with Czech Republic OSM Geofabrik data against Snowflake: - Points (465K): 38,119 rows/sec - LineStrings (1.9M): 56,804 rows/sec - Polygons (5M): 68,611 rows/sec Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Author
|
Moving to adbc-drivers/snowflake per maintainer request. Will re-open there. |
5 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds geospatial column support to the Snowflake ADBC driver's bulk ingestion path. When Arrow columns carry
geoarrow.wkborgeoarrow.wktextension metadata, the driver automatically creates GEOGRAPHY or GEOMETRY columns in Snowflake and converts the data.ARROW:extension:namefield metadata (handles C Data Interface where Go-level extension types are stripped)adbc.snowflake.statement.ingest_geo_type:"geography"(default, WGS84/4326) or"geometry"(any SRID)"EPSG:NNNN"format) for GEOMETRY columnsHow it works
TO_GEOGRAPHY/TO_GEOMETRYST_SETSRIDif present in geoarrow metadataWhy CTAS instead of direct COPY INTO GEOGRAPHY?
Snowflake's COPY INTO from Parquet cannot load WKB directly into GEOGRAPHY/GEOMETRY columns — only CSV and JSON/AVRO support direct geospatial loading from stages (docs). The CTAS workaround (rename → CTAS with conversion → drop staging) adds minimal overhead at scale.
A future optimization could use COPY transforms (
SELECT ... FROM @stage) to convert inline.Benchmark results
Tested with Czech Republic OSM Geofabrik data (real-world geometries):
This is a 4.4x improvement over the previous approach (WKT string + staging table + server-side TRY_TO_GEOGRAPHY, ~8,600 rows/sec) and approaches SnowSQL staging performance (79K rows/sec) without needing the SnowSQL CLI.
Context
This is part of a broader effort to add GeoArrow support across ADBC drivers for seamless geospatial data transfer between DuckDB and cloud data warehouses. Related work:
Test plan
toSnowflakeTypewith geoarrow extension typesextractSRIDFromMeta(PROJJSON, simple EPSG string, null, empty, invalid)TestIngestBatchedParquetWithFileLimitstill passes🤖 Generated with Claude Code