Skip to content

feat: GeoArrow support for bulk ingestion (GEOGRAPHY/GEOMETRY)#99

Open
jatorre wants to merge 1 commit intoadbc-drivers:mainfrom
jatorre:geoarrow-support
Open

feat: GeoArrow support for bulk ingestion (GEOGRAPHY/GEOMETRY)#99
jatorre wants to merge 1 commit intoadbc-drivers:mainfrom
jatorre:geoarrow-support

Conversation

@jatorre
Copy link
Copy Markdown

@jatorre jatorre commented Mar 18, 2026

Summary

Adds geospatial column support to the Snowflake ADBC driver's bulk ingestion path. When Arrow columns carry geoarrow.wkb or geoarrow.wkt extension metadata, the driver automatically creates GEOGRAPHY or GEOMETRY columns in Snowflake and converts the data.

  • Detects geoarrow columns from ARROW:extension:name field metadata (handles C Data Interface where Go-level extension types are stripped)
  • New statement option adbc.snowflake.statement.ingest_geo_type: "geography" (default, WGS84/4326) or "geometry" (any SRID)
  • Extracts SRID from geoarrow CRS metadata (PROJJSON or "EPSG:NNNN" format) for GEOMETRY columns
  • Unit tests for type mapping and SRID extraction

How it works

  1. Detect geoarrow columns from Arrow extension metadata
  2. Create table with native GEOGRAPHY/GEOMETRY columns
  3. COPY INTO with transform: TO_GEOGRAPHY($1:"geom"::BINARY, true) converts WKB to GEOGRAPHY inline during COPY — no post-processing needed
  4. For GEOMETRY columns, SRID is applied via ST_SETSRID if present in geoarrow metadata

Why COPY transform?

Snowflake's COPY INTO from Parquet cannot load WKB directly into GEOGRAPHY/GEOMETRY columns — only CSV and JSON/AVRO support direct geospatial loading from stages (docs).

The initial approach used a post-COPY CTAS pattern (rename → CREATE TABLE AS SELECT with conversion → drop staging). This PR replaces that with a COPY transform that applies TO_GEOGRAPHY/TO_GEOMETRY in the SELECT clause of the COPY subquery, eliminating 3 SQL round-trips and a full table rewrite.

The original CTAS path is preserved as fallback for schemas without geo columns.

Benchmark results

COPY transform vs CTAS approach (50K random points, median of 10 runs):

                       Median      P25      P75      Min      Max   Rows/sec   N
─────────────────────────────────────────────────────────────────────────────────
CTAS (old approach)     8.13s    8.11s    8.26s    7.99s    8.85s      6,150   7
COPY transform (this)   6.11s    5.81s    6.15s    5.34s    7.44s      8,183  10

Speedup (median): 1.33x  (6,150 → 8,183 rows/sec)

The COPY transform also had zero transient failures vs 3/10 for the CTAS path (fewer SQL round-trips = fewer timeout opportunities).

End-to-end with real-world data (Czech Republic OSM Geofabrik):

Dataset Rows Throughput Geometry type
POIs 465,280 38,119 rows/sec Point
Roads 1,885,651 56,804 rows/sec LineString
Buildings 5,014,886 68,611 rows/sec Polygon

At scale (500K rows, single runs):

Dataset CTAS (old) COPY transform Speedup
500K points 29,499 rows/sec 39,339 rows/sec 1.33x
500K polygons 25,189 rows/sec 28,074 rows/sec 1.11x

Export (not in this PR)

Export/read-path geoarrow support is in a separate PR (#100). Detecting GEOGRAPHY/GEOMETRY columns on the read path is non-trivial because:

  • With GEOGRAPHY_OUTPUT_FORMAT=EWKB, srcMeta.Type becomes "binary" (type info lost)
  • With default GeoJSON format, srcMeta.Type is "object" (same as VARIANT/OBJECT)

Context

This is part of a broader effort to add GeoArrow support across ADBC drivers. Previously opened as apache/arrow-adbc#4114, moved here per maintainer request.

Test plan

  • Unit tests for toSnowflakeType with geoarrow extension types
  • Unit tests for extractSRIDFromMeta (PROJJSON, simple EPSG string, null, empty, invalid)
  • Existing TestIngestBatchedParquetWithFileLimit still passes
  • End-to-end tested against real Snowflake with points, lines, and polygons
  • Verified GEOGRAPHY column type created in Snowflake via INFORMATION_SCHEMA
  • Benchmarked COPY transform vs CTAS approach (10 iterations, median comparison)

Add transparent geometry import via geoarrow.wkb/wkt extension types.
The driver detects geoarrow columns in Arrow metadata and converts them
to Snowflake GEOGRAPHY or GEOMETRY using a COPY transform with inline
TO_GEOGRAPHY/TO_GEOMETRY conversion.

How it works:
  1. Detect geoarrow.wkb/wkb_view/wkt/wkt_view from Arrow extension
     types or ARROW:extension:name field metadata (C Data Interface)
  2. Create table with native GEOGRAPHY/GEOMETRY columns
  3. COPY INTO with transform: TO_GEOGRAPHY($1:"geom"::BINARY, true)
     converts WKB to GEOGRAPHY inline during COPY — no post-processing

Statement option:
  adbc.snowflake.statement.ingest_geo_type = "geography" (default) | "geometry"
  GEOGRAPHY is always WGS84 (SRID 4326). GEOMETRY supports any SRID,
  extracted from geoarrow CRS metadata (PROJJSON or "EPSG:NNNN").

The COPY transform approach is ~1.33x faster than the alternative
rename+CTAS+drop pattern because it eliminates 3 SQL round-trips and
a full table rewrite:

  50K points (median, 10 runs):  8.13s → 6.11s  (6,150 → 8,183 rows/sec)
  500K points:                  16.95s → 12.71s (29,499 → 39,339 rows/sec)
  500K polygons:                19.85s → 17.81s (25,189 → 28,074 rows/sec)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant