Skip to content

feat: expose Arrow-native geospatial option (databricks.arrow.native_geospatial)#350

Open
jatorre wants to merge 5 commits intoadbc-drivers:mainfrom
jatorre:feat/arrow-native-geospatial
Open

feat: expose Arrow-native geospatial option (databricks.arrow.native_geospatial)#350
jatorre wants to merge 5 commits intoadbc-drivers:mainfrom
jatorre:feat/arrow-native-geospatial

Conversation

@jatorre
Copy link

@jatorre jatorre commented Mar 14, 2026

Summary

Adds Arrow-native geospatial support via databricks.arrow.native_geospatial ADBC connection option, implementing the full pipeline from Databricks geometry to standard GeoArrow geoarrow.wkb.

Why geoarrow.wkb: Databricks uses a proprietary Struct<srid: Int32, wkb: Binary> format for Arrow geometry serialization. While this works, it's not recognized by downstream tools. By converting to geoarrow.wkb (the standard Arrow extension type for geometry), we get maximum compatibility: DuckDB, pandas/geopandas, polars, GDAL, and any other GeoArrow-aware consumer can read geometry natively without custom parsing. The conversion is essentially free (zero-copy pointer extraction, no data copying).

Pipeline

Databricks server
  → Struct<srid: Int32, wkb: Binary>  (via geospatialAsArrow Thrift flag, SPARK-54232)
  → ADBC driver flattens to Binary + ARROW:extension:name=geoarrow.wkb
  → CRS from SRID encoded as PROJJSON in ARROW:extension:metadata
  → DuckDB adbc_scanner → native GEOMETRY (already supports geoarrow.wkb)

End result: SELECT * FROM adbc_scan(...) returns DuckDB GEOMETRY directly. No ST_AsBinary(), no ST_GeomFromWKB(). Zero geometry conversion in user code.

Dependency

Requires databricks/databricks-sql-go#328 for WithArrowNativeGeospatial() ConnOption.

Changes

  • go/driver.go — Add OptionArrowNativeGeospatial constant
  • go/database.gouseArrowNativeGeospatial field, GetOption/SetOption, connection passthrough
  • go/connection.go — Pass flag to statements
  • go/statement.go — Pass flag to IPC reader adapter
  • go/ipc_reader_adapter.go — Core geoarrow conversion:
    • isGeometryStruct() — detect Databricks Struct<srid: Int32, wkb: Binary>
    • detectGeometryColumns() — find geometry struct column indices
    • buildGeoArrowSchemaWithoutCRS() — eagerly rewrite schema: Struct → Binary with geoarrow.wkb (needed before first Next() since consumers read schema upfront)
    • buildGeoArrowSchema() — rebuild schema with SRID-based CRS from first record batch, encoded as PROJJSON in ARROW:extension:metadata
    • transformRecordForGeoArrow() — extract wkb child array from struct per batch (zero-copy, O(columns) not O(rows))

Usage

-- DuckDB with adbc_scanner: geometry arrives as native GEOMETRY
SET VARIABLE db_conn = (SELECT adbc_connect({
  'driver': 'libadbc_driver_databricks.dylib',
  'databricks.server_hostname': '...',
  'databricks.http_path': '/sql/1.0/warehouses/...',
  'databricks.access_token': 'dapi...',
  'databricks.arrow.native_geospatial': 'true'
}));

-- geom is DuckDB GEOMETRY — no ST_GeomFromWKB needed!
SELECT * FROM adbc_scan(getvariable('db_conn')::BIGINT, 'SELECT * FROM my_geo_table');

Benchmarks (vs baseline ST_AsBinary + ST_GeomFromWKB)

Dataset Baseline GeoArrow Speedup
100k points 27.04s (3.7k rows/sec) 3.85s (26k rows/sec) 7.0x
10k polygons 8.41s (1.2k rows/sec) 2.33s (4.3k rows/sec) 3.6x

SRID / CRS Handling

Databricks uses per-row SRID in Struct<srid, wkb>. The GeoArrow spec defines CRS per-column, not per-row. The driver reads the SRID from the first non-null row of each geometry column in the first record batch and encodes it as PROJJSON CRS in ARROW:extension:metadata:

{"crs":{"type":"projjson","properties":{"name":"EPSG:4326"},"id":{"authority":"EPSG","code":4326}}}

In practice, Databricks geometry/geography uses a consistent SRID within a column (typically 0 or 4326), so this per-column CRS approach is safe and follows the GeoArrow standard. Non-zero SRIDs (e.g. 4326, 3857) are preserved; SRID 0 produces empty CRS metadata (the geoarrow default).

jatorre and others added 4 commits March 14, 2026 09:53
Expose geospatialAsArrow support (SPARK-54232) as an opt-in ADBC
connection option. When set to "true", geometry/geography columns
arrive as Struct<srid: Int32, wkb: Binary> instead of EWKT strings.

This depends on databricks/databricks-sql-go#328 which adds the
WithArrowNativeGeospatial() ConnOption to the underlying Go SQL driver.

Usage via adbc_connect (e.g. from DuckDB adbc_scanner):

  adbc_connect({
    'driver': 'libadbc_driver_databricks.dylib',
    'databricks.server_hostname': '...',
    'databricks.arrow.native_geospatial': 'true'
  })
When databricks.arrow.native_geospatial is enabled, the driver now
converts Struct<srid: Int32, wkb: Binary> columns to flat Binary
columns with ARROW:extension:name=geoarrow.wkb metadata.

This enables downstream consumers (e.g. DuckDB adbc_scanner) to
automatically map geometry columns to native GEOMETRY types without
any explicit ST_GeomFromWKB conversion.

Pipeline: Databricks -> Struct<srid,wkb> -> geoarrow.wkb -> native GEOMETRY

Benchmarks vs baseline (ST_AsBinary + ST_GeomFromWKB):
  100k points:  2.05x faster (31k rows/sec vs 15k rows/sec)
  10k polygons: 1.31x faster (4.5k rows/sec vs 3.4k rows/sec)
Defer schema transformation to the first Next() call so the SRID can be
read from the first non-null row of each geometry column. The SRID is
encoded as PROJJSON CRS in ARROW:extension:metadata, e.g. EPSG:4326 or
EPSG:3857. This ensures CRS information propagates correctly to
downstream consumers (DuckDB, pandas, polars, GDAL).

Split transformSchemaForGeoArrow into:
- detectGeometryColumns: finds geometry struct column indices (called in constructor)
- buildGeoArrowSchema: builds geoarrow schema with CRS from first batch (called lazily)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The schema must be available before the first Next() call since
consumers like adbc_scanner read it upfront to create table columns.
Build the geoarrow.wkb schema eagerly with empty CRS metadata in the
constructor, then enrich it with the actual SRID from the first record
batch during the first Next() call.

Verified: DuckDB now correctly recognizes geometry columns as native
GEOMETRY type via the geoarrow.wkb extension metadata.

Benchmark results (Databricks → DuckDB):
- 100k points: 7x faster than ST_AsBinary baseline
- 10k polygons: 3.6x faster than ST_AsBinary baseline

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool!

Co-authored-by: Dewey Dunnington <dewey@dunnington.ca>
jatorre added a commit to jatorre/snowflake that referenced this pull request Mar 18, 2026
Detect GEOGRAPHY/GEOMETRY columns during query execution and tag them
with geoarrow.wkb Arrow extension metadata, enabling DuckDB and other
Arrow consumers to receive native geometry types with CRS information.

How it works:
1. Set GEOGRAPHY/GEOMETRY_OUTPUT_FORMAT=WKB at connection time so geo
   columns arrive as binary WKB instead of GeoJSON strings
2. Before executing a query, extract the table name and run DESCRIBE
   TABLE to identify GEOGRAPHY/GEOMETRY columns (catalog metadata is
   unaffected by the WKB output format setting)
3. Tag identified columns with geoarrow.wkb extension metadata in the
   Arrow schema — GEOGRAPHY gets CRS "EPSG:4326", GEOMETRY gets no CRS
4. Data flows as binary WKB with zero conversion overhead

Note: Snowflake's REST API reports geo columns as "binary" in rowtype
metadata when WKB output format is set, losing the original type info.
This is why we need the separate DESCRIBE TABLE query. We've reported
this to Snowflake.

Limitations (documented as TODOs):
- GEOMETRY SRID: requires data inspection to determine, same cross-driver
  issue as adbc-drivers/redshift#2 and adbc-drivers/databricks#350
- Arbitrary queries: only table scans (SELECT ... FROM table) get geoarrow
  metadata. Complex queries with joins/subqueries don't trigger geo
  detection. The data is still correct WKB, just without the metadata.

Tested end-to-end: DuckDB reads Snowflake GEOGRAPHY as native GEOMETRY
with CRS EPSG:4326, and GeoParquet export preserves the type.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants