[PECOBLR-1377] SRID in geospatial column type name#1157
Conversation
Signed-off-by: Sreekanth Vadigi <sreekanth.vadigi@databricks.com>
Signed-off-by: Sreekanth Vadigi <sreekanth.vadigi@databricks.com>
| return parser.parseJsonStringToDbStruct(object.toString(), arrowMetadata); | ||
| } | ||
|
|
||
| private static AbstractDatabricksGeospatial convertToGeospatial( |
There was a problem hiding this comment.
removing duplicate code, using the same logic present in GeospatialConverter
| long rowSize = executionResult.getRowCount(); | ||
| List<String> arrowMetadata = null; | ||
| if (executionResult instanceof ArrowStreamResult) { | ||
| arrowMetadata = ((ArrowStreamResult) executionResult).getArrowMetadata(); |
There was a problem hiding this comment.
((ArrowStreamResult) executionResult).getArrowMetadata() will always be null for cloud fetch mode at this point in the code flow. So getting arrow metadata from TGetResultSetMetadataResp
There was a problem hiding this comment.
Will getResultSetMetadata always be present?
There was a problem hiding this comment.
Ideally we should have TGetResultSetMetadataResp object in TFetchResultsResp. Anyways we are handling this case in getArrowMetadata method to check for null values
There was a problem hiding this comment.
is arrow metadata consistently populated in TGetResultSetMetadataResp?
There was a problem hiding this comment.
Not when inline result. But that is limitation of complex data types as well
There was a problem hiding this comment.
Not when inline result
Arrow metadata is populated from thrift always when format is CSV/JSON_ARRAY or ARROW.
The only time arrow metadata is not present is for when format is HIVE.
Is inline result in hive format ?
| * @return true if the type name starts with ARRAY, MAP, STRUCT, GEOMETRY, or GEOGRAPHY, false | ||
| * otherwise | ||
| */ | ||
| private static boolean isComplexType(String typeName) { |
There was a problem hiding this comment.
moving this method to DatabricksTypeUtil for using in other classes
| return complexDatatypeSupport ? obj : obj.toString(); | ||
| } | ||
|
|
||
| private Object handleComplexDataTypesForSEAInline(Object obj, String columnName) |
There was a problem hiding this comment.
Updating this method to always return complex datatype, if complex mode is disabled, caller will convert to string
| typeText = "STRING"; | ||
| } | ||
|
|
||
| // store base type eg. DECIMAL instead of DECIMAL(7,2) except for geospatial datatypes |
There was a problem hiding this comment.
For geospatial, we want SRID info (in paranthesis) to be present. eg: GEOMETRY(4326)
There was a problem hiding this comment.
Will there be a case that SRID info might be missing?
There was a problem hiding this comment.
I don't think so, even for SRID 0, we get GEOMETRY(0)
| columnIndex++) { | ||
| TColumnDesc columnDesc = resultManifest.getSchema().getColumns().get(columnIndex); | ||
|
|
||
| ColumnInfo columnInfo = getColumnInfoFromTColumnDesc(columnDesc); |
There was a problem hiding this comment.
TColumnDesc doesn't have complete information about the column metadata, hence enhancing it using the arrowMetadata extracted from the TGetResultSetMetadataResp. This is required for methods which use ColumnInfo later in the flow (thrift + cloudfetch + complex)
| .columnTypeClassName("java.lang.String") | ||
| .columnType(Types.OTHER) | ||
| .columnTypeText(VARIANT); | ||
| } else if (isGeometryColumn(arrowMetadata, columnIndex) |
There was a problem hiding this comment.
This is handled with updates to the column info, no separate handling is required
Signed-off-by: Sreekanth Vadigi <sreekanth.vadigi@databricks.com>
|
|
||
| if (arrowMetadata != null && isComplexType(arrowMetadata)) { | ||
| typeText = arrowMetadata; | ||
| if (arrowMetadata.startsWith(GEOMETRY)) { |
There was a problem hiding this comment.
do we want to check if Geospatial flag is enabled?
There was a problem hiding this comment.
This is an inner layer, so flag check is not required here. we are checking for geospatial flags in DatabricksResultSetMetaData which is the interface through which users get metadata info
| .map(e -> e.get(ARROW_METADATA_KEY)) | ||
| .collect(Collectors.toList()); | ||
| } catch (IOException e) { | ||
| throw new DatabricksSQLException( |
There was a problem hiding this comment.
added error log
Signed-off-by: Sreekanth Vadigi <sreekanth.vadigi@databricks.com>
vikrantpuppala
left a comment
There was a problem hiding this comment.
thanks! can we also run the jdbc comparator in different modes (sea/arrow) across different column types to ensure things are consistent
|
|
||
| ### Fixed | ||
| - Fixed complex types not being returned as objects in SEA Inline mode when `EnableComplexDatatypeSupport=true`. | ||
| - Fixed errors with complex data types in Thrift CloudFetch mode. |
There was a problem hiding this comment.
can we be more descriptive on what these errors are?
There was a problem hiding this comment.
updated the change log with more details
| long rowSize = executionResult.getRowCount(); | ||
| List<String> arrowMetadata = null; | ||
| if (executionResult instanceof ArrowStreamResult) { | ||
| arrowMetadata = ((ArrowStreamResult) executionResult).getArrowMetadata(); |
There was a problem hiding this comment.
is arrow metadata consistently populated in TGetResultSetMetadataResp?
| return parser.parseJsonStringToDbStruct(obj.toString(), columnName); | ||
| } else if (columnName.startsWith(GEOMETRY)) { | ||
| return obj; | ||
| return new GeospatialConverter().toDatabricksGeometry(obj); |
There was a problem hiding this comment.
why is toDatabricksGeometry not a static method? why do we need to init an object to use it?
There was a problem hiding this comment.
updated the code to use cached converter similar to other datatype's converters
| long rowSize = executionResult.getRowCount(); | ||
| List<String> arrowMetadata = null; | ||
| if (executionResult instanceof ArrowStreamResult) { | ||
| arrowMetadata = ((ArrowStreamResult) executionResult).getArrowMetadata(); |
There was a problem hiding this comment.
Not when inline result
Arrow metadata is populated from thrift always when format is CSV/JSON_ARRAY or ARROW.
The only time arrow metadata is not present is for when format is HIVE.
Is inline result in hive format ?
| return parser.parseJsonStringToDbStruct(obj.toString(), columnName); | ||
| } else if (columnName.startsWith(GEOMETRY)) { | ||
| return obj; | ||
| return new GeospatialConverter().toDatabricksGeometry(obj); |
There was a problem hiding this comment.
why didnt we handle this before ?
what is the impact of not having this in 3.0.6 ?
There was a problem hiding this comment.
Currently complex data types are not supported in SEA inline flow, so geospatial type is returned as a string (till 3.0.7). This is consistent with our documentation. With this change we are supporting complex types for SEA inline mode as well. Will update the documentation accordingly.
| typeText = "STRING"; | ||
| } | ||
|
|
||
| // store base type eg. DECIMAL instead of DECIMAL(7,2) except for geospatial datatypes |
| * and EnableComplexDatatypeSupport are enabled AND not in Thrift+Inline mode. Otherwise, returns as | ||
| * STRING. | ||
| */ | ||
| public class GeospatialTests { |
There was a problem hiding this comment.
lets test geometry any as well:
example query:
%sql
SELECT * FROM VALUES
(ST_GeomFromText('POINT(17 7)', 4326)),
(ST_GeomFromText('POINT(5 5)', 0)) AS t(geom)
expected type name: GEOMETRY(ANY) expected SRIDs per row: 4326 and 0. lets assert on everything, that both row level values are correct and column metadata is correct
There was a problem hiding this comment.
Added the tests for GEOMETRY(ANY)
Signed-off-by: Sreekanth Vadigi <sreekanth.vadigi@databricks.com>
Signed-off-by: Sreekanth Vadigi <sreekanth.vadigi@databricks.com>
| private void setColumnInfo(TGetResultSetMetadataResp resultManifest) | ||
| throws DatabricksSQLException { | ||
| columnInfos = new ArrayList<>(); | ||
| List<String> arrowMetadataList = DatabricksThriftUtil.getArrowMetadata(resultManifest); |
There was a problem hiding this comment.
what happens when this is null ?
Some very old spark servers do not return arrowMetadata, we do not want them to break
There was a problem hiding this comment.
In the method DatabricksThriftUtil.getArrowMetadata, we check if arrowSchema is present in TGetResultSetMetadataResp, and return null if required. At every code point where we are accessing the arrowMetadataList, we check if it is not null and the size of the list before accessing the element so that existing flows doesn't break. We have multi DBR tests to ensure the backward compatibility.
There was a problem hiding this comment.
In the context of geospatial datatypes, if arrowMetadata is null (won't happen in ideal flow, since driver would always get arrowMetadata in DBR 17.1+ from which geospatial types are supported), the column type is shown as String (since the TColumnDesc contains STRING and driver doesn't have other information to identify it as GEOMETRY/GEOGRAPHY)
| columnInfos.add( | ||
| com.databricks.jdbc.common.util.DatabricksThriftUtil.getColumnInfoFromTColumnDesc( | ||
| tColumnDesc)); | ||
| List<String> arrowMetadata = DatabricksThriftUtil.getArrowMetadata(resultManifest); |
There was a problem hiding this comment.
what happens when this is null ?
Some very old spark servers do not return arrowMetadata, we do not want them to break
There was a problem hiding this comment.
do we have tests when thrift protocol is low ?
There was a problem hiding this comment.
In the method DatabricksThriftUtil.getArrowMetadata, we check if arrowSchema is present in TGetResultSetMetadataResp, and return null if required. At every code point where we are accessing the arrowMetadataList, we check if it is not null and the size of the list before accessing the element so that existing flows doesn't break. We have multi DBR tests to ensure the backward compatibility.
| if (arrowMetadata != null && isComplexType(arrowMetadata)) { | ||
| typeText = arrowMetadata; | ||
| if (arrowMetadata.startsWith(GEOMETRY)) { | ||
| columnInfoTypeName = ColumnInfoTypeName.GEOMETRY; |
There was a problem hiding this comment.
shouldn't this return column info in parenthesis ? e.g. GEOMETRY(4326)?
There was a problem hiding this comment.
columnInfoTypeName is an enum, so we can't have SRID info here. But this is just an intermediate processing step, not the ultimate response that user receives. User response includes SRID info. E2E behaviour can be found in this test (from line no: 211)
Description
This PR enhances geospatial datatype handling to include SRID (Spatial Reference System Identifier) information in column type names and fixes multiple issues related to complex datatype handling across different result formats.
Key Changes
Geospatial Type Name Enhancement
GEOMETRY(4326)instead ofGEOMETRYSEA Inline Mode Complex Type Fix
EnableComplexDatatypeSupport=trueThrift CloudFetch Metadata Enhancement
INTfromARRAY<INT>) in Thrift CloudFetch modegetColumnInfoFromTColumnDesc()to use Arrow schema metadata alongsideTColumnDescARRAY<INT>) whileTColumnDesconly contains base type (e.g.,ARRAY)Arrow Metadata Extraction
DatabricksThriftUtil.getArrowMetadata()to deserialize Arrow schema fromTGetResultSetMetadataRespDatabricksResultSetconstructor for Thrift CloudFetch modeTesting
Unit Tests
Integration Tests
GeospatialTests.java- Comprehensive E2E integration testAdditional Notes to the Reviewer
Other required details are mentioned in comments in the diff