SNOW-2408964: support custom_schema when reading xml#4033
SNOW-2408964: support custom_schema when reading xml#4033sfc-gh-yuwang merged 15 commits intomainfrom
Conversation
| and format.lower() == "xml" | ||
| and XML_ROW_TAG_STRING not in self._cur_options | ||
| ): | ||
| raise ValueError("When read XML with user schema, rowtag must be set.") |
There was a problem hiding this comment.
| raise ValueError("When read XML with user schema, rowtag must be set.") | |
| raise ValueError("When reading XML with user schema, rowtag must be set.") |
| return None | ||
|
|
||
|
|
||
| def generate_norm_column_name_to_ori_column_name_dict(result: dict): |
There was a problem hiding this comment.
does this really need to be a function?
There was a problem hiding this comment.
you are right, I removed this function
| # cast to input custom schema type | ||
| if self._user_schema: | ||
| cols = [ | ||
| df[single_quote(field._name)] |
There was a problem hiding this comment.
why are we using single_quote here?
There was a problem hiding this comment.
currently all column name is produced with a single quote around,
I am not sure why we are having this behavior but I think we should not change now as it would be a BCR
single quote is added here so that we can correctly read the column
There was a problem hiding this comment.
Ahh, I think I remember Jianzhun mentioning we use PIVOT and it snowflake pivot adds single quote for string, and datetime columns. This could break since server side is planning to do a BCR to remove quotes.
| with pytest.raises( | ||
| ValueError, match="When read XML with user schema, rowtag must be set." | ||
| ): | ||
| session.read.schema(user_schema).xml(f"@{tmp_stage_name}/{test_file_books_xml}") |
There was a problem hiding this comment.
Test regex pattern mismatch with actual error message. The test expects "When read XML with user schema, rowtag must be set." but the actual error message in dataframe_reader.py:1392 is "When reading XML with user schema, rowtag must be set." (note "reading" vs "read"). This test will fail.
# Fix: Update the match pattern to match the actual error message
with pytest.raises(
ValueError, match="When reading XML with user schema, rowtag must be set."
):| with pytest.raises( | |
| ValueError, match="When read XML with user schema, rowtag must be set." | |
| ): | |
| session.read.schema(user_schema).xml(f"@{tmp_stage_name}/{test_file_books_xml}") | |
| with pytest.raises( | |
| ValueError, match="When reading XML with user schema, rowtag must be set." | |
| ): | |
| session.read.schema(user_schema).xml(f"@{tmp_stage_name}/{test_file_books_xml}") | |
Spotted by Graphite Agent
Is this helpful? React 👍 or 👎 to let us know.
| ).option("mode", "failfast").xml(f"@{tmp_stage_name}/{test_file_books_xml}") | ||
|
|
||
|
|
||
| def test_read_xml_with_custom_schema(session): |
There was a problem hiding this comment.
can we have one simple case for nested data? books2.xml seems to be a nested one
sfc-gh-aling
left a comment
There was a problem hiding this comment.
LGTM.
can you also help address the test coverage before merging it?
Which Jira issue is this PR addressing? Make sure that there is an accompanying issue to your PR.
Fixes SNOW-2408964
Fill out the following pre-review checklist:
Please describe how your code solves the related issue.
Here is a doc to talk about the behavior of XML read with user schema in snowpark:
https://docs.google.com/document/d/1W3os2SGbO_M89TjIa7xVuNrQ0RDDuGMHEOBHFuw0OYw/edit?tab=t.0