-
Notifications
You must be signed in to change notification settings - Fork 29.2k
[SPARK-55361][DOCS][PYTHON] Suggest the use of to_arrow_schema to avoid specifying schema twice #54180
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-55361][DOCS][PYTHON] Suggest the use of to_arrow_schema to avoid specifying schema twice #54180
Changes from all commits
62cbc3e
294dd81
9d8fe84
b9c1c8c
162779b
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -512,6 +512,7 @@ The following example demonstrates how to implement a basic Data Source using Ar | |
|
|
||
| from pyspark.sql.datasource import DataSource, DataSourceReader, InputPartition | ||
| from pyspark.sql import SparkSession | ||
| from pyspark.sql.pandas.types import to_arrow_schema | ||
| import pyarrow as pa | ||
|
|
||
| # Define the ArrowBatchDataSource | ||
|
|
@@ -534,14 +535,14 @@ The following example demonstrates how to implement a basic Data Source using Ar | |
| class ArrowBatchDataSourceReader(DataSourceReader): | ||
| def __init__(self, schema, options): | ||
| self.schema: str = schema | ||
| self.arrow_schema = to_arrow_schema(self.schema) | ||
| self.options = options | ||
|
|
||
| def read(self, partition): | ||
| # Create Arrow Record Batch | ||
| keys = pa.array([1, 2, 3, 4, 5], type=pa.int32()) | ||
| values = pa.array(["one", "two", "three", "four", "five"], type=pa.string()) | ||
| schema = pa.schema([("key", pa.int32()), ("value", pa.string())]) | ||
| record_batch = pa.RecordBatch.from_arrays([keys, values], schema=schema) | ||
| record_batch = pa.RecordBatch.from_arrays([keys, values], schema=self.arrow_schema) | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. it seems the
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, that is the core idea of this PR, to avoid specifying the arrow schema and the PySpark schema. |
||
| yield record_batch | ||
|
|
||
| def partitions(self): | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -27,7 +27,7 @@ | |
| write_with_length, | ||
| ) | ||
| from pyspark.sql.datasource import DataSource, CaseInsensitiveDict | ||
| from pyspark.sql.types import _parse_datatype_json_string, StructType | ||
| from pyspark.sql.types import _parse_datatype_json_string, from_arrow_schema, StructType | ||
| from pyspark.sql.worker.utils import worker_run | ||
| from pyspark.util import local_connect_and_auth | ||
| from pyspark.worker_util import ( | ||
|
|
@@ -36,6 +36,8 @@ | |
| utf8_deserializer, | ||
| ) | ||
|
|
||
| import pyarrow as pa | ||
|
|
||
|
|
||
| def _main(infile: IO, outfile: IO) -> None: | ||
| """ | ||
|
|
@@ -125,6 +127,11 @@ def _main(infile: IO, outfile: IO) -> None: | |
| # Here we cannot use _parse_datatype_string to parse the DDL string schema. | ||
| # as it requires an active Spark session. | ||
| is_ddl_string = True | ||
| if isinstance(schema, pa.schema): | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. sorry, let me be more clear. What I thought is to add
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @allisonwang-db for this change
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sorry, I don't get what you mean.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I am wondering whether we can directly pass the pyarrow schema in this example Not sure whether this works, @allisonwang-db is this allowed? |
||
| # Convert Arrow schema to Spark schema for compatibility, | ||
| # as the data source API in Python allows data source to | ||
| # return Arrow schema directly. | ||
| schema = from_arrow_schema(schema) | ||
| else: | ||
| schema = user_specified_schema # type: ignore | ||
|
|
||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just noticed that this is an example.
Do this example really works?
the
self.schema: stris a str, but theto_arrow_schemaaccept a Spark StructType as input, not astrThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
moreover,
to_arrow_schemamethod is not a public API, I think we should not use it in examples?cc @HyukjinKwon
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, curiously it worked (tested on Databricks DBR 17.3 with Spark 4.0.0)
We can ofc change to
Regarding your second point yes & fair, but maybe we can find another way to avoid having users specify schemas twice?
Maybe allow using a PyArrow schema as a schema definition in
DataSource?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll try to implement that then & change the PR accordingly