Skip to content

[SPARK-55361][DOCS][PYTHON] suggest the use of to_arrow_schema to avoid specifying schema twice#54180

Open
casgie wants to merge 1 commit intoapache:masterfrom
casgie:SPARK-55361-improve-docs-python-source
Open

[SPARK-55361][DOCS][PYTHON] suggest the use of to_arrow_schema to avoid specifying schema twice#54180
casgie wants to merge 1 commit intoapache:masterfrom
casgie:SPARK-55361-improve-docs-python-source

Conversation

@casgie
Copy link

@casgie casgie commented Feb 6, 2026

suggest the use of to_arrow_schema to avoid specifying schema twice

What changes were proposed in this pull request?

python/docs/source/tutorial/sql/python_data_source.rst gives an example for using PyArrow RecordBatch.

In the example, the Schema is specified twice, once for Spark

def schema(self):
   return "key int, value string" 

and then again for PyArrow

def read(self, partition):
   ...
   schema = pa.schema([("key", pa.int32()), ("value", pa.string())]) 

I am proposing to change the documentation to only specify a Spark schema and to use 
pyspark.sql.pandas.types.to_arrow_schema() to convert the Spark schema to arrow.

Why are the changes needed?

Using Python Data Source in production, having to specify the schema twice is a hassle and also a source for errors.

Does this PR introduce any user-facing change?

Yes, changes documentation.

How was this patch tested?

N/A

Was this patch authored or co-authored using generative AI tooling?

No

@github-actions
Copy link

github-actions bot commented Feb 6, 2026

JIRA Issue Information

=== Documentation SPARK-55361 ===
Summary: add to_arrow_schema python_data_source.rst to avoid double-specifying schema
Assignee: None
Status: Open
Affected: ["4.0.0","4.0.1","4.1.1"]


This comment was automatically generated by GitHub Actions

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants