[WIP][SPARK-56248][PYTHON][SS] Optimize python TWS stateful processor serialization calls#55039
Open
jiateoh wants to merge 1 commit intoapache:masterfrom
Open
[WIP][SPARK-56248][PYTHON][SS] Optimize python TWS stateful processor serialization calls#55039jiateoh wants to merge 1 commit intoapache:masterfrom
jiateoh wants to merge 1 commit intoapache:masterfrom
Conversation
jiateoh
commented
Mar 26, 2026
| return serializer | ||
|
|
||
| field_names = [f.name for f in schema.fields] | ||
| row_value = Row(**dict(zip(field_names, converted))) |
Contributor
Author
There was a problem hiding this comment.
Copying from the PR description for reference on why this field_name/Row creation is no longer necessary:
StructType.toInternaldispatches on type: for dict it looks up by field name, for tuple/list it zips by position. So functionally there is no need to convert the tuple to list. (L521 deletion)Rowis a tuple subclass, so it always hit the positional branch.- Since 3.0.0 (
types.pychange notes), Row field names are insertion ordered. Python dictionaries (as of 3.7+) are also insertion ordered. dict(zip(field_names, converted)) → Row(**...)ends up adding extra hops to (1) fetch field names, (2) zip them with row values, (3) create an insertion-ordered dictionary of those field names, and (4) create an insertion-ordered row (dropping the field names which are no longer used). With the end result being aRow(tuple subclass) which uses same positional branch ofSchema.toInternalas the original input tuple would.
…on, cache serializer, hoist normalize helpers. - Eliminate per-call Row(**dict(zip(...))) construction in _serialize_to_bytes; pass normalized tuples directly to schema.toInternal which handles them by index - Cache the serialize callable per schema id so field extraction and closure creation happen once per schema rather than per row - Hoist _normalize_value/_normalize_tuple to module level to avoid re-creating closures and re-importing numpy on every call - Add non-testing helper method has_numpy to avoid circular dependency. AI-assisted + human reviewed/edited Generated-by: Claude Code (Claude Opus 4.6)
9d1665d to
243a49b
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
Skip unnecessary list+Row construction, add cache serializer, hoist normalize helpers.
has_numpyto avoid circular dependency, as the existinghave_numpydepends on dataframe (hence the original deferred import). The previoushave_numpyis also in a testing file whereas the newhas_numpyis in a dedicated utils file.Of the above changes, the first one is the only one with significant logic changes (removing Row/dict/list creation). The next two are primarily moving/caching function definitions, and the last is a new helper method for enabling the new function locations.
To better explain the removal of
Rowusage:StructType.toInternaldispatches on type: for dict it looks up by field name, for tuple/list it zips by position. So functionally there is no need to convert the tuple to list.Rowis a tuple subclass, so it always hit the positional branch.types.pychange notes), Row field names are insertion ordered. Python dictionaries (as of 3.7+) are also insertion ordered.dict(zip(field_names, converted)) → Row(**...)ends up adding extra hops to (1) fetch field names, (2) zip them with row values, (3) create an insertion-ordered dictionary of those field names, and (4) create an insertion-ordered row (dropping the field names which are no longer used). With the end result being aRow(tuple subclass) which uses same positional branch ofSchema.toInternalas the original input tuple would.AI-assisted + human reviewed/updated
Why are the changes needed?
This is a code cleanup/performance optimization. Original code has unnecessary operations that are executed for every row, including: rebuilding closures, extracting field names, building intermediate lists + dicts, and constructing Row objects (which sort by field unnecessarily). These can all add minor overhead while having no effect on the underlying usage.
Does this PR introduce any user-facing change?
No
How was this patch tested?
Unit tests
Was this patch authored or co-authored using generative AI tooling?
Generated-by: Claude Code (Claude Opus 4.6)