Skip to content

GH-26818: [C++][Python] Preserve order when writing dataset multi-threaded (#44470)#5

Merged
EnricoMi merged 2 commits intoG-Research:20.0.0-grfrom
EnricoMi:20.0.0-gr-patch-1
May 23, 2025
Merged

GH-26818: [C++][Python] Preserve order when writing dataset multi-threaded (#44470)#5
EnricoMi merged 2 commits intoG-Research:20.0.0-grfrom
EnricoMi:20.0.0-gr-patch-1

Conversation

@EnricoMi
Copy link
Copy Markdown

Backports apache#44470 to 20.0.0-gr.

Rationale for this change

The order of rows in a dataset might be important for users and should be preserved when writing to a filesystem. With multi-threaded write, the order is currently not guaranteed,

What changes are included in this PR?

Preserving the dataset order of rows requires the SourceNode to sequence the fragments output (this keeps exec batches in the order of fragments), to provide an ImplicitOrdering (this gives exec batches an index), and the ConsumingSinkNode to sequence exec batches (finally preserve order of batches according to their index).

User-facing changes:

  • Add option preserve_order to FileSystemDatasetWriteOptions (C++) and arrow.dataset.write_dataset (Python).

Default behaviour is current behaviour.

Are these changes tested?

Unit tests have been added,

Are there any user-facing changes?

Users can set FileSystemDatasetWriteOptions.preserve_order = true (C++) / arrow.dataset.write_dataset(..., preserve_order=True) (Python).

Lead-authored-by: Enrico Minack github@enrico.minack.dev

Thanks for opening a pull request!

If this is your first pull request you can find detailed information on how to contribute here:

Please remove this line and the above text before creating your pull request.

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

This PR includes breaking changes to public APIs. (If there are any breaking changes to public APIs, please explain which changes are breaking. If not, you can remove this.)

This PR contains a "Critical Fix". (If the changes fix either (a) a security vulnerability, (b) a bug that caused incorrect or invalid data to be produced, or (c) a bug that causes a crash (even when the API contract is upheld), please provide explanation. If not, you can remove this.)

…ti-threaded (apache#44470)

### Rationale for this change
The order of rows in a dataset might be important for users and should be preserved when writing to a filesystem. With multi-threaded write, the order is currently not guaranteed,

### What changes are included in this PR?
Preserving the dataset order of rows requires the `SourceNode` to sequence the fragments output (this keeps exec batches in the order of fragments), to provide an `ImplicitOrdering` (this gives exec batches an index), and the `ConsumingSinkNode` to sequence exec batches (finally preserve order of batches according to their index).

User-facing changes:
- Add option `preserve_order` to `FileSystemDatasetWriteOptions` (C++) and `arrow.dataset.write_dataset` (Python).

Default behaviour is current behaviour.

### Are these changes tested?
Unit tests have been added,

### Are there any user-facing changes?
Users can set `FileSystemDatasetWriteOptions.preserve_order = true` (C++) / `arrow.dataset.write_dataset(..., preserve_order=True)` (Python).
* GitHub Issue: apache#26818

Lead-authored-by: Enrico Minack <github@enrico.minack.dev>
Co-authored-by: Rok Mihevc <rok@mihevc.org>
Signed-off-by: Rok Mihevc <rok@mihevc.org>
@github-actions
Copy link
Copy Markdown

❌ GitHub issue apache#26818 could not be retrieved.

@EnricoMi EnricoMi merged commit 50dfe89 into G-Research:20.0.0-gr May 23, 2025
30 of 41 checks passed
@EnricoMi EnricoMi deleted the 20.0.0-gr-patch-1 branch May 23, 2025 08:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants