Skip to content

[chore] Cleaner wds#147

Merged
blefaudeux merged 17 commits intoben/update_manylinux_pythonfrom
ben/cleaner_wds
Jan 7, 2026
Merged

[chore] Cleaner wds#147
blefaudeux merged 17 commits intoben/update_manylinux_pythonfrom
ben/cleaner_wds

Conversation

@blefaudeux
Copy link
Copy Markdown
Collaborator

@blefaudeux blefaudeux commented Jan 4, 2026

  • Back and forth testing some options to have a better (faster) WDS support, this is already an improvement. Key underlying change is to split support in between "how many download streams do we start in parallel" and "how many CPU cores can be dedicated to decoding the images".

  • Fixing missing multimodal attributes, will update the PR with a unit test to lock it down prior to merging, if accepted

  • Checking in the code used for plotting, while I'm at it


Some benchmark results:

Optimal settings probably vary with the underlying hardware and data distribution, attached are a few benchmarks showing that in any case datago is more than competitive with the python webdataset library. It's also quite convenient to use as the data streaming starts immediately vs. initial buffering for python webdataset.

This (from a household fiber connection + zen 3 laptop) should be a relatively fair picture, speed should translate well into other setups. This is while processing the big images into Transformer compatible buckets (PD12M)
image

Not touching the images then the CPU spread counts less, but datago is still visibly faster (FakeIN)
image

This (from a cluster, Epyc CPU and PD12M dataset) confirms that the perf gains scale
image

Comment thread src/generator_wds.rs
if samples_metadata_tx.send(current_files_for_sample).is_err() {
debug!("dispatch_shards (streaming): samples_metadata_tx channel closed.");
let _ = samples_metadata_tx.close(); // Make sure that we close on both ends
return Err("Channel closed".into());
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we close the channel to signal end of processing (if enough samples received), so that was not a real error

Copy link
Copy Markdown
Contributor

@photoroman photoroman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. I added a few comments and questions I'd like to discuss before merging.

Comment thread python/benchmark_webdataset.py Outdated
Comment thread src/generator_wds.rs Outdated
pub fn orchestrate(client: &DatagoClient) -> DatagoEngine {
// Allocate all the message passing pipes
let (samples_metadata_tx, samples_metadata_rx) = bounded::<TarballSample>(32);
// Allocate all the message passing pipes with larger buffers for better throughput
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious, was this comment written by Claude? It often does this sort of thing where it writes comments that refer to what's being changed instead of what's the current state. It drives me crazy 😄

E.g. "larger buffers" only makes sense with respect to this change, because we are increasing the size from 32 to max(128, samples_buffer * 2). But "larger buffers" doesn't make sense as a code comment, because they should document the code as is outside of the context of a PR. It's not clear what to compare the buffer size to outside of this PR.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah it was written by a LLM indeed, was tied to a change but I reverted it afterwards realizing it didn't make sense... will fix, sorry about that

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to be sorry, not your fault. It's just an annoying thing with Claude and other LLMs.

Comment thread src/worker_wds.rs
Comment thread pyproject.toml
"webdataset>=0.2.100",
"zig>=1.0.dev0",
]

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two questions:

  1. Should we add a separate extra (e.g. dev or benchmarks, I prefer the first one as it's more general) for development dependencies to avoid depending on pandas and matplotlib for standard datago?
  2. Do we still need the requirements.txt and requirements-benchs.txt file if we add dependencies here? If we use them, maybe only for locking dependencies to exact versions, but uv.lock could do the same.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes to both, probably easier with just a single .dev, and we're kind of in between pip and uv here, it's not super nice. Happy to move either way, I can also revert all these req changes and we just keep it local, no strong opinion

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also no strong opinion. Whatever is easier for you at this point. We can always change this easily later on.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reverted the pyproject part, could be a dedicated PR to do just that in a clean way and leave the requirement files in the dust ?

In the meantime the dev requirements are practical for now I feel, not required for the build or tests, but running benchmarks is part of the dev prices and they come with a few more deps, good to keep around ? Compromising :)

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice with Marco's follow up, perfect !

Comment thread requirements-dev.txt
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: If this file is needed (see other comment in pyproject.toml), I suggest to name it requirements-dev.txt and make it the default place to add all dev requirements.

@blefaudeux
Copy link
Copy Markdown
Collaborator Author

could probably go after this one #148

@blefaudeux blefaudeux changed the base branch from main to ben/update_manylinux_python January 6, 2026 21:48
@blefaudeux blefaudeux requested a review from photoroman January 6, 2026 21:48
@blefaudeux blefaudeux marked this pull request as ready for review January 6, 2026 21:48
blefaudeux and others added 2 commits January 6, 2026 21:49
* tentative maturin update

* bumping the version

* reverting the pyproject change, not important
Copy link
Copy Markdown
Contributor

@photoroman photoroman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice!

@blefaudeux blefaudeux merged commit a600cf4 into ben/update_manylinux_python Jan 7, 2026
@blefaudeux
Copy link
Copy Markdown
Collaborator Author

jeez, merged on the wrong branch..

blefaudeux added a commit that referenced this pull request Jan 7, 2026
* tentative maturin update

* bumping the version

* reverting the pyproject change, not important

* [chore] Cleaner wds (#147)

* Some cleanup + better benchmarks

* cargo fmt, broken dev setup

* removing a silly change

* better workload spread while wds, tarball extraction is still the bottleneck

* better workload spread while wds, tarball extraction is still the bottleneck

* adding a small plot helper

* Renaming a confusing variable + adding more benchmarks

* Updating pyproject + PD12M bench on Epyc

* Fix the missing attributes from wds samples

* slightly cleaner code

* Adding a unit test + cleaner error handling

* Adding backtrace to the tests

* code review

* Updating the version post-merge

* [infra] Update the build process (#148)

* tentative maturin update

* bumping the version

* reverting the pyproject change, not important
MarcoForte pushed a commit that referenced this pull request Jan 17, 2026
* tentative maturin update

* bumping the version

* reverting the pyproject change, not important

* [chore] Cleaner wds (#147)

* Some cleanup + better benchmarks

* cargo fmt, broken dev setup

* removing a silly change

* better workload spread while wds, tarball extraction is still the bottleneck

* better workload spread while wds, tarball extraction is still the bottleneck

* adding a small plot helper

* Renaming a confusing variable + adding more benchmarks

* Updating pyproject + PD12M bench on Epyc

* Fix the missing attributes from wds samples

* slightly cleaner code

* Adding a unit test + cleaner error handling

* Adding backtrace to the tests

* code review

* Updating the version post-merge

* [infra] Update the build process (#148)

* tentative maturin update

* bumping the version

* reverting the pyproject change, not important
MarcoForte added a commit that referenced this pull request Jan 28, 2026
* dependencies in pyproject for uv

* uv system python for cicd

* bumping the version

* tentative maturin update

* ty and ruff local fixes, need add to cicd

* [dupe] Merging #147 (#151)

* tentative maturin update

* bumping the version

* reverting the pyproject change, not important

* [chore] Cleaner wds (#147)

* Some cleanup + better benchmarks

* cargo fmt, broken dev setup

* removing a silly change

* better workload spread while wds, tarball extraction is still the bottleneck

* better workload spread while wds, tarball extraction is still the bottleneck

* adding a small plot helper

* Renaming a confusing variable + adding more benchmarks

* Updating pyproject + PD12M bench on Epyc

* Fix the missing attributes from wds samples

* slightly cleaner code

* Adding a unit test + cleaner error handling

* Adding backtrace to the tests

* code review

* Updating the version post-merge

* [infra] Update the build process (#148)

* tentative maturin update

* bumping the version

* reverting the pyproject change, not important

* sync

* prek precommit for ty and ruff.

* no skip

* test ty  and ruff cicd

* fix back

* remove hook stage manual

* redo hook stage manual

---------

Co-authored-by: Benjamin Lefaudeux <benjamin.lefaudeux@mistral.ai>
Co-authored-by: Benjamin Lefaudeux <blefaudeux@users.noreply.github.com>
@blefaudeux blefaudeux deleted the ben/cleaner_wds branch January 30, 2026 10:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants