[chore] Cleaner wds by blefaudeux · Pull Request #147 · Photoroom/datago

blefaudeux · 2026-01-04T21:33:22Z

Back and forth testing some options to have a better (faster) WDS support, this is already an improvement. Key underlying change is to split support in between "how many download streams do we start in parallel" and "how many CPU cores can be dedicated to decoding the images".
Fixing missing multimodal attributes, will update the PR with a unit test to lock it down prior to merging, if accepted
Checking in the code used for plotting, while I'm at it

Some benchmark results:

Optimal settings probably vary with the underlying hardware and data distribution, attached are a few benchmarks showing that in any case datago is more than competitive with the python webdataset library. It's also quite convenient to use as the data streaming starts immediately vs. initial buffering for python webdataset.

This (from a household fiber connection + zen 3 laptop) should be a relatively fair picture, speed should translate well into other setups. This is while processing the big images into Transformer compatible buckets (PD12M)

Not touching the images then the CPU spread counts less, but datago is still visibly faster (FakeIN)

This (from a cluster, Epyc CPU and PD12M dataset) confirms that the perf gains scale

…tleneck

blefaudeux · 2026-01-05T08:15:28Z

            if samples_metadata_tx.send(current_files_for_sample).is_err() {
                debug!("dispatch_shards (streaming): samples_metadata_tx channel closed.");
                let _ = samples_metadata_tx.close(); // Make sure that we close on both ends
-                return Err("Channel closed".into());


we close the channel to signal end of processing (if enough samples received), so that was not a real error

photoroman

Looks good to me. I added a few comments and questions I'd like to discuss before merging.

photoroman · 2026-01-06T12:35:16Z

 pub fn orchestrate(client: &DatagoClient) -> DatagoEngine {
-    // Allocate all the message passing pipes
-    let (samples_metadata_tx, samples_metadata_rx) = bounded::<TarballSample>(32);
+    // Allocate all the message passing pipes with larger buffers for better throughput


I'm curious, was this comment written by Claude? It often does this sort of thing where it writes comments that refer to what's being changed instead of what's the current state. It drives me crazy 😄

E.g. "larger buffers" only makes sense with respect to this change, because we are increasing the size from 32 to max(128, samples_buffer * 2). But "larger buffers" doesn't make sense as a code comment, because they should document the code as is outside of the context of a PR. It's not clear what to compare the buffer size to outside of this PR.

ah it was written by a LLM indeed, was tied to a change but I reverted it afterwards realizing it didn't make sense... will fix, sorry about that

No need to be sorry, not your fault. It's just an annoying thing with Claude and other LLMs.

photoroman · 2026-01-06T12:40:10Z

+    "webdataset>=0.2.100",
+    "zig>=1.0.dev0",
+]



Two questions:

Should we add a separate extra (e.g. dev or benchmarks, I prefer the first one as it's more general) for development dependencies to avoid depending on pandas and matplotlib for standard datago?

Do we still need the requirements.txt and requirements-benchs.txt file if we add dependencies here? If we use them, maybe only for locking dependencies to exact versions, but uv.lock could do the same.

yes to both, probably easier with just a single .dev, and we're kind of in between pip and uv here, it's not super nice. Happy to move either way, I can also revert all these req changes and we just keep it local, no strong opinion

Also no strong opinion. Whatever is easier for you at this point. We can always change this easily later on.

I reverted the pyproject part, could be a dedicated PR to do just that in a clean way and leave the requirement files in the dust ?

In the meantime the dev requirements are practical for now I feel, not required for the build or tests, but running benchmarks is part of the dev prices and they come with a few more deps, good to keep around ? Compromising :)

nice with Marco's follow up, perfect !

photoroman · 2026-01-06T12:42:20Z

nit: If this file is needed (see other comment in pyproject.toml), I suggest to name it requirements-dev.txt and make it the default place to add all dev requirements.

blefaudeux · 2026-01-06T21:39:04Z

could probably go after this one #148

* tentative maturin update * bumping the version * reverting the pyproject change, not important

photoroman

Very nice!

blefaudeux · 2026-01-07T08:31:13Z

jeez, merged on the wrong branch..

* tentative maturin update * bumping the version * reverting the pyproject change, not important * [chore] Cleaner wds (#147) * Some cleanup + better benchmarks * cargo fmt, broken dev setup * removing a silly change * better workload spread while wds, tarball extraction is still the bottleneck * better workload spread while wds, tarball extraction is still the bottleneck * adding a small plot helper * Renaming a confusing variable + adding more benchmarks * Updating pyproject + PD12M bench on Epyc * Fix the missing attributes from wds samples * slightly cleaner code * Adding a unit test + cleaner error handling * Adding backtrace to the tests * code review * Updating the version post-merge * [infra] Update the build process (#148) * tentative maturin update * bumping the version * reverting the pyproject change, not important

* dependencies in pyproject for uv * uv system python for cicd * bumping the version * tentative maturin update * ty and ruff local fixes, need add to cicd * [dupe] Merging #147 (#151) * tentative maturin update * bumping the version * reverting the pyproject change, not important * [chore] Cleaner wds (#147) * Some cleanup + better benchmarks * cargo fmt, broken dev setup * removing a silly change * better workload spread while wds, tarball extraction is still the bottleneck * better workload spread while wds, tarball extraction is still the bottleneck * adding a small plot helper * Renaming a confusing variable + adding more benchmarks * Updating pyproject + PD12M bench on Epyc * Fix the missing attributes from wds samples * slightly cleaner code * Adding a unit test + cleaner error handling * Adding backtrace to the tests * code review * Updating the version post-merge * [infra] Update the build process (#148) * tentative maturin update * bumping the version * reverting the pyproject change, not important * sync * prek precommit for ty and ruff. * no skip * test ty and ruff cicd * fix back * remove hook stage manual * redo hook stage manual --------- Co-authored-by: Benjamin Lefaudeux <benjamin.lefaudeux@mistral.ai> Co-authored-by: Benjamin Lefaudeux <blefaudeux@users.noreply.github.com>

blefaudeux and others added 7 commits December 25, 2025 11:23

Some cleanup + better benchmarks

3eba0c0

cargo fmt, broken dev setup

1de8ced

removing a silly change

b5a0166

better workload spread while wds, tarball extraction is still the bot…

426bc0b

…tleneck

better workload spread while wds, tarball extraction is still the bot…

1808b48

…tleneck

adding a small plot helper

44344b2

Renaming a confusing variable + adding more benchmarks

54425bd

blefaudeux temporarily deployed to release January 4, 2026 21:33 — with GitHub Actions Inactive

blefaudeux requested a review from photoroman January 4, 2026 21:33

blefaudeux commented Jan 5, 2026

View reviewed changes

Updating pyproject + PD12M bench on Epyc

183c9e1

blefaudeux temporarily deployed to release January 5, 2026 08:37 — with GitHub Actions Inactive

Fix the missing attributes from wds samples

3900537

blefaudeux had a problem deploying to release January 5, 2026 22:54 — with GitHub Actions Error

blefaudeux had a problem deploying to release January 5, 2026 22:54 — with GitHub Actions Failure

blefaudeux had a problem deploying to release January 5, 2026 22:54 — with GitHub Actions Error

slightly cleaner code

2ffa9be

blefaudeux had a problem deploying to release January 5, 2026 23:05 — with GitHub Actions Error

blefaudeux had a problem deploying to release January 6, 2026 12:11 — with GitHub Actions Failure

blefaudeux had a problem deploying to release January 6, 2026 12:11 — with GitHub Actions Error

photoroman reviewed Jan 6, 2026

View reviewed changes

code review

3f210d5

blefaudeux had a problem deploying to release January 6, 2026 21:29 — with GitHub Actions Error

blefaudeux had a problem deploying to release January 6, 2026 21:29 — with GitHub Actions Failure

blefaudeux had a problem deploying to release January 6, 2026 21:29 — with GitHub Actions Error

Merge branch 'ben/update_manylinux_python' into ben/cleaner_wds

3875cb7

blefaudeux temporarily deployed to release January 6, 2026 21:47 — with GitHub Actions Inactive

blefaudeux changed the base branch from main to ben/update_manylinux_python January 6, 2026 21:48

blefaudeux requested a review from photoroman January 6, 2026 21:48

blefaudeux marked this pull request as ready for review January 6, 2026 21:48

blefaudeux and others added 2 commits January 6, 2026 21:49

Updating the version post-merge

0ce6359

[infra] Update the build process (#148)

d58bb2d

* tentative maturin update * bumping the version * reverting the pyproject change, not important

photoroman approved these changes Jan 6, 2026

View reviewed changes

Merge remote-tracking branch 'origin/main' into ben/cleaner_wds

76761aa

blefaudeux merged commit a600cf4 into ben/update_manylinux_python Jan 7, 2026

blefaudeux deleted the ben/cleaner_wds branch January 30, 2026 10:29

Conversation

blefaudeux commented Jan 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Some benchmark results:

Uh oh!

Choose a reason for hiding this comment

Uh oh!

photoroman left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

blefaudeux commented Jan 6, 2026

Uh oh!

photoroman left a comment

Choose a reason for hiding this comment

Uh oh!

blefaudeux commented Jan 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

blefaudeux commented Jan 4, 2026 •

edited

Loading