[chore] Cleaner wds#147
Conversation
| if samples_metadata_tx.send(current_files_for_sample).is_err() { | ||
| debug!("dispatch_shards (streaming): samples_metadata_tx channel closed."); | ||
| let _ = samples_metadata_tx.close(); // Make sure that we close on both ends | ||
| return Err("Channel closed".into()); |
There was a problem hiding this comment.
we close the channel to signal end of processing (if enough samples received), so that was not a real error
photoroman
left a comment
There was a problem hiding this comment.
Looks good to me. I added a few comments and questions I'd like to discuss before merging.
| pub fn orchestrate(client: &DatagoClient) -> DatagoEngine { | ||
| // Allocate all the message passing pipes | ||
| let (samples_metadata_tx, samples_metadata_rx) = bounded::<TarballSample>(32); | ||
| // Allocate all the message passing pipes with larger buffers for better throughput |
There was a problem hiding this comment.
I'm curious, was this comment written by Claude? It often does this sort of thing where it writes comments that refer to what's being changed instead of what's the current state. It drives me crazy 😄
E.g. "larger buffers" only makes sense with respect to this change, because we are increasing the size from 32 to max(128, samples_buffer * 2). But "larger buffers" doesn't make sense as a code comment, because they should document the code as is outside of the context of a PR. It's not clear what to compare the buffer size to outside of this PR.
There was a problem hiding this comment.
ah it was written by a LLM indeed, was tied to a change but I reverted it afterwards realizing it didn't make sense... will fix, sorry about that
There was a problem hiding this comment.
No need to be sorry, not your fault. It's just an annoying thing with Claude and other LLMs.
| "webdataset>=0.2.100", | ||
| "zig>=1.0.dev0", | ||
| ] | ||
|
|
There was a problem hiding this comment.
Two questions:
- Should we add a separate extra (e.g.
devorbenchmarks, I prefer the first one as it's more general) for development dependencies to avoid depending on pandas and matplotlib for standard datago? - Do we still need the requirements.txt and requirements-benchs.txt file if we add dependencies here? If we use them, maybe only for locking dependencies to exact versions, but uv.lock could do the same.
There was a problem hiding this comment.
yes to both, probably easier with just a single .dev, and we're kind of in between pip and uv here, it's not super nice. Happy to move either way, I can also revert all these req changes and we just keep it local, no strong opinion
There was a problem hiding this comment.
Also no strong opinion. Whatever is easier for you at this point. We can always change this easily later on.
There was a problem hiding this comment.
I reverted the pyproject part, could be a dedicated PR to do just that in a clean way and leave the requirement files in the dust ?
In the meantime the dev requirements are practical for now I feel, not required for the build or tests, but running benchmarks is part of the dev prices and they come with a few more deps, good to keep around ? Compromising :)
There was a problem hiding this comment.
nice with Marco's follow up, perfect !
There was a problem hiding this comment.
nit: If this file is needed (see other comment in pyproject.toml), I suggest to name it requirements-dev.txt and make it the default place to add all dev requirements.
|
could probably go after this one #148 |
* tentative maturin update * bumping the version * reverting the pyproject change, not important
|
jeez, merged on the wrong branch.. |
* tentative maturin update * bumping the version * reverting the pyproject change, not important * [chore] Cleaner wds (#147) * Some cleanup + better benchmarks * cargo fmt, broken dev setup * removing a silly change * better workload spread while wds, tarball extraction is still the bottleneck * better workload spread while wds, tarball extraction is still the bottleneck * adding a small plot helper * Renaming a confusing variable + adding more benchmarks * Updating pyproject + PD12M bench on Epyc * Fix the missing attributes from wds samples * slightly cleaner code * Adding a unit test + cleaner error handling * Adding backtrace to the tests * code review * Updating the version post-merge * [infra] Update the build process (#148) * tentative maturin update * bumping the version * reverting the pyproject change, not important
* tentative maturin update * bumping the version * reverting the pyproject change, not important * [chore] Cleaner wds (#147) * Some cleanup + better benchmarks * cargo fmt, broken dev setup * removing a silly change * better workload spread while wds, tarball extraction is still the bottleneck * better workload spread while wds, tarball extraction is still the bottleneck * adding a small plot helper * Renaming a confusing variable + adding more benchmarks * Updating pyproject + PD12M bench on Epyc * Fix the missing attributes from wds samples * slightly cleaner code * Adding a unit test + cleaner error handling * Adding backtrace to the tests * code review * Updating the version post-merge * [infra] Update the build process (#148) * tentative maturin update * bumping the version * reverting the pyproject change, not important
* dependencies in pyproject for uv * uv system python for cicd * bumping the version * tentative maturin update * ty and ruff local fixes, need add to cicd * [dupe] Merging #147 (#151) * tentative maturin update * bumping the version * reverting the pyproject change, not important * [chore] Cleaner wds (#147) * Some cleanup + better benchmarks * cargo fmt, broken dev setup * removing a silly change * better workload spread while wds, tarball extraction is still the bottleneck * better workload spread while wds, tarball extraction is still the bottleneck * adding a small plot helper * Renaming a confusing variable + adding more benchmarks * Updating pyproject + PD12M bench on Epyc * Fix the missing attributes from wds samples * slightly cleaner code * Adding a unit test + cleaner error handling * Adding backtrace to the tests * code review * Updating the version post-merge * [infra] Update the build process (#148) * tentative maturin update * bumping the version * reverting the pyproject change, not important * sync * prek precommit for ty and ruff. * no skip * test ty and ruff cicd * fix back * remove hook stage manual * redo hook stage manual --------- Co-authored-by: Benjamin Lefaudeux <benjamin.lefaudeux@mistral.ai> Co-authored-by: Benjamin Lefaudeux <blefaudeux@users.noreply.github.com>
Back and forth testing some options to have a better (faster) WDS support, this is already an improvement. Key underlying change is to split support in between "how many download streams do we start in parallel" and "how many CPU cores can be dedicated to decoding the images".
Fixing missing multimodal attributes, will update the PR with a unit test to lock it down prior to merging, if accepted
Checking in the code used for plotting, while I'm at it
Some benchmark results:
Optimal settings probably vary with the underlying hardware and data distribution, attached are a few benchmarks showing that in any case datago is more than competitive with the python webdataset library. It's also quite convenient to use as the data streaming starts immediately vs. initial buffering for python webdataset.
This (from a household fiber connection + zen 3 laptop) should be a relatively fair picture, speed should translate well into other setups. This is while processing the big images into Transformer compatible buckets (PD12M)

Not touching the images then the CPU spread counts less, but datago is still visibly faster (FakeIN)

This (from a cluster, Epyc CPU and PD12M dataset) confirms that the perf gains scale
