Skip to content

Add Global max_cache_size to CombinedStreamingDataset to Enforce a Total Cache Budget Across Child Datasets #790

@hvkoops

Description

@hvkoops

🚀 Feature

Add an optional max_cache_size at the CombinedStreamingDataset level that enforces a single total cache budget across all child StreamingDatasets, instead of only per-dataset cache limits.

Today, max_cache_size is only defined on StreamingDataset and passed into Cache -> BinaryReader -> PrepareChunksThread, where eviction is triggered based on the size of that dataset's cache directory.

Motivation

When composing many streaming datasets (e.g. 50+), each StreamingDataset can independently grow its cache up to its own max_cache_size (default is "100GB").

In a CombinedStreamingDataset, this can easily lead to runaway disk usage because eviction is enforced per dataset cache dir, not across the combined set:

  • CombinedStreamingDataset simply holds a list of StreamingDatasets and instantiates iterators for each
    (self._dataset_iters = [iter(dataset) for dataset in datasets]).
  • Each StreamingDataset lazily creates its own Cache(...) with max_cache_size=self.max_cache_size.
  • Eviction happens when _get_folder_size(self._config._cache_dir, ...) >= self._max_cache_size inside PrepareChunksThread.

So, with N datasets you can effectively consume ~N × max_cache_size on local disk. With 50 datasets and defaults, that's an upper bound on the order of terabytes (even if you "only meant" to budget ~100GB
total).

This is especially painful in multi-node / shared environments where local scratch space is limited, and it's easy to miss because each dataset looks "correctly configured" in isolation.

Pitch

Introduce max_cache_size (and optionally an allocation strategy) on CombinedStreamingDataset, and have it enforce a global cache budget by distributing that budget across the child datasets before their caches are instantiated.

Proposed API:

CombinedStreamingDataset(
    datasets=[...],
    seed=42,
    iterate_over_all=True,
    batching_method="stratified",
    max_cache_size="200GB",                # NEW: total budget across all datasets
    cache_allocation="proportional",       # optional: "equal" | "proportional"
)

Behavior:

  1. If max_cache_size is not provided: keep current behavior (backward compatible).
  2. If provided: compute a per-dataset budget and apply it to each child dataset's StreamingDataset.max_cache_size before calling iter(dataset) (since iterator construction triggers cache creation).

Allocation strategies:

  • equal: per_ds = total_budget / num_datasets
  • proportional (recommended default): allocate by combined sampling weights (already computed in CombinedStreamingDataset.__init__).
    • e.g. per_ds_i = total_budget * weight_i
    • If a dataset is sampled more often, it benefits more from cache headroom.

Implementation details (minimal-intrusion approach):

  • Add max_cache_size: int | str | None = None and cache_allocation: Literal["equal","proportional"] = "proportional" to CombinedStreamingDataset.__init__.
  • In CombinedStreamingDataset.__iter__, before creating _CombinedDatasetIterator, apply the computed per-dataset budget:
    • Set dataset.max_cache_size = per_ds_budget for each StreamingDataset child.
    • StreamingDataset uses self.max_cache_size when constructing Cache(...), which passes it down to BinaryReader(... max_cache_size=...) (the thing PrepareChunksThread uses for eviction).

Edge cases / notes:

  • If a child dataset already has a user-specified max_cache_size, you can either:
    • override it when combined-level budget is set, or
    • allow optional cache_allocation="cap_only" semantics (combined-level acts as an upper cap but doesn't override smaller values).
  • If iterate_over_all=True and datasets are removed/re-added during iteration, you could keep it simple and allocate once based on initial set; or recompute allocation when active datasets change.

Eviction is already implemented and robust at the per-directory level in PrepareChunksThread. It's just ensuring the sum of directories remains bounded by controlling per-directory budgets.

Alternatives

  1. Manually set max_cache_size on every StreamingDataset
    • Becomes brittle and tedious with large mixtures.
    • Users still need to manually update configs when the number of datasets changes.
  2. Use a single shared cache_dir
    • Risky: different datasets may clash in the same directory (and even if they don't today, it's not an advertised contract).
    • Still doesn't solve budgeting unless eviction becomes global over that directory.
  3. Implement a true global eviction policy across dataset cache dirs
    • More correct but significantly more complex:
      • need global accounting of chunk sizes across dirs
      • need a consistent definition of "oldest" or "least needed" across datasets
      • need safe deletion under concurrent workers/processes (locks already exist per-chunk, but cross-dataset coordination is non-trivial).
    • The proposed budget-distribution approach solves the practical disk-exhaustion problem with minimal churn.

Additional context

Relevant code paths showing where the current per-dataset-only limit is
enforced:

  • CombinedStreamingDataset constructs iterators for each dataset (no
    cache budgeting logic at this level today).
  • StreamingDataset accepts max_cache_size (default "100GB") and
    stores it.
  • StreamingDataset._create_cache() passes
    max_cache_size=self.max_cache_size into Cache.
  • Cache passes max_cache_size into BinaryReader, which spawns
    PrepareChunksThread(max_cache_size=...).
  • PrepareChunksThread triggers deletion based on
    _get_folder_size(self._config._cache_dir) >= self._max_cache_size
    (per cache dir).

This feature would prevent accidental disk exhaustion in large multi-dataset training setups while remaining fully backward compatible and minimally invasive to the existing cache/eviction architecture.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions