Skip to content

Next design, directories oriented?/localized, is needed #69

@yarikoptic

Description

@yarikoptic

This issue could be considered a duplicate of earlier dandi/dandi-cli#848 (comment) but as that one just started with suggestion of an alternative implementation within dandi-cli, I decided to file a separate one within fscacher which would provide more coverage over the situation and since I still hope that fscacher could be a reusable package to provide needed solution.

Outstanding issues we have which are all bottlenecking IMHO in our initial simplistic fscacher implementation, which

  • I1: uses the same cache for a function (or collection of functions) regardless of their parametrization etc
  • I2: places all caches into centralized user-wide cache directory (e.g., ~/.cache/fscacher/{cachename}): caches related to a dataset which might get removed,
  • I3: uses joblib's memoize so that each invocation gets its own directory + 2 files on disk (thus 3 inodes):
(dandi-devel) jovyan@jupyter-yarikoptic:~/.cache/fscacher/dandi-checksums/joblib/dandi/support/digests$
$ ls get_zarr_checksum/b74a957ae8f7e2e4fc2b07d1aaa73775/
metadata.json  output.pkl
  - in effect allows to avoid need for any locking since fingerprint defines the directory name to use

Such simplistic initial design showed its limitations by

The question is how we could redesign (or just expand since current implementation is good in its simplicity) fscacher to possibly

  • RF I3: provide an alternative (to directory with 2 files)
    • more efficient in case of working with lots of small files
    • less reliant on filesystem and thus more gentle to inodes storage backend
    • might need to have locking mechanisms added to guarantee
  • RF I2 and/or may be I1 provide some "locality":
    • "dataset level": so we could keep cache within any given dandiset, so if that dandiset is removed, cache goes along with it
    • "directory/path" (in some cases only): keep all fingerprints for files within .zarr/ folder within a singular cache "file" -- would save inodes, for mv etc operations would be a matter of copying/changing one such file.
    • in principle, both localities could already be achieved with current fscacher to some degree by establishing separate cache per specific dataset/zarr file. We just would need to get away from using a simple generic @decorator form and instantiate cache per each dandiset/zarr file in a given location. For 'zarr' support though some alternative storage backend would be needed
  • gain more versatile cleaning
    • may be tie a little with the code, i.e. removal of file by the code could trigger cleaning the cache for that path. although not sure if worth adding such explicit ties really.
  • support "early decision" for directories (say that changed - run the target compute while in parallel finishing getting the full directory fingerprint; might need storing "tree" fingerprint at least to some level depth)

edit: adding an explicit section

Features

  • be able to query the status/if changed on a path (directory). Will be of use for DataLad as well. So we might need to abstract those interfaces

WDYT @jwodder -- have any ideas/vision on what next design in fscacher could be to make our life in dandi-cli better?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions