WIP: Storage device API cleanup #770

marktsuchida · 2025-10-30T00:02:46Z

This PR builds on top of #535 (MM::StorageDevice). I'm using a new PR to keep things organized, since I expect to make some pretty extensive changes.
(Merging this PR should mark #535 as merged, as long as #535 is not further modified.)
(You will see all the commits of #535 in this PR, too.)

See only the changes relative to #535.

This is a work in progress to refine the API design for Storage devices. The goal is to get to something that we can merge with reasonable confidence that major breaking changes won't be needed for further evolution.

Tasks (includes items in my review comments to #535; may not be exhaustive):

Big issues

The API is overfitted to Go2ScopeTiff, with features probably not relevant to other backends
- acquire-zarr needs all configuration at time of stream creation; after that, we only append pixel data
  - AddImage() is not possible
  - configureDatasetDimension() won't work after creation
  - configureDatasetCoordinate() won't work after creation (if ever)
  - The distinction between dimension "name" and "meaning" is probably not general
  - Summary metadata has no place to go (unless used as acquire-zarr "custom metadata", but iffy)
  - Image metadata has no place to go (perhaps we do want to support per-plane metadata in other backends, but it's not 100% clear what the best API would be -- a single string may not always suffice)
  - Custom metadata (mutable key-value API) has no place to go (acquire-zarr "custom metadata" is a single JSON blob)
- When we consider the properties of other formats (e.g. OME-TIFF) or I/O libraries we'll have similar mismatches
- Conversely, there are additional, backend-dependent configurations that can go into creating a dataset (e.g., everything in ZarrStreamSettings) that won't be possible with the current API
  - Not being able to specify chunking from the app seems critically limiting
- Given that we're not even trying to come up with an API that abstracts backend differences (outside of pixel data), maybe just pass in a JSON configuration to createDataset() and leave up to storage backend to interpret?
  - This may allow us to e.g. pass in ~~yaozarrs-generated JSON~~ JSON auto-transformed on the app side from yaozarrs or ome-zarr-models in the case of acquire-zarr
  - The JSON config would replace all of: dataset (summary) metadata, dimension names, coordinate names, custom metadata; but createDataset() would still take shape and pixelFormat (possibly redundantly)
(reading) API should explicitly model write-only datasets (acquire-zarr): e.g., app should be able to query if readable

API design

Let the Core assign dataset handles, rather than the devices
GetProgress() - remove for now? (Also check for other unimplemented functions.)
(reading) What is the lifetime of the pointer returned by GetImage()? (Implicitly until next call?) Rather copy into caller-provided buffer?
- Reading into caller-provided buffer better matches OS APIs and gives caller max control over allocation and copy (and caching, if desired; we certainly don't want to see each storage backend implement ad hoc caching)
- Counterargument: some I/O libraries (TensorStore) may support caching and return a shared (refcounted) array; requiring a caller-provided buffer adds a copy. But we don't have a mechanism to support application (Python/Java) side handles to buffers, and adding that here would be scope creep. We can consider such an optimization in the future if it becomes important.
(reading) getDeviceNameToOpenDataset(): replace with isDatasetSupportedByStorage(storageLabel, path)
StorageDataType should not be specific to storage; let's define a generally useful PixelFormat type
Create() says dataset name "may be modified ... to avoid overwriting". This is unhelpful, especially when left to each device -- user has no way to predict the resulting dataset name. We've had lots of bad experience with this in the MMStudio API (and GUI).
IsOpen() is questionable: Core may be able to support this now without device's help, but what is the use case? Remove.

API technicalities

Sizes should be size_t, not int
Axis lengths should probably be 64-bit
Clarify rules for null terminator for metadata buffers
(reading) We need a way for GetImage() to return an error code
(reading) ReleaseStringBuffer() implementation should not default to delete[]
Consider using a custom type for handles (struct DatasetHandle { int h; };)

API naming

appendNextToDataset(): rename to popNextImageAndAppendToDataset() (or similar)

MMCore behavior

Only support mono images (for now), and correctly report error on RGB (BGRa)
Multi Camera won't work with appendNextToDataset(); can we prevent misuse?
AppendImage(): Core should throw if read-only, before asking device; similar for all invalid calls on read/write-only datasets
Ensure datasets are closed when the storage device is unloaded

Misc

MaxMetadataLength is unused; remove
Can we have automated/unit tests in stead of ad-hoc programs?

Functionality beyond write/read

(reading) Clarify what (should) happen when user tries to "load" a dataset that is already open
(reading) Clarify Delete() requirements: dataset must be open (or user must pass path)
(reading) List() API is hard to use correctly (caller must initialize buffers with empty string; if all caller-provided buffers are filled, can't tell how much more is needed) -- and not exposed in Core API
(reading) Freeze() should either be required for read-write devices or (temporarily) removed from the API

Known limitations whose fixes are out of scope for this PR (can be fixed later):

No direct stream ("attached storage") from camera to storage yet
AcquireZarrStorage only supports a single dataset open at a time

These plans are not set in stone, so please feel free to propose alternatives or ask questions!

Cc: @go2scope, @nicost, @tlambert03.

… missing

…rage

…he system level to a different interface number

…ltiple places

…eted AddImage and replaced with append.

…trings

…ets from different storage devices; added explicit deviceLabel to alternative create and load methods

Using two different integers ("core" and "device" handles) to represent the same dataset is unnecessary. Instead, let the Core assign a globally unique dataset handle, and eliminate handle assignment from devices entirely. This also avoids copy-pasting the handle assignment logic every time writing a new storage device. In the Core, instead of mapping handles to device labels (`std::string`), map to `std::weak_ptr<StorageInstance>`. This is important, because otherwise there could be ABA problems: the storage device with the same label may not be the same device, or it may have been reloaded, invalidating all handles. (That could also be prevented by making sure to remove handles when devices are unloaded, but it's better to eliminate even the possibility of such an error.)

marktsuchida · 2025-10-30T15:12:16Z

I'm thinking the go2scope device adapter should be just G2SBigTiffStorage and AcquireZarrStorage should be split out into its own device adapter. Reasons:

The 2 have almost no code in common
AcquireZarrStorage has dependencies, and it would be nice if G2SBigTiffStorage could be built without those
It's easier to read the code when you know that the 2 are independent, instead of having to determine that yourself
I don't think we want to explain to users over and over that they need to look under go2scope to find the AcquireZarrStorage

Each device should write this function to match how they allocate string buffers. An implicit default is easy to miss and dangerous.

It is not clear that this is the best API (as opposed to, say, a callback). Let's leave out until we have at least a proof-of-principle implementation.

It doesn't make sense to put an arbitrary selection of projects in the official repo. Added *.slnf to .gitignore so that people (including myself!) can use our own solution filters locally without adding to Git.

AcquireZarrStorage and G2SBigTiffStorage are completely independent of each other, so this makes more sense. Should help with discoverability, build configuration, and code comprehension.

Inconsistency can cause problems on Linux. go2scope -> Go2ScopeTmp Go2Scope -> Go2ScopeTmp The 'Tmp' will be removed in the next commit -- using two separate commits helps Git get things right on Windows and macOS.

Go2ScopeTmp -> Go2Scope

marktsuchida · 2025-12-03T23:25:17Z

After further thinking, I've updated the to-do items above and categorized.

There's a lot to fix here, and I notice that limiting an initial version to write-only storage would probably allow us to make Storage device available sooner (an idea that I think some of us have floated previously). Also, it will be easier to decouple the API and MMCore implementation from the G2STiff adapter, for the time being, if we do that (rather than awkwardly leaving around unexposed reading capabilities). Conversely, I think it will be good to ensure that (an up-to-date version of) acquire-zarr is a good fit for the device API.

So one direction I'm entertaining is to break down the process as follows:

Branch off this PR to temporarily remove the device adapters (Go2Scope (G2STiff) and the AcquireZarr based on obsolete(?) acquire-zarr and also only supporting a single dataset at a time).
On said branch, trim the API down to a minimal write-only API (solving issues listed above without (reading)).
Also on said branch, re-add an AcquireZarr device adapter based on current acquire-zarr C API. In addition to proving the API, this should hopefully be fully usable.
After that work is finished and merged, we can revisit G2STiff, including resurrecting reading capabilities in the API.

Note that I'm not proposing to eliminate the read capabilities of G2STiff, only to temporarily proceed without them so that we can work on one thing at a time.

The minimal API (on the application side) would be something close to this:

createDataset(storageLabel, path, name, shape, pixelFormat, config) - takes backend-defined JSON config (discussed above)
closeDataset(handle)
getDatasetPath(handle), getDatasetName(handle)
getDatasetShape(handle), getDatasetPixelFormat(handle)
appendImageToDataset(handle, size, pixels) - per-plane metadata support deferred for now
snapAndAppendToDataset(handle), popNextImageAndAppendToDataset(handle)

I think this maps nicely to acquire-zarr, but is also implementable by most backends.

How do people feel about this? I'd especifically appreciate feedback on:

Is anything missing from the "minimal API" that would be a deal breaker for initial use with AcquireZarr?
What are the planned use cases of G2STiff? Is this format following an existing definition or is this code its definition (and is it set in stone)? Would it be okay if we temporarily prioritize getting Zarr writing available?
How much interest is there in OME-TIFF support (note that G2STiff is unrelated to OME-TIFF)? We won't have a storage backend for OME-TIFF initially, but I was planning on ensuring that the API design is compatible with the properties of this format.

Cc: @nclack, @aliddell, @talonchandler, @dpshepherd, @go2scope, @nenada, @tlambert03, @nicost.

tlambert03 · 2025-12-03T23:51:24Z

i definitely think a step-wise, build up from minimal API approach is the right way to go for something this large. It's nice to have the giant "working principle" branch to work with, but it's simply too large to evaluate. Having you distill the essence of the core from the scope creep would be a great way to move forward on this.
I think all you need for acquire-zarr support is createDataset with some way to map parameters to the ArraySettings spec there (this is tricky to do in a backend-agnostic way of course... so I think your backend-JSON approach is the only way to go for now), appendImageToDataset and closeDataset. (with the other things you mentioned being nice to have too)
I do think that ome-tiff is pretty important for probably more half of users, and i honestly don't see that going away any time soon with the pace of ome-zarr adoption and progression.

nclack · 2025-12-04T18:16:50Z

@marktsuchida thanks for all the work here!

One of the things I like about the approach with the Storage API is that it promises to give a route to more pluggable file format support. I'm very supportive of splitting out the G2STiff and AcquireZarr writers, and I think an OME-TIFF writer would be great. As you noted, zarr doesn't have great conventions for per-frame metadata at the moment but Tiff does.

I think @tlambert03 got things right wrt the acquire-zarr api. We have changed it a bit since #553. @aliddell maybe you can weigh in here?

nenada · 2025-12-05T03:20:06Z

Sorry I was out of the loop for the last couple of months and missed the API discussion. I am catching up now. @marktsuchida thank you for effort of integrating the storage API. Let me know what should I do to help. I am also available for meetings.

A couple of thoughts:
We don't need to worry about G2STIFF; it was the first adapter we chose to develop the API with, and it was the new format that no one used, so we had freedom to do whatever we wanted and try a bunch of approaches with the metadata. I was trying to come up with an API that made sense from the application side in general, without worrying about the idiosyncrasies of the existing formats.

We can remove go2scope and G2STiff entirely from micromanager devices. I will add a separate dll with the g2stiff when the API becomes stable.

I agree with Mark's plan for incremental API development. Regarding all other issues, I think there is a lot of material and I do think it requires a meeting to discuss and coordinate. I can definitely put some work towards moving things forward, but not sure where to begin.

aliddell · 2025-12-05T20:24:32Z

I think @tlambert03 got things right wrt the acquire-zarr api. We have changed it a bit since #553. @aliddell maybe you can weigh in here?

Talley's assessment about what the aqz API needs is spot on. Several things in the (configuration) API have changed since #535, but probably the most relevant change is that multiple output arrays are now supported within a single Zarr dataset, so the properties of the dataset that pertain specifically to arrays have been consolidated into a ZarrArraySettings struct. There's also a flag which allows users to overwrite (or not) anything that lives in the base dataset.

marktsuchida · 2025-12-05T23:53:34Z

Thanks, all, for your comments and support!

I think there is a lot of material and I do think it requires a meeting to discuss and coordinate.

Absolutely happy to discuss on Zoom! Let me hammer at this a bit more and see if I can come up with something concrete that we can use as a basis for discussion -- API design being so much about details, I think that might be more productive (and I won't mind if that results in more code churn on my side).

multiple output arrays

This is a good topic to think ahead about (thanks also @tlambert03 for helpful discussion that brought this up).

@aliddell Does overwrite=false allow one to reopen a partially written dataset and add new arrays to it (as in adding more HCS plates/wells/FOVs)?

If so, specifying the array name (output key) in the dataset configuration (JSON passed at creation time) would allow the use of multi-array datasets -- so long as their writing doesn't overlap in time (my understanding; please correct me if wrong).

Regardless, I think there is a path to fairly cleanly supporting simultaneous writing to multiple arrays, and this could be useful for OME-TIFF (which can potentially be a multi-file dataset) as well. Maybe not necessary in the first version of the API, but can be added subsequently. Here's the idea:

The appendImageToDataset() function could gain an optional "stream" parameter (probably a string). Valid values for this parameter would be determined by the storage implementation, usually (but not necessarily) based on the configuration provided at dataset creation time.

Terminology mapping something like this:

MM::Storage API	Zarr v3 spec	acquire-zarr	OME-TIFF (hypothetical implementation)
dataset	Zarr hierarchy	ZarrStream	dataset; OME-XML
stream	array	output key	TIFF file

In this model, a dataset contains 0 or more streams (1 or more to be useful), and a stream is a portion of the dataset that can only be written sequentially in a predefined order. (Reading, where supported, may be random access -- or not -- but we leave that for later.) The set of valid streams is up to the storage backend/driver, which may expose a single stream, multiple predefined streams, or even arbitrary streams that get created on first access. In the case of acquire-zarr, it would map to the key passed to ZarrStream_append(), predefined in the ZarrArraySettings (and/or HCS settings).

When we add support for attaching storage to a camera for direct saving, it will be a stream, not a whole dataset, that gets attached, at least by default (though there might be a need for virtual streams that span multiple backend-defined streams).

(The "stream" parameter might also be a way to support random access writers (which can be modeled as having a stream per plane, for example), should the need arise.)

aliddell · 2025-12-08T16:27:07Z

@aliddell Does overwrite=false allow one to reopen a partially written dataset and add new arrays to it (as in adding more HCS plates/wells/FOVs)?

Yes, that's right, though there's a risk that metadata can be overwritten. For example, if we have the structure

dataset.zarr/
├── A # contains a multiscale group
│   ├── 0 # full-res LOD
│   │   ├── c
│   │   │   ├── ...
│   │   │   └── zarr.json
│   │   └── zarr.json # contains NGFF multiscales metadata
│   └── zarr.json 
└── zarr.json

and we decide we later want to write another array at A/B, then acquire-zarr will overwrite A/zarr.json and the NGFF multiscales metadata contained therein. Right now acquire-zarr doesn't have any way to reconcile multiple writes to the same dataset from different processes. This is why we're able to configure multiple arrays at the same time in a single stream.

If so, specifying the array name (output key) in the dataset configuration (JSON passed at creation time) would allow the use of multi-array datasets -- so long as their writing doesn't overlap in time (my understanding; please correct me if wrong).

Not exactly. The array name (output key) doesn't belong in the dataset configuration, in belongs in an array configuration, of which you can pass several of these at dataset creation time.

marktsuchida · 2025-12-08T16:49:26Z

Makes sense, thanks. We can initially limit our support to configs where we only have one ZarrArraySettings in the ZarrStreamSettings, but that will still allow adding arrays via overwrite=false (with the caveats about concurrent access).

nenada and others added 30 commits February 20, 2025 19:11

StorageDevice introduced - work in progress

2cf31dd

go2scope device introduced

09cc8c9

Storage instance added and MMCore plumbing

67d503b

storage instance: further work

c7fb810

Storage API cleanup

68b53f8

Minimal MMCore interface sketched out

fdd7dc9

loadDataset added

cc6a4c3

vector<int> arguments replaced with vector<long> to help with Swig

8b0b3a3

implementation of data read functions

9e744a0

storage instance methods populated

58f105d

nlohmann json library added

3d95a2f

StroageMnager copied from go2scope. Requres extensive refactoring

8cbb354

cleaned up API signatures in MMCore

ccbfbe1

storage manager refactored

7b33e5b

acquire zarr linked into the device adapter, other dependencies still…

8cfa575

… missing

static runtime selected

b2b3a6f

temporarily commented out zar library

b3b8f89

Merge branch 'main' of https://github.com/go2scope/mmCoreAndDevicesSD

d9f2c25

zarr storage device created and separated from the upcoming tiff4 sto…

de52c7f

…rage

Merge branch 'main' of https://github.com/go2scope/mmCoreAndDevicesSD

7f0f9e8

API modified to accomodate forwared declaration of pixel type

a38091c

further integration of zarr

bc2a51f

BigTIFF storage added (branch merged manually)

280e1f1

G2STIFF parser added to BigTIFF storage driver

3273386

BigTIFF storage - Add image, set dataset info

5abc468

G2STiff format: Fixes for read & write access on the same file stream

4148ad0

image number initialized in constructor

6595d47

Set device interface version to 71. This will have to be updated on t…

cdc8ad0

…he system level to a different interface number

bug fix: MMCore requires explicit declaration of any new device on mu…

0ce7061

…ltiple places

Storage instance bug fix

1a5ccab

nenada and others added 11 commits February 24, 2025 19:48

changed API naming convention to include "dataset". Added freeze. Del…

865b11e

…eted AddImage and replaced with append.

metadata length parameter added to all functions accepting metadata s…

3fa7f05

…trings

updated storage tests in C++

d677379

dataset handle type changed from string to int

3145826

updated storage tests to reflect the new API

020d376

custom metadata disk layout format changed. Now in separate subdir.

ec66d4f

bug fix: storage device included in core properties

289a78c

handle lookup mechanism added to MMCore to enable mutliple open datas…

8726eec

…ets from different storage devices; added explicit deviceLabel to alternative create and load methods

changed using to typedef to avoid swig compiler error

3cdf591

device label bug fix in handle map in MMCore

adffae8

marktsuchida added 6 commits October 30, 2025 14:06

Compatibility with older SWIG

c0ae7c2

Don't provide default impl for ReleaseStringBuffer

617e3a0

Each device should write this function to match how they allocate string buffers. An implicit default is easy to miss and dangerous.

Remove GetProgress (never used, never implemented)

83b1a96

It is not clear that this is the best API (as opposed to, say, a callback). Let's leave out until we have at least a proof-of-principle implementation.

Remove placeholders for "attached storage" API

65efeac

Remove solution filter file

be10e57

It doesn't make sense to put an arbitrary selection of projects in the official repo. Added *.slnf to .gitignore so that people (including myself!) can use our own solution filters locally without adding to Git.

Split AcquireZarr out of go2scope adapter

0a66d05

AcquireZarrStorage and G2SBigTiffStorage are completely independent of each other, so this makes more sense. Should help with discoverability, build configuration, and code comprehension.

marktsuchida force-pushed the mmstorage-mark branch from a190e20 to 0a66d05 Compare October 31, 2025 16:42

marktsuchida added 3 commits October 31, 2025 13:28

Rename Go2Scope for consistent case, step 1/2

1109a04

Inconsistency can cause problems on Linux. go2scope -> Go2ScopeTmp Go2Scope -> Go2ScopeTmp The 'Tmp' will be removed in the next commit -- using two separate commits helps Git get things right on Windows and macOS.

Rename Go2Scope for consistent case, step 2/2

aba2d0a

Go2ScopeTmp -> Go2Scope

Remove nlohmann-json, it's not actually used

788bf68

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

WIP: Storage device API cleanup #770

WIP: Storage device API cleanup #770

Uh oh!

marktsuchida commented Oct 30, 2025 •

edited

Loading

Uh oh!

marktsuchida commented Oct 30, 2025

Uh oh!

marktsuchida commented Dec 3, 2025

Uh oh!

tlambert03 commented Dec 3, 2025

Uh oh!

nclack commented Dec 4, 2025

Uh oh!

nenada commented Dec 5, 2025

Uh oh!

aliddell commented Dec 5, 2025

Uh oh!

marktsuchida commented Dec 5, 2025 •

edited

Loading

Uh oh!

aliddell commented Dec 8, 2025

Uh oh!

marktsuchida commented Dec 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

WIP: Storage device API cleanup #770

Are you sure you want to change the base?

WIP: Storage device API cleanup #770

Uh oh!

Conversation

marktsuchida commented Oct 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

marktsuchida commented Oct 30, 2025

Uh oh!

marktsuchida commented Dec 3, 2025

Uh oh!

tlambert03 commented Dec 3, 2025

Uh oh!

nclack commented Dec 4, 2025

Uh oh!

nenada commented Dec 5, 2025

Uh oh!

aliddell commented Dec 5, 2025

Uh oh!

marktsuchida commented Dec 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aliddell commented Dec 8, 2025

Uh oh!

marktsuchida commented Dec 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

marktsuchida commented Oct 30, 2025 •

edited

Loading

marktsuchida commented Dec 5, 2025 •

edited

Loading