Skip to content

Zarr3 transition#192

Draft
mat10d wants to merge 93 commits intomainfrom
zarr3-transition
Draft

Zarr3 transition#192
mat10d wants to merge 93 commits intomainfrom
zarr3-transition

Conversation

@mat10d
Copy link
Copy Markdown
Collaborator

@mat10d mat10d commented Feb 5, 2026

Description

Thank you for your contribution to Brieflow!
Please succinctly summarize your proposed change.
What motivated you to make this change?

Please also link to any relevant issues that your code is associated with.

What is the nature of your change?

  • Bug fix (fixes an issue).
  • Enhancement (adds functionality).
  • Breaking change (fix or feature that would cause existing functionality to not work as expected).
  • This change requires a documentation update.

Checklist

Please ensure that all boxes are checked before indicating that a pull request is ready for review.

  • My code follows the conventions of this project.
  • I have updated the pyproject.toml to reflect the change as designated by semantic versioning.
  • I have checked linting and formatting with ruff check and ruff format.
  • I have performed a self-review of my own code.
  • I have commented my code, particularly in hard-to-understand areas.
  • I have made corresponding changes to the documentation.
  • My changes generate no new warnings.
  • I have deleted all non-relevant text in this pull request template.

petercla0119 and others added 30 commits January 30, 2026 10:01
- Updated `write_image_omezarr` and `write_labels_omezarr` functions to accept pixel sizes as float, tuple, or dictionary, allowing for more flexible input formats.
- Introduced `_parse_pixel_sizes` helper function to standardize pixel size extraction and validation.
- Enhanced metadata extraction in `extract_metadata_tile_nd2` and `extract_metadata_well_nd2` to include pixel size, objective magnification, zoom magnification, and binning information.
- Updated `export_omezarr_image` script to read image data from TIFF or raw formats, improving compatibility with different data sources.
- Added warnings for potential inconsistencies in pixel size calibration.
- Introduced `conftest.py` to ensure the repository root is included in `sys.path` for test imports.
- Updated `test_preprocess.py` to assert required columns in metadata instead of exact counts.
- Modified `test_omezarr_exports.py` to check for any Zarr files in the output directory.
- Enhanced `write_image_omezarr` to accept new parameters: `coarsening_factor`, `max_levels`, and `is_label`, improving flexibility in image writing.
- Added error handling for `max_levels` and `coarsening_factor` to ensure valid values.
- Updated metadata handling in `write_image_omezarr` to accommodate label images and ensure proper storage of pixel sizes.
omezarr_writer.py moved under lib/shared
resolves issue with failed labels import into napari
…ss_zarr_v4

Cherry-picked from 1161474 on 79f1eea_preprocess_zarr_v4.
- Add dynamic key selection (CONVERT_SBS_KEY/CONVERT_PHENOTYPE_KEY) based on OME_ZARR_ENABLED
- IC fields respect IC_EXT (zarr vs tiff) based on config
- Downstream rules (sbs.smk, phenotype.smk) use dynamic keys for preprocess inputs
- Add image_to_omezarr.py script that uses convert_to_array + write_image_omezarr
- Add convert_sbs_omezarr and convert_phenotype_omezarr rules
- Update CONVERT_*_KEY selection to use _omezarr variants when USE_OME_ZARR=True
- This allows direct ND2→Zarr conversion, bypassing TIFF intermediates entirely
- Added integration tests for Zarr preprocessing functionality, ensuring nd2_to_zarr conversion produces outputs equivalent to TIFF conversion.
- Updated pytest markers to include integration tests.
- Modified existing tests to prioritize Zarr format over TIFF where applicable.
- Introduced new rules for Zarr conversion in the Snakemake workflow, allowing for flexible output formats based on configuration.
- Implemented a script for direct ND2 to standard Zarr conversion, streamlining the preprocessing pipeline.
mat10d and others added 30 commits March 30, 2026 18:05
- Add unit on T axis ("second") and spatial axes in _axes_str_to_dicts
- Separate label axis unit patching from pixel scale patching so units
  are always set even without preprocess metadata
- Re-inject downsamplingMethod after iohub dump_meta (which strips it)
- segmentation.method now includes model (e.g. "cellpose.cyto3")
- segmentation.stitching uses string "none" instead of boolean false
- Add statistics.n_cells by counting unique labels in the array
- Validated: 0 errors with ops-schema validator

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…with output_to_input() lambdas using new _merge_well_expand helpers. also add row to cell_data_metadata_cols.tsv so aggregate steps treat it as metadata, not a feature.
Template CSV with 188 cp_emulator feature patterns. {Compartment} and
{Channel} placeholders are expanded at submission time by the finalize
rule using channel names from config.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
New add_ensembl_ids() function maps Entrez gene IDs to Ensembl IDs using
either a static TSV mapping file or Ensembl REST API fallback. Wired into
standardize_barcode_design() via ensembl_mapping_path parameter. Non-targeting
controls are automatically labeled "non-targeting".

Required for OPS Data Standard perturbation_library.csv which requires
Ensembl gene IDs (ENSG format) instead of Entrez IDs.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- get_gene_mapping(): downloads gene_symbol/entrez_id/ensembl_gene_id
  mapping from Ensembl BioMart at runtime (same pattern as UniProt download)
- resolve_gene_ids(): fills in missing gene identifiers from any starting
  point (symbol only, Entrez only, Ensembl only, or mixed)
- Wired into standardize_barcode_design() via gene_mapping_path parameter
- Replaces the earlier add_ensembl_ids() which only handled Entrez → Ensembl

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replaced BioMart bulk download (unreliable) with MyGene.info querymany()
for targeted symbol→Ensembl/Entrez resolution. Only looks up genes
present in the user's barcode library — fast and doesn't hit API limits.

Requires: uv pip install mygene

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Brings in main fixes: recombination detection, spatial heatmaps,
aggregate edge cases, resolve_path, file_manifest, custom cellpose,
aggregation/clustering cleanup. Keep version at 1.5.0.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- gene_id now contains Ensembl IDs (replaces Entrez)
- Preserve full sgRNA as protospacer_sequence before prefix truncation
- Derive role (targeting/control) and control_type from nontargeting patterns
- Add protospacer_adjacent_motif ("3' NGG" for Cas9)
- At export time, prep_cellxstate.sh just renames prefix→barcode and
  adds perturbation_id=gene_symbol

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Main's spatial heatmap changes introduced hardcoded ["well"] expansion
values in eval rules. In zarr mode, wildcards use row/col instead of
well. Replace with _phen_well_expand/_sbs_well_expand/_sbs_tile_expand
which dispatch correctly based on IMG_FMT.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
barcode_col pointed to sgRNA which no longer exists in the new
barcode library format. Use prefix_col: prefix instead, which is
the truncated barcode used for read matching.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Merge outputs (parquets) always use {well} paths regardless of format.
Only cross-module references (SBS/phenotype/preprocess outputs) need
format-aware expansion. Distinguish between:
- Merge own outputs: always expansion_values=["well"]
- SBS/phenotype data outputs: _merge_well_expand_all (row/col in zarr)
- Preprocess metadata: _merge_well_expand_all (row/col in zarr)

Also adds _combos_with_well() helper for future use.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…es_singlecell.parquet outputs into a single AnnData .h5ad per channel combo, combining all cell classes
In zarr mode, the montage pipeline now writes individual cell crops to
an examples.zarr store ({gene}/{barcode}/0..N/) instead of tiled
PNG + TIFF montages. TIFF mode unchanged.

Changes:
- montage_utils.py: add_filenames() zarr-aware, grid_view() uses read_image()
- generate_montage.py: dispatches on IMG_FMT (zarr crops vs PNG/TIFF grid)
- aggregate.smk/targets: conditional outputs for zarr vs tiff mode
- rule_utils.py: get_montage_inputs() handles None overlay template

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Each cell crop is now written as a proper OME-Zarr with channel names,
axes, and coordinate transforms via save_image(). This means each crop
carries its own channel metadata rather than relying on the parent store.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Label groups inside a .zarr store should not have .zarr suffix —
e.g. labels/nuclei not labels/nuclei.zarr. The suffix caused
napari-ome-zarr to silently skip segmentations because the labels
index listed ["nuclei"] but the directory was "nuclei.zarr".

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
iohub's channel_display_settings only recognizes standard fluorophore
names (DAPI, GFP, etc.) — marker names like COXIV, CENPA, WGA got
white/inactive defaults. Now:
- All channels set to active: true
- Colors: config color > iohub color > default palette fallback
- Default palette: blue, green, red, magenta, yellow, cyan, orange, purple

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Same white-color issue as HCS metadata — hardcoded FFFFFF for all
channels. Now uses the same default palette (blue, green, red, magenta,
etc.) so example zarr crops and all OME-Zarr writes get distinct colors.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Move DEFAULT_CHANNEL_COLORS to io.py as shared constant, import in
  write_hcs_metadata.py (removes duplicate palette definition)
- Example zarr crops: max_levels=1 (no pyramids for 80px images)
- Remove unused _combos_with_well() helper from merge.smk

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Phenotype: DAPI=blue, COXIV=green, CENPA=red, WGA=magenta
SBS: DAPI=blue, G=green, T=red, A=yellow, C=magenta

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Tile-based workflows don't need pyramids — images are ~2400x2400.
Changed default max_levels from 5/4 to 1 in save_image() and
write_image_omezarr(). Added zarr_max_levels config option for
documentation. Users wanting pyramids pass max_levels explicitly.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
image-label metadata lives at attributes.ome.image-label in zarr v3,
not attributes.image-label. Without this fix, the labels container
zarr.json never gets written because no label stores are detected.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Brings in Ege's generate_anndata rule and anndata dependency.
Run script updated to include all pipeline stages.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Brings in Ege's param validation and direct param access across all
scripts. Resolved 6 conflicts — kept our zarr-aware read_image(),
save_image(), uint32 labels, and zarr/tiff montage dispatch while
adopting param validation and direct snakemake.params access.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…220)

* fix params for multi mode (#218)

* Rename int → integrated in CP emulator features

Aligns with OPS Data Standard feature naming convention. Changes:
- cp_emulator.py: feature key "int" → "integrated", column mappings
  "int" → "integrated", "int_edge" → "integrated_edge"
- feature_definitions.csv: updated template column names
- CP_EMULATOR_FEATURES.md: updated documentation

This is the only rename needed — all other feature names already match
the standardized Vesuvius feature set.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Add compartment/channel columns, update feature types in template

Per updated OPS spec:
- morphology → shape
- Correlation features (K, manders, overlap, etc.) → correlation
- New compartment and channel columns for metadata

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Add format_cluster_anndata rule for spec-compliant aggregated h5ad

New cluster step that produces aggregated_data.h5ad per the OPS spec:
- obs = perturbations indexed by perturbation_id
- var = standardized feature set (shape + intensity + correlation)
- X = mean aggregated feature values per perturbation
- obsm = PHATE embedding coordinates
- uns = schema_version, default_embedding, title
- Bootstrap p-values wired but optional (TODO: reshape)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Add format_cluster_anndata rule for cluster-level h5ad

Combines perturbation-level features with PHATE embedding and cluster
assignments into cluster.h5ad. Includes all available metadata and
features with parsed var annotations (type, compartment, channel).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Extract format_cluster_anndata logic into lib function

Move core AnnData construction into workflow/lib/cluster/, keep script
as thin caller. Follows brieflow lib/scripts pattern.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Update classifier feature names: int → integrated

Renamed features in the test classifier dill to match the CP emulator
rename. Reverted compatibility shim in train.py — fix at source instead.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Retrain dummy classifier with integrated feature names

XGBoost's feature_names_in_ is read-only — can't patch the dill.
Retrained a simple dummy classifier on random data with the correct
feature names (int → integrated). Same class structure (Interphase/Mitotic).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Add dummy_classifier.dill with integrated feature names

Properly trained dummy classifier with:
- Feature names using _integrated (not _int)
- Labels 1=Mitotic, 2=Interphase (matching original config mapping)
- LabelEncoder for XGBoost 0-indexed compatibility
- Config updated to use dummy_classifier.dill

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Remove old classifier with outdated feature names

Replaced by dummy_classifier.dill which uses _integrated feature names.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Wire bootstrap p-values into cluster h5ad

Add _add_bootstrap_layers() that reshapes per-feature p-values and
FDR from the combined gene bootstrap TSV into AnnData layers:
- layers["p_values"]: per-feature p-values per perturbation
- layers["neg_log10_fdr"]: -log10(FDR) per feature per perturbation

Bootstrap results wired as input to format_cluster_anndata rule.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Add percentile_rank layer to cluster h5ad

Per-feature percentile rank (0-100) across all perturbations.
Useful for human-readable interpretation of feature values.
Dropped during cellxstate export but retained in pipeline h5ad.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Add neg_log10_fdr to bootstrap, dump all layers into cluster h5ad

Bootstrap now computes and outputs:
- {feature}_neg_log10_pval: -log10(p-value)
- {feature}_fdr: FDR-corrected p-value
- {feature}_neg_log10_fdr: -log10(FDR)

Cluster h5ad reads all four bootstrap columns directly as layers:
p_values, fdr, neg_log10_pval, neg_log10_fdr. No computation at
the cluster step — bootstrap does all the work.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Combine all leiden resolutions into single cluster h5ad

format_cluster_anndata now accepts a dict of clusterings (one per
resolution) and merges cluster assignments as separate obs columns:
cluster_group_2, cluster_group_5, etc. Output is one h5ad per
cell_class/channel_combo at cluster/{combo}/{class}/h5ad/cluster.h5ad.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Fix bootstrap column reorder to match renamed columns

The ordered_cols list in apply_multiple_hypothesis_correction still
referenced _log10 after we renamed to _neg_log10_pval and added
_neg_log10_fdr. Updated to match the actual column names.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Clean obs: drop PHATE duplicates and merge-suffix columns

PHATE_0/1 belong in obsm not obs. cell_count_cluster is a merge
artifact. cluster column replaced by per-resolution cluster_group_N.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Add col to cell_data_metadata_cols.tsv

The col column (well column index from split_well_to_cols) was missing
from the metadata cols list, causing it to leak into feature columns.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* Add row and col to DEFAULT_METADATA_COLS

These columns are added by split_well_to_cols in zarr mode but were
missing from the default metadata list, causing them to leak into
feature columns.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* merge ege's work, final improvements

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants