Zarr3 transition by mat10d · Pull Request #192 · cheeseman-lab/brieflow

mat10d · 2026-02-05T23:46:33Z

Description

Thank you for your contribution to Brieflow!
Please succinctly summarize your proposed change.
What motivated you to make this change?

Please also link to any relevant issues that your code is associated with.

What is the nature of your change?

Bug fix (fixes an issue).
Enhancement (adds functionality).
Breaking change (fix or feature that would cause existing functionality to not work as expected).
This change requires a documentation update.

Checklist

Please ensure that all boxes are checked before indicating that a pull request is ready for review.

My code follows the conventions of this project.
I have updated the pyproject.toml to reflect the change as designated by semantic versioning.
I have checked linting and formatting with ruff check and ruff format.
I have performed a self-review of my own code.
I have commented my code, particularly in hard-to-understand areas.
I have made corresponding changes to the documentation.
My changes generate no new warnings.
I have deleted all non-relevant text in this pull request template.

- Updated `write_image_omezarr` and `write_labels_omezarr` functions to accept pixel sizes as float, tuple, or dictionary, allowing for more flexible input formats. - Introduced `_parse_pixel_sizes` helper function to standardize pixel size extraction and validation. - Enhanced metadata extraction in `extract_metadata_tile_nd2` and `extract_metadata_well_nd2` to include pixel size, objective magnification, zoom magnification, and binning information. - Updated `export_omezarr_image` script to read image data from TIFF or raw formats, improving compatibility with different data sources. - Added warnings for potential inconsistencies in pixel size calibration.

- Introduced `conftest.py` to ensure the repository root is included in `sys.path` for test imports. - Updated `test_preprocess.py` to assert required columns in metadata instead of exact counts. - Modified `test_omezarr_exports.py` to check for any Zarr files in the output directory. - Enhanced `write_image_omezarr` to accept new parameters: `coarsening_factor`, `max_levels`, and `is_label`, improving flexibility in image writing. - Added error handling for `max_levels` and `coarsening_factor` to ensure valid values. - Updated metadata handling in `write_image_omezarr` to accommodate label images and ensure proper storage of pixel sizes.

omezarr_writer.py moved under lib/shared

resolves issue with failed labels import into napari

…ss_zarr_v4 Cherry-picked from 1161474 on 79f1eea_preprocess_zarr_v4. - Add dynamic key selection (CONVERT_SBS_KEY/CONVERT_PHENOTYPE_KEY) based on OME_ZARR_ENABLED - IC fields respect IC_EXT (zarr vs tiff) based on config - Downstream rules (sbs.smk, phenotype.smk) use dynamic keys for preprocess inputs

- Add image_to_omezarr.py script that uses convert_to_array + write_image_omezarr - Add convert_sbs_omezarr and convert_phenotype_omezarr rules - Update CONVERT_*_KEY selection to use _omezarr variants when USE_OME_ZARR=True - This allows direct ND2→Zarr conversion, bypassing TIFF intermediates entirely

…parse time

…bal access

- Added integration tests for Zarr preprocessing functionality, ensuring nd2_to_zarr conversion produces outputs equivalent to TIFF conversion. - Updated pytest markers to include integration tests. - Modified existing tests to prioritize Zarr format over TIFF where applicable. - Introduced new rules for Zarr conversion in the Snakemake workflow, allowing for flexible output formats based on configuration. - Implemented a script for direct ND2 to standard Zarr conversion, streamlining the preprocessing pipeline.

…md for clean PR

- Add unit on T axis ("second") and spatial axes in _axes_str_to_dicts - Separate label axis unit patching from pixel scale patching so units are always set even without preprocess metadata - Re-inject downsamplingMethod after iohub dump_meta (which strips it) - segmentation.method now includes model (e.g. "cellpose.cyto3") - segmentation.stitching uses string "none" instead of boolean false - Add statistics.n_cells by counting unique labels in the array - Validated: 0 errors with ops-schema validator Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…with output_to_input() lambdas using new _merge_well_expand helpers. also add row to cell_data_metadata_cols.tsv so aggregate steps treat it as metadata, not a feature.

Template CSV with 188 cp_emulator feature patterns. {Compartment} and {Channel} placeholders are expanded at submission time by the finalize rule using channel names from config. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

New add_ensembl_ids() function maps Entrez gene IDs to Ensembl IDs using either a static TSV mapping file or Ensembl REST API fallback. Wired into standardize_barcode_design() via ensembl_mapping_path parameter. Non-targeting controls are automatically labeled "non-targeting". Required for OPS Data Standard perturbation_library.csv which requires Ensembl gene IDs (ENSG format) instead of Entrez IDs. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- get_gene_mapping(): downloads gene_symbol/entrez_id/ensembl_gene_id mapping from Ensembl BioMart at runtime (same pattern as UniProt download) - resolve_gene_ids(): fills in missing gene identifiers from any starting point (symbol only, Entrez only, Ensembl only, or mixed) - Wired into standardize_barcode_design() via gene_mapping_path parameter - Replaces the earlier add_ensembl_ids() which only handled Entrez → Ensembl Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Replaced BioMart bulk download (unreliable) with MyGene.info querymany() for targeted symbol→Ensembl/Entrez resolution. Only looks up genes present in the user's barcode library — fast and doesn't hit API limits. Requires: uv pip install mygene Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Brings in main fixes: recombination detection, spatial heatmaps, aggregate edge cases, resolve_path, file_manifest, custom cellpose, aggregation/clustering cleanup. Keep version at 1.5.0. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- gene_id now contains Ensembl IDs (replaces Entrez) - Preserve full sgRNA as protospacer_sequence before prefix truncation - Derive role (targeting/control) and control_type from nontargeting patterns - Add protospacer_adjacent_motif ("3' NGG" for Cas9) - At export time, prep_cellxstate.sh just renames prefix→barcode and adds perturbation_id=gene_symbol Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Main's spatial heatmap changes introduced hardcoded ["well"] expansion values in eval rules. In zarr mode, wildcards use row/col instead of well. Replace with _phen_well_expand/_sbs_well_expand/_sbs_tile_expand which dispatch correctly based on IMG_FMT. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

barcode_col pointed to sgRNA which no longer exists in the new barcode library format. Use prefix_col: prefix instead, which is the truncated barcode used for read matching. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Merge outputs (parquets) always use {well} paths regardless of format. Only cross-module references (SBS/phenotype/preprocess outputs) need format-aware expansion. Distinguish between: - Merge own outputs: always expansion_values=["well"] - SBS/phenotype data outputs: _merge_well_expand_all (row/col in zarr) - Preprocess metadata: _merge_well_expand_all (row/col in zarr) Also adds _combos_with_well() helper for future use. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…es_singlecell.parquet outputs into a single AnnData .h5ad per channel combo, combining all cell classes

In zarr mode, the montage pipeline now writes individual cell crops to an examples.zarr store ({gene}/{barcode}/0..N/) instead of tiled PNG + TIFF montages. TIFF mode unchanged. Changes: - montage_utils.py: add_filenames() zarr-aware, grid_view() uses read_image() - generate_montage.py: dispatches on IMG_FMT (zarr crops vs PNG/TIFF grid) - aggregate.smk/targets: conditional outputs for zarr vs tiff mode - rule_utils.py: get_montage_inputs() handles None overlay template Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Each cell crop is now written as a proper OME-Zarr with channel names, axes, and coordinate transforms via save_image(). This means each crop carries its own channel metadata rather than relying on the parent store. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Label groups inside a .zarr store should not have .zarr suffix — e.g. labels/nuclei not labels/nuclei.zarr. The suffix caused napari-ome-zarr to silently skip segmentations because the labels index listed ["nuclei"] but the directory was "nuclei.zarr". Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

iohub's channel_display_settings only recognizes standard fluorophore names (DAPI, GFP, etc.) — marker names like COXIV, CENPA, WGA got white/inactive defaults. Now: - All channels set to active: true - Colors: config color > iohub color > default palette fallback - Default palette: blue, green, red, magenta, yellow, cyan, orange, purple Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Same white-color issue as HCS metadata — hardcoded FFFFFF for all channels. Now uses the same default palette (blue, green, red, magenta, etc.) so example zarr crops and all OME-Zarr writes get distinct colors. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Move DEFAULT_CHANNEL_COLORS to io.py as shared constant, import in write_hcs_metadata.py (removes duplicate palette definition) - Example zarr crops: max_levels=1 (no pyramids for 80px images) - Remove unused _combos_with_well() helper from merge.smk Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Phenotype: DAPI=blue, COXIV=green, CENPA=red, WGA=magenta SBS: DAPI=blue, G=green, T=red, A=yellow, C=magenta Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Tile-based workflows don't need pyramids — images are ~2400x2400. Changed default max_levels from 5/4 to 1 in save_image() and write_image_omezarr(). Added zarr_max_levels config option for documentation. Users wanting pyramids pass max_levels explicitly. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

image-label metadata lives at attributes.ome.image-label in zarr v3, not attributes.image-label. Without this fix, the labels container zarr.json never gets written because no label stores are detected. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…tate-prep

Brings in Ege's generate_anndata rule and anndata dependency. Run script updated to include all pipeline stages. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Brings in Ege's param validation and direct param access across all scripts. Resolved 6 conflicts — kept our zarr-aware read_image(), save_image(), uint32 labels, and zarr/tiff montage dispatch while adopting param validation and direct snakemake.params access. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…220) * fix params for multi mode (#218) * Rename int → integrated in CP emulator features Aligns with OPS Data Standard feature naming convention. Changes: - cp_emulator.py: feature key "int" → "integrated", column mappings "int" → "integrated", "int_edge" → "integrated_edge" - feature_definitions.csv: updated template column names - CP_EMULATOR_FEATURES.md: updated documentation This is the only rename needed — all other feature names already match the standardized Vesuvius feature set. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Add compartment/channel columns, update feature types in template Per updated OPS spec: - morphology → shape - Correlation features (K, manders, overlap, etc.) → correlation - New compartment and channel columns for metadata Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Add format_cluster_anndata rule for spec-compliant aggregated h5ad New cluster step that produces aggregated_data.h5ad per the OPS spec: - obs = perturbations indexed by perturbation_id - var = standardized feature set (shape + intensity + correlation) - X = mean aggregated feature values per perturbation - obsm = PHATE embedding coordinates - uns = schema_version, default_embedding, title - Bootstrap p-values wired but optional (TODO: reshape) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Add format_cluster_anndata rule for cluster-level h5ad Combines perturbation-level features with PHATE embedding and cluster assignments into cluster.h5ad. Includes all available metadata and features with parsed var annotations (type, compartment, channel). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Extract format_cluster_anndata logic into lib function Move core AnnData construction into workflow/lib/cluster/, keep script as thin caller. Follows brieflow lib/scripts pattern. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Update classifier feature names: int → integrated Renamed features in the test classifier dill to match the CP emulator rename. Reverted compatibility shim in train.py — fix at source instead. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Retrain dummy classifier with integrated feature names XGBoost's feature_names_in_ is read-only — can't patch the dill. Retrained a simple dummy classifier on random data with the correct feature names (int → integrated). Same class structure (Interphase/Mitotic). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Add dummy_classifier.dill with integrated feature names Properly trained dummy classifier with: - Feature names using _integrated (not _int) - Labels 1=Mitotic, 2=Interphase (matching original config mapping) - LabelEncoder for XGBoost 0-indexed compatibility - Config updated to use dummy_classifier.dill Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Remove old classifier with outdated feature names Replaced by dummy_classifier.dill which uses _integrated feature names. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Wire bootstrap p-values into cluster h5ad Add _add_bootstrap_layers() that reshapes per-feature p-values and FDR from the combined gene bootstrap TSV into AnnData layers: - layers["p_values"]: per-feature p-values per perturbation - layers["neg_log10_fdr"]: -log10(FDR) per feature per perturbation Bootstrap results wired as input to format_cluster_anndata rule. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Add percentile_rank layer to cluster h5ad Per-feature percentile rank (0-100) across all perturbations. Useful for human-readable interpretation of feature values. Dropped during cellxstate export but retained in pipeline h5ad. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Add neg_log10_fdr to bootstrap, dump all layers into cluster h5ad Bootstrap now computes and outputs: - {feature}_neg_log10_pval: -log10(p-value) - {feature}_fdr: FDR-corrected p-value - {feature}_neg_log10_fdr: -log10(FDR) Cluster h5ad reads all four bootstrap columns directly as layers: p_values, fdr, neg_log10_pval, neg_log10_fdr. No computation at the cluster step — bootstrap does all the work. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Combine all leiden resolutions into single cluster h5ad format_cluster_anndata now accepts a dict of clusterings (one per resolution) and merges cluster assignments as separate obs columns: cluster_group_2, cluster_group_5, etc. Output is one h5ad per cell_class/channel_combo at cluster/{combo}/{class}/h5ad/cluster.h5ad. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Fix bootstrap column reorder to match renamed columns The ordered_cols list in apply_multiple_hypothesis_correction still referenced _log10 after we renamed to _neg_log10_pval and added _neg_log10_fdr. Updated to match the actual column names. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Clean obs: drop PHATE duplicates and merge-suffix columns PHATE_0/1 belong in obsm not obs. cell_count_cluster is a merge artifact. cluster column replaced by per-resolution cluster_group_N. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Add col to cell_data_metadata_cols.tsv The col column (well column index from split_well_to_cols) was missing from the metadata cols list, causing it to leak into feature columns. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Add row and col to DEFAULT_METADATA_COLS These columns are added by split_well_to_cols in zarr mode but were missing from the default metadata list, causing them to leak into feature columns. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * merge ege's work, final improvements --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…tage_confidence obs placement

petercla0119 and others added 30 commits January 30, 2026 10:01

Implement OME-Zarr export with Zarr v2 enforcement and metadata fixes

f22c47a

resolve sbs wildcard error

bb83c37

chore: ignore previous brieflow_outputs_

87fb205

fix forward preserve_z to preprocess lib and io writer

12e4d5d

remove separate io directory in lib

7da3e62

omezarr_writer.py moved under lib/shared

fix export labels as signed int32 and avoid uint32

453488c

resolves issue with failed labels import into napari

chore: bump version, add pytest markers, and ruff-clean omezarr work

2262768

fix: default to TIFF conversion keys; OME-Zarr exports remain optional

9690aed

fix: filter out unused conversion targets (TIFF vs Zarr) based on config

2307334

fix: make PREPROCESS_OUTPUT_MAPPINGS conditional on format selection

ac7f238

fix: conditionally define conversion rules based on USE_OME_ZARR

32c25a0

fix: always define all output mapping keys to avoid KeyError at rule …

63168c3

…parse time

fix: define mappings before filtering outputs to avoid KeyError

87b4f30

fix: check PREPROCESS_OUTPUTS_MAPPED membership before defining rules

b70cec6

fix: move CONVERT_*_KEY definitions to targets/preprocess.smk for glo…

dcfcc46

…bal access

Update docs for TIFF and Zarr support

27e78f8

update config glossary

5cdd73e

chore: remove README2.md and revert .gitignore and 3.running_modules.…

0591725

…md for clean PR

docs: sync 3.running_modules.md with upstream

c059fd8

style: apply ruff formatting to codebase

4b21113

feat: align sbs and phenotype to claire's setup

0b52b8f

feat: upgrade to nested paths

590f958

feat: bump zarr

1373e61

further streamlining of writes

f08024a

feat: cleanup tests

850cd9e

mat10d and others added 30 commits March 30, 2026 18:05

Fix WildcardErrors in merge rules by replacing bare ancient() inputs …

7c709a4

…with output_to_input() lambdas using new _merge_well_expand helpers. also add row to cell_data_metadata_cols.tsv so aggregate steps treat it as metadata, not a feature.

Add generate_anndata rule to the aggregate stage that converts featur…

0e83873

…es_singlecell.parquet outputs into a single AnnData .h5ad per channel combo, combining all cell classes

add defautl params

58321fe

updated barcode library

bd41f63

Add channel colors to test config

5895909

Phenotype: DAPI=blue, COXIV=green, CENPA=red, WGA=magenta SBS: DAPI=blue, G=green, T=red, A=yellow, C=magenta Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Merge remote-tracking branch 'origin/anndata-integration' into cellxs…

44961c0

…tate-prep

Merge anndata-integration + resolve run script conflict

058a57b

Brings in Ege's generate_anndata rule and anndata dependency. Run script updated to include all pipeline stages. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Pin anndata to 0.12.10

e665208

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Regenerate dummy classifier dill for brieflow_zarr env and fix cell_s…

a758930

…tage_confidence obs placement

workflow/scripts/aggregate/format_singlecell_anndata.py

adadd53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Zarr3 transition#192

Zarr3 transition#192
mat10d wants to merge 93 commits intomainfrom
zarr3-transition

mat10d commented Feb 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

mat10d commented Feb 5, 2026

Description

What is the nature of your change?

Checklist

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants