From 6bbaa9701731a008c371be57fd448192603b631d Mon Sep 17 00:00:00 2001 From: reneyagmur Date: Mon, 1 Jun 2026 14:58:23 -0700 Subject: [PATCH 01/25] config, write, validate, read planning prompts --- planning/ARCHITECTURE.md | 199 ++++++++++++++++++++++ planning/README.md | 32 ++++ planning/TODO.md | 60 +++++++ planning/prompts/00_shared_context.md | 44 +++++ planning/prompts/01_config.md | 35 ++++ planning/prompts/02_write_spec.md | 50 ++++++ planning/prompts/03_validation.md | 37 ++++ planning/prompts/04_writers.md | 50 ++++++ planning/prompts/05_readers.md | 42 +++++ planning/prompts/06_notebook_migration.md | 40 +++++ planning/prompts/07_tests.md | 26 +++ 11 files changed, 615 insertions(+) create mode 100644 planning/ARCHITECTURE.md create mode 100644 planning/README.md create mode 100644 planning/TODO.md create mode 100644 planning/prompts/00_shared_context.md create mode 100644 planning/prompts/01_config.md create mode 100644 planning/prompts/02_write_spec.md create mode 100644 planning/prompts/03_validation.md create mode 100644 planning/prompts/04_writers.md create mode 100644 planning/prompts/05_readers.md create mode 100644 planning/prompts/06_notebook_migration.md create mode 100644 planning/prompts/07_tests.md diff --git a/planning/ARCHITECTURE.md b/planning/ARCHITECTURE.md new file mode 100644 index 0000000..4f9a5b5 --- /dev/null +++ b/planning/ARCHITECTURE.md @@ -0,0 +1,199 @@ +# IO Layer Architecture — write / read / validation + +Status: design agreed 2026-06-01. Implementation to be done by follow-up agents. +This document is the source of truth for the design. The runnable agent prompts +live in `planning/prompts/`. The task breakdown lives in `planning/TODO.md`. + +## Hard constraints (read before any work) + +1. **Do not edit `src/connects_common_connectivity/models.py`.** It is auto-generated + from the LinkML YAMLs in `schemas/`. Any change to the data model happens in the + YAMLs and is regenerated — never hand-edited. +2. **Do not change the LinkML schemas without explicit permission from YY.** If safe + writing turns out to need a new slot (e.g. a clearer project/dataset scoping key), + stop and ask first. Propose the change in writing; do not edit `schemas/*.yaml` + pre-emptively. +3. **Single source of truth = the LinkML schema.** The registry and the derived + validators read from the generated models; they never restate field definitions. +4. New IO code lives under `src/connects_common_connectivity/io/`. Plotting stays in + `code/utils.py`. Notebooks are migrated to call the new API, not to embed logic. + +## What exists today (do not rebuild) + +- `models.py` — LinkML-generated pydantic v2 models. Key classes: + `DataSet`, `DataItem`, `DataItemDataSetAssociation`, `Cluster`, `ClusterHierarchy`, + `ClusterMembership`, `CellFeatureSet`, `CellFeatureDefinition`, `CellFeatureMatrix`, + `CellFeatureMeasurement`, `MappingSet`, `CellToCellMapping`, `CellToClusterMapping`, + `ClusterToClusterMapping`, `ProjectionMeasurementMatrix`, `BrainRegionAssociation`, + `ZarrDataset`, `ParquetDataset`. A `ProjectScoped` mixin supplies `project_id`. +- `arrow_utils.py` — `build_arrow_schema(model_cls)`, `models_to_table(models, schema)`, + `attach_linkml_metadata(table, linkml_class=...)`, `build_cell_feature_matrix_schema(...)`. + These already convert pydantic models → Arrow tables with LinkML metadata. **Reuse.** +- `write_utils.py` — `append_new_dataitems(path, table, project_id=...)` (id-deduped + append) and `walk_ancestors(leaf_id, parent_of)` (hierarchy denormalization). **Reuse; + the new writers wrap/generalize these rather than replacing them.** +- `parquet_loader.py` — `load_parquet_to_models(...)` (Parquet → models with a report). +- `cli.py` — LinkML `SchemaView`-based full validation (the `ccc` command). Kept as the + occasional heavyweight conformance check, **not** on the hot write path. +- `io/io_plans.md` — pre-existing analysis-util plans (`populate_region_coverage`, + `compare_region_coverage`). Keep; fold into the read/analysis module. + +## The bug this design fixes + +In every `_01_dataset_dataitem` notebook the DataSet is written with: + +```python +write_deltalake(root+"dataset/", table_ds, mode="overwrite", + predicate=f"project_id = '{PROJECT_ID}'", partition_by=["project_id"]) +``` + +`visp_exc_patchseq` and `visp_inh_patchseq` **share** `project_id = 'visp_patchseq'` but +have different `dataset_id`. So writing the inhibitory dataset overwrites the excitatory +dataset's row (and vice versa). The association write already does the right thing +(`predicate = "project_id = '...' AND dataset_id = '...'"`). The fix is structural: the +correct scope columns must come from a **per-class registry**, not be retyped by hand in +each notebook. `DataSet`'s scope is `(project_id, id)`; the association's scope is +`(project_id, dataset_id)`; `DataItem` is append-by-id; etc. + +## Design overview + +One registry entry per class is the hub. It drives four things so they can never drift +apart: partitioning, the overwrite predicate (scope columns), which slots are required +for a safe write, and the auto-derived strict validator. + +``` + ┌─────────────────────────────┐ +LinkML schema ──▶│ models.py (generated) │ + └─────────────────────────────┘ + │ read-only + ▼ + ┌───────────────────────────────────────────────┐ + │ write_spec registry (one entry per class) │ + │ partition_by · scope_columns · write_mode · │ + │ required_for_write · cross_field_rules │ + └───────────────────────────────────────────────┘ + │ │ │ + ▼ ▼ ▼ + validation write module read module + (strict submodel (write_dataset, (predicate-based + + derived per write_dataitem, flexible cross-dataset + class) write_features...) reads) + │ + ▼ + Settings (global output_root) +``` + +## Module 1 — `config.py` (global output path) + +Decision: **plain pydantic `BaseModel`**, version-controlled default in code, optional +env override. No new dependency (no pydantic-settings). + +```python +class Settings(BaseModel): + output_root: Path = Path("../scratch/em_patchseq_wnm_v1/") + # add knobs here later (dry_run, schema_version_pin, ...) as needed + + @classmethod + def load(cls) -> "Settings": + default = cls.model_fields["output_root"].default + return cls(output_root=os.environ.get("CCC_OUTPUT_ROOT", default)) +``` + +Rationale: the default is readable in git without running anything and adds no +dependency; the `CCC_OUTPUT_ROOT` env override is the escape hatch for CodeOcean, where +the write location differs from local. Notebooks replace the hardcoded `OUTPUT_ROOT` +string with `settings = Settings.load()` and print the resolved value at the top. +A `table_path(settings, "dataset")` helper resolves per-table subdirectories so notebooks +never concatenate path strings. + +## Module 2 — `write_spec.py` (the registry) + +An explicit, hand-maintained lookup, one entry per writable class, seeded from the schema +and refined from early experience. It is the source of truth for write/validation +behavior. A test cross-checks it against the LinkML schema so drift fails loudly (the +class names and `project_id`/identifier slots must exist in the generated models). + +Each entry declares: + +- `subdir` — Delta table subdirectory under `output_root` (e.g. `"dataset"`). +- `partition_by` — Delta partition columns (e.g. `["project_id"]`). +- `scope_columns` — columns that define the overwrite predicate (the identity within the + shared table). DataSet → `["project_id", "id"]`; DataItemDataSetAssociation → + `["project_id", "dataset_id"]`. +- `write_mode` — `"overwrite_scoped"` (scoped idempotent overwrite) or + `"append_new_by_id"` (the `append_new_dataitems` behavior for DataItem). +- `required_for_write` — slots that must be present/non-null for a safe write (may be + stricter than the schema's own `required`). +- `cross_field_rules` — names of cross-field checks to attach to the strict validator. + +Predicate is built from `scope_columns` + the row values, e.g. +`"project_id = 'visp_patchseq' AND id = 'visp_exc_patchseq'"`. This is exactly the bug +fix: DataSet now carries `id` in its scope. + +## Module 3 — `validation.py` (auto-derived strict submodels) + +Decision: **auto-derived** strict submodels — single source of truth. + +`strict_model_for(cls)` takes the generated pydantic model and returns a subclass that +(a) flips each slot in the registry's `required_for_write` to required, and (b) attaches +the registry's `cross_field_rules` as pydantic `model_validator`s. No field definitions +are restated; everything is read from `models.py` + the registry. `models.py` is never +touched. Validation runs on the hot write path (fast, pydantic-only). The LinkML/`cli.py` +validator remains the separate, occasional full-conformance check. + +Example cross-field rule: an association's `dataset_id` must refer to a DataSet already +present for that `project_id` (referential safety before write). + +## Module 4 — `writers.py` (+ keep `write_utils.py`) + +A single dispatch core plus thin typed wrappers: + +- `write_models(models, *, settings=None)` — infers the class, looks up the registry, + validates each model via the strict submodel, converts via `arrow_utils`, attaches + LinkML metadata, then writes per `write_mode` (scoped overwrite with the + registry-built predicate, or `append_new_by_id` via the existing helper). +- Wrappers for ergonomics and discoverability: `write_dataset`, `write_dataitem`, + `write_association`, `write_features`, `write_cluster`, `write_cluster_membership`, + `write_cell_to_cluster_mapping`, `write_projection_matrix`, etc. Each is a one-liner + over `write_models`. +- `write_utils.py` stays: `append_new_dataitems` becomes the `append_new_by_id` backend; + `walk_ancestors` stays for membership/mapping denormalization. Generalize + `append_new_dataitems` only if needed (e.g. parametrize the partition column), without + breaking its current callers. + +Wide feature matrices (`CellFeatureMatrix`) use `build_cell_feature_matrix_schema` and a +matrix-specific writer path, since they are wide Parquet, not row-modeled Delta tables. + +## Module 5 — `readers.py` (+ fold in `io_plans.md`) + +Two layers: + +- Thin predicate-based readers mirroring the write spec: `read_dataset`, `read_dataitem`, + `read_features`, scoped by `project_id`/`dataset_id`, returning polars/pandas. +- Flexible cross-dataset / cross-schema reads now that datasets share tables. Flagship + example: "read all DataItems that have either a ClusterMembership or a + CellToClusterMapping to a given set of clusters" — a cross-table query joining + membership/mapping tables on cluster ids and returning the union of matching DataItems, + regardless of source dataset/modality. Users can still drop to raw + `polars.read_delta` for ad-hoc queries; the readers are conveniences, not a wall. +- Fold the `io_plans.md` analysis utilities (`populate_region_coverage`, + `compare_region_coverage`) into this module. + +## Notebook migration (no logic, no schema, no models.py changes) + +For each ETL notebook: replace hardcoded `OUTPUT_ROOT` with `Settings.load()`, replace +direct `write_deltalake(...)` calls with the typed writers, and delete the per-cell +`mode`/`predicate`/`partition_by` bookkeeping (now owned by the registry). Verification +cells stay. The `visp_*_patchseq` bug is fixed automatically once DataSet writes go +through the registry (scope = project_id + id). Confirm exc + inh DataSet rows coexist +after a re-run as the migration's acceptance test. + +## Testing + +- Registry-vs-schema drift test (class names + scope/identifier slots exist in models). +- Idempotency: writing the same models twice yields no duplicates and no row loss. +- Shared-partition safety: writing dataset B does not remove dataset A's rows when they + share a `project_id` (the patchseq regression test). +- Strict-validator tests: missing `required_for_write` slot or failing cross-field rule + raises before any write touches disk. +- Round-trip: write models → read back via readers → equality on scope columns. diff --git a/planning/README.md b/planning/README.md new file mode 100644 index 0000000..e51b3c9 --- /dev/null +++ b/planning/README.md @@ -0,0 +1,32 @@ +# planning/ — IO layer design & agent prompts + +This folder documents how we're building the user-friendly IO layer (write / read / +validation) for ConnectsCommonConnectivity, and holds ready-to-run prompts for the agents +that implement each piece. Created 2026-06-01. + +## Contents +- `ARCHITECTURE.md` — the design (source of truth). Registry-centric write/read/validation. +- `TODO.md` — ordered, dependency-aware task list. +- `prompts/` — one prompt per work item, to hand to implementing agents: + - `00_shared_context.md` — **prepend to every prompt below.** Hard rules + repo facts. + - `01_config.md` — global output path (`Settings`, no new dep). + - `02_write_spec.md` — the registry (source of truth) + drift test. + - `03_validation.md` — auto-derived strict submodels. + - `04_writers.md` — write dispatch + typed wrappers (fixes the patchseq bug). + - `05_readers.md` — predicate-based + cross-dataset reads. + - `06_notebook_migration.md` — migrate ETL notebooks to the new API. + - `07_tests.md` — safe-writing test suite. + +## Two hard rules (repeated everywhere on purpose) +1. **Never edit `src/connects_common_connectivity/models.py`** — auto-generated from LinkML. +2. **Never edit `schemas/*.yaml`** without explicit permission from YY. + +## Locked decisions +- Config: plain pydantic, version-controlled default + `CCC_OUTPUT_ROOT` env override. +- Write spec: explicit registry, schema-checked for drift. +- Validation: auto-derived strict submodels (single source of truth). + +## How to run an item +Hand the implementing agent: `00_shared_context.md` + the specific prompt, and point it at +`ARCHITECTURE.md`. Follow the order in `TODO.md` (config → registry → validation → +writers → readers → notebook migration → tests). diff --git a/planning/TODO.md b/planning/TODO.md new file mode 100644 index 0000000..5141f96 --- /dev/null +++ b/planning/TODO.md @@ -0,0 +1,60 @@ +# IO Layer — TODO + +Ordered, dependency-aware. See `ARCHITECTURE.md` for design and `prompts/` for the +agent prompt that implements each item. Hard rules: never edit `models.py`; never edit +`schemas/*.yaml` without explicit permission from YY. + +## Phase 0 — groundwork +- [ ] **0.1 Config module** (`io/config.py`) — plain pydantic `Settings` with + `output_root` default + `CCC_OUTPUT_ROOT` env override + `table_path()` helper. + No new dependency. Prompt: `prompts/01_config.md`. Blocks everything that writes. + +## Phase 1 — registry (the hub) +- [ ] **1.1 Write spec registry** (`io/write_spec.py`) — one entry per writable class: + `subdir`, `partition_by`, `scope_columns`, `write_mode`, `required_for_write`, + `cross_field_rules`. Seed DataSet/DataItem/Association first, then the rest. + Prompt: `prompts/02_write_spec.md`. Blocked by: none (reads generated models). +- [ ] **1.2 Registry↔schema drift test** — assert every entry's class + scope/identifier + slots exist in `models.py`. Part of `prompts/02_write_spec.md`. + +## Phase 2 — validation +- [ ] **2.1 Strict submodel derivation** (`io/validation.py`) — `strict_model_for(cls)` + flips `required_for_write` to required + attaches `cross_field_rules`. Auto-derived + from generated models + registry. Prompt: `prompts/03_validation.md`. Blocked by 1.1. + +## Phase 3 — writers +- [ ] **3.1 Write dispatch core** (`io/writers.py`) — `write_models(models, settings=...)`: + infer class → registry lookup → strict-validate → arrow convert → metadata → write per + `write_mode`. Reuses `arrow_utils` + `write_utils`. Prompt: `prompts/04_writers.md`. + Blocked by 0.1, 1.1, 2.1. +- [ ] **3.2 Typed wrappers** — `write_dataset`, `write_dataitem`, `write_association`, + `write_features`, `write_cluster`, `write_cluster_membership`, + `write_cell_to_cluster_mapping`, `write_projection_matrix`. Part of `prompts/04_writers.md`. +- [ ] **3.3 Reconcile `write_utils.py`** — make `append_new_dataitems` the + `append_new_by_id` backend without breaking current callers. Part of `prompts/04_writers.md`. + +## Phase 4 — readers +- [ ] **4.1 Predicate-based readers** (`io/readers.py`) — `read_dataset`, `read_dataitem`, + `read_features` scoped by project/dataset. Prompt: `prompts/05_readers.md`. Blocked by 1.1. +- [ ] **4.2 Cross-dataset reads** — flagship: DataItems with ClusterMembership OR + CellToClusterMapping to a given cluster set. Part of `prompts/05_readers.md`. +- [ ] **4.3 Fold in analysis utils** — `populate_region_coverage`, + `compare_region_coverage` from `io/io_plans.md`. Part of `prompts/05_readers.md`. + +## Phase 5 — notebook migration +- [ ] **5.1 Migrate `_01_dataset_dataitem` notebooks** — Settings + typed writers; fixes + the patchseq DataSet overwrite. Prompt: `prompts/06_notebook_migration.md`. Blocked by 3.x. +- [ ] **5.2 Migrate feature / cluster / mapping / projection notebooks.** Same prompt. +- [ ] **5.3 Patchseq regression check** — re-run exc then inh; assert both DataSet rows + coexist. Acceptance test for the migration. + +## Phase 6 — tests & docs +- [ ] **6.1 Test suite** — idempotency, shared-partition safety (patchseq regression), + strict-validator failures, round-trip. Prompt: `prompts/07_tests.md`. +- [ ] **6.2 Update README / usage docs** for the new IO API. (Ask before large edits.) + +## Decisions locked (2026-06-01) +- Config: plain pydantic, version-controlled default + env override, no new dep. +- Write spec: explicit registry, source of truth, schema-checked for drift. +- Validation: auto-derived strict submodels from generated models + registry. +- Scope: this session produced planning docs + prompts only. diff --git a/planning/prompts/00_shared_context.md b/planning/prompts/00_shared_context.md new file mode 100644 index 0000000..22fbd51 --- /dev/null +++ b/planning/prompts/00_shared_context.md @@ -0,0 +1,44 @@ +# Shared context — prepend to every IO-layer agent prompt + +You are working in the `ConnectsCommonConnectivity` repo: a LinkML+pydantic data schema +holding multi-scale connectomics data (EM cell-to-cell, morphology cell-to-area, viral +area-to-area, patch-seq multimodal) in one format, plus taxonomies/clusters. + +## Non-negotiable rules +1. **Never edit `src/connects_common_connectivity/models.py`** — it is auto-generated + from `schemas/*.yaml`. Treat it as read-only. +2. **Never edit `schemas/*.yaml`** without explicit written permission from the maintainer + (YY). If your task seems to require a new slot for safe writing, STOP and report what + you need and why; do not change the schema. +3. **Single source of truth = the LinkML schema / generated models.** Read field + definitions from `models.py`; do not restate them. +4. New IO code goes under `src/connects_common_connectivity/io/`. Do not move plotting + code out of `code/utils.py`. +5. Read `planning/ARCHITECTURE.md` fully before starting. It governs the design. + +## What already exists — reuse, don't rebuild +- `models.py`: generated pydantic v2 classes incl. `DataSet`, `DataItem`, + `DataItemDataSetAssociation`, `Cluster`, `ClusterHierarchy`, `ClusterMembership`, + `CellFeatureSet`, `CellFeatureDefinition`, `CellFeatureMatrix`, `CellFeatureMeasurement`, + `MappingSet`, `CellToCellMapping`, `CellToClusterMapping`, `ClusterToClusterMapping`, + `ProjectionMeasurementMatrix`. `ProjectScoped` mixin → `project_id`. +- `arrow_utils.py`: `build_arrow_schema`, `models_to_table`, `attach_linkml_metadata`, + `build_cell_feature_matrix_schema`. +- `write_utils.py`: `append_new_dataitems(path, table, *, project_id, id_column="id")`, + `walk_ancestors(leaf_id, parent_of)`. +- `parquet_loader.py`: `load_parquet_to_models(...)`. `cli.py`: LinkML full validation. +- `io/io_plans.md`: analysis-util specs to fold into readers. + +## Conventions +- Python 3.10+, pydantic v2, polars + pyarrow + deltalake (already deps). +- Match existing style (ruff, line-length 100). Add docstrings like the existing modules. +- Add `pytest` tests under `tests/` for anything you implement. +- After implementing, run the relevant tests and report results. Do not mark work done + with failing tests or partial implementation. + +## Reference: the bug to keep in mind +`visp_exc_patchseq` and `visp_inh_patchseq` share `project_id='visp_patchseq'` with +different `dataset_id`. The current DataSet write uses predicate +`project_id = ''`, so writing one wipes the other. The registry fixes this by +making DataSet's scope `(project_id, id)`. Any writer you build must derive its predicate +from the registry, never from a hardcoded string. diff --git a/planning/prompts/01_config.md b/planning/prompts/01_config.md new file mode 100644 index 0000000..aee8659 --- /dev/null +++ b/planning/prompts/01_config.md @@ -0,0 +1,35 @@ +# Agent prompt — Config module (global output path) + +> Prepend `00_shared_context.md`. + +## Goal +Create `src/connects_common_connectivity/io/config.py` providing a single, version- +controlled, human-readable global output path, with an optional env override. **No new +dependency** — plain pydantic `BaseModel` only (NOT pydantic-settings). + +## Requirements +1. A `Settings(BaseModel)` class with: + - `output_root: Path` — default `Path("../scratch/em_patchseq_wnm_v1/")` (current value + used across notebooks; confirm by grepping `OUTPUT_ROOT` in `code/*.ipynb`). + - A `load()` classmethod that returns `Settings`, using + `os.environ.get("CCC_OUTPUT_ROOT", )` so CodeOcean can override the path via + env without editing tracked code. + - Designed so more knobs (e.g. `dry_run`, `schema_version_pin`) can be added later. +2. A helper `table_path(settings: Settings, table: str) -> Path` that joins + `output_root / table` (e.g. `"dataset"`, `"dataitem"`, + `"dataitem_dataset_association"`) so notebooks never concatenate path strings. Use the + exact subdir names currently in the notebooks. +3. A `describe()` / `__repr__` that prints the resolved config so notebooks can show it at + the top instead of relying on hidden state. + +## Tests (`tests/test_config.py`) +- Default `output_root` is the expected path when env var unset. +- `CCC_OUTPUT_ROOT` env var overrides the default. +- `table_path` joins correctly and returns a `Path`. + +## Do not +- Add pydantic-settings or any new dependency. +- Touch `models.py` or schemas. + +## Report +List the subdir names you found in the notebooks and confirm the default matches. diff --git a/planning/prompts/02_write_spec.md b/planning/prompts/02_write_spec.md new file mode 100644 index 0000000..31ae603 --- /dev/null +++ b/planning/prompts/02_write_spec.md @@ -0,0 +1,50 @@ +# Agent prompt — Write spec registry + +> Prepend `00_shared_context.md`. Depends on nothing (reads generated models). + +## Goal +Create `src/connects_common_connectivity/io/write_spec.py`: an explicit registry, one +entry per writable class, that is the single source of truth for how each class is +written and validated. Plus a test that the registry cannot drift from the schema. + +## Registry shape +Define a dataclass/pydantic model `WriteSpec` with fields: +- `model_cls` — the generated pydantic class (import from `..models`). +- `subdir: str` — Delta subdir under `output_root` (must match the notebook paths). +- `partition_by: list[str]` — Delta partition columns. +- `scope_columns: list[str]` — columns defining the overwrite predicate (identity within + the shared table). +- `write_mode: Literal["overwrite_scoped", "append_new_by_id"]`. +- `required_for_write: list[str]` — slots that must be non-null to write safely (may be + stricter than the schema's `required`). +- `cross_field_rules: list[str]` — names of cross-field checks (implemented in + `validation.py`); empty for now is fine. + +Expose `REGISTRY: dict[str, WriteSpec]` keyed by class name, and a +`get_spec(model_or_cls) -> WriteSpec` lookup. + +## Seed these first (correctness-critical) +- `DataSet`: subdir `"dataset"`, partition `["project_id"]`, + **scope `["project_id", "id"]`** (THIS is the patchseq bug fix), mode + `overwrite_scoped`. +- `DataItem`: subdir `"dataitem"`, partition `["project_id"]`, mode `append_new_by_id`, + id column `"id"`. +- `DataItemDataSetAssociation`: subdir `"dataitem_dataset_association"`, partition + `["project_id"]`, scope `["project_id", "dataset_id"]`, mode `overwrite_scoped`. + +Then add entries for `Cluster`, `ClusterHierarchy`, `ClusterMembership`, +`CellFeatureSet`, `CellFeatureDefinition`, `CellToClusterMapping`, `MappingSet`, +`ProjectionMeasurementMatrix`, etc. — derive `subdir`/`scope_columns` by reading how each +is written in `code/etl_*.ipynb` (grep `write_deltalake` and `predicate=`). Where a +notebook's predicate looks wrong (like the DataSet case), prefer the correct scope and +note it in a comment. `CellFeatureMatrix` is wide Parquet, not row Delta — mark it so the +writer routes it to the matrix path (`build_cell_feature_matrix_schema`). + +## Drift test (`tests/test_write_spec.py`) +- Every `REGISTRY` key resolves to a real class in `models.py`. +- Every column in `scope_columns` + `partition_by` + `required_for_write` corresponds to + a field on that model (check `model_fields`). Fail loudly otherwise. + +## Report +A table of each class → subdir / partition_by / scope_columns / write_mode, and call out +any notebook predicate you believe is wrong (do not fix notebooks here). diff --git a/planning/prompts/03_validation.md b/planning/prompts/03_validation.md new file mode 100644 index 0000000..c3ff1c9 --- /dev/null +++ b/planning/prompts/03_validation.md @@ -0,0 +1,37 @@ +# Agent prompt — Validation (auto-derived strict submodels) + +> Prepend `00_shared_context.md`. Depends on `write_spec.py`. + +## Goal +Create `src/connects_common_connectivity/io/validation.py` that derives a STRICT pydantic +submodel per class **at runtime** from (a) the generated model in `models.py` and (b) the +registry's `required_for_write` + `cross_field_rules`. Single source of truth: nothing +is restated from the schema. + +## Requirements +1. `strict_model_for(model_cls) -> type[BaseModel]`: + - Subclass the generated model. + - For each slot in the registry's `required_for_write`, make it required (no default / + not Optional). Use pydantic v2 mechanisms (`model_fields` overrides via + `create_model` or field re-annotation) — do NOT edit the generated class in place. + - Attach each named `cross_field_rule` as a `@model_validator(mode="after")`. + - Cache the derived class (e.g. `functools.lru_cache`) so it's built once. +2. `validate_for_write(model) -> model` (or list): run the instance through the strict + submodel, raising a clear error that names the class, the failing slot/rule, and the + offending value. This runs on the hot write path, so keep it pydantic-only (fast); do + NOT call the LinkML/`cli.py` validator here. +3. Implement a starter cross-field rule registry (a dict name → callable) including: + - `association_dataset_exists`: a `DataItemDataSetAssociation`'s `dataset_id` must + exist among written DataSets for that `project_id`. (May need a reader/lookup; if the + reader module isn't ready, implement the hook and mark it TODO without breaking the + import.) + Add others only as the registry references them. + +## Tests (`tests/test_validation.py`) +- A model missing a `required_for_write` slot fails `validate_for_write` before any IO. +- A valid model passes and is returned unchanged (round-trip equality on fields). +- The generated `models.py` class is unchanged after deriving the strict model + (no in-place mutation). + +## Do not +- Edit `models.py`. Restate schema field definitions. Put LinkML validation on the write path. diff --git a/planning/prompts/04_writers.md b/planning/prompts/04_writers.md new file mode 100644 index 0000000..e8a6072 --- /dev/null +++ b/planning/prompts/04_writers.md @@ -0,0 +1,50 @@ +# Agent prompt — Writers (dispatch core + typed wrappers) + +> Prepend `00_shared_context.md`. Depends on `config.py`, `write_spec.py`, `validation.py`. + +## Goal +Create `src/connects_common_connectivity/io/writers.py`: a single write dispatch that uses +the registry so notebooks never hand-write `mode` / `predicate` / `partition_by` again. + +## Core +`write_models(models, *, settings=None) -> WriteResult`: +1. Accept a single model or an iterable; infer the class; require homogeneous type. +2. `settings = settings or Settings.load()`. +3. Look up the `WriteSpec` via `get_spec`. +4. Validate every model with `validate_for_write` (strict submodel) BEFORE any IO. +5. Convert via `arrow_utils.models_to_table` + `build_arrow_schema`; attach metadata with + `attach_linkml_metadata(linkml_class=)`. +6. Resolve path with `table_path(settings, spec.subdir)`. +7. Dispatch on `spec.write_mode`: + - `overwrite_scoped`: build the predicate from `spec.scope_columns` and the row values + (e.g. `project_id = '...' AND id = '...'`), then + `write_deltalake(path, table, mode="overwrite", predicate=..., partition_by=spec.partition_by)`. + If a batch contains multiple distinct scope tuples, write per scope group (one + predicate each) — never widen a predicate to cover rows it shouldn't. + - `append_new_by_id`: delegate to `write_utils.append_new_dataitems` (the backend), + passing `project_id` and id column. +8. Return a small result object: rows written/appended, path, mode, predicate used. + +## Typed wrappers (one-liners over `write_models`) +`write_dataset`, `write_dataitem`, `write_association`, `write_features`, `write_cluster`, +`write_cluster_membership`, `write_cell_to_cluster_mapping`, `write_projection_matrix`. +Signatures should be ergonomic (accept the model(s) and optional `settings`). + +## Wide feature matrices +`CellFeatureMatrix` is wide Parquet. Route it through a matrix-specific path using +`build_cell_feature_matrix_schema`; do not force it into the row-Delta path. + +## Reconcile `write_utils.py` +Make `append_new_dataitems` the `append_new_by_id` backend. If you must generalize it +(e.g. parametrize the partition column), keep the existing signature working — its current +notebook callers must not break. + +## Tests (`tests/test_writers.py`) +- Scoped overwrite writes only matching rows; a second dataset sharing `project_id` is + preserved (patchseq regression: write DataSet A, write DataSet B, both rows exist). +- Re-writing identical models is idempotent (no dupes, no loss). +- `append_new_by_id` appends only new ids. +- Predicate is built from `scope_columns`, verified by string/inspection. + +## Do not +- Hardcode any predicate. Touch `models.py` or schemas. diff --git a/planning/prompts/05_readers.md b/planning/prompts/05_readers.md new file mode 100644 index 0000000..88d05e8 --- /dev/null +++ b/planning/prompts/05_readers.md @@ -0,0 +1,42 @@ +# Agent prompt — Readers (predicate-based + cross-dataset) + +> Prepend `00_shared_context.md`. Depends on `write_spec.py` (+ `config.py`). + +## Goal +Create `src/connects_common_connectivity/io/readers.py`: convenient reads over the shared +Delta tables, scoped by the registry, plus flexible cross-dataset/cross-schema queries. +Readers are conveniences — users can always drop to raw `polars.read_delta`. + +## Layer 1 — predicate-based readers +- `read_dataset(*, project_id=None, dataset_id=None, settings=None)`, + `read_dataitem(...)`, `read_features(...)` etc. +- Resolve the path via the registry `subdir` + `table_path`; filter by the given scope + columns; return a polars DataFrame (offer `.to_pandas()` convenience). +- Reuse `parquet_loader.load_parquet_to_models` where returning typed models is wanted. + +## Layer 2 — cross-dataset / cross-schema reads +Flagship function (build this and design it to generalize): +`read_dataitems_for_clusters(cluster_ids, *, via=("membership","mapping"), project_id=None, +settings=None) -> DataFrame`: +- Returns the union of DataItems that have EITHER a `ClusterMembership` OR a + `CellToClusterMapping` to any cluster in `cluster_ids`. +- Join the membership and mapping Delta tables on cluster id; collect distinct DataItem + ids; optionally hydrate with DataItem rows. Cross-dataset and cross-modality by design — + do not assume a single source dataset. +- Use `walk_ancestors` semantics so a query for a parent cluster also matches descendants + if the membership/mapping tables are denormalized that way (check how the `_03`/cluster + notebooks write the hierarchy before assuming). + +## Layer 3 — fold in analysis utils +Port `populate_region_coverage(pmm, matrix)` and `compare_region_coverage(pmms)` from +`io/io_plans.md` into this module (or a sibling `analysis.py` if cleaner). Keep their +documented signatures and pure-function behavior. + +## Tests (`tests/test_readers.py`) +- Round-trip: write models via the writers, read them back scoped, assert equality on + scope columns. +- `read_dataitems_for_clusters` returns the correct union for a small synthetic + membership + mapping fixture, including cross-dataset cases. + +## Do not +- Touch `models.py` or schemas. Lock users out of raw polars (readers are additive). diff --git a/planning/prompts/06_notebook_migration.md b/planning/prompts/06_notebook_migration.md new file mode 100644 index 0000000..86955d5 --- /dev/null +++ b/planning/prompts/06_notebook_migration.md @@ -0,0 +1,40 @@ +# Agent prompt — Notebook migration + +> Prepend `00_shared_context.md`. Depends on writers (and readers for verification cells). + +## Goal +Migrate the ETL notebooks in `code/etl_*.ipynb` to use the new IO API. Move bookkeeping +into the library; keep the science logic and verification cells. + +## Per notebook +1. Replace the hardcoded `OUTPUT_ROOT = "../scratch/..."` with: + ```python + from connects_common_connectivity.io.config import Settings + settings = Settings.load() + print(settings) # show resolved output_root at top + ``` +2. Replace each direct `write_deltalake(... mode=... predicate=... partition_by=...)` call + with the matching typed writer (`write_dataset`, `write_dataitem`, `write_association`, + `write_features`, `write_cluster`, `write_cell_to_cluster_mapping`, + `write_projection_matrix`, ...). Delete the now-redundant `mode`/`predicate`/ + `partition_by` arguments and their explanatory comments — that logic now lives in the + registry. +3. Keep verification cells; update their paths to use `table_path(settings, ...)`. + +## Migrate in this order +1. `etl_*_01_dataset_dataitem.ipynb` (all of minnie, wnm, visp_exc/inh patchseq) — these + carry the DataSet overwrite bug. +2. feature notebooks (`_02_cell_features`). +3. cluster / membership / mapping notebooks (`_03`, cluster files). +4. projection (`etl_wnm_exc_04_projection_matrix.ipynb`). + +## Patchseq regression acceptance test (do this explicitly) +Run `etl_visp_exc_patchseq_01` then `etl_visp_inh_patchseq_01` (in that order), then read +the `dataset` table and assert BOTH `visp_exc_patchseq` and `visp_inh_patchseq` rows +exist under `project_id='visp_patchseq'`. Before the fix, the second run wiped the first. +Report the before/after row counts. + +## Do not +- Change the science/ETL transformation logic. Fix the `etl_visp_inh_patchseq` data logic + beyond the write path — the maintainer said the writer fix is enough for now. +- Touch `models.py` or schemas. diff --git a/planning/prompts/07_tests.md b/planning/prompts/07_tests.md new file mode 100644 index 0000000..73489b6 --- /dev/null +++ b/planning/prompts/07_tests.md @@ -0,0 +1,26 @@ +# Agent prompt — Test suite + +> Prepend `00_shared_context.md`. Run after writers/readers exist (can be built alongside). + +## Goal +A focused pytest suite under `tests/` covering the safe-writing guarantees. Use small +synthetic models written to a `tmp_path` Delta root (set `CCC_OUTPUT_ROOT` to `tmp_path`) +so tests never touch real data. + +## Required cases +1. **Shared-partition safety (patchseq regression):** write `DataSet(id="A")` and + `DataSet(id="B")` both with `project_id="P"`; assert both rows survive. This is the + core regression for the bug. +2. **Idempotency:** writing the same models twice → no duplicates, no row loss, for both + `overwrite_scoped` and `append_new_by_id`. +3. **Append-new-by-id:** second write with one new + one existing id appends exactly one. +4. **Strict validation:** a model missing a `required_for_write` slot, or violating a + cross-field rule, raises before any file is written (assert the Delta dir is unchanged). +5. **Registry↔schema drift:** (from `02_write_spec.md`) every registry entry's class and + columns exist in `models.py`. +6. **Round-trip:** write → read back via readers → equality on scope columns. +7. **Predicate construction:** the predicate is derived from `scope_columns` (assert the + DataSet predicate includes both `project_id` and `id`). + +## Reporting +Run `pytest -q` and paste the summary. Do not mark complete with failures. From d9a9e32a1a85db6f5568f7af1df2a52b3604d6bc Mon Sep 17 00:00:00 2001 From: reneyagmur Date: Mon, 1 Jun 2026 15:22:30 -0700 Subject: [PATCH 02/25] edited plan --- planning/ARCHITECTURE.md | 80 ++++++++++++++++++++++----- planning/README.md | 6 +- planning/TODO.md | 15 ++++- planning/prompts/00_shared_context.md | 13 ++++- planning/prompts/04_writers.md | 17 +++++- planning/prompts/05_readers.md | 12 ++-- planning/prompts/08_analysis.md | 25 +++++++++ 7 files changed, 141 insertions(+), 27 deletions(-) create mode 100644 planning/prompts/08_analysis.md diff --git a/planning/ARCHITECTURE.md b/planning/ARCHITECTURE.md index 4f9a5b5..c847896 100644 --- a/planning/ARCHITECTURE.md +++ b/planning/ARCHITECTURE.md @@ -35,8 +35,49 @@ live in `planning/prompts/`. The task breakdown lives in `planning/TODO.md`. - `parquet_loader.py` — `load_parquet_to_models(...)` (Parquet → models with a report). - `cli.py` — LinkML `SchemaView`-based full validation (the `ccc` command). Kept as the occasional heavyweight conformance check, **not** on the hot write path. -- `io/io_plans.md` — pre-existing analysis-util plans (`populate_region_coverage`, - `compare_region_coverage`). Keep; fold into the read/analysis module. +- `io/io_plans.md` — two pre-existing ideas that are **different concerns** and must land in + different modules (see below): + - `populate_region_coverage(pmm, matrix)` — derives `region_coverage` from the dense + values **before** a matrix is written → a **write-side transform**. + - `compare_region_coverage(pmms)` — summarizes overlap across already-written matrices → + **read/analysis**. + +## Target `io/` structure (clean is the goal) + +The existing IO files are scattered at the package root. The target is a single tidy `io/` +package; the existing modules are **relocated into it and become backends** the new files +call. "Do not rebuild" means *move and wrap, never reimplement*. + +``` +src/connects_common_connectivity/ + models.py # generated, UNTOUCHED, stays at root + cli.py # CLI entry point, stays at root; calls io.validation full check + io/ + config.py # NEW Settings (global output_root) + write_spec.py # NEW registry — source of truth + validation.py # NEW auto-derived strict submodels + arrow.py # MOVED from arrow_utils.py (models <-> Arrow conversion) + writers.py # NEW write_models() + typed wrappers + write_utils.py # MOVED from root (append-by-id backend, walk_ancestors) + transforms.py # NEW write-side enrichment incl. populate_region_coverage + readers.py # MOVED + folds parquet_loader.py + predicate/cross-dataset reads + analysis.py # NEW compare_region_coverage + future cross-dataset analysis +``` + +Where each existing file goes: +- `arrow_utils.py` → `io/arrow.py`. Conversion layer used by `writers.py`. Pure move. +- `write_utils.py` → `io/write_utils.py`. `append_new_dataitems` becomes the + `append_new_by_id` backend; `walk_ancestors` is used by membership/mapping writers and by + cross-dataset reads. Pure move. +- `parquet_loader.py` → folded into `io/readers.py` (Parquet→models with report becomes the + typed-read backend). Pure move/merge. +- `cli.py` stays at the package root as the `ccc` entry point; it calls into + `io/validation.py` for the occasional full LinkML conformance check. +- `models.py` stays at root, generated, never edited. + +Migration safety: while notebooks are being migrated, the moved modules may keep one-line +re-export shims at their old import paths (e.g. `from .io.arrow import *`) so nothing breaks +mid-transition; delete the shims once `06_notebook_migration` is complete. ## The bug this design fixes @@ -144,40 +185,51 @@ validator remains the separate, occasional full-conformance check. Example cross-field rule: an association's `dataset_id` must refer to a DataSet already present for that `project_id` (referential safety before write). -## Module 4 — `writers.py` (+ keep `write_utils.py`) +## Module 4 — `writers.py` (+ `io/write_utils.py`, `io/arrow.py`, `io/transforms.py`) A single dispatch core plus thin typed wrappers: - `write_models(models, *, settings=None)` — infers the class, looks up the registry, - validates each model via the strict submodel, converts via `arrow_utils`, attaches + validates each model via the strict submodel, converts via `io/arrow.py`, attaches LinkML metadata, then writes per `write_mode` (scoped overwrite with the - registry-built predicate, or `append_new_by_id` via the existing helper). + registry-built predicate, or `append_new_by_id` via the backend). - Wrappers for ergonomics and discoverability: `write_dataset`, `write_dataitem`, `write_association`, `write_features`, `write_cluster`, `write_cluster_membership`, `write_cell_to_cluster_mapping`, `write_projection_matrix`, etc. Each is a one-liner over `write_models`. -- `write_utils.py` stays: `append_new_dataitems` becomes the `append_new_by_id` backend; - `walk_ancestors` stays for membership/mapping denormalization. Generalize +- `io/write_utils.py` (moved from root): `append_new_dataitems` is the `append_new_by_id` + backend; `walk_ancestors` is used by membership/mapping writers. Generalize `append_new_dataitems` only if needed (e.g. parametrize the partition column), without - breaking its current callers. + breaking callers. +- `io/transforms.py` holds **write-side enrichment** run before a write — notably + `populate_region_coverage(pmm, matrix)` from `io_plans.md`, which derives + `region_coverage` from the dense values. `write_projection_matrix` calls it (or accepts + an already-enriched matrix). Keep it a pure function (no IO, no mutation of input). -Wide feature matrices (`CellFeatureMatrix`) use `build_cell_feature_matrix_schema` and a -matrix-specific writer path, since they are wide Parquet, not row-modeled Delta tables. +Wide feature matrices (`CellFeatureMatrix`) use `build_cell_feature_matrix_schema` (in +`io/arrow.py`) and a matrix-specific writer path, since they are wide Parquet, not +row-modeled Delta tables. -## Module 5 — `readers.py` (+ fold in `io_plans.md`) +## Module 5 — `readers.py` (folds `parquet_loader.py`) Two layers: - Thin predicate-based readers mirroring the write spec: `read_dataset`, `read_dataitem`, - `read_features`, scoped by `project_id`/`dataset_id`, returning polars/pandas. + `read_features`, scoped by `project_id`/`dataset_id`, returning polars/pandas. Typed + reads (Parquet→models) use the folded-in `load_parquet_to_models`. - Flexible cross-dataset / cross-schema reads now that datasets share tables. Flagship example: "read all DataItems that have either a ClusterMembership or a CellToClusterMapping to a given set of clusters" — a cross-table query joining membership/mapping tables on cluster ids and returning the union of matching DataItems, regardless of source dataset/modality. Users can still drop to raw `polars.read_delta` for ad-hoc queries; the readers are conveniences, not a wall. -- Fold the `io_plans.md` analysis utilities (`populate_region_coverage`, - `compare_region_coverage`) into this module. + +## Module 6 — `analysis.py` (read-side analysis) + +Read-side analysis over already-written tables. Seed with `compare_region_coverage(pmms)` +from `io_plans.md` (shared vs exclusive region coverage across matrices). This is distinct +from `transforms.py`: analysis reads finished data and summarizes; transforms enrich data +on its way in. Future cross-dataset analyses live here. ## Notebook migration (no logic, no schema, no models.py changes) diff --git a/planning/README.md b/planning/README.md index e51b3c9..4c0d753 100644 --- a/planning/README.md +++ b/planning/README.md @@ -12,10 +12,12 @@ that implement each piece. Created 2026-06-01. - `01_config.md` — global output path (`Settings`, no new dep). - `02_write_spec.md` — the registry (source of truth) + drift test. - `03_validation.md` — auto-derived strict submodels. - - `04_writers.md` — write dispatch + typed wrappers (fixes the patchseq bug). - - `05_readers.md` — predicate-based + cross-dataset reads. + - `04_writers.md` — write dispatch + typed wrappers + `io/transforms.py` (fixes the + patchseq bug; relocates `arrow_utils`/`write_utils` into `io/`). + - `05_readers.md` — predicate-based + cross-dataset reads (folds in `parquet_loader`). - `06_notebook_migration.md` — migrate ETL notebooks to the new API. - `07_tests.md` — safe-writing test suite. + - `08_analysis.md` — read-side analysis (`compare_region_coverage`). ## Two hard rules (repeated everywhere on purpose) 1. **Never edit `src/connects_common_connectivity/models.py`** — auto-generated from LinkML. diff --git a/planning/TODO.md b/planning/TODO.md index 5141f96..3e112b6 100644 --- a/planning/TODO.md +++ b/planning/TODO.md @@ -23,23 +23,32 @@ agent prompt that implements each item. Hard rules: never edit `models.py`; neve from generated models + registry. Prompt: `prompts/03_validation.md`. Blocked by 1.1. ## Phase 3 — writers +- [ ] **3.0 Relocate backends into `io/`** — move `arrow_utils.py`→`io/arrow.py`, + `write_utils.py`→`io/write_utils.py`, with re-export shims at old paths. Part of + `prompts/04_writers.md`. - [ ] **3.1 Write dispatch core** (`io/writers.py`) — `write_models(models, settings=...)`: infer class → registry lookup → strict-validate → arrow convert → metadata → write per - `write_mode`. Reuses `arrow_utils` + `write_utils`. Prompt: `prompts/04_writers.md`. + `write_mode`. Reuses `io/arrow.py` + `io/write_utils.py`. Prompt: `prompts/04_writers.md`. Blocked by 0.1, 1.1, 2.1. - [ ] **3.2 Typed wrappers** — `write_dataset`, `write_dataitem`, `write_association`, `write_features`, `write_cluster`, `write_cluster_membership`, `write_cell_to_cluster_mapping`, `write_projection_matrix`. Part of `prompts/04_writers.md`. - [ ] **3.3 Reconcile `write_utils.py`** — make `append_new_dataitems` the `append_new_by_id` backend without breaking current callers. Part of `prompts/04_writers.md`. +- [ ] **3.4 Write-side transforms** (`io/transforms.py`) — `populate_region_coverage` + (pre-write enrichment of ProjectionMeasurementMatrix). Part of `prompts/04_writers.md`. ## Phase 4 — readers +- [ ] **4.0 Fold `parquet_loader.py` into `io/readers.py`** (re-export shim at old path). + Part of `prompts/05_readers.md`. - [ ] **4.1 Predicate-based readers** (`io/readers.py`) — `read_dataset`, `read_dataitem`, `read_features` scoped by project/dataset. Prompt: `prompts/05_readers.md`. Blocked by 1.1. - [ ] **4.2 Cross-dataset reads** — flagship: DataItems with ClusterMembership OR CellToClusterMapping to a given cluster set. Part of `prompts/05_readers.md`. -- [ ] **4.3 Fold in analysis utils** — `populate_region_coverage`, - `compare_region_coverage` from `io/io_plans.md`. Part of `prompts/05_readers.md`. + +## Phase 4b — analysis +- [ ] **4b.1 Analysis module** (`io/analysis.py`) — `compare_region_coverage` (read-side + overlap summary). Prompt: `prompts/08_analysis.md`. Blocked by 4.1. ## Phase 5 — notebook migration - [ ] **5.1 Migrate `_01_dataset_dataitem` notebooks** — Settings + typed writers; fixes diff --git a/planning/prompts/00_shared_context.md b/planning/prompts/00_shared_context.md index 22fbd51..2ac335d 100644 --- a/planning/prompts/00_shared_context.md +++ b/planning/prompts/00_shared_context.md @@ -12,9 +12,16 @@ area-to-area, patch-seq multimodal) in one format, plus taxonomies/clusters. you need and why; do not change the schema. 3. **Single source of truth = the LinkML schema / generated models.** Read field definitions from `models.py`; do not restate them. -4. New IO code goes under `src/connects_common_connectivity/io/`. Do not move plotting - code out of `code/utils.py`. -5. Read `planning/ARCHITECTURE.md` fully before starting. It governs the design. +4. All IO code lives under `src/connects_common_connectivity/io/` — this is a relocation + to a clean package, not a parallel one. Existing IO modules at the package root + (`arrow_utils.py`, `write_utils.py`, `parquet_loader.py`) are MOVED into `io/` and + become backends, not reimplemented. See the "Target io/ structure" section of + ARCHITECTURE.md for the exact layout and where each existing file goes. Do not move + plotting code out of `code/utils.py`. Do not move `cli.py` or `models.py`. +5. When you move a module, keep a one-line re-export shim at its old path (e.g. + `from .io.arrow import *`) until notebook migration is done, so nothing breaks + mid-transition. +6. Read `planning/ARCHITECTURE.md` fully before starting. It governs the design. ## What already exists — reuse, don't rebuild - `models.py`: generated pydantic v2 classes incl. `DataSet`, `DataItem`, diff --git a/planning/prompts/04_writers.md b/planning/prompts/04_writers.md index e8a6072..ab1039e 100644 --- a/planning/prompts/04_writers.md +++ b/planning/prompts/04_writers.md @@ -2,6 +2,13 @@ > Prepend `00_shared_context.md`. Depends on `config.py`, `write_spec.py`, `validation.py`. +## Relocation first (clean structure) +Before writing new code, MOVE the existing backends into `io/` (with re-export shims at the +old paths until notebook migration is done): +- `arrow_utils.py` → `io/arrow.py` +- `write_utils.py` → `io/write_utils.py` +All new code imports from the `io/` locations. + ## Goal Create `src/connects_common_connectivity/io/writers.py`: a single write dispatch that uses the registry so notebooks never hand-write `mode` / `predicate` / `partition_by` again. @@ -32,7 +39,15 @@ Signatures should be ergonomic (accept the model(s) and optional `settings`). ## Wide feature matrices `CellFeatureMatrix` is wide Parquet. Route it through a matrix-specific path using -`build_cell_feature_matrix_schema`; do not force it into the row-Delta path. +`build_cell_feature_matrix_schema` (now in `io/arrow.py`); do not force it into the +row-Delta path. + +## Write-side transforms (`io/transforms.py`) +Create `io/transforms.py` for pre-write enrichment. Port `populate_region_coverage(pmm, +matrix)` from `io/io_plans.md`: derive `region_coverage` from the dense values array, +return a copy of the `ProjectionMeasurementMatrix` (pure function, no mutation, no IO). +`write_projection_matrix` should call it (or accept an already-enriched matrix). Do NOT put +`compare_region_coverage` here — that is read-side analysis (see `08_analysis.md`). ## Reconcile `write_utils.py` Make `append_new_dataitems` the `append_new_by_id` backend. If you must generalize it diff --git a/planning/prompts/05_readers.md b/planning/prompts/05_readers.md index 88d05e8..e4725db 100644 --- a/planning/prompts/05_readers.md +++ b/planning/prompts/05_readers.md @@ -2,6 +2,10 @@ > Prepend `00_shared_context.md`. Depends on `write_spec.py` (+ `config.py`). +## Relocation first (clean structure) +Fold `parquet_loader.py` into `io/readers.py` (the typed Parquet→models backend). Keep a +re-export shim at the old `parquet_loader` path until notebook migration is done. + ## Goal Create `src/connects_common_connectivity/io/readers.py`: convenient reads over the shared Delta tables, scoped by the registry, plus flexible cross-dataset/cross-schema queries. @@ -27,10 +31,10 @@ settings=None) -> DataFrame`: if the membership/mapping tables are denormalized that way (check how the `_03`/cluster notebooks write the hierarchy before assuming). -## Layer 3 — fold in analysis utils -Port `populate_region_coverage(pmm, matrix)` and `compare_region_coverage(pmms)` from -`io/io_plans.md` into this module (or a sibling `analysis.py` if cleaner). Keep their -documented signatures and pure-function behavior. +## Note — analysis is a separate module +Do NOT put analysis utils here. `compare_region_coverage(pmms)` goes in `io/analysis.py` +(`08_analysis.md`), and `populate_region_coverage` is a write-side transform +(`io/transforms.py`, `04_writers.md`). Readers only read. ## Tests (`tests/test_readers.py`) - Round-trip: write models via the writers, read them back scoped, assert equality on diff --git a/planning/prompts/08_analysis.md b/planning/prompts/08_analysis.md new file mode 100644 index 0000000..2e79344 --- /dev/null +++ b/planning/prompts/08_analysis.md @@ -0,0 +1,25 @@ +# Agent prompt — Analysis module (read-side) + +> Prepend `00_shared_context.md`. Depends on `readers.py` (uses read outputs). + +## Goal +Create `src/connects_common_connectivity/io/analysis.py` for read-side analysis over +already-written tables. This is distinct from `io/transforms.py` (write-side enrichment): +analysis reads finished data and summarizes; it never writes or mutates inputs. + +## Seed function +Port `compare_region_coverage(pmms)` from `io/io_plans.md`: +- Input: list of `ProjectionMeasurementMatrix` instances with `region_index` and + `region_coverage` populated. +- Compute `shared_regions` (intersection of `region_index`), `shared_coverage` + (intersection of `region_coverage`), and, for every non-empty subset of the inputs, the + count of regions exclusively covered by that combination. +- Print the summary table shown in `io_plans.md` and return a dict with keys + `shared_regions`, `shared_coverage`, `exclusive_counts`. + +## Tests (`tests/test_analysis.py`) +- Small synthetic set of PMMs gives the expected shared/exclusive counts. +- Pure: inputs are not mutated. + +## Do not +- Write to disk here. Touch `models.py` or schemas. From ccd5b2645424456cb49c052d07e60e347d7894cf Mon Sep 17 00:00:00 2001 From: reneyagmur Date: Tue, 2 Jun 2026 15:37:53 -0700 Subject: [PATCH 03/25] editing architechture and plans, maximalist version --- planning/ARCHITECTURE.md | 160 ++++++++++++++-------- planning/README.md | 38 ++--- planning/TODO.md | 86 +++++++----- planning/prompts/00_shared_context.md | 64 +++------ planning/prompts/01_config.md | 65 ++++++--- planning/prompts/02_write_spec.md | 2 +- planning/prompts/03_validation.md | 28 ++-- planning/prompts/04_writers.md | 37 +++-- planning/prompts/05_readers.md | 11 +- planning/prompts/06_notebook_migration.md | 36 +++-- planning/prompts/07_tests.md | 35 ++--- planning/prompts/08_analysis.md | 34 +++-- planning/prompts/09_public_api.md | 30 ++++ 13 files changed, 372 insertions(+), 254 deletions(-) create mode 100644 planning/prompts/09_public_api.md diff --git a/planning/ARCHITECTURE.md b/planning/ARCHITECTURE.md index c847896..004dece 100644 --- a/planning/ARCHITECTURE.md +++ b/planning/ARCHITECTURE.md @@ -6,17 +6,10 @@ live in `planning/prompts/`. The task breakdown lives in `planning/TODO.md`. ## Hard constraints (read before any work) -1. **Do not edit `src/connects_common_connectivity/models.py`.** It is auto-generated - from the LinkML YAMLs in `schemas/`. Any change to the data model happens in the - YAMLs and is regenerated — never hand-edited. -2. **Do not change the LinkML schemas without explicit permission from YY.** If safe - writing turns out to need a new slot (e.g. a clearer project/dataset scoping key), - stop and ask first. Propose the change in writing; do not edit `schemas/*.yaml` - pre-emptively. -3. **Single source of truth = the LinkML schema.** The registry and the derived - validators read from the generated models; they never restate field definitions. -4. New IO code lives under `src/connects_common_connectivity/io/`. Plotting stays in - `code/utils.py`. Notebooks are migrated to call the new API, not to embed logic. +The non-negotiable rules live in `prompts/00_shared_context.md` and are not restated here: +never edit `models.py` (generated) or `schemas/*.yaml` (ask YY first); the LinkML schema is +the single source of truth; all IO code lives under `src/connects_common_connectivity/io/`. +This document assumes those and adds the design on top. ## What exists today (do not rebuild) @@ -51,33 +44,49 @@ call. "Do not rebuild" means *move and wrap, never reimplement*. ``` src/connects_common_connectivity/ models.py # generated, UNTOUCHED, stays at root - cli.py # CLI entry point, stays at root; calls io.validation full check + cli.py # CLI entry point, stays at root; full LinkML conformance check + config.py # NEW package-wide Settings (output_root, dry_run, ...) — see below io/ - config.py # NEW Settings (global output_root) + __init__.py # NEW curated public API (what users import); __all__ + docstring write_spec.py # NEW registry — source of truth - validation.py # NEW auto-derived strict submodels - arrow.py # MOVED from arrow_utils.py (models <-> Arrow conversion) - writers.py # NEW write_models() + typed wrappers + write_validation.py# NEW auto-derived strict submodels (write-safety validation) + arrow_utils.py # MOVED from root (no rename) (models <-> Arrow conversion) + writers.py # NEW write_models() + typed wrappers + write-side transforms write_utils.py # MOVED from root (append-by-id backend, walk_ancestors) - transforms.py # NEW write-side enrichment incl. populate_region_coverage readers.py # MOVED + folds parquet_loader.py + predicate/cross-dataset reads - analysis.py # NEW compare_region_coverage + future cross-dataset analysis ``` +`config.py` lives at the **package root**, not in `io/`: configuration is package-wide +(`cli.py` and future plotting/analysis code read it too), so the general name belongs in the +general namespace next to `models.py`. Conversely the io validator is named +`write_validation.py`, not `validation.py`: it is specifically write-safety validation +coupled to `write_spec`, and the bare word "validation" is already claimed by `cli.py`'s +LinkML conformance check — two different validations, so neither owns the generic name. + +Seed-stage modules are NOT split out prematurely. Write-side enrichment +(`populate_region_coverage`) starts as a section at the top of `writers.py`; read-side +analysis (`compare_region_coverage`) starts in `readers.py`. Promote either to its own +module (`transforms.py` / `analysis.py`) only when a second function arrives — that move is +a pure relocation with no public-API change because users import from `io/__init__.py`. + Where each existing file goes: -- `arrow_utils.py` → `io/arrow.py`. Conversion layer used by `writers.py`. Pure move. +- `arrow_utils.py` → `io/arrow_utils.py`. Conversion layer used by `writers.py`. Pure move. - `write_utils.py` → `io/write_utils.py`. `append_new_dataitems` becomes the `append_new_by_id` backend; `walk_ancestors` is used by membership/mapping writers and by cross-dataset reads. Pure move. - `parquet_loader.py` → folded into `io/readers.py` (Parquet→models with report becomes the typed-read backend). Pure move/merge. -- `cli.py` stays at the package root as the `ccc` entry point; it calls into - `io/validation.py` for the occasional full LinkML conformance check. +- `cli.py` stays at the package root as the `ccc` entry point; it owns the occasional full + LinkML conformance check (separate from `io/write_validation.py`, which is the fast + write-path check). +- `config.py` is NEW at the package root (package-wide settings; see structure note above). - `models.py` stays at root, generated, never edited. Migration safety: while notebooks are being migrated, the moved modules may keep one-line -re-export shims at their old import paths (e.g. `from .io.arrow import *`) so nothing breaks -mid-transition; delete the shims once `06_notebook_migration` is complete. +re-export shims at their old import paths (e.g. `from .io.arrow_utils import *`) so nothing breaks +mid-transition. Shim removal is a tracked task (TODO 5.4), gated by a test that asserts no +old import path is referenced anywhere once migration is complete — otherwise the two import +paths linger and become exactly the clutter this redesign removes. ## The bug this design fixes @@ -124,28 +133,51 @@ LinkML schema ──▶│ models.py (generated) │ Settings (global output_root) ``` -## Module 1 — `config.py` (global output path) +## Module 1 — `config.py` (package root; discovered config file) -Decision: **plain pydantic `BaseModel`**, version-controlled default in code, optional -env override. No new dependency (no pydantic-settings). +Decision: settings live in a **declarative, version-controlled `ccc_config.yaml`** at the +repo root, discovered by walking up from the working directory (the `pyproject.toml` / +`ruff` / `pytest` pattern) and loaded into a validated pydantic `Settings`. No `%run`, no +process-global mutation, no per-notebook setup. No new dependency (pydantic + PyYAML, the +latter already in the tree via LinkML). + +```yaml +# ccc_config.yaml (repo root — the ONE place values live) +output_root: ../scratch/em_patchseq_wnm_v1/ +dry_run: false +``` ```python class Settings(BaseModel): - output_root: Path = Path("../scratch/em_patchseq_wnm_v1/") - # add knobs here later (dry_run, schema_version_pin, ...) as needed - - @classmethod - def load(cls) -> "Settings": - default = cls.model_fields["output_root"].default - return cls(output_root=os.environ.get("CCC_OUTPUT_ROOT", default)) + output_root: Path # required, no default + dry_run: bool = False + # room for more knobs (schema_version_pin, ...) later + +@lru_cache +def get_settings() -> Settings: + path = find_config_file("ccc_config.yaml") # walk cwd → parents + if path is None: + raise RuntimeError("No ccc_config.yaml found — create one at the repo root " + "with output_root: ...") + data = yaml.safe_load(path.read_text()) + if env := os.environ.get("CCC_OUTPUT_ROOT"): # developer escape hatch, path only + data["output_root"] = env + return Settings(**data) ``` -Rationale: the default is readable in git without running anything and adds no -dependency; the `CCC_OUTPUT_ROOT` env override is the escape hatch for CodeOcean, where -the write location differs from local. Notebooks replace the hardcoded `OUTPUT_ROOT` -string with `settings = Settings.load()` and print the resolved value at the top. -A `table_path(settings, "dataset")` helper resolves per-table subdirectories so notebooks -never concatenate path strings. +Resolution precedence: **explicit `settings=` arg (per call) > `CCC_OUTPUT_ROOT` env > +`ccc_config.yaml` > error.** The file is the source of truth and is validated by pydantic on +load; the env var is a subordinate developer override for `output_root` only (it cannot +express structured knobs like `dry_run`). There is no built-in default path — a missing file +fails loudly rather than writing somewhere arbitrary. `get_settings()` is a pure, cached +function of the filesystem (clearable in tests), not a mutable global. + +How the ETL uses it (kills the per-notebook setup): there is no config cell at all. A +notebook just imports and calls `write_dataset(...)` / `read_dataset(...)`; the library +discovers `ccc_config.yaml` on its own. Writers/readers do `settings = settings or +get_settings()`. To repoint local vs CodeOcean, edit the one file (or set `CCC_OUTPUT_ROOT`). +A `table_path(settings, "dataset")` helper resolves per-table subdirectories so nothing +concatenates path strings. ## Module 2 — `write_spec.py` (the registry) @@ -171,7 +203,7 @@ Predicate is built from `scope_columns` + the row values, e.g. `"project_id = 'visp_patchseq' AND id = 'visp_exc_patchseq'"`. This is exactly the bug fix: DataSet now carries `id` in its scope. -## Module 3 — `validation.py` (auto-derived strict submodels) +## Module 3 — `io/write_validation.py` (auto-derived strict submodels) Decision: **auto-derived** strict submodels — single source of truth. @@ -179,35 +211,44 @@ Decision: **auto-derived** strict submodels — single source of truth. (a) flips each slot in the registry's `required_for_write` to required, and (b) attaches the registry's `cross_field_rules` as pydantic `model_validator`s. No field definitions are restated; everything is read from `models.py` + the registry. `models.py` is never -touched. Validation runs on the hot write path (fast, pydantic-only). The LinkML/`cli.py` -validator remains the separate, occasional full-conformance check. +touched. Validation runs on the hot write path (fast, pydantic-only, **no I/O**). The +LinkML/`cli.py` validator remains the separate, occasional full-conformance check. -Example cross-field rule: an association's `dataset_id` must refer to a DataSet already -present for that `project_id` (referential safety before write). +Hot-path validation is purely structural: required-slot enforcement plus pure cross-field +rules that only inspect the model in hand. **Referential checks that read other tables do +NOT belong on the hot path.** Example: "an association's `dataset_id` must refer to a +DataSet already present for that `project_id`" requires a reader, so it is an opt-in check +(`write_models(..., check_refs=True)`) implemented after readers exist (Phase 4b), not a +strict-submodel validator. This keeps Phase 2 free of any dependency on Phase 4. -## Module 4 — `writers.py` (+ `io/write_utils.py`, `io/arrow.py`, `io/transforms.py`) +## Module 4 — `writers.py` (+ `io/write_utils.py`, `io/arrow_utils.py`) A single dispatch core plus thin typed wrappers: - `write_models(models, *, settings=None)` — infers the class, looks up the registry, - validates each model via the strict submodel, converts via `io/arrow.py`, attaches + validates each model via the strict submodel, converts via `io/arrow_utils.py`, attaches LinkML metadata, then writes per `write_mode` (scoped overwrite with the registry-built predicate, or `append_new_by_id` via the backend). -- Wrappers for ergonomics and discoverability: `write_dataset`, `write_dataitem`, +- Typed wrappers for discoverability (`write_dataset`, `write_dataitem`, `write_association`, `write_features`, `write_cluster`, `write_cluster_membership`, - `write_cell_to_cluster_mapping`, `write_projection_matrix`, etc. Each is a one-liner - over `write_models`. + `write_cell_to_cluster_mapping`, `write_projection_matrix`). `write_models` is the one + real entry point; the wrappers are sugar. To avoid hand-maintaining eight one-liners that + must stay in lockstep with the registry, **generate them from the registry** (a small + factory binding the class) and re-export the generated names from `io/__init__.py`. A + hand-written wrapper is justified only where a class needs a non-uniform signature + (e.g. `write_projection_matrix` taking the dense matrix for enrichment). - `io/write_utils.py` (moved from root): `append_new_dataitems` is the `append_new_by_id` backend; `walk_ancestors` is used by membership/mapping writers. Generalize `append_new_dataitems` only if needed (e.g. parametrize the partition column), without breaking callers. -- `io/transforms.py` holds **write-side enrichment** run before a write — notably - `populate_region_coverage(pmm, matrix)` from `io_plans.md`, which derives +- **Write-side enrichment** lives as a section at the top of `writers.py` (not yet its own + module): `populate_region_coverage(pmm, matrix)` from `io_plans.md` derives `region_coverage` from the dense values. `write_projection_matrix` calls it (or accepts an already-enriched matrix). Keep it a pure function (no IO, no mutation of input). + Split into `io/transforms.py` only when a second transform appears. Wide feature matrices (`CellFeatureMatrix`) use `build_cell_feature_matrix_schema` (in -`io/arrow.py`) and a matrix-specific writer path, since they are wide Parquet, not +`io/arrow_utils.py`) and a matrix-specific writer path, since they are wide Parquet, not row-modeled Delta tables. ## Module 5 — `readers.py` (folds `parquet_loader.py`) @@ -223,17 +264,16 @@ Two layers: membership/mapping tables on cluster ids and returning the union of matching DataItems, regardless of source dataset/modality. Users can still drop to raw `polars.read_delta` for ad-hoc queries; the readers are conveniences, not a wall. - -## Module 6 — `analysis.py` (read-side analysis) - -Read-side analysis over already-written tables. Seed with `compare_region_coverage(pmms)` -from `io_plans.md` (shared vs exclusive region coverage across matrices). This is distinct -from `transforms.py`: analysis reads finished data and summarizes; transforms enrich data -on its way in. Future cross-dataset analyses live here. +- **Read-side analysis** lives as a section in `readers.py` to start: + `compare_region_coverage(pmms)` from `io_plans.md` (shared vs exclusive region coverage + across matrices). It reads finished data and summarizes — the mirror image of write-side + enrichment, which augments data on the way in. Split into `io/analysis.py` only when a + second analysis function appears. ## Notebook migration (no logic, no schema, no models.py changes) -For each ETL notebook: replace hardcoded `OUTPUT_ROOT` with `Settings.load()`, replace +For each ETL notebook: delete hardcoded `OUTPUT_ROOT` (no config cell — the library +discovers `ccc_config.yaml`), replace direct `write_deltalake(...)` calls with the typed writers, and delete the per-cell `mode`/`predicate`/`partition_by` bookkeeping (now owned by the registry). Verification cells stay. The `visp_*_patchseq` bug is fixed automatically once DataSet writes go diff --git a/planning/README.md b/planning/README.md index 4c0d753..eba657e 100644 --- a/planning/README.md +++ b/planning/README.md @@ -1,34 +1,16 @@ # planning/ — IO layer design & agent prompts -This folder documents how we're building the user-friendly IO layer (write / read / -validation) for ConnectsCommonConnectivity, and holds ready-to-run prompts for the agents -that implement each piece. Created 2026-06-01. +How we're building the user-friendly IO layer (write / read / validation) for +ConnectsCommonConnectivity. Created 2026-06-01. -## Contents -- `ARCHITECTURE.md` — the design (source of truth). Registry-centric write/read/validation. +- `ARCHITECTURE.md` — the design (source of truth). - `TODO.md` — ordered, dependency-aware task list. -- `prompts/` — one prompt per work item, to hand to implementing agents: - - `00_shared_context.md` — **prepend to every prompt below.** Hard rules + repo facts. - - `01_config.md` — global output path (`Settings`, no new dep). - - `02_write_spec.md` — the registry (source of truth) + drift test. - - `03_validation.md` — auto-derived strict submodels. - - `04_writers.md` — write dispatch + typed wrappers + `io/transforms.py` (fixes the - patchseq bug; relocates `arrow_utils`/`write_utils` into `io/`). - - `05_readers.md` — predicate-based + cross-dataset reads (folds in `parquet_loader`). - - `06_notebook_migration.md` — migrate ETL notebooks to the new API. - - `07_tests.md` — safe-writing test suite. - - `08_analysis.md` — read-side analysis (`compare_region_coverage`). - -## Two hard rules (repeated everywhere on purpose) -1. **Never edit `src/connects_common_connectivity/models.py`** — auto-generated from LinkML. -2. **Never edit `schemas/*.yaml`** without explicit permission from YY. - -## Locked decisions -- Config: plain pydantic, version-controlled default + `CCC_OUTPUT_ROOT` env override. -- Write spec: explicit registry, schema-checked for drift. -- Validation: auto-derived strict submodels (single source of truth). +- `prompts/` — one prompt per work item (`00_shared_context.md` is prepended to every other + prompt and holds the **hard rules**: don't edit `models.py` or `schemas/*.yaml`): + `01_config` · `02_write_spec` · `03_validation` · `04_writers` · `05_readers` · + `06_notebook_migration` · `07_tests` · `08_analysis` (read-side analysis + opt-in + referential check) · `09_public_api` (`io/__init__.py`). ## How to run an item -Hand the implementing agent: `00_shared_context.md` + the specific prompt, and point it at -`ARCHITECTURE.md`. Follow the order in `TODO.md` (config → registry → validation → -writers → readers → notebook migration → tests). +Hand the implementing agent `00_shared_context.md` + the specific prompt, point it at +`ARCHITECTURE.md`, and follow the order in `TODO.md`. diff --git a/planning/TODO.md b/planning/TODO.md index 3e112b6..ddd2f2f 100644 --- a/planning/TODO.md +++ b/planning/TODO.md @@ -1,69 +1,87 @@ # IO Layer — TODO -Ordered, dependency-aware. See `ARCHITECTURE.md` for design and `prompts/` for the -agent prompt that implements each item. Hard rules: never edit `models.py`; never edit -`schemas/*.yaml` without explicit permission from YY. +Ordered, dependency-aware. Design lives in `ARCHITECTURE.md`; the implementing prompt for +each item is in `prompts/`. Hard rules: see `prompts/00_shared_context.md`. ## Phase 0 — groundwork -- [ ] **0.1 Config module** (`io/config.py`) — plain pydantic `Settings` with - `output_root` default + `CCC_OUTPUT_ROOT` env override + `table_path()` helper. - No new dependency. Prompt: `prompts/01_config.md`. Blocks everything that writes. +- [ ] **0.1 Config module** (`config.py`, package root — not `io/`) — pydantic `Settings` loaded from a discovered + `ccc_config.yaml` (walk-up like `pyproject.toml`), cached `get_settings()`, `table_path()` + helper. Precedence: explicit arg > `CCC_OUTPUT_ROOT` env (developer escape hatch, path + only) > `ccc_config.yaml` > **error, no default path**. No `configure()` global, no `%run`. + Deps: pydantic + PyYAML (already present). Prompt: `01_config.md`. Blocks everything that writes. ## Phase 1 — registry (the hub) - [ ] **1.1 Write spec registry** (`io/write_spec.py`) — one entry per writable class: `subdir`, `partition_by`, `scope_columns`, `write_mode`, `required_for_write`, `cross_field_rules`. Seed DataSet/DataItem/Association first, then the rest. - Prompt: `prompts/02_write_spec.md`. Blocked by: none (reads generated models). + Prompt: `02_write_spec.md`. Blocked by: none (reads generated models). - [ ] **1.2 Registry↔schema drift test** — assert every entry's class + scope/identifier - slots exist in `models.py`. Part of `prompts/02_write_spec.md`. + slots exist in `models.py`. Part of `02_write_spec.md`. -## Phase 2 — validation -- [ ] **2.1 Strict submodel derivation** (`io/validation.py`) — `strict_model_for(cls)` - flips `required_for_write` to required + attaches `cross_field_rules`. Auto-derived - from generated models + registry. Prompt: `prompts/03_validation.md`. Blocked by 1.1. +## Phase 2 — validation (structural only) +- [ ] **2.1 Strict submodel derivation** (`io/write_validation.py`) — `strict_model_for(cls)` + flips `required_for_write` to required + attaches *pure* `cross_field_rules`. No I/O, no + reading other tables. Auto-derived from generated models + registry. Prompt: + `03_validation.md`. Blocked by 1.1. (Referential checks are 4b.2, not here.) ## Phase 3 — writers -- [ ] **3.0 Relocate backends into `io/`** — move `arrow_utils.py`→`io/arrow.py`, +- [ ] **3.0 Relocate backends into `io/`** — move `arrow_utils.py`→`io/arrow_utils.py`, `write_utils.py`→`io/write_utils.py`, with re-export shims at old paths. Part of - `prompts/04_writers.md`. + `04_writers.md`. - [ ] **3.1 Write dispatch core** (`io/writers.py`) — `write_models(models, settings=...)`: infer class → registry lookup → strict-validate → arrow convert → metadata → write per - `write_mode`. Reuses `io/arrow.py` + `io/write_utils.py`. Prompt: `prompts/04_writers.md`. + `write_mode`. Reuses `io/arrow_utils.py` + `io/write_utils.py`. Prompt: `04_writers.md`. Blocked by 0.1, 1.1, 2.1. -- [ ] **3.2 Typed wrappers** — `write_dataset`, `write_dataitem`, `write_association`, - `write_features`, `write_cluster`, `write_cluster_membership`, - `write_cell_to_cluster_mapping`, `write_projection_matrix`. Part of `prompts/04_writers.md`. +- [ ] **3.2 Typed wrappers** — generated from the registry (not hand-maintained); hand-write + only non-uniform signatures (e.g. `write_projection_matrix`). Part of `04_writers.md`. - [ ] **3.3 Reconcile `write_utils.py`** — make `append_new_dataitems` the - `append_new_by_id` backend without breaking current callers. Part of `prompts/04_writers.md`. -- [ ] **3.4 Write-side transforms** (`io/transforms.py`) — `populate_region_coverage` - (pre-write enrichment of ProjectionMeasurementMatrix). Part of `prompts/04_writers.md`. + `append_new_by_id` backend without breaking current callers. Part of `04_writers.md`. +- [ ] **3.4 Write-side transform** — `populate_region_coverage` as a section in + `writers.py` (pre-write enrichment of ProjectionMeasurementMatrix). Part of `04_writers.md`. + +## Phase 3b — public API +- [ ] **3b.1 `io/__init__.py`** — curated exports, module docstring, `__all__`. Defines what + users type after `from connects_common_connectivity.io import …`. Prompt: `09_public_api.md`. + Blocked by 3.1. ## Phase 4 — readers - [ ] **4.0 Fold `parquet_loader.py` into `io/readers.py`** (re-export shim at old path). - Part of `prompts/05_readers.md`. + Part of `05_readers.md`. - [ ] **4.1 Predicate-based readers** (`io/readers.py`) — `read_dataset`, `read_dataitem`, - `read_features` scoped by project/dataset. Prompt: `prompts/05_readers.md`. Blocked by 1.1. + `read_features` scoped by project/dataset. Prompt: `05_readers.md`. Blocked by 1.1. - [ ] **4.2 Cross-dataset reads** — flagship: DataItems with ClusterMembership OR - CellToClusterMapping to a given cluster set. Part of `prompts/05_readers.md`. + CellToClusterMapping to a given cluster set. Part of `05_readers.md`. -## Phase 4b — analysis -- [ ] **4b.1 Analysis module** (`io/analysis.py`) — `compare_region_coverage` (read-side - overlap summary). Prompt: `prompts/08_analysis.md`. Blocked by 4.1. +## Phase 4b — analysis & referential checks (need readers) +- [ ] **4b.1 Read-side analysis** — `compare_region_coverage` as a section in `readers.py` + (read-side overlap summary). Prompt: `08_analysis.md`. Blocked by 4.1. +- [ ] **4b.2 Opt-in referential check** — `write_models(..., check_refs=True)` verifies an + association's `dataset_id` exists among written DataSets. Uses readers; off the hot path. + Part of `08_analysis.md`. Blocked by 4.1. ## Phase 5 — notebook migration -- [ ] **5.1 Migrate `_01_dataset_dataitem` notebooks** — Settings + typed writers; fixes - the patchseq DataSet overwrite. Prompt: `prompts/06_notebook_migration.md`. Blocked by 3.x. +- [ ] **5.0 Create `ccc_config.yaml`** at the repo root — the single, version-controlled + source of truth for `output_root` (+ `dry_run`). Part of `06_notebook_migration.md`. +- [ ] **5.1 Migrate `_01_dataset_dataitem` notebooks** — delete hardcoded `OUTPUT_ROOT` + (no config cell; library discovers `ccc_config.yaml`) + typed writers; fixes the patchseq + DataSet overwrite. Prompt: `06_notebook_migration.md`. Blocked by 3.x. - [ ] **5.2 Migrate feature / cluster / mapping / projection notebooks.** Same prompt. - [ ] **5.3 Patchseq regression check** — re-run exc then inh; assert both DataSet rows coexist. Acceptance test for the migration. +- [ ] **5.4 Remove re-export shims** — delete shims at `arrow_utils.py`, `write_utils.py`, + `parquet_loader.py` once no notebook/test imports them. Add a test asserting no old import + path is referenced anywhere. Blocked by 5.1–5.2. ## Phase 6 — tests & docs -- [ ] **6.1 Test suite** — idempotency, shared-partition safety (patchseq regression), - strict-validator failures, round-trip. Prompt: `prompts/07_tests.md`. +- [ ] **6.1 Test suite** — see `07_tests.md`. Pulls together the cases already specified in + `02` (drift), `04`/`06` (regression) rather than re-specifying them. - [ ] **6.2 Update README / usage docs** for the new IO API. (Ask before large edits.) ## Decisions locked (2026-06-01) -- Config: plain pydantic, version-controlled default + env override, no new dep. +- Config: declarative `ccc_config.yaml` at repo root, discovered by walk-up and validated by + pydantic; no per-notebook setup, no `%run`, no global. Precedence: explicit arg > env + (escape hatch) > `ccc_config.yaml` > **error (no default path)**. Deps already present. - Write spec: explicit registry, source of truth, schema-checked for drift. -- Validation: auto-derived strict submodels from generated models + registry. -- Scope: this session produced planning docs + prompts only. +- Validation: auto-derived strict submodels; hot path is structural-only, no I/O. +- Packaging: seed transforms/analysis inside writers/readers; split out on second function. + Public surface is `io/__init__.py`. Scope of this session: planning docs + prompts only. diff --git a/planning/prompts/00_shared_context.md b/planning/prompts/00_shared_context.md index 2ac335d..3807014 100644 --- a/planning/prompts/00_shared_context.md +++ b/planning/prompts/00_shared_context.md @@ -1,51 +1,29 @@ # Shared context — prepend to every IO-layer agent prompt -You are working in the `ConnectsCommonConnectivity` repo: a LinkML+pydantic data schema -holding multi-scale connectomics data (EM cell-to-cell, morphology cell-to-area, viral -area-to-area, patch-seq multimodal) in one format, plus taxonomies/clusters. +You are working in the `ConnectsCommonConnectivity` repo (LinkML+pydantic schema for +multi-scale connectomics). **Read `planning/ARCHITECTURE.md` before starting** — it owns +the design, the existing-file inventory, the target `io/` layout, and the motivating bug. +This file is only the rules of the room. ## Non-negotiable rules -1. **Never edit `src/connects_common_connectivity/models.py`** — it is auto-generated - from `schemas/*.yaml`. Treat it as read-only. -2. **Never edit `schemas/*.yaml`** without explicit written permission from the maintainer - (YY). If your task seems to require a new slot for safe writing, STOP and report what - you need and why; do not change the schema. +1. **Never edit `src/connects_common_connectivity/models.py`** — auto-generated from + `schemas/*.yaml`. Read-only. +2. **Never edit `schemas/*.yaml`** without explicit written permission from YY. If your + task seems to need a new slot, STOP and report what you need and why. 3. **Single source of truth = the LinkML schema / generated models.** Read field - definitions from `models.py`; do not restate them. -4. All IO code lives under `src/connects_common_connectivity/io/` — this is a relocation - to a clean package, not a parallel one. Existing IO modules at the package root - (`arrow_utils.py`, `write_utils.py`, `parquet_loader.py`) are MOVED into `io/` and - become backends, not reimplemented. See the "Target io/ structure" section of - ARCHITECTURE.md for the exact layout and where each existing file goes. Do not move - plotting code out of `code/utils.py`. Do not move `cli.py` or `models.py`. -5. When you move a module, keep a one-line re-export shim at its old path (e.g. - `from .io.arrow import *`) until notebook migration is done, so nothing breaks - mid-transition. -6. Read `planning/ARCHITECTURE.md` fully before starting. It governs the design. - -## What already exists — reuse, don't rebuild -- `models.py`: generated pydantic v2 classes incl. `DataSet`, `DataItem`, - `DataItemDataSetAssociation`, `Cluster`, `ClusterHierarchy`, `ClusterMembership`, - `CellFeatureSet`, `CellFeatureDefinition`, `CellFeatureMatrix`, `CellFeatureMeasurement`, - `MappingSet`, `CellToCellMapping`, `CellToClusterMapping`, `ClusterToClusterMapping`, - `ProjectionMeasurementMatrix`. `ProjectScoped` mixin → `project_id`. -- `arrow_utils.py`: `build_arrow_schema`, `models_to_table`, `attach_linkml_metadata`, - `build_cell_feature_matrix_schema`. -- `write_utils.py`: `append_new_dataitems(path, table, *, project_id, id_column="id")`, - `walk_ancestors(leaf_id, parent_of)`. -- `parquet_loader.py`: `load_parquet_to_models(...)`. `cli.py`: LinkML full validation. -- `io/io_plans.md`: analysis-util specs to fold into readers. + definitions from `models.py`; never restate them. +4. **IO code lives under `src/connects_common_connectivity/io/`.** Existing root + modules (`arrow_utils.py`, `write_utils.py`, `parquet_loader.py`) are MOVED there and + wrapped as backends — never reimplemented. `cli.py` and `models.py` stay at root; so does + `config.py` (package-wide settings, not IO-specific) and plotting stays in + `code/utils.py`. Exact layout: ARCHITECTURE.md → "Target io/ structure". +5. When you move a module, leave a one-line re-export shim at its old path until notebook + migration is done, so nothing breaks mid-transition. ## Conventions -- Python 3.10+, pydantic v2, polars + pyarrow + deltalake (already deps). -- Match existing style (ruff, line-length 100). Add docstrings like the existing modules. +- Python 3.10+, pydantic v2; polars + pyarrow + deltalake (already deps). No new deps + without asking. +- Match existing style (ruff, line-length 100); docstring like the existing modules. - Add `pytest` tests under `tests/` for anything you implement. -- After implementing, run the relevant tests and report results. Do not mark work done - with failing tests or partial implementation. - -## Reference: the bug to keep in mind -`visp_exc_patchseq` and `visp_inh_patchseq` share `project_id='visp_patchseq'` with -different `dataset_id`. The current DataSet write uses predicate -`project_id = ''`, so writing one wipes the other. The registry fixes this by -making DataSet's scope `(project_id, id)`. Any writer you build must derive its predicate -from the registry, never from a hardcoded string. +- Run the relevant tests and report results. Never mark work done with failing tests or a + partial implementation. diff --git a/planning/prompts/01_config.md b/planning/prompts/01_config.md index aee8659..83a9d86 100644 --- a/planning/prompts/01_config.md +++ b/planning/prompts/01_config.md @@ -1,35 +1,56 @@ -# Agent prompt — Config module (global output path) +# Agent prompt — Config module (discovered config file) > Prepend `00_shared_context.md`. ## Goal -Create `src/connects_common_connectivity/io/config.py` providing a single, version- -controlled, human-readable global output path, with an optional env override. **No new -dependency** — plain pydantic `BaseModel` only (NOT pydantic-settings). +Create `src/connects_common_connectivity/config.py` — at the **package root**, next to +`models.py` and `cli.py`, NOT in `io/`. Configuration is package-wide (cli and future +plotting/analysis read it too), so the general name belongs in the general namespace. +Settings live in **one declarative, version-controlled file** (`ccc_config.yaml`) that every +entry point discovers automatically — no per-notebook setup, no `%run`, no process-global +mutation. The library holds the *mechanism* and validates the file via pydantic; the +*values* live in `ccc_config.yaml` at the repo root. **No new dependency** — plain pydantic +`BaseModel` + PyYAML (already in the tree via LinkML). ## Requirements -1. A `Settings(BaseModel)` class with: - - `output_root: Path` — default `Path("../scratch/em_patchseq_wnm_v1/")` (current value - used across notebooks; confirm by grepping `OUTPUT_ROOT` in `code/*.ipynb`). - - A `load()` classmethod that returns `Settings`, using - `os.environ.get("CCC_OUTPUT_ROOT", )` so CodeOcean can override the path via - env without editing tracked code. - - Designed so more knobs (e.g. `dry_run`, `schema_version_pin`) can be added later. -2. A helper `table_path(settings: Settings, table: str) -> Path` that joins - `output_root / table` (e.g. `"dataset"`, `"dataitem"`, - `"dataitem_dataset_association"`) so notebooks never concatenate path strings. Use the - exact subdir names currently in the notebooks. -3. A `describe()` / `__repr__` that prints the resolved config so notebooks can show it at - the top instead of relying on hidden state. +1. A `Settings(BaseModel)`: + - `output_root: Path` (required, no default). + - `dry_run: bool = False`, and room for more knobs (`schema_version_pin`, ...) later. + - **No built-in default output path.** The value comes from the config file. + - `describe()` / `__repr__` printing the resolved config. +2. **File discovery + typed load (the key piece):** + - `find_config_file(start: Path | None = None) -> Path | None` walks up from `cwd` + (or `start`) to the filesystem root looking for `ccc_config.yaml` — same pattern as + `pyproject.toml`/`ruff`/`pytest`. This is what lets a notebook in `code/` find the + repo-root config with zero config code. + - `get_settings() -> Settings` (cache with `functools.lru_cache`): + 1. find `ccc_config.yaml`; if none found, **raise a clear, actionable error** + (`"No ccc_config.yaml found — create one at the repo root with output_root: ..."`). + 2. `yaml.safe_load` it and construct `Settings(**data)` (pydantic validates here). + 3. **Developer escape hatch:** if `CCC_OUTPUT_ROOT` env is set, override + `output_root` with it (env wins over the file, for the path only — it cannot + express other knobs). Document it as override-only, not the primary path. + - Precedence overall: **explicit `settings=` arg (handled by callers) > `CCC_OUTPUT_ROOT` + env > `ccc_config.yaml` > error.** + - Provide a way to clear the cache for tests (e.g. expose `get_settings.cache_clear`). +3. `table_path(settings: Settings, table: str) -> Path` joins `output_root / table` (e.g. + `"dataset"`, `"dataitem"`, `"dataitem_dataset_association"`) using the exact subdir names + in the notebooks, so nothing concatenates path strings. +4. Export `Settings`, `get_settings`, `table_path` from `config.py` (and re-exported from + `io/__init__.py` for convenience). `io/` imports them via `from ..config import ...`. + Do NOT add a `configure()` process-global setter — discovery replaces it. ## Tests (`tests/test_config.py`) -- Default `output_root` is the expected path when env var unset. -- `CCC_OUTPUT_ROOT` env var overrides the default. +- `get_settings()` raises the actionable error when no `ccc_config.yaml` is discoverable. +- A `ccc_config.yaml` in a tmp dir is discovered from a nested cwd and loaded/validated. +- `CCC_OUTPUT_ROOT` env overrides only `output_root`; `dry_run` still comes from the file. +- An explicit `settings=` passed to a caller wins over both. - `table_path` joins correctly and returns a `Path`. ## Do not -- Add pydantic-settings or any new dependency. -- Touch `models.py` or schemas. +- Add a built-in default output path, a `configure()` global, or `%run`-style coupling. Add + any dependency beyond pydantic + PyYAML. Touch `models.py` or schemas. ## Report -List the subdir names you found in the notebooks and confirm the default matches. +List the subdir names found in the notebooks (for `table_path`) and the `output_root` +value you put in `ccc_config.yaml`. diff --git a/planning/prompts/02_write_spec.md b/planning/prompts/02_write_spec.md index 31ae603..b9108d0 100644 --- a/planning/prompts/02_write_spec.md +++ b/planning/prompts/02_write_spec.md @@ -18,7 +18,7 @@ Define a dataclass/pydantic model `WriteSpec` with fields: - `required_for_write: list[str]` — slots that must be non-null to write safely (may be stricter than the schema's `required`). - `cross_field_rules: list[str]` — names of cross-field checks (implemented in - `validation.py`); empty for now is fine. + `write_validation.py`); empty for now is fine. Expose `REGISTRY: dict[str, WriteSpec]` keyed by class name, and a `get_spec(model_or_cls) -> WriteSpec` lookup. diff --git a/planning/prompts/03_validation.md b/planning/prompts/03_validation.md index c3ff1c9..dfbf6a2 100644 --- a/planning/prompts/03_validation.md +++ b/planning/prompts/03_validation.md @@ -1,9 +1,14 @@ -# Agent prompt — Validation (auto-derived strict submodels) +# Agent prompt — Write-validation (auto-derived strict submodels) > Prepend `00_shared_context.md`. Depends on `write_spec.py`. +## Naming +File is `io/write_validation.py`, NOT `io/validation.py`: this is specifically write-safety +validation coupled to `write_spec`. The generic word "validation" is already claimed by +`cli.py`'s LinkML full-conformance check — keep the two distinct. + ## Goal -Create `src/connects_common_connectivity/io/validation.py` that derives a STRICT pydantic +Create `src/connects_common_connectivity/io/write_validation.py` that derives a STRICT pydantic submodel per class **at runtime** from (a) the generated model in `models.py` and (b) the registry's `required_for_write` + `cross_field_rules`. Single source of truth: nothing is restated from the schema. @@ -18,16 +23,17 @@ is restated from the schema. - Cache the derived class (e.g. `functools.lru_cache`) so it's built once. 2. `validate_for_write(model) -> model` (or list): run the instance through the strict submodel, raising a clear error that names the class, the failing slot/rule, and the - offending value. This runs on the hot write path, so keep it pydantic-only (fast); do - NOT call the LinkML/`cli.py` validator here. -3. Implement a starter cross-field rule registry (a dict name → callable) including: - - `association_dataset_exists`: a `DataItemDataSetAssociation`'s `dataset_id` must - exist among written DataSets for that `project_id`. (May need a reader/lookup; if the - reader module isn't ready, implement the hook and mark it TODO without breaking the - import.) - Add others only as the registry references them. + offending value. This runs on the hot write path, so keep it pydantic-only (fast, **no + I/O**); do NOT call the LinkML/`cli.py` validator here. +3. Implement a starter cross-field rule registry (a dict name → callable). Rules here MUST + be pure: they inspect only the model instance in hand, do no I/O, and never read other + tables. Add rules only as the registry references them. + - Do NOT implement `association_dataset_exists` here. It reads written DataSets, so it is + a referential check, not a structural one — it belongs off the hot path as an opt-in + `check_refs` in Phase 4b (`08_analysis.md`), after readers exist. Keeping it out of + this module is what frees Phase 2 from any dependency on Phase 4. -## Tests (`tests/test_validation.py`) +## Tests (`tests/test_write_validation.py`) - A model missing a `required_for_write` slot fails `validate_for_write` before any IO. - A valid model passes and is returned unchanged (round-trip equality on fields). - The generated `models.py` class is unchanged after deriving the strict model diff --git a/planning/prompts/04_writers.md b/planning/prompts/04_writers.md index ab1039e..14d8ea1 100644 --- a/planning/prompts/04_writers.md +++ b/planning/prompts/04_writers.md @@ -1,11 +1,11 @@ # Agent prompt — Writers (dispatch core + typed wrappers) -> Prepend `00_shared_context.md`. Depends on `config.py`, `write_spec.py`, `validation.py`. +> Prepend `00_shared_context.md`. Depends on `config.py`, `write_spec.py`, `write_validation.py`. ## Relocation first (clean structure) Before writing new code, MOVE the existing backends into `io/` (with re-export shims at the old paths until notebook migration is done): -- `arrow_utils.py` → `io/arrow.py` +- `arrow_utils.py` → `io/arrow_utils.py` - `write_utils.py` → `io/write_utils.py` All new code imports from the `io/` locations. @@ -16,7 +16,8 @@ the registry so notebooks never hand-write `mode` / `predicate` / `partition_by` ## Core `write_models(models, *, settings=None) -> WriteResult`: 1. Accept a single model or an iterable; infer the class; require homogeneous type. -2. `settings = settings or Settings.load()`. +2. `settings = settings or get_settings()` (loads the discovered `ccc_config.yaml`; an + explicit `settings=` still wins). 3. Look up the `WriteSpec` via `get_spec`. 4. Validate every model with `validate_for_write` (strict submodel) BEFORE any IO. 5. Convert via `arrow_utils.models_to_table` + `build_arrow_schema`; attach metadata with @@ -32,22 +33,30 @@ the registry so notebooks never hand-write `mode` / `predicate` / `partition_by` passing `project_id` and id column. 8. Return a small result object: rows written/appended, path, mode, predicate used. -## Typed wrappers (one-liners over `write_models`) -`write_dataset`, `write_dataitem`, `write_association`, `write_features`, `write_cluster`, -`write_cluster_membership`, `write_cell_to_cluster_mapping`, `write_projection_matrix`. -Signatures should be ergonomic (accept the model(s) and optional `settings`). +## Typed wrappers (generated from the registry) +`write_models` is the one real entry point. Provide the discoverable per-class names +(`write_dataset`, `write_dataitem`, `write_association`, `write_features`, `write_cluster`, +`write_cluster_membership`, `write_cell_to_cluster_mapping`, `write_projection_matrix`) but +**generate them from the registry** with a small factory that binds the class, rather than +hand-writing eight one-liners that can drift from the registry. Hand-write a wrapper only +where the signature is non-uniform (e.g. `write_projection_matrix` accepting the dense +matrix for enrichment). The generated names are re-exported from `io/__init__.py` (Phase 3b). ## Wide feature matrices `CellFeatureMatrix` is wide Parquet. Route it through a matrix-specific path using -`build_cell_feature_matrix_schema` (now in `io/arrow.py`); do not force it into the +`build_cell_feature_matrix_schema` (now in `io/arrow_utils.py`); do not force it into the row-Delta path. -## Write-side transforms (`io/transforms.py`) -Create `io/transforms.py` for pre-write enrichment. Port `populate_region_coverage(pmm, -matrix)` from `io/io_plans.md`: derive `region_coverage` from the dense values array, -return a copy of the `ProjectionMeasurementMatrix` (pure function, no mutation, no IO). -`write_projection_matrix` should call it (or accept an already-enriched matrix). Do NOT put -`compare_region_coverage` here — that is read-side analysis (see `08_analysis.md`). +## Write-side transform (section in `writers.py`, not a new module) +Add pre-write enrichment as a clearly-marked section at the top of `writers.py` — do NOT +create `io/transforms.py` yet (single function = premature module). Port +`populate_region_coverage(pmm, matrix)` from `io/io_plans.md`: derive `region_coverage` +from the dense values array, return a copy of the `ProjectionMeasurementMatrix` (pure +function, no mutation, no IO). `write_projection_matrix` calls it (or accepts an +already-enriched matrix). When a second write-side transform appears, relocate the section +into `io/transforms.py` — a pure move, no public-API change (users import via +`io/__init__.py`). Do NOT put `compare_region_coverage` here — that is read-side analysis +(see `08_analysis.md`). ## Reconcile `write_utils.py` Make `append_new_dataitems` the `append_new_by_id` backend. If you must generalize it diff --git a/planning/prompts/05_readers.md b/planning/prompts/05_readers.md index e4725db..851f995 100644 --- a/planning/prompts/05_readers.md +++ b/planning/prompts/05_readers.md @@ -31,10 +31,13 @@ settings=None) -> DataFrame`: if the membership/mapping tables are denormalized that way (check how the `_03`/cluster notebooks write the hierarchy before assuming). -## Note — analysis is a separate module -Do NOT put analysis utils here. `compare_region_coverage(pmms)` goes in `io/analysis.py` -(`08_analysis.md`), and `populate_region_coverage` is a write-side transform -(`io/transforms.py`, `04_writers.md`). Readers only read. +## Read-side analysis (section in this file, not a new module) +`compare_region_coverage(pmms)` is read-side analysis and starts as a clearly-marked section +in `readers.py` — do NOT create `io/analysis.py` yet (single function = premature module). +Its implementation is specified in `08_analysis.md`; build it there. When a second analysis +function appears, relocate the section to `io/analysis.py` (pure move, no public-API change). +`populate_region_coverage` is a write-side transform and stays with the writers +(`04_writers.md`), not here. ## Tests (`tests/test_readers.py`) - Round-trip: write models via the writers, read them back scoped, assert equality on diff --git a/planning/prompts/06_notebook_migration.md b/planning/prompts/06_notebook_migration.md index 86955d5..19e5f4b 100644 --- a/planning/prompts/06_notebook_migration.md +++ b/planning/prompts/06_notebook_migration.md @@ -4,22 +4,33 @@ ## Goal Migrate the ETL notebooks in `code/etl_*.ipynb` to use the new IO API. Move bookkeeping -into the library; keep the science logic and verification cells. - -## Per notebook -1. Replace the hardcoded `OUTPUT_ROOT = "../scratch/..."` with: - ```python - from connects_common_connectivity.io.config import Settings - settings = Settings.load() - print(settings) # show resolved output_root at top - ``` +into the library; keep the science logic and verification cells. The output path lives in +ONE file (`ccc_config.yaml`) discovered automatically — notebooks carry no path and no +config cell. + +## First: create the config file +Create `ccc_config.yaml` at the repo root (the single source of truth, version-controlled): +```yaml +output_root: ../scratch/em_patchseq_wnm_v1/ # match the value grep'd from code/*.ipynb +dry_run: false +``` +To repoint local vs CodeOcean, edit this file (or set `CCC_OUTPUT_ROOT`); nothing else +changes. The library finds it by walking up from the notebook's working directory. + +## Per ETL notebook +1. Delete the hardcoded `OUTPUT_ROOT = "../scratch/..."` entirely. There is no replacement + config cell and no `%run` — the library discovers `ccc_config.yaml` on its own, so + `write_*` / `read_*` calls need neither a path nor `settings=`. (If a cell wants to show + the resolved config, it may `from connects_common_connectivity.io import get_settings; + print(get_settings())`, but this is optional.) 2. Replace each direct `write_deltalake(... mode=... predicate=... partition_by=...)` call with the matching typed writer (`write_dataset`, `write_dataitem`, `write_association`, `write_features`, `write_cluster`, `write_cell_to_cluster_mapping`, `write_projection_matrix`, ...). Delete the now-redundant `mode`/`predicate`/ `partition_by` arguments and their explanatory comments — that logic now lives in the registry. -3. Keep verification cells; update their paths to use `table_path(settings, ...)`. +3. Keep verification cells; update their paths to use + `table_path(get_settings(), ...)`. ## Migrate in this order 1. `etl_*_01_dataset_dataitem.ipynb` (all of minnie, wnm, visp_exc/inh patchseq) — these @@ -34,6 +45,11 @@ the `dataset` table and assert BOTH `visp_exc_patchseq` and `visp_inh_patchseq` exist under `project_id='visp_patchseq'`. Before the fix, the second run wiped the first. Report the before/after row counts. +## After migration — hand off shim removal +Once every notebook imports from the `io/` paths, the re-export shims at `arrow_utils.py`, +`write_utils.py`, `parquet_loader.py` are dead weight. Do TODO 5.4: delete them and confirm +the no-shim test (`07_tests.md`) passes. Report which old paths were still referenced, if any. + ## Do not - Change the science/ETL transformation logic. Fix the `etl_visp_inh_patchseq` data logic beyond the write path — the maintainer said the writer fix is enough for now. diff --git a/planning/prompts/07_tests.md b/planning/prompts/07_tests.md index 73489b6..3c306ff 100644 --- a/planning/prompts/07_tests.md +++ b/planning/prompts/07_tests.md @@ -3,24 +3,25 @@ > Prepend `00_shared_context.md`. Run after writers/readers exist (can be built alongside). ## Goal -A focused pytest suite under `tests/` covering the safe-writing guarantees. Use small -synthetic models written to a `tmp_path` Delta root (set `CCC_OUTPUT_ROOT` to `tmp_path`) -so tests never touch real data. +Pull the suite together and fill the gaps. Several cases are already specified in their +owning prompts — do NOT re-specify them here, just ensure they exist and run as one suite: +- Registry↔schema drift → `02_write_spec.md` (`tests/test_write_spec.py`). +- Patchseq shared-partition regression, idempotency, append-new-by-id, predicate + construction → `04_writers.md` (`tests/test_writers.py`). +- Round-trip + cross-dataset reads → `05_readers.md` (`tests/test_readers.py`). +- Strict-validation failures → `03_validation.md` (`tests/test_write_validation.py`). +- Public-API surface → `09_public_api.md` (`tests/test_public_api.py`). -## Required cases -1. **Shared-partition safety (patchseq regression):** write `DataSet(id="A")` and - `DataSet(id="B")` both with `project_id="P"`; assert both rows survive. This is the - core regression for the bug. -2. **Idempotency:** writing the same models twice → no duplicates, no row loss, for both - `overwrite_scoped` and `append_new_by_id`. -3. **Append-new-by-id:** second write with one new + one existing id appends exactly one. -4. **Strict validation:** a model missing a `required_for_write` slot, or violating a - cross-field rule, raises before any file is written (assert the Delta dir is unchanged). -5. **Registry↔schema drift:** (from `02_write_spec.md`) every registry entry's class and - columns exist in `models.py`. -6. **Round-trip:** write → read back via readers → equality on scope columns. -7. **Predicate construction:** the predicate is derived from `scope_columns` (assert the - DataSet predicate includes both `project_id` and `id`). +Use small synthetic models written to a `tmp_path` Delta root (set `CCC_OUTPUT_ROOT` to +`tmp_path`) so tests never touch real data. + +## Gaps this prompt owns (not covered elsewhere) +1. **No-shim regression (TODO 5.4):** after migration, assert no module imports the old + paths `arrow_utils`, `write_utils`, `parquet_loader` (grep the repo or import-scan); the + shims must be gone, not lingering. +2. **End-to-end smoke:** a single test exercising write → read → analysis on a tiny fixture, + proving the modules compose. +3. Confirm the whole suite is collected and green together (no per-prompt drift). ## Reporting Run `pytest -q` and paste the summary. Do not mark complete with failures. diff --git a/planning/prompts/08_analysis.md b/planning/prompts/08_analysis.md index 2e79344..6711be5 100644 --- a/planning/prompts/08_analysis.md +++ b/planning/prompts/08_analysis.md @@ -1,13 +1,15 @@ -# Agent prompt — Analysis module (read-side) +# Agent prompt — Read-side analysis + referential check (Phase 4b) > Prepend `00_shared_context.md`. Depends on `readers.py` (uses read outputs). -## Goal -Create `src/connects_common_connectivity/io/analysis.py` for read-side analysis over -already-written tables. This is distinct from `io/transforms.py` (write-side enrichment): -analysis reads finished data and summarizes; it never writes or mutates inputs. +Two things land here, both requiring readers to exist: + +## A. Read-side analysis — `compare_region_coverage` +Add as a clearly-marked section in `io/readers.py` (NOT a new `io/analysis.py` yet — single +function = premature module; relocate to `io/analysis.py` only when a second analysis +function arrives, a pure move with no public-API change). It reads finished data and +summarizes; it never writes or mutates inputs. -## Seed function Port `compare_region_coverage(pmms)` from `io/io_plans.md`: - Input: list of `ProjectionMeasurementMatrix` instances with `region_index` and `region_coverage` populated. @@ -17,9 +19,21 @@ Port `compare_region_coverage(pmms)` from `io/io_plans.md`: - Print the summary table shown in `io_plans.md` and return a dict with keys `shared_regions`, `shared_coverage`, `exclusive_counts`. -## Tests (`tests/test_analysis.py`) -- Small synthetic set of PMMs gives the expected shared/exclusive counts. -- Pure: inputs are not mutated. +## B. Opt-in referential check — `check_refs` +This is the home for the referential rule deliberately kept off the hot path in +`03_validation.md`. Implement it as an opt-in step invoked by writers: +- `write_models(..., check_refs=False)` — when True, before writing a + `DataItemDataSetAssociation`, read the `dataset` table (via the readers) and assert each + `dataset_id` exists for that `project_id`; raise a clear error naming the missing id. +- It reads other tables, so it is NOT a strict-submodel validator and never runs on the + default write path. Default `check_refs=False` keeps writes fast. + +## Tests +- `compare_region_coverage`: small synthetic PMM set gives expected shared/exclusive counts; + inputs are not mutated. +- `check_refs`: writing an association whose `dataset_id` is absent raises with + `check_refs=True`, and succeeds (no check) with the default. ## Do not -- Write to disk here. Touch `models.py` or schemas. +- Write to disk in the analysis function. Put referential checks on the default write path. + Touch `models.py` or schemas. diff --git a/planning/prompts/09_public_api.md b/planning/prompts/09_public_api.md new file mode 100644 index 0000000..6c7a8a3 --- /dev/null +++ b/planning/prompts/09_public_api.md @@ -0,0 +1,30 @@ +# Agent prompt — Public API (`io/__init__.py`) + +> Prepend `00_shared_context.md`. Depends on writers (3.1/3.2); readers can be added later. + +## Why +`io/__init__.py` is the single most important file for "user-friendly": it defines what a +user types after `from connects_common_connectivity.io import …` and what shows up in +autocomplete. It also decouples the public surface from internal module layout, so seed +sections can later be split into `transforms.py` / `analysis.py` without breaking imports. + +## Requirements +1. A concise module docstring: one paragraph on the IO layer (note settings come from a + discovered `ccc_config.yaml`) + a 3–5 line usage example (a `write_*` call, a `read_*` + call — no config ceremony needed). +2. Curated re-exports — only the names users should touch: + - config (from the package root, `from ..config import ...`): `get_settings`, `Settings`, + `table_path` + - writers: `write_models` + the generated typed wrappers + - readers: `read_dataset`, `read_dataitem`, `read_features`, + `read_dataitems_for_clusters`, and (when present) `compare_region_coverage` + Do NOT re-export backends (`arrow`, `write_utils`) or internal helpers. +3. Define `__all__` to match exactly the curated list (keeps `dir()` and `*` imports clean). +4. Keep it import-light: no heavy work at import time; just imports + `__all__`. + +## Test (`tests/test_public_api.py`) +- Every name in `__all__` is importable from `connects_common_connectivity.io`. +- No backend/internal module name leaks into `__all__`. + +## Do not +- Re-export internal backends. Touch `models.py` or schemas. From d2fe9254a054c49c128735833a6411d7a3d7ee3d Mon Sep 17 00:00:00 2001 From: reneyagmur Date: Tue, 9 Jun 2026 16:48:19 -0700 Subject: [PATCH 04/25] final iteration -1 --- planning/ARCHITECTURE.md | 128 +++++++++-------- planning/README.md | 11 +- planning/TODO.md | 135 ++++++++---------- planning/prompts/02_write_spec.md | 30 ++-- .../prompts/{04_writers.md => 03_writers.md} | 24 ++-- .../{09_public_api.md => 04_public_api.md} | 17 ++- .../{03_validation.md => 05_validation.md} | 13 +- planning/prompts/06_notebook_migration.md | 5 +- planning/prompts/07_tests.md | 34 ++--- .../prompts/{05_readers.md => 08_readers.md} | 13 +- .../{08_analysis.md => 09_analysis.md} | 7 +- 11 files changed, 226 insertions(+), 191 deletions(-) rename planning/prompts/{04_writers.md => 03_writers.md} (76%) rename planning/prompts/{09_public_api.md => 04_public_api.md} (65%) rename planning/prompts/{03_validation.md => 05_validation.md} (77%) rename planning/prompts/{05_readers.md => 08_readers.md} (81%) rename planning/prompts/{08_analysis.md => 09_analysis.md} (87%) diff --git a/planning/ARCHITECTURE.md b/planning/ARCHITECTURE.md index 004dece..3966990 100644 --- a/planning/ARCHITECTURE.md +++ b/planning/ARCHITECTURE.md @@ -51,9 +51,12 @@ src/connects_common_connectivity/ write_spec.py # NEW registry — source of truth write_validation.py# NEW auto-derived strict submodels (write-safety validation) arrow_utils.py # MOVED from root (no rename) (models <-> Arrow conversion) - writers.py # NEW write_models() + typed wrappers + write-side transforms - write_utils.py # MOVED from root (append-by-id backend, walk_ancestors) - readers.py # MOVED + folds parquet_loader.py + predicate/cross-dataset reads + writers.py # NEW write_models() + typed wrappers + write_utils.py # MOVED from root (append-by-id backend, walk_ancestors, + # populate_region_coverage) + # --- deferred (see "Later — elaborations"; designs kept, not built yet) --- + parquet_loader.py # MOVED from root (PURE MOVE, not folded into readers) + readers.py # NEW predicate-based + cross-dataset reads ``` `config.py` lives at the **package root**, not in `io/`: configuration is package-wide @@ -63,19 +66,19 @@ general namespace next to `models.py`. Conversely the io validator is named coupled to `write_spec`, and the bare word "validation" is already claimed by `cli.py`'s LinkML conformance check — two different validations, so neither owns the generic name. -Seed-stage modules are NOT split out prematurely. Write-side enrichment -(`populate_region_coverage`) starts as a section at the top of `writers.py`; read-side -analysis (`compare_region_coverage`) starts in `readers.py`. Promote either to its own -module (`transforms.py` / `analysis.py`) only when a second function arrives — that move is -a pure relocation with no public-API change because users import from `io/__init__.py`. +Seed-stage modules are NOT split out prematurely. `populate_region_coverage` is **not** a +separate "transforms" module — it lives in `write_utils.py` as a helper the projection +writer calls (it's write plumbing, like `append_new_dataitems`). Read-side +`compare_region_coverage` is deferred entirely (see "Later — elaborations"). Where each existing file goes: - `arrow_utils.py` → `io/arrow_utils.py`. Conversion layer used by `writers.py`. Pure move. - `write_utils.py` → `io/write_utils.py`. `append_new_dataitems` becomes the - `append_new_by_id` backend; `walk_ancestors` is used by membership/mapping writers and by - cross-dataset reads. Pure move. -- `parquet_loader.py` → folded into `io/readers.py` (Parquet→models with report becomes the - typed-read backend). Pure move/merge. + `append_new_by_id` backend; `walk_ancestors` is used by membership/mapping writers; + `populate_region_coverage` (ported from `io_plans.md`) is the pre-write projection helper. + Pure move + additions. +- `parquet_loader.py` → `io/parquet_loader.py`. **Pure move, NOT folded into readers** — + deferred with the read-side work. - `cli.py` stays at the package root as the `ccc` entry point; it owns the occasional full LinkML conformance check (separate from `io/write_validation.py`, which is the fast write-path check). @@ -181,30 +184,42 @@ concatenates path strings. ## Module 2 — `write_spec.py` (the registry) -An explicit, hand-maintained lookup, one entry per writable class, seeded from the schema -and refined from early experience. It is the source of truth for write/validation -behavior. A test cross-checks it against the LinkML schema so drift fails loudly (the -class names and `project_id`/identifier slots must exist in the generated models). +An explicit, hand-maintained lookup, one entry per writable class. **Build it like a +prototype, not a derivation.** Do not assume every class is scoped-overwrite-with-predicate; +that pattern fits DataSet/Association, but `append_new_by_id` already exists for DataItem +because append was the right behavior there, and other classes may want append or modes we +haven't named yet. For each class, write a small real example in a notebook *first*, see how +it actually wants to be written, and let that experience set the entry. The registry is then +the source of truth, cross-checked against the schema for drift (class names and +`project_id`/identifier slots must exist in the generated models). Each entry declares: - `subdir` — Delta table subdirectory under `output_root` (e.g. `"dataset"`). - `partition_by` — Delta partition columns (e.g. `["project_id"]`). -- `scope_columns` — columns that define the overwrite predicate (the identity within the - shared table). DataSet → `["project_id", "id"]`; DataItemDataSetAssociation → - `["project_id", "dataset_id"]`. -- `write_mode` — `"overwrite_scoped"` (scoped idempotent overwrite) or - `"append_new_by_id"` (the `append_new_dataitems` behavior for DataItem). +- `scope_columns` — for scoped-overwrite classes, the columns that define the predicate (the + identity within the shared table). DataSet → `["project_id", "id"]`; + DataItemDataSetAssociation → `["project_id", "dataset_id"]`. May be empty/N-A for + append-mode classes. +- `write_mode` — a small open vocabulary, not a fixed binary: `"overwrite_scoped"`, + `"append_new_by_id"` (the `append_new_dataitems` behavior), and whatever else the + prototyping surfaces. New modes are added when a class's example shows the existing ones + don't fit — `write_mode` is a `Literal` we extend, not a constraint to force classes into. - `required_for_write` — slots that must be present/non-null for a safe write (may be stricter than the schema's own `required`). -- `cross_field_rules` — names of cross-field checks to attach to the strict validator. +- `cross_field_rules` — names of cross-field checks to attach to the strict validator + (validation is layered in after the write path works; see ordering). -Predicate is built from `scope_columns` + the row values, e.g. +For `overwrite_scoped`, the predicate is built from `scope_columns` + the row values, e.g. `"project_id = 'visp_patchseq' AND id = 'visp_exc_patchseq'"`. This is exactly the bug fix: DataSet now carries `id` in its scope. ## Module 3 — `io/write_validation.py` (auto-derived strict submodels) +Built **after** the write path works (priority order: config → write IO → validation). The +writers ship first with a pass-through validation hook; this module swaps the real validator +into that hook. + Decision: **auto-derived** strict submodels — single source of truth. `strict_model_for(cls)` takes the generated pydantic model and returns a subclass that @@ -218,17 +233,19 @@ Hot-path validation is purely structural: required-slot enforcement plus pure cr rules that only inspect the model in hand. **Referential checks that read other tables do NOT belong on the hot path.** Example: "an association's `dataset_id` must refer to a DataSet already present for that `project_id`" requires a reader, so it is an opt-in check -(`write_models(..., check_refs=True)`) implemented after readers exist (Phase 4b), not a -strict-submodel validator. This keeps Phase 2 free of any dependency on Phase 4. +(`write_models(..., check_refs=True)`) deferred with the read-side work (it needs a reader), +not a strict-submodel validator. This keeps validation free of any dependency on readers. ## Module 4 — `writers.py` (+ `io/write_utils.py`, `io/arrow_utils.py`) A single dispatch core plus thin typed wrappers: - `write_models(models, *, settings=None)` — infers the class, looks up the registry, - validates each model via the strict submodel, converts via `io/arrow_utils.py`, attaches - LinkML metadata, then writes per `write_mode` (scoped overwrite with the - registry-built predicate, or `append_new_by_id` via the backend). + converts via `io/arrow_utils.py`, attaches LinkML metadata, then writes per `write_mode` + (scoped overwrite with the registry-built predicate, or `append_new_by_id` via the + backend). It calls a **validation hook** before writing; in the write-IO phase that hook is + a pass-through, and Module 3 (built afterward) swaps in the real strict validator with no + restructuring. - Typed wrappers for discoverability (`write_dataset`, `write_dataitem`, `write_association`, `write_features`, `write_cluster`, `write_cluster_membership`, `write_cell_to_cluster_mapping`, `write_projection_matrix`). `write_models` is the one @@ -238,37 +255,34 @@ A single dispatch core plus thin typed wrappers: hand-written wrapper is justified only where a class needs a non-uniform signature (e.g. `write_projection_matrix` taking the dense matrix for enrichment). - `io/write_utils.py` (moved from root): `append_new_dataitems` is the `append_new_by_id` - backend; `walk_ancestors` is used by membership/mapping writers. Generalize - `append_new_dataitems` only if needed (e.g. parametrize the partition column), without - breaking callers. -- **Write-side enrichment** lives as a section at the top of `writers.py` (not yet its own - module): `populate_region_coverage(pmm, matrix)` from `io_plans.md` derives - `region_coverage` from the dense values. `write_projection_matrix` calls it (or accepts - an already-enriched matrix). Keep it a pure function (no IO, no mutation of input). - Split into `io/transforms.py` only when a second transform appears. + backend; `walk_ancestors` is used by membership/mapping writers; `populate_region_coverage` + (ported from `io_plans.md`) is the pre-write projection helper. `write_projection_matrix` + calls `populate_region_coverage` (or accepts an already-enriched matrix). Keep it a pure + function (no IO, no mutation of input). Generalize `append_new_dataitems` only if needed + (e.g. parametrize the partition column) without breaking callers. Rationale: this is write + plumbing the projection writer needs — same shelf as `append_new_dataitems` — not a + separate "transforms" concern. Wide feature matrices (`CellFeatureMatrix`) use `build_cell_feature_matrix_schema` (in `io/arrow_utils.py`) and a matrix-specific writer path, since they are wide Parquet, not row-modeled Delta tables. -## Module 5 — `readers.py` (folds `parquet_loader.py`) - -Two layers: - -- Thin predicate-based readers mirroring the write spec: `read_dataset`, `read_dataitem`, - `read_features`, scoped by `project_id`/`dataset_id`, returning polars/pandas. Typed - reads (Parquet→models) use the folded-in `load_parquet_to_models`. -- Flexible cross-dataset / cross-schema reads now that datasets share tables. Flagship - example: "read all DataItems that have either a ClusterMembership or a - CellToClusterMapping to a given set of clusters" — a cross-table query joining - membership/mapping tables on cluster ids and returning the union of matching DataItems, - regardless of source dataset/modality. Users can still drop to raw - `polars.read_delta` for ad-hoc queries; the readers are conveniences, not a wall. -- **Read-side analysis** lives as a section in `readers.py` to start: - `compare_region_coverage(pmms)` from `io_plans.md` (shared vs exclusive region coverage - across matrices). It reads finished data and summarizes — the mirror image of write-side - enrichment, which augments data on the way in. Split into `io/analysis.py` only when a - second analysis function appears. +## Later — elaborations (deferred; design kept, not built yet) + +These are **not actionable in this round.** Priority now is config → write IO → validation → +notebook migration. Once the write path is solid and notebooks are migrated, revisit: + +- **Readers** (`io/readers.py`): predicate-based readers mirroring the write spec + (`read_dataset`, `read_dataitem`, `read_features` scoped by `project_id`/`dataset_id`), + plus flexible cross-dataset reads now that datasets share tables — flagship: "all DataItems + with either a ClusterMembership or a CellToClusterMapping to a given cluster set." Users can + always drop to raw `polars.read_delta`; readers are conveniences, not a wall. When this + starts, `parquet_loader.py` is **moved** to `io/parquet_loader.py` (pure move, not folded) + and used as the typed-read backend. +- **Read-side analysis**: `compare_region_coverage(pmms)` from `io_plans.md` (shared vs + exclusive region coverage across matrices) — reads finished data and summarizes. +- **Opt-in referential check** (`write_models(..., check_refs=True)`): needs a reader, so it + rides with the read-side work. ## Notebook migration (no logic, no schema, no models.py changes) @@ -286,6 +300,8 @@ after a re-run as the migration's acceptance test. - Idempotency: writing the same models twice yields no duplicates and no row loss. - Shared-partition safety: writing dataset B does not remove dataset A's rows when they share a `project_id` (the patchseq regression test). +- Per-class write example: every writable class has a small notebook example exercising its + registry entry (the prototyping evidence behind its `write_mode`/`scope_columns`). - Strict-validator tests: missing `required_for_write` slot or failing cross-field rule - raises before any write touches disk. -- Round-trip: write models → read back via readers → equality on scope columns. + raises before any write touches disk (added with Module 3). +- Round-trip (write → read back → equality on scope columns): deferred with readers. diff --git a/planning/README.md b/planning/README.md index eba657e..967e2ad 100644 --- a/planning/README.md +++ b/planning/README.md @@ -6,10 +6,13 @@ ConnectsCommonConnectivity. Created 2026-06-01. - `ARCHITECTURE.md` — the design (source of truth). - `TODO.md` — ordered, dependency-aware task list. - `prompts/` — one prompt per work item (`00_shared_context.md` is prepended to every other - prompt and holds the **hard rules**: don't edit `models.py` or `schemas/*.yaml`): - `01_config` · `02_write_spec` · `03_validation` · `04_writers` · `05_readers` · - `06_notebook_migration` · `07_tests` · `08_analysis` (read-side analysis + opt-in - referential check) · `09_public_api` (`io/__init__.py`). + prompt and holds the **hard rules**: don't edit `models.py` or `schemas/*.yaml`). + +**This round (priority order):** `01_config` → `02_write_spec` → `03_writers` → +`05_validation` → `06_notebook_migration` → `07_tests` (+ `04_public_api` alongside writers). + +**Deferred (design kept, not actionable yet):** `08_readers`, `09_analysis` — start only +after the write path is done and notebooks are migrated. ## How to run an item Hand the implementing agent `00_shared_context.md` + the specific prompt, point it at diff --git a/planning/TODO.md b/planning/TODO.md index ddd2f2f..0cd04eb 100644 --- a/planning/TODO.md +++ b/planning/TODO.md @@ -3,85 +3,76 @@ Ordered, dependency-aware. Design lives in `ARCHITECTURE.md`; the implementing prompt for each item is in `prompts/`. Hard rules: see `prompts/00_shared_context.md`. -## Phase 0 — groundwork -- [ ] **0.1 Config module** (`config.py`, package root — not `io/`) — pydantic `Settings` loaded from a discovered - `ccc_config.yaml` (walk-up like `pyproject.toml`), cached `get_settings()`, `table_path()` - helper. Precedence: explicit arg > `CCC_OUTPUT_ROOT` env (developer escape hatch, path - only) > `ccc_config.yaml` > **error, no default path**. No `configure()` global, no `%run`. - Deps: pydantic + PyYAML (already present). Prompt: `01_config.md`. Blocks everything that writes. +**Priority for this round: config → write IO → validation → notebook migration.** +Readers and analysis are deferred — see "Later — elaborations" (not actionable yet). -## Phase 1 — registry (the hub) -- [ ] **1.1 Write spec registry** (`io/write_spec.py`) — one entry per writable class: - `subdir`, `partition_by`, `scope_columns`, `write_mode`, `required_for_write`, - `cross_field_rules`. Seed DataSet/DataItem/Association first, then the rest. - Prompt: `02_write_spec.md`. Blocked by: none (reads generated models). -- [ ] **1.2 Registry↔schema drift test** — assert every entry's class + scope/identifier - slots exist in `models.py`. Part of `02_write_spec.md`. +## Phase 0 — config +- [ ] **0.1 Config module** (`config.py`, package root — not `io/`) — pydantic `Settings` + loaded from a discovered `ccc_config.yaml` (walk-up like `pyproject.toml`), cached + `get_settings()`, `table_path()` helper. Precedence: explicit arg > `CCC_OUTPUT_ROOT` env + (developer escape hatch, path only) > `ccc_config.yaml` > **error, no default path**. No + `configure()` global, no `%run`. Deps already present. Prompt: `01_config.md`. Blocks writes. -## Phase 2 — validation (structural only) -- [ ] **2.1 Strict submodel derivation** (`io/write_validation.py`) — `strict_model_for(cls)` - flips `required_for_write` to required + attaches *pure* `cross_field_rules`. No I/O, no - reading other tables. Auto-derived from generated models + registry. Prompt: - `03_validation.md`. Blocked by 1.1. (Referential checks are 4b.2, not here.) - -## Phase 3 — writers -- [ ] **3.0 Relocate backends into `io/`** — move `arrow_utils.py`→`io/arrow_utils.py`, - `write_utils.py`→`io/write_utils.py`, with re-export shims at old paths. Part of - `04_writers.md`. -- [ ] **3.1 Write dispatch core** (`io/writers.py`) — `write_models(models, settings=...)`: - infer class → registry lookup → strict-validate → arrow convert → metadata → write per - `write_mode`. Reuses `io/arrow_utils.py` + `io/write_utils.py`. Prompt: `04_writers.md`. - Blocked by 0.1, 1.1, 2.1. -- [ ] **3.2 Typed wrappers** — generated from the registry (not hand-maintained); hand-write - only non-uniform signatures (e.g. `write_projection_matrix`). Part of `04_writers.md`. -- [ ] **3.3 Reconcile `write_utils.py`** — make `append_new_dataitems` the - `append_new_by_id` backend without breaking current callers. Part of `04_writers.md`. -- [ ] **3.4 Write-side transform** — `populate_region_coverage` as a section in - `writers.py` (pre-write enrichment of ProjectionMeasurementMatrix). Part of `04_writers.md`. +## Phase 1 — write IO (prototype per class) +Approach this like prototyping. **Do not assume every class is scoped-overwrite-with- +predicate.** For each writable class, add a small real write example to a notebook, see how +it actually wants to be written, and let that set its registry entry. `append_new_by_id` +already exists for DataItem; other classes may want append or modes not yet named. -## Phase 3b — public API -- [ ] **3b.1 `io/__init__.py`** — curated exports, module docstring, `__all__`. Defines what - users type after `from connects_common_connectivity.io import …`. Prompt: `09_public_api.md`. - Blocked by 3.1. +- [ ] **1.1 Write spec registry** (`io/write_spec.py`) — one entry per class: `subdir`, + `partition_by`, `scope_columns`, `write_mode` (open `Literal`, extend as prototyping + surfaces new modes), `required_for_write`, `cross_field_rules`. Seed + DataSet/DataItem/Association; add others as their examples are built. Prompt: `02_write_spec.md`. +- [ ] **1.2 Registry↔schema drift test** — class + scope/identifier slots exist in `models.py`. +- [ ] **1.3 Relocate write backends into `io/`** — `arrow_utils.py`→`io/arrow_utils.py`, + `write_utils.py`→`io/write_utils.py` (re-export shims). `populate_region_coverage` lands in + `write_utils.py` (the projection writer calls it), NOT a transforms module. Part of `03_writers.md`. +- [ ] **1.4 Write dispatch core** (`io/writers.py`) — `write_models` + registry-generated + typed wrappers; dispatch on `write_mode`. Include a **pass-through validation hook** so + Phase 2 can slot the real validator in without restructuring. Prompt: `03_writers.md`. + Blocked by 0.1, 1.1. +- [ ] **1.5 Per-class write examples in notebooks** — the prototyping evidence that informs + 1.1; one small example per writable class. Part of `02_write_spec.md` / `03_writers.md`. -## Phase 4 — readers -- [ ] **4.0 Fold `parquet_loader.py` into `io/readers.py`** (re-export shim at old path). - Part of `05_readers.md`. -- [ ] **4.1 Predicate-based readers** (`io/readers.py`) — `read_dataset`, `read_dataitem`, - `read_features` scoped by project/dataset. Prompt: `05_readers.md`. Blocked by 1.1. -- [ ] **4.2 Cross-dataset reads** — flagship: DataItems with ClusterMembership OR - CellToClusterMapping to a given cluster set. Part of `05_readers.md`. +## Phase 2 — validation (after write works) +- [ ] **2.1 Strict submodel derivation** (`io/write_validation.py`) — `strict_model_for(cls)` + flips `required_for_write` to required + attaches *pure* `cross_field_rules` (no I/O); wire + `validate_for_write` into `write_models`, replacing the pass-through hook. Prompt: + `05_validation.md`. Blocked by 1.1, 1.4. (Referential checks deferred with readers.) -## Phase 4b — analysis & referential checks (need readers) -- [ ] **4b.1 Read-side analysis** — `compare_region_coverage` as a section in `readers.py` - (read-side overlap summary). Prompt: `08_analysis.md`. Blocked by 4.1. -- [ ] **4b.2 Opt-in referential check** — `write_models(..., check_refs=True)` verifies an - association's `dataset_id` exists among written DataSets. Uses readers; off the hot path. - Part of `08_analysis.md`. Blocked by 4.1. +## Phase 3 — notebook migration +- [ ] **3.0 Create `ccc_config.yaml`** at repo root — single source of truth for + `output_root` (+ `dry_run`). Part of `06_notebook_migration.md`. +- [ ] **3.1 Migrate `_01_dataset_dataitem` notebooks** — delete hardcoded `OUTPUT_ROOT` (no + config cell; library discovers `ccc_config.yaml`) + typed writers; fixes the patchseq + DataSet overwrite. Prompt: `06_notebook_migration.md`. Blocked by 1.x (2.x preferred). +- [ ] **3.2 Migrate feature / cluster / mapping / projection notebooks.** Same prompt. +- [ ] **3.3 Patchseq regression check** — re-run exc then inh; assert both DataSet rows coexist. +- [ ] **3.4 Remove write-side re-export shims** — delete shims at `arrow_utils.py`, + `write_utils.py` once no notebook/test imports them; test asserts no old path is referenced. + Blocked by 3.1–3.2. -## Phase 5 — notebook migration -- [ ] **5.0 Create `ccc_config.yaml`** at the repo root — the single, version-controlled - source of truth for `output_root` (+ `dry_run`). Part of `06_notebook_migration.md`. -- [ ] **5.1 Migrate `_01_dataset_dataitem` notebooks** — delete hardcoded `OUTPUT_ROOT` - (no config cell; library discovers `ccc_config.yaml`) + typed writers; fixes the patchseq - DataSet overwrite. Prompt: `06_notebook_migration.md`. Blocked by 3.x. -- [ ] **5.2 Migrate feature / cluster / mapping / projection notebooks.** Same prompt. -- [ ] **5.3 Patchseq regression check** — re-run exc then inh; assert both DataSet rows - coexist. Acceptance test for the migration. -- [ ] **5.4 Remove re-export shims** — delete shims at `arrow_utils.py`, `write_utils.py`, - `parquet_loader.py` once no notebook/test imports them. Add a test asserting no old import - path is referenced anywhere. Blocked by 5.1–5.2. +## Phase 4 — write-side tests & docs +- [ ] **4.1 Write-side test suite** — drift, patchseq shared-partition regression, idempotency, + append-new-by-id, predicate construction, per-class example smoke. Prompt: `07_tests.md`. +- [ ] **4.2 Update README / usage docs** for the write API. (Ask before large edits.) -## Phase 6 — tests & docs -- [ ] **6.1 Test suite** — see `07_tests.md`. Pulls together the cases already specified in - `02` (drift), `04`/`06` (regression) rather than re-specifying them. -- [ ] **6.2 Update README / usage docs** for the new IO API. (Ask before large edits.) +## Later — elaborations (NOT actionable yet) +Deferred until the write path is done and notebooks migrated. Designs kept in `ARCHITECTURE.md` +and prompts `08_readers.md` / `09_analysis.md` for reference; do not start these now. +- **Readers** (`io/readers.py`) — predicate-based + cross-dataset reads. `parquet_loader.py` + is **moved** to `io/parquet_loader.py` (pure move, NOT folded) when this starts. +- **Read-side analysis** — `compare_region_coverage`. +- **Opt-in referential check** — `write_models(..., check_refs=True)`; needs readers. ## Decisions locked (2026-06-01) -- Config: declarative `ccc_config.yaml` at repo root, discovered by walk-up and validated by +- Config: declarative `ccc_config.yaml` at repo root, discovered by walk-up, validated by pydantic; no per-notebook setup, no `%run`, no global. Precedence: explicit arg > env - (escape hatch) > `ccc_config.yaml` > **error (no default path)**. Deps already present. -- Write spec: explicit registry, source of truth, schema-checked for drift. -- Validation: auto-derived strict submodels; hot path is structural-only, no I/O. -- Packaging: seed transforms/analysis inside writers/readers; split out on second function. - Public surface is `io/__init__.py`. Scope of this session: planning docs + prompts only. + (escape hatch) > file > **error (no default path)**. `config.py` at package root, not `io/`. +- Write spec: explicit registry, prototyped per class via notebook examples; `write_mode` is + an open vocabulary, not a forced overwrite assumption. +- `populate_region_coverage` lives in `write_utils.py` (write plumbing), not a transforms module. +- Validation: built after the write path; auto-derived strict submodels; structural-only, no I/O. + Named `io/write_validation.py` (cli owns the generic LinkML conformance check). +- Readers, analysis, referential check: deferred. `parquet_loader.py` is a pure move, not a fold. +- Public surface is `io/__init__.py`. Scope of this session: planning docs + prompts only. diff --git a/planning/prompts/02_write_spec.md b/planning/prompts/02_write_spec.md index b9108d0..5c2550e 100644 --- a/planning/prompts/02_write_spec.md +++ b/planning/prompts/02_write_spec.md @@ -5,16 +5,27 @@ ## Goal Create `src/connects_common_connectivity/io/write_spec.py`: an explicit registry, one entry per writable class, that is the single source of truth for how each class is -written and validated. Plus a test that the registry cannot drift from the schema. +written. Plus a test that the registry cannot drift from the schema. + +## Approach: prototype, don't assume +Do NOT assume every class is scoped-overwrite-with-predicate. That pattern fits +DataSet/Association, but `append_new_by_id` already exists for DataItem because append was +right there, and other classes may want append or modes not yet named. **For each class, +build a small real write example in a notebook first** (paired with `03_writers.md`), see how +it actually wants to be written, and let that decide the entry. `write_mode` is an open +`Literal` you extend when an example doesn't fit the existing modes — not a constraint to +force classes into. Seed the three correctness-critical classes below; add the rest as their +examples are built rather than all at once up front. ## Registry shape Define a dataclass/pydantic model `WriteSpec` with fields: - `model_cls` — the generated pydantic class (import from `..models`). - `subdir: str` — Delta subdir under `output_root` (must match the notebook paths). - `partition_by: list[str]` — Delta partition columns. -- `scope_columns: list[str]` — columns defining the overwrite predicate (identity within - the shared table). -- `write_mode: Literal["overwrite_scoped", "append_new_by_id"]`. +- `scope_columns: list[str]` — for scoped-overwrite classes, columns defining the predicate + (identity within the shared table). May be empty for append-mode classes. +- `write_mode: Literal[...]` — start with `"overwrite_scoped"`, `"append_new_by_id"`; add new + members when a class's example shows neither fits. Keep it easy to extend. - `required_for_write: list[str]` — slots that must be non-null to write safely (may be stricter than the schema's `required`). - `cross_field_rules: list[str]` — names of cross-field checks (implemented in @@ -34,11 +45,12 @@ Expose `REGISTRY: dict[str, WriteSpec]` keyed by class name, and a Then add entries for `Cluster`, `ClusterHierarchy`, `ClusterMembership`, `CellFeatureSet`, `CellFeatureDefinition`, `CellToClusterMapping`, `MappingSet`, -`ProjectionMeasurementMatrix`, etc. — derive `subdir`/`scope_columns` by reading how each -is written in `code/etl_*.ipynb` (grep `write_deltalake` and `predicate=`). Where a -notebook's predicate looks wrong (like the DataSet case), prefer the correct scope and -note it in a comment. `CellFeatureMatrix` is wide Parquet, not row Delta — mark it so the -writer routes it to the matrix path (`build_cell_feature_matrix_schema`). +`ProjectionMeasurementMatrix`, etc. **as each one's write example is prototyped** — read how +it's written today in `code/etl_*.ipynb` (grep `write_deltalake` and `predicate=`), try it +through the writer, and only then fix its entry. Where a notebook's current predicate looks +wrong (like the DataSet case), prefer the correct scope and note it in a comment. +`CellFeatureMatrix` is wide Parquet, not row Delta — mark it so the writer routes it to the +matrix path (`build_cell_feature_matrix_schema`). ## Drift test (`tests/test_write_spec.py`) - Every `REGISTRY` key resolves to a real class in `models.py`. diff --git a/planning/prompts/04_writers.md b/planning/prompts/03_writers.md similarity index 76% rename from planning/prompts/04_writers.md rename to planning/prompts/03_writers.md index 14d8ea1..b6a213d 100644 --- a/planning/prompts/04_writers.md +++ b/planning/prompts/03_writers.md @@ -1,6 +1,7 @@ # Agent prompt — Writers (dispatch core + typed wrappers) -> Prepend `00_shared_context.md`. Depends on `config.py`, `write_spec.py`, `write_validation.py`. +> Prepend `00_shared_context.md`. Depends on `config.py`, `write_spec.py`. (Validation is +> built afterward and slots into the pass-through hook below — not a dependency here.) ## Relocation first (clean structure) Before writing new code, MOVE the existing backends into `io/` (with re-export shims at the @@ -19,7 +20,9 @@ the registry so notebooks never hand-write `mode` / `predicate` / `partition_by` 2. `settings = settings or get_settings()` (loads the discovered `ccc_config.yaml`; an explicit `settings=` still wins). 3. Look up the `WriteSpec` via `get_spec`. -4. Validate every model with `validate_for_write` (strict submodel) BEFORE any IO. +4. Call a **validation hook** before any IO. In this phase the hook is a pass-through + (identity) function — validation is built afterward (`05_validation.md`) and swaps the + real `validate_for_write` into this hook with no restructuring. Wire the call site now. 5. Convert via `arrow_utils.models_to_table` + `build_arrow_schema`; attach metadata with `attach_linkml_metadata(linkml_class=)`. 6. Resolve path with `table_path(settings, spec.subdir)`. @@ -47,16 +50,13 @@ matrix for enrichment). The generated names are re-exported from `io/__init__.py `build_cell_feature_matrix_schema` (now in `io/arrow_utils.py`); do not force it into the row-Delta path. -## Write-side transform (section in `writers.py`, not a new module) -Add pre-write enrichment as a clearly-marked section at the top of `writers.py` — do NOT -create `io/transforms.py` yet (single function = premature module). Port -`populate_region_coverage(pmm, matrix)` from `io/io_plans.md`: derive `region_coverage` -from the dense values array, return a copy of the `ProjectionMeasurementMatrix` (pure -function, no mutation, no IO). `write_projection_matrix` calls it (or accepts an -already-enriched matrix). When a second write-side transform appears, relocate the section -into `io/transforms.py` — a pure move, no public-API change (users import via -`io/__init__.py`). Do NOT put `compare_region_coverage` here — that is read-side analysis -(see `08_analysis.md`). +## Projection pre-write helper (in `io/write_utils.py`, not a transforms module) +`populate_region_coverage(pmm, matrix)` is write plumbing the projection writer needs — same +shelf as `append_new_dataitems` — so it lives in `io/write_utils.py`, NOT a separate +`transforms` module. Port it from `io/io_plans.md`: derive `region_coverage` from the dense +values array, return a copy of the `ProjectionMeasurementMatrix` (pure function, no mutation, +no IO). `write_projection_matrix` calls it (or accepts an already-enriched matrix). Do NOT +port `compare_region_coverage` — that is read-side analysis and is deferred (`09_analysis.md`). ## Reconcile `write_utils.py` Make `append_new_dataitems` the `append_new_by_id` backend. If you must generalize it diff --git a/planning/prompts/09_public_api.md b/planning/prompts/04_public_api.md similarity index 65% rename from planning/prompts/09_public_api.md rename to planning/prompts/04_public_api.md index 6c7a8a3..9bdbe3a 100644 --- a/planning/prompts/09_public_api.md +++ b/planning/prompts/04_public_api.md @@ -1,24 +1,23 @@ # Agent prompt — Public API (`io/__init__.py`) -> Prepend `00_shared_context.md`. Depends on writers (3.1/3.2); readers can be added later. +> Prepend `00_shared_context.md`. Depends on writers (1.4); reader exports added later when +> the read-side work happens. ## Why `io/__init__.py` is the single most important file for "user-friendly": it defines what a user types after `from connects_common_connectivity.io import …` and what shows up in -autocomplete. It also decouples the public surface from internal module layout, so seed -sections can later be split into `transforms.py` / `analysis.py` without breaking imports. +autocomplete. It also decouples the public surface from internal module layout. ## Requirements 1. A concise module docstring: one paragraph on the IO layer (note settings come from a - discovered `ccc_config.yaml`) + a 3–5 line usage example (a `write_*` call, a `read_*` - call — no config ceremony needed). -2. Curated re-exports — only the names users should touch: + discovered `ccc_config.yaml`) + a 3–5 line usage example (a `write_*` call — no config + ceremony needed). +2. Curated re-exports — only the names users should touch (write-side for now): - config (from the package root, `from ..config import ...`): `get_settings`, `Settings`, `table_path` - writers: `write_models` + the generated typed wrappers - - readers: `read_dataset`, `read_dataitem`, `read_features`, - `read_dataitems_for_clusters`, and (when present) `compare_region_coverage` - Do NOT re-export backends (`arrow`, `write_utils`) or internal helpers. + - reader names are added here when readers land (deferred) — leave a clear TODO comment. + Do NOT re-export backends (`arrow_utils`, `write_utils`) or internal helpers. 3. Define `__all__` to match exactly the curated list (keeps `dir()` and `*` imports clean). 4. Keep it import-light: no heavy work at import time; just imports + `__all__`. diff --git a/planning/prompts/03_validation.md b/planning/prompts/05_validation.md similarity index 77% rename from planning/prompts/03_validation.md rename to planning/prompts/05_validation.md index dfbf6a2..2a5689b 100644 --- a/planning/prompts/03_validation.md +++ b/planning/prompts/05_validation.md @@ -1,6 +1,7 @@ # Agent prompt — Write-validation (auto-derived strict submodels) -> Prepend `00_shared_context.md`. Depends on `write_spec.py`. +> Prepend `00_shared_context.md`. Depends on `write_spec.py` and `writers.py` (built after +> the write path; wires into the pass-through validation hook left in `write_models`). ## Naming File is `io/write_validation.py`, NOT `io/validation.py`: this is specifically write-safety @@ -25,13 +26,15 @@ is restated from the schema. submodel, raising a clear error that names the class, the failing slot/rule, and the offending value. This runs on the hot write path, so keep it pydantic-only (fast, **no I/O**); do NOT call the LinkML/`cli.py` validator here. -3. Implement a starter cross-field rule registry (a dict name → callable). Rules here MUST +3. **Wire it into `write_models`:** replace the pass-through validation hook left by + `03_writers.md` with `validate_for_write`. This is the only change to the writer. +4. Implement a starter cross-field rule registry (a dict name → callable). Rules here MUST be pure: they inspect only the model instance in hand, do no I/O, and never read other tables. Add rules only as the registry references them. - Do NOT implement `association_dataset_exists` here. It reads written DataSets, so it is - a referential check, not a structural one — it belongs off the hot path as an opt-in - `check_refs` in Phase 4b (`08_analysis.md`), after readers exist. Keeping it out of - this module is what frees Phase 2 from any dependency on Phase 4. + a referential check, not a structural one — it is deferred with the read-side work as an + opt-in `check_refs` (`09_analysis.md`). Keeping it out keeps validation free of any + dependency on readers. ## Tests (`tests/test_write_validation.py`) - A model missing a `required_for_write` slot fails `validate_for_write` before any IO. diff --git a/planning/prompts/06_notebook_migration.md b/planning/prompts/06_notebook_migration.md index 19e5f4b..10d1497 100644 --- a/planning/prompts/06_notebook_migration.md +++ b/planning/prompts/06_notebook_migration.md @@ -46,9 +46,10 @@ exist under `project_id='visp_patchseq'`. Before the fix, the second run wiped t Report the before/after row counts. ## After migration — hand off shim removal -Once every notebook imports from the `io/` paths, the re-export shims at `arrow_utils.py`, -`write_utils.py`, `parquet_loader.py` are dead weight. Do TODO 5.4: delete them and confirm +Once every notebook imports from the `io/` paths, the write-side re-export shims at +`arrow_utils.py` and `write_utils.py` are dead weight. Do TODO 3.4: delete them and confirm the no-shim test (`07_tests.md`) passes. Report which old paths were still referenced, if any. +(`parquet_loader.py` is untouched this round — it moves with the deferred read-side work.) ## Do not - Change the science/ETL transformation logic. Fix the `etl_visp_inh_patchseq` data logic diff --git a/planning/prompts/07_tests.md b/planning/prompts/07_tests.md index 3c306ff..8d1a688 100644 --- a/planning/prompts/07_tests.md +++ b/planning/prompts/07_tests.md @@ -1,27 +1,29 @@ -# Agent prompt — Test suite +# Agent prompt — Write-side test suite -> Prepend `00_shared_context.md`. Run after writers/readers exist (can be built alongside). +> Prepend `00_shared_context.md`. Run after the write path + validation exist. (Reader/ +> analysis tests are deferred with that work.) ## Goal -Pull the suite together and fill the gaps. Several cases are already specified in their -owning prompts — do NOT re-specify them here, just ensure they exist and run as one suite: +Pull the write-side suite together and fill the gaps. Several cases are already specified in +their owning prompts — do NOT re-specify them here, just ensure they exist and run as one +suite: - Registry↔schema drift → `02_write_spec.md` (`tests/test_write_spec.py`). - Patchseq shared-partition regression, idempotency, append-new-by-id, predicate - construction → `04_writers.md` (`tests/test_writers.py`). -- Round-trip + cross-dataset reads → `05_readers.md` (`tests/test_readers.py`). -- Strict-validation failures → `03_validation.md` (`tests/test_write_validation.py`). -- Public-API surface → `09_public_api.md` (`tests/test_public_api.py`). + construction → `03_writers.md` (`tests/test_writers.py`). +- Strict-validation failures → `05_validation.md` (`tests/test_write_validation.py`). +- Public-API surface → `04_public_api.md` (`tests/test_public_api.py`). -Use small synthetic models written to a `tmp_path` Delta root (set `CCC_OUTPUT_ROOT` to -`tmp_path`) so tests never touch real data. +Use small synthetic models written to a `tmp_path` Delta root (point `CCC_OUTPUT_ROOT` at +`tmp_path`, or a tmp `ccc_config.yaml`) so tests never touch real data. ## Gaps this prompt owns (not covered elsewhere) -1. **No-shim regression (TODO 5.4):** after migration, assert no module imports the old - paths `arrow_utils`, `write_utils`, `parquet_loader` (grep the repo or import-scan); the - shims must be gone, not lingering. -2. **End-to-end smoke:** a single test exercising write → read → analysis on a tiny fixture, - proving the modules compose. -3. Confirm the whole suite is collected and green together (no per-prompt drift). +1. **Per-class write-example smoke:** every writable class in the registry has a tiny write + that round-trips through `write_models` without error (the prototyping evidence as a test). +2. **No-shim regression (TODO 3.4):** after migration, assert no module imports the old + write-side paths `arrow_utils`, `write_utils` (grep / import-scan); the shims must be gone. +3. Confirm the suite is collected and green together (no per-prompt drift). + +Round-trip and cross-dataset read tests are deferred to the read-side work. ## Reporting Run `pytest -q` and paste the summary. Do not mark complete with failures. diff --git a/planning/prompts/05_readers.md b/planning/prompts/08_readers.md similarity index 81% rename from planning/prompts/05_readers.md rename to planning/prompts/08_readers.md index 851f995..c75f8d4 100644 --- a/planning/prompts/05_readers.md +++ b/planning/prompts/08_readers.md @@ -1,10 +1,15 @@ # Agent prompt — Readers (predicate-based + cross-dataset) +> **DEFERRED — not actionable this round.** Priority is config → write IO → validation → +> notebook migration. This design is kept for reference; do not start it until the write path +> is done and notebooks are migrated. +> > Prepend `00_shared_context.md`. Depends on `write_spec.py` (+ `config.py`). ## Relocation first (clean structure) -Fold `parquet_loader.py` into `io/readers.py` (the typed Parquet→models backend). Keep a -re-export shim at the old `parquet_loader` path until notebook migration is done. +**Move** `parquet_loader.py` → `io/parquet_loader.py` as a PURE MOVE (re-export shim at the +old path). Do NOT fold it into `io/readers.py` — keep it a standalone module; `readers.py` +imports `load_parquet_to_models` from it where typed reads are wanted. ## Goal Create `src/connects_common_connectivity/io/readers.py`: convenient reads over the shared @@ -34,10 +39,10 @@ settings=None) -> DataFrame`: ## Read-side analysis (section in this file, not a new module) `compare_region_coverage(pmms)` is read-side analysis and starts as a clearly-marked section in `readers.py` — do NOT create `io/analysis.py` yet (single function = premature module). -Its implementation is specified in `08_analysis.md`; build it there. When a second analysis +Its implementation is specified in `09_analysis.md`; build it there. When a second analysis function appears, relocate the section to `io/analysis.py` (pure move, no public-API change). `populate_region_coverage` is a write-side transform and stays with the writers -(`04_writers.md`), not here. +(`03_writers.md`), not here. ## Tests (`tests/test_readers.py`) - Round-trip: write models via the writers, read them back scoped, assert equality on diff --git a/planning/prompts/08_analysis.md b/planning/prompts/09_analysis.md similarity index 87% rename from planning/prompts/08_analysis.md rename to planning/prompts/09_analysis.md index 6711be5..30304d5 100644 --- a/planning/prompts/08_analysis.md +++ b/planning/prompts/09_analysis.md @@ -1,5 +1,8 @@ -# Agent prompt — Read-side analysis + referential check (Phase 4b) +# Agent prompt — Read-side analysis + referential check +> **DEFERRED — not actionable this round.** Rides with the read-side work, after config → +> write IO → validation → notebook migration. Design kept for reference. +> > Prepend `00_shared_context.md`. Depends on `readers.py` (uses read outputs). Two things land here, both requiring readers to exist: @@ -21,7 +24,7 @@ Port `compare_region_coverage(pmms)` from `io/io_plans.md`: ## B. Opt-in referential check — `check_refs` This is the home for the referential rule deliberately kept off the hot path in -`03_validation.md`. Implement it as an opt-in step invoked by writers: +`05_validation.md`. Implement it as an opt-in step invoked by writers: - `write_models(..., check_refs=False)` — when True, before writing a `DataItemDataSetAssociation`, read the `dataset` table (via the readers) and assert each `dataset_id` exists for that `project_id`; raise a clear error naming the missing id. From dd2e940f09c305fd3aa93e7389c0bc0ee6eabe89 Mon Sep 17 00:00:00 2001 From: reneyagmur Date: Tue, 9 Jun 2026 17:04:52 -0700 Subject: [PATCH 05/25] final plans --- .../instructions/changelog.instructions.md | 54 +++++++++ CHANGELOG.md | 20 +++ planning/ARCHITECTURE.md | 15 +-- planning/README.md | 24 ++-- planning/TODO.md | 114 ++++++++---------- planning/prompts/00_shared_context.md | 9 +- planning/prompts/02_write_spec.md | 9 +- .../prompts/{ => _deferred}/08_readers.md | 0 .../prompts/{ => _deferred}/09_analysis.md | 0 9 files changed, 158 insertions(+), 87 deletions(-) create mode 100644 .github/instructions/changelog.instructions.md create mode 100644 CHANGELOG.md rename planning/prompts/{ => _deferred}/08_readers.md (100%) rename planning/prompts/{ => _deferred}/09_analysis.md (100%) diff --git a/.github/instructions/changelog.instructions.md b/.github/instructions/changelog.instructions.md new file mode 100644 index 0000000..5b3ebda --- /dev/null +++ b/.github/instructions/changelog.instructions.md @@ -0,0 +1,54 @@ +--- +description: "Use when editing CHANGELOG.md, drafting release notes, or summarizing user-visible changes. Enforces Keep a Changelog format, SemVer scope, and the user-voice rule." +applyTo: "CHANGELOG.md" +--- +# Changelog rules + +The changelog is the user-facing log of what changed in +`connects_common_connectivity`. It is **not** an internal work journal. + +## Format +- [Keep a Changelog 1.1.0](https://keepachangelog.com/en/1.1.0/) + + [SemVer](https://semver.org/spec/v2.0.0.html). +- All new entries go under `## [Unreleased]` until a release is cut. +- Use only the standard sections: `Added`, `Changed`, `Deprecated`, `Removed`, + `Fixed`, `Security`. Omit empty sections in released versions; keep them as + empty headers under `[Unreleased]` so contributors see the slots. +- Newest version on top. Releases are `## [X.Y.Z] - YYYY-MM-DD`. + +## Voice and scope (the rule that actually matters) +- Write in **user voice**: what changed for someone who imports + `connects_common_connectivity`, runs the `ccc` CLI, or follows the README. +- One bullet per change. Past tense, present-perfect-style is fine + (`Added …`, `Moved …`, `Fixed …`). No first person, no narrative. +- **Include**: new public names, removed public names, moved import paths, + changed signatures, changed defaults, behavior fixes a user could observe, + new CLI flags, new config keys, dropped Python versions. +- **Exclude**: internal refactors, test-only changes, planning-doc edits, + prompt/agent-customization edits, dev-tooling tweaks, comment-only changes. + If a user couldn't notice it, it doesn't belong here. +- If a change has both an internal and a user-visible side, log only the + user-visible side. + +## Linking +- Reference public names in backticks: `` `write_models` ``, `` `io.writers` ``. +- Link to issues/PRs only when they add information a user would want + (`#123`); do not link to internal planning docs. + +## Deprecations and removals +- Announce in `Deprecated` first (one release minimum) before moving to + `Removed`, except for genuinely unused or never-released names. +- Name the replacement when there is one: "Deprecated `X`; use `Y` instead." + +## Releasing (manual for now) +1. Rename `## [Unreleased]` to `## [X.Y.Z] - YYYY-MM-DD` (today's date). +2. Drop empty subsections from the released block. +3. Add a fresh `## [Unreleased]` at the top with all six empty sub-headers. +4. Bump the version in `pyproject.toml` in the same commit. + +## Anti-patterns +- "Refactored internals." — internal, drop it. +- "Updated planning docs." — internal, drop it. +- "Various fixes." — split into specific bullets or drop. +- "Added new feature." — name the public symbol or describe the behavior. +- Long prose paragraphs — one bullet, one change. diff --git a/CHANGELOG.md b/CHANGELOG.md new file mode 100644 index 0000000..10652f1 --- /dev/null +++ b/CHANGELOG.md @@ -0,0 +1,20 @@ +# Changelog + +All notable changes to this project will be documented in this file. + +The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/), +and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html). + +## [Unreleased] + +### Added + +### Changed + +### Deprecated + +### Removed + +### Fixed + +### Security diff --git a/planning/ARCHITECTURE.md b/planning/ARCHITECTURE.md index 3966990..1cc8d32 100644 --- a/planning/ARCHITECTURE.md +++ b/planning/ARCHITECTURE.md @@ -71,14 +71,11 @@ separate "transforms" module — it lives in `write_utils.py` as a helper the pr writer calls (it's write plumbing, like `append_new_dataitems`). Read-side `compare_region_coverage` is deferred entirely (see "Later — elaborations"). -Where each existing file goes: -- `arrow_utils.py` → `io/arrow_utils.py`. Conversion layer used by `writers.py`. Pure move. -- `write_utils.py` → `io/write_utils.py`. `append_new_dataitems` becomes the - `append_new_by_id` backend; `walk_ancestors` is used by membership/mapping writers; - `populate_region_coverage` (ported from `io_plans.md`) is the pre-write projection helper. - Pure move + additions. -- `parquet_loader.py` → `io/parquet_loader.py`. **Pure move, NOT folded into readers** — - deferred with the read-side work. +Module placement summary (the operational "how to move them" lives in +`prompts/03_writers.md` so it is not restated in three places): +- `arrow_utils.py`, `write_utils.py` → `io/` as backends to `writers.py` (W3). +- `parquet_loader.py` → `io/parquet_loader.py` is a **pure move, deferred** with the + read-side work; do NOT move it now. - `cli.py` stays at the package root as the `ccc` entry point; it owns the occasional full LinkML conformance check (separate from `io/write_validation.py`, which is the fast write-path check). @@ -87,7 +84,7 @@ Where each existing file goes: Migration safety: while notebooks are being migrated, the moved modules may keep one-line re-export shims at their old import paths (e.g. `from .io.arrow_utils import *`) so nothing breaks -mid-transition. Shim removal is a tracked task (TODO 5.4), gated by a test that asserts no +mid-transition. Shim removal is a tracked task (TODO W6), gated by a test that asserts no old import path is referenced anywhere once migration is complete — otherwise the two import paths linger and become exactly the clutter this redesign removes. diff --git a/planning/README.md b/planning/README.md index 967e2ad..d947d1e 100644 --- a/planning/README.md +++ b/planning/README.md @@ -4,15 +4,25 @@ How we're building the user-friendly IO layer (write / read / validation) for ConnectsCommonConnectivity. Created 2026-06-01. - `ARCHITECTURE.md` — the design (source of truth). -- `TODO.md` — ordered, dependency-aware task list. -- `prompts/` — one prompt per work item (`00_shared_context.md` is prepended to every other - prompt and holds the **hard rules**: don't edit `models.py` or `schemas/*.yaml`). +- `TODO.md` — ordered, flat task list (W1–W8). +- `prompts/` — one prompt per work item. `00_shared_context.md` is prepended to every other + prompt and holds the **hard rules** (don't edit `models.py` or `schemas/*.yaml`). +- `prompts/_deferred/` — designs kept for reference; not actionable this round. -**This round (priority order):** `01_config` → `02_write_spec` → `03_writers` → -`05_validation` → `06_notebook_migration` → `07_tests` (+ `04_public_api` alongside writers). +## TODO ↔ prompt map -**Deferred (design kept, not actionable yet):** `08_readers`, `09_analysis` — start only -after the write path is done and notebooks are migrated. +| TODO | Prompt | What it owns | +|------|-------------------------------------|-------------------------------------------------| +| W1 | `01_config.md` | `config.py` + `ccc_config.yaml` discovery | +| W2 | `02_write_spec.md` | Registry (seed 3 classes) + drift test | +| W3 | `03_writers.md` | Relocation, writers, per-class prototyping | +| W4 | `04_public_api.md` | `io/__init__.py` curated surface | +| W5 | `05_validation.md` | Strict submodels + hook swap | +| W6 | `06_notebook_migration.md` | Migrate notebooks, regression, shim removal | +| W7 | `07_tests.md` | Write-side suite gaps | +| W8 | (no prompt) | README / usage docs update | +| L1 | `_deferred/08_readers.md` | Readers (deferred) | +| L2 | `_deferred/09_analysis.md` | Read-side analysis + `check_refs` (deferred) | ## How to run an item Hand the implementing agent `00_shared_context.md` + the specific prompt, point it at diff --git a/planning/TODO.md b/planning/TODO.md index 0cd04eb..4de287f 100644 --- a/planning/TODO.md +++ b/planning/TODO.md @@ -1,74 +1,58 @@ # IO Layer — TODO -Ordered, dependency-aware. Design lives in `ARCHITECTURE.md`; the implementing prompt for -each item is in `prompts/`. Hard rules: see `prompts/00_shared_context.md`. - -**Priority for this round: config → write IO → validation → notebook migration.** -Readers and analysis are deferred — see "Later — elaborations" (not actionable yet). - -## Phase 0 — config -- [ ] **0.1 Config module** (`config.py`, package root — not `io/`) — pydantic `Settings` - loaded from a discovered `ccc_config.yaml` (walk-up like `pyproject.toml`), cached - `get_settings()`, `table_path()` helper. Precedence: explicit arg > `CCC_OUTPUT_ROOT` env - (developer escape hatch, path only) > `ccc_config.yaml` > **error, no default path**. No - `configure()` global, no `%run`. Deps already present. Prompt: `01_config.md`. Blocks writes. - -## Phase 1 — write IO (prototype per class) -Approach this like prototyping. **Do not assume every class is scoped-overwrite-with- -predicate.** For each writable class, add a small real write example to a notebook, see how -it actually wants to be written, and let that set its registry entry. `append_new_by_id` -already exists for DataItem; other classes may want append or modes not yet named. - -- [ ] **1.1 Write spec registry** (`io/write_spec.py`) — one entry per class: `subdir`, - `partition_by`, `scope_columns`, `write_mode` (open `Literal`, extend as prototyping - surfaces new modes), `required_for_write`, `cross_field_rules`. Seed - DataSet/DataItem/Association; add others as their examples are built. Prompt: `02_write_spec.md`. -- [ ] **1.2 Registry↔schema drift test** — class + scope/identifier slots exist in `models.py`. -- [ ] **1.3 Relocate write backends into `io/`** — `arrow_utils.py`→`io/arrow_utils.py`, - `write_utils.py`→`io/write_utils.py` (re-export shims). `populate_region_coverage` lands in - `write_utils.py` (the projection writer calls it), NOT a transforms module. Part of `03_writers.md`. -- [ ] **1.4 Write dispatch core** (`io/writers.py`) — `write_models` + registry-generated - typed wrappers; dispatch on `write_mode`. Include a **pass-through validation hook** so - Phase 2 can slot the real validator in without restructuring. Prompt: `03_writers.md`. - Blocked by 0.1, 1.1. -- [ ] **1.5 Per-class write examples in notebooks** — the prototyping evidence that informs - 1.1; one small example per writable class. Part of `02_write_spec.md` / `03_writers.md`. - -## Phase 2 — validation (after write works) -- [ ] **2.1 Strict submodel derivation** (`io/write_validation.py`) — `strict_model_for(cls)` - flips `required_for_write` to required + attaches *pure* `cross_field_rules` (no I/O); wire - `validate_for_write` into `write_models`, replacing the pass-through hook. Prompt: - `05_validation.md`. Blocked by 1.1, 1.4. (Referential checks deferred with readers.) - -## Phase 3 — notebook migration -- [ ] **3.0 Create `ccc_config.yaml`** at repo root — single source of truth for - `output_root` (+ `dry_run`). Part of `06_notebook_migration.md`. -- [ ] **3.1 Migrate `_01_dataset_dataitem` notebooks** — delete hardcoded `OUTPUT_ROOT` (no - config cell; library discovers `ccc_config.yaml`) + typed writers; fixes the patchseq - DataSet overwrite. Prompt: `06_notebook_migration.md`. Blocked by 1.x (2.x preferred). -- [ ] **3.2 Migrate feature / cluster / mapping / projection notebooks.** Same prompt. -- [ ] **3.3 Patchseq regression check** — re-run exc then inh; assert both DataSet rows coexist. -- [ ] **3.4 Remove write-side re-export shims** — delete shims at `arrow_utils.py`, - `write_utils.py` once no notebook/test imports them; test asserts no old path is referenced. - Blocked by 3.1–3.2. - -## Phase 4 — write-side tests & docs -- [ ] **4.1 Write-side test suite** — drift, patchseq shared-partition regression, idempotency, - append-new-by-id, predicate construction, per-class example smoke. Prompt: `07_tests.md`. -- [ ] **4.2 Update README / usage docs** for the write API. (Ask before large edits.) - -## Later — elaborations (NOT actionable yet) -Deferred until the write path is done and notebooks migrated. Designs kept in `ARCHITECTURE.md` -and prompts `08_readers.md` / `09_analysis.md` for reference; do not start these now. -- **Readers** (`io/readers.py`) — predicate-based + cross-dataset reads. `parquet_loader.py` - is **moved** to `io/parquet_loader.py` (pure move, NOT folded) when this starts. -- **Read-side analysis** — `compare_region_coverage`. -- **Opt-in referential check** — `write_models(..., check_refs=True)`; needs readers. +Flat, ordered list. One row per prompt; sub-tasks live in the prompts. Design lives in +`ARCHITECTURE.md`. Hard rules: see `prompts/00_shared_context.md`. + +**Priority for this round: W1 → W7. Readers and analysis are deferred.** + +## This round (write path → migration → tests) + +- [ ] **W1 — Config** (`prompts/01_config.md`) — `config.py` at the package root, pydantic + `Settings` loaded from a discovered `ccc_config.yaml` (walk-up like `pyproject.toml`), + cached `get_settings()`, `table_path()` helper. Precedence: explicit arg > `CCC_OUTPUT_ROOT` + env > `ccc_config.yaml` > error. No `configure()` global, no `%run`. Blocks W3+. +- [ ] **W2 — Write spec registry** (`prompts/02_write_spec.md`) — `io/write_spec.py`: one + entry per writable class (`subdir`, `partition_by`, `scope_columns`, `write_mode`, + `required_for_write`, `cross_field_rules`). Seed DataSet/DataItem/Association now; add + others as W3 prototypes them. Includes the registry↔schema drift test. +- [ ] **W3 — Writers + relocation + per-class prototyping** (`prompts/03_writers.md`) — + Move `arrow_utils.py`/`write_utils.py` into `io/` (re-export shims at old paths). Build + `io/writers.py` (`write_models` + registry-generated typed wrappers) with a pass-through + validation hook for W5. Land `populate_region_coverage` in `io/write_utils.py`. For each + writable class, add a small real write example to a notebook and let it set the W2 entry. + Blocked by W1, W2. +- [ ] **W4 — Public API** (`prompts/04_public_api.md`) — `io/__init__.py`: curated + re-exports + `__all__`. The user-facing surface; defines what autocomplete shows. Blocked + by W3. +- [ ] **W5 — Write validation** (`prompts/05_validation.md`) — `io/write_validation.py`: + `strict_model_for(cls)` flips `required_for_write` to required + attaches pure + `cross_field_rules` (no I/O). Swap `validate_for_write` into the W3 hook. Blocked by W2, W3. +- [ ] **W6 — Notebook migration** (`prompts/06_notebook_migration.md`) — Create + `ccc_config.yaml` at repo root. Migrate every ETL notebook to typed writers; delete + hardcoded `OUTPUT_ROOT` and per-cell `mode`/`predicate`/`partition_by`. Run the patchseq + regression (exc then inh, both DataSet rows must coexist). Remove the W3 re-export shims + and confirm nothing imports the old paths. Blocked by W3 (W5 preferred). +- [ ] **W7 — Write-side test suite** (`prompts/07_tests.md`) — Drift, patchseq regression, + idempotency, append-new-by-id, predicate construction, per-class example smoke, no-shim + regression, public-API surface. Owns only the gaps not specified by W2/W3/W4/W5. +- [ ] **W8 — README / usage docs** — Update README for the write API. No prompt; small task. + Ask before large edits. + +## Deferred (do not start; design kept for reference) + +Designs live in `ARCHITECTURE.md` and `prompts/_deferred/`. Pick up only after W1–W7 land. + +- **L1 — Readers** (`prompts/_deferred/08_readers.md`) — `io/readers.py` (predicate-based + + cross-dataset). `parquet_loader.py` is **moved** to `io/parquet_loader.py` (pure move, + not folded) when this starts. +- **L2 — Read-side analysis + opt-in referential check** + (`prompts/_deferred/09_analysis.md`) — `compare_region_coverage` and + `write_models(..., check_refs=True)`. ## Decisions locked (2026-06-01) - Config: declarative `ccc_config.yaml` at repo root, discovered by walk-up, validated by pydantic; no per-notebook setup, no `%run`, no global. Precedence: explicit arg > env - (escape hatch) > file > **error (no default path)**. `config.py` at package root, not `io/`. + (escape hatch) > file > error (no default path). `config.py` at package root, not `io/`. - Write spec: explicit registry, prototyped per class via notebook examples; `write_mode` is an open vocabulary, not a forced overwrite assumption. - `populate_region_coverage` lives in `write_utils.py` (write plumbing), not a transforms module. diff --git a/planning/prompts/00_shared_context.md b/planning/prompts/00_shared_context.md index 3807014..fc6da43 100644 --- a/planning/prompts/00_shared_context.md +++ b/planning/prompts/00_shared_context.md @@ -12,10 +12,11 @@ This file is only the rules of the room. task seems to need a new slot, STOP and report what you need and why. 3. **Single source of truth = the LinkML schema / generated models.** Read field definitions from `models.py`; never restate them. -4. **IO code lives under `src/connects_common_connectivity/io/`.** Existing root - modules (`arrow_utils.py`, `write_utils.py`, `parquet_loader.py`) are MOVED there and - wrapped as backends — never reimplemented. `cli.py` and `models.py` stay at root; so does - `config.py` (package-wide settings, not IO-specific) and plotting stays in +4. **IO code lives under `src/connects_common_connectivity/io/`.** Write-side root + modules (`arrow_utils.py`, `write_utils.py`) are MOVED there and wrapped as backends in + W3 — never reimplemented. `parquet_loader.py` is a **deferred move** that rides with the + read-side work; do NOT relocate it during W1–W7. `cli.py` and `models.py` stay at root; + so does `config.py` (package-wide settings, not IO-specific) and plotting stays in `code/utils.py`. Exact layout: ARCHITECTURE.md → "Target io/ structure". 5. When you move a module, leave a one-line re-export shim at its old path until notebook migration is done, so nothing breaks mid-transition. diff --git a/planning/prompts/02_write_spec.md b/planning/prompts/02_write_spec.md index 5c2550e..1a78508 100644 --- a/planning/prompts/02_write_spec.md +++ b/planning/prompts/02_write_spec.md @@ -14,8 +14,13 @@ right there, and other classes may want append or modes not yet named. **For eac build a small real write example in a notebook first** (paired with `03_writers.md`), see how it actually wants to be written, and let that decide the entry. `write_mode` is an open `Literal` you extend when an example doesn't fit the existing modes — not a constraint to -force classes into. Seed the three correctness-critical classes below; add the rest as their -examples are built rather than all at once up front. +force classes into. + +**Sequencing:** in W2 (this prompt) seed *only* the three correctness-critical classes +below — enough to unblock W3. The remaining entries are added during W3, where the writer +exists to prototype against; that loop (notebook example → registry entry) is W3's job, not +this one's. Trying to fill the whole registry up front contradicts "prototype, don't +assume." ## Registry shape Define a dataclass/pydantic model `WriteSpec` with fields: diff --git a/planning/prompts/08_readers.md b/planning/prompts/_deferred/08_readers.md similarity index 100% rename from planning/prompts/08_readers.md rename to planning/prompts/_deferred/08_readers.md diff --git a/planning/prompts/09_analysis.md b/planning/prompts/_deferred/09_analysis.md similarity index 100% rename from planning/prompts/09_analysis.md rename to planning/prompts/_deferred/09_analysis.md From e1f8d3a28ec9bf6535722799b5429544f430b6fe Mon Sep 17 00:00:00 2001 From: reneyagmur Date: Wed, 10 Jun 2026 00:58:37 +0000 Subject: [PATCH 06/25] config for global outdir and open to more config settings --- ccc_config.yaml | 5 + code/etl_wnm_exc_01_dataset_dataitem.ipynb | 100 ++++++++- planning/TODO.md | 12 +- src/connects_common_connectivity/config.py | 180 ++++++++++++++++ .../io/__init__.py | 13 ++ tests/test_config.py | 195 ++++++++++++++++++ 6 files changed, 493 insertions(+), 12 deletions(-) create mode 100644 ccc_config.yaml create mode 100644 src/connects_common_connectivity/config.py create mode 100644 tests/test_config.py diff --git a/ccc_config.yaml b/ccc_config.yaml new file mode 100644 index 0000000..9256860 --- /dev/null +++ b/ccc_config.yaml @@ -0,0 +1,5 @@ +# Package-wide settings for ConnectsCommonConnectivity. +# Discovered by walking up from cwd (pyproject.toml/ruff/pytest pattern). +# Edit this file (or set CCC_OUTPUT_ROOT) to repoint writers/readers. +output_root: scratch/em_patchseq_wnm_v1/ +dry_run: false diff --git a/code/etl_wnm_exc_01_dataset_dataitem.ipynb b/code/etl_wnm_exc_01_dataset_dataitem.ipynb index 01fa280..8e1dec4 100644 --- a/code/etl_wnm_exc_01_dataset_dataitem.ipynb +++ b/code/etl_wnm_exc_01_dataset_dataitem.ipynb @@ -12,14 +12,7 @@ { "cell_type": "code", "execution_count": 1, - "metadata": { - "execution": { - "iopub.execute_input": "2026-04-30T23:48:14.065449Z", - "iopub.status.busy": "2026-04-30T23:48:14.065189Z", - "iopub.status.idle": "2026-04-30T23:48:15.118656Z", - "shell.execute_reply": "2026-04-30T23:48:15.117861Z" - } - }, + "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", @@ -38,7 +31,96 @@ " DataItemDataSetAssociation,\n", " Modality,\n", ")\n", - "from connects_common_connectivity.write_utils import append_new_dataitems" + "from connects_common_connectivity.write_utils import append_new_dataitems\n", + "import connects_common_connectivity.config as c" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [], + "source": [ + "OUTPUT_ROOT = c.output_root()" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "'../scratch/em_patchseq_wnm_v1/'" + ] + }, + "execution_count": 3, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "OUTPUT_ROOT" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "output_root=PosixPath('scratch/em_patchseq_wnm_v1') dry_run=False\n", + "dataset table path: scratch/em_patchseq_wnm_v1/dataset\n" + ] + } + ], + "source": [ + "settings = get_settings()\n", + "print(settings)\n", + "print(\"dataset table path:\", table_path(settings, \"dataset\"))" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "output_root=PosixPath('/tmp/ccc-smoke') dry_run=False\n" + ] + } + ], + "source": [ + "import os, importlib, connects_common_connectivity.config as c\n", + "os.environ[\"CCC_OUTPUT_ROOT\"] = \"/tmp/ccc-smoke\"\n", + "c.get_settings.cache_clear()\n", + "print(c.get_settings())" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "output_root=PosixPath('scratch/em_patchseq_wnm_v1') dry_run=False\n" + ] + } + ], + "source": [ + "del os.environ[\"CCC_OUTPUT_ROOT\"]\n", + "c.get_settings.cache_clear()\n", + "print(c.get_settings())" ] }, { diff --git a/planning/TODO.md b/planning/TODO.md index 4de287f..0bf5b1f 100644 --- a/planning/TODO.md +++ b/planning/TODO.md @@ -7,10 +7,16 @@ Flat, ordered list. One row per prompt; sub-tasks live in the prompts. Design li ## This round (write path → migration → tests) -- [ ] **W1 — Config** (`prompts/01_config.md`) — `config.py` at the package root, pydantic +- [x] **W1 — Config** (`prompts/01_config.md`) — `config.py` at the package root, pydantic `Settings` loaded from a discovered `ccc_config.yaml` (walk-up like `pyproject.toml`), - cached `get_settings()`, `table_path()` helper. Precedence: explicit arg > `CCC_OUTPUT_ROOT` - env > `ccc_config.yaml` > error. No `configure()` global, no `%run`. Blocks W3+. + cached `get_settings()`, `table_path()` helper, plus `output_root()` convenience that + returns the path relative to cwd (notebooks in `code/` see `../scratch/...`). Relative + values in the file are anchored at the config file's directory using `os.path.abspath` + (not `Path.resolve`, so Code Ocean's `scratch -> /scratch` symlink isn't followed). + Precedence: explicit arg > `CCC_OUTPUT_ROOT` env > `ccc_config.yaml` > error. No + `configure()` global, no `%run`. Re-exported from `io/__init__.py`. + `ccc_config.yaml` seeded at repo root with `output_root: scratch/em_patchseq_wnm_v1/`. + Tests: `tests/test_config.py` (14 tests, all passing). - [ ] **W2 — Write spec registry** (`prompts/02_write_spec.md`) — `io/write_spec.py`: one entry per writable class (`subdir`, `partition_by`, `scope_columns`, `write_mode`, `required_for_write`, `cross_field_rules`). Seed DataSet/DataItem/Association now; add diff --git a/src/connects_common_connectivity/config.py b/src/connects_common_connectivity/config.py new file mode 100644 index 0000000..f31510c --- /dev/null +++ b/src/connects_common_connectivity/config.py @@ -0,0 +1,180 @@ +"""Package-wide settings discovered from a repo-root ``ccc_config.yaml``. + +Configuration is a *mechanism* here; the *values* live in a single +version-controlled ``ccc_config.yaml`` at the repo root. Every entry point +(CLI, writers/readers, notebooks, future plotting/analysis) calls +:func:`get_settings`, which walks up from ``cwd`` to find that file, +validates it with pydantic, and returns a cached :class:`Settings`. + +No notebook setup cell, no ``%run``, no process-global mutation. + +Resolution precedence (highest wins): + +1. An explicit ``settings=`` argument passed by a caller. +2. ``CCC_OUTPUT_ROOT`` environment variable (overrides ``output_root`` only; + it cannot express structured knobs like ``dry_run``). +3. The discovered ``ccc_config.yaml``. +4. Otherwise: a clear, actionable error. +""" + +from __future__ import annotations + +import os +from functools import lru_cache +from pathlib import Path +from typing import Optional + +import yaml +from pydantic import BaseModel, Field + +CONFIG_FILENAME = "ccc_config.yaml" + + +class Settings(BaseModel): + """Validated, package-wide settings loaded from ``ccc_config.yaml``.""" + + output_root: Path = Field( + ..., + description="Root directory under which Delta/Parquet tables are written.", + ) + dry_run: bool = Field( + default=False, + description="If True, callers should log intended writes instead of executing them.", + ) + + model_config = {"extra": "forbid"} + + def describe(self) -> str: + """Return a human-readable summary of the resolved settings.""" + return ( + f"Settings(output_root={self.output_root!s}, dry_run={self.dry_run})" + ) + + def __repr__(self) -> str: # pragma: no cover - trivial + return self.describe() + + +def find_config_file( + start: Optional[Path] = None, + filename: str = CONFIG_FILENAME, +) -> Optional[Path]: + """Walk up from ``start`` (default: ``cwd``) to the filesystem root looking + for ``filename``. + + Returns the resolved path to the first match, or ``None`` if not found. + Mirrors the discovery pattern used by ``pyproject.toml``, ``ruff``, and + ``pytest`` — a notebook in ``code/`` finds the repo-root config with zero + config code. + """ + here = (start or Path.cwd()).resolve() + for candidate in (here, *here.parents): + path = candidate / filename + if path.is_file(): + return path + return None + + +@lru_cache(maxsize=1) +def get_settings() -> Settings: + """Discover ``ccc_config.yaml``, validate it, and return cached settings. + + Raises ``RuntimeError`` with an actionable message if no config file is + discoverable from the current working directory. + + Tests can call ``get_settings.cache_clear()`` to force re-discovery. + """ + path = find_config_file() + if path is None: + raise RuntimeError( + f"No {CONFIG_FILENAME} found — create one at the repo root with " + "output_root: . Discovery walks up from the current working " + "directory, like pyproject.toml/ruff/pytest." + ) + + raw = yaml.safe_load(path.read_text()) or {} + if not isinstance(raw, dict): + raise RuntimeError( + f"{path}: expected a YAML mapping at the top level, got {type(raw).__name__}." + ) + + config_dir = path.parent + + if "output_root" in raw and raw["output_root"] is not None: + raw["output_root"] = _anchor_path(raw["output_root"], config_dir) + + env_override = os.environ.get("CCC_OUTPUT_ROOT") + if env_override: + # Env values come from the user's shell; anchor to cwd so they are + # cwd-independent thereafter (matches shell intuition). + raw["output_root"] = _anchor_path(env_override, Path.cwd()) + + return Settings(**raw) + + +def _anchor_path(value, base: Path) -> Path: + """Return ``value`` as an absolute :class:`Path`, anchored at ``base`` if relative. + + Uses :func:`os.path.abspath` rather than :meth:`Path.resolve`: abspath + normalizes the path without following symlinks, so a symlinked + ``scratch -> /scratch`` doesn't suddenly point outside the repo and + relative-path output stays sensible (e.g. ``../scratch/x`` from ``code/``). + """ + p = Path(value) + if not p.is_absolute(): + p = base / p + return Path(os.path.abspath(p)) + + +def table_path(settings: Settings, table: str) -> Path: + """Resolve the on-disk path for a named Delta/Parquet table subdir. + + ``table`` should be one of the canonical subdir names used by the + notebooks (e.g. ``"dataset"``, ``"dataitem"``, + ``"dataitem_dataset_association"``, ``"cellfeatureset"``, + ``"cellfeaturematrix"``, ``"cluster"``, ``"clusterhierarchy"``, + ``"clustermembership"``, ``"mappingset"``, ``"celltoclustermapping"``, + ``"projectionmeasurementmatrix"``). Callers pass the exact name so + nothing concatenates path strings ad hoc. + """ + return Path(settings.output_root) / table + + +def output_root(settings: Optional[Settings] = None, *, absolute: bool = False) -> str: + """Return ``output_root`` as a string with a trailing ``/``. + + Resolution rule (the bit that makes notebooks Just Work): a relative + ``output_root`` in ``ccc_config.yaml`` is anchored at the config file's + directory (the repo root), not at ``cwd``. So a notebook running in + ``code/`` and a script running at the repo root both point at the same + place. By default this function then returns the path **relative to the + current working directory**, so a notebook in ``code/`` sees + ``"../scratch/em_patchseq_wnm_v1/"`` while a process at the repo root + sees ``"scratch/em_patchseq_wnm_v1/"``. Pass ``absolute=True`` to get the + fully resolved absolute path instead. + + Prefer :func:`table_path` for new code — it returns a typed :class:`Path` + for a named table subdir and is cwd-independent. + """ + s = settings if settings is not None else get_settings() + abs_path = Path(s.output_root) + if not abs_path.is_absolute(): + abs_path = Path(os.path.abspath(abs_path)) + if absolute: + text = str(abs_path) + else: + try: + text = os.path.relpath(abs_path, Path.cwd()) + except ValueError: + # Different drives on Windows — fall back to absolute. + text = str(abs_path) + return text if text.endswith("/") else text + "/" + + +__all__ = [ + "CONFIG_FILENAME", + "Settings", + "find_config_file", + "get_settings", + "output_root", + "table_path", +] diff --git a/src/connects_common_connectivity/io/__init__.py b/src/connects_common_connectivity/io/__init__.py index e69de29..c4018fc 100644 --- a/src/connects_common_connectivity/io/__init__.py +++ b/src/connects_common_connectivity/io/__init__.py @@ -0,0 +1,13 @@ +"""IO layer for ConnectsCommonConnectivity. + +This package owns write/read backends and (re-)exports a few package-wide +helpers for convenience. The settings live in :mod:`connects_common_connectivity.config`; +they are re-exported here so IO callers can ``from connects_common_connectivity.io +import get_settings, table_path``. +""" + +from __future__ import annotations + +from ..config import Settings, get_settings, output_root, table_path + +__all__ = ["Settings", "get_settings", "output_root", "table_path"] diff --git a/tests/test_config.py b/tests/test_config.py new file mode 100644 index 0000000..dd27641 --- /dev/null +++ b/tests/test_config.py @@ -0,0 +1,195 @@ +"""Tests for the package-wide config module.""" + +from __future__ import annotations + +import os +from pathlib import Path + +import pytest +from connects_common_connectivity.config import ( + CONFIG_FILENAME, + Settings, + find_config_file, + get_settings, + output_root, + table_path, +) + + +@pytest.fixture(autouse=True) +def _reset_cache_and_env(monkeypatch, tmp_path): + """Each test runs in an isolated tmp cwd with a cleared cache and no env override.""" + monkeypatch.delenv("CCC_OUTPUT_ROOT", raising=False) + monkeypatch.chdir(tmp_path) + get_settings.cache_clear() + yield + get_settings.cache_clear() + + +def _write_config(dir_: Path, **values) -> Path: + import yaml + + path = dir_ / CONFIG_FILENAME + path.write_text(yaml.safe_dump(values)) + return path + + +def test_get_settings_raises_actionable_error_when_missing(tmp_path): + # tmp_path has no ccc_config.yaml anywhere up the tree (we chdir'd into it). + with pytest.raises(RuntimeError, match=CONFIG_FILENAME): + get_settings() + + +def test_find_and_load_from_nested_cwd(tmp_path, monkeypatch): + _write_config(tmp_path, output_root=str(tmp_path / "out"), dry_run=True) + nested = tmp_path / "a" / "b" / "c" + nested.mkdir(parents=True) + monkeypatch.chdir(nested) + get_settings.cache_clear() + + found = find_config_file() + assert found == (tmp_path / CONFIG_FILENAME).resolve() + + settings = get_settings() + assert isinstance(settings, Settings) + assert settings.output_root == Path(str(tmp_path / "out")) + assert settings.dry_run is True + + +def test_env_overrides_only_output_root(tmp_path, monkeypatch): + _write_config(tmp_path, output_root=str(tmp_path / "from_file"), dry_run=True) + monkeypatch.setenv("CCC_OUTPUT_ROOT", str(tmp_path / "from_env")) + get_settings.cache_clear() + + settings = get_settings() + assert settings.output_root == Path(str(tmp_path / "from_env")) + # dry_run still comes from the file; env cannot express it. + assert settings.dry_run is True + + +def test_explicit_settings_wins_over_env_and_file(tmp_path, monkeypatch): + _write_config(tmp_path, output_root=str(tmp_path / "from_file"), dry_run=True) + monkeypatch.setenv("CCC_OUTPUT_ROOT", str(tmp_path / "from_env")) + get_settings.cache_clear() + + explicit = Settings(output_root=tmp_path / "explicit", dry_run=False) + + # Simulate the caller-side precedence pattern documented for writers/readers. + def writer(settings=None): + return settings or get_settings() + + resolved = writer(settings=explicit) + assert resolved is explicit + assert resolved.output_root == tmp_path / "explicit" + assert resolved.dry_run is False + + +def test_table_path_joins_and_returns_path(tmp_path): + settings = Settings(output_root=tmp_path / "root") + p = table_path(settings, "dataset") + assert isinstance(p, Path) + assert p == tmp_path / "root" / "dataset" + # A few of the canonical subdir names used by the notebooks. + for name in ( + "dataitem", + "dataitem_dataset_association", + "cellfeatureset", + "cellfeaturematrix", + "cluster", + "clustermembership", + "projectionmeasurementmatrix", + ): + assert table_path(settings, name) == tmp_path / "root" / name + + +def test_output_root_is_required(tmp_path): + _write_config(tmp_path, dry_run=False) # missing output_root + get_settings.cache_clear() + with pytest.raises(Exception): + get_settings() + + +def test_unknown_keys_rejected(tmp_path): + _write_config(tmp_path, output_root=str(tmp_path), nonsense_key=1) + get_settings.cache_clear() + with pytest.raises(Exception): + get_settings() + + +def test_io_reexports_settings_helpers(): + from connects_common_connectivity.io import ( + Settings as IOSettings, + get_settings as io_get_settings, + output_root as io_output_root, + table_path as io_table_path, + ) + + assert IOSettings is Settings + assert io_get_settings is get_settings + assert io_table_path is table_path + assert io_output_root is output_root + + +def test_get_settings_is_cached(tmp_path, monkeypatch): + _write_config(tmp_path, output_root=str(tmp_path / "out")) + get_settings.cache_clear() + first = get_settings() + # Mutating the file should not change the cached result. + _write_config(tmp_path, output_root=str(tmp_path / "changed")) + second = get_settings() + assert first is second + # After clearing, discovery re-runs. + get_settings.cache_clear() + third = get_settings() + assert third.output_root == Path(str(tmp_path / "changed")) + + +def test_describe_includes_resolved_values(tmp_path): + settings = Settings(output_root=tmp_path / "root", dry_run=True) + text = settings.describe() + assert "root" in text + assert "dry_run=True" in text + + +def test_output_root_helper_appends_trailing_slash(tmp_path, monkeypatch): + _write_config(tmp_path, output_root=str(tmp_path / "out")) + get_settings.cache_clear() + # cwd is tmp_path (autouse fixture), so relpath of tmp_path/out is "out". + root = output_root() + assert isinstance(root, str) + assert root.endswith("/") + assert root == "out/" + + +def test_output_root_helper_absolute_flag(tmp_path): + settings = Settings(output_root=tmp_path / "explicit") + assert output_root(settings, absolute=True) == str(tmp_path / "explicit") + "/" + + +def test_output_root_helper_accepts_explicit_settings(tmp_path, monkeypatch): + monkeypatch.chdir(tmp_path) + explicit = Settings(output_root=tmp_path / "explicit") + # Default returns path relative to cwd (tmp_path). + assert output_root(explicit) == "explicit/" + + +def test_relative_output_root_in_config_is_anchored_at_config_dir(tmp_path, monkeypatch): + # Config sits at tmp_path; output_root is relative ("scratch/x/"). + _write_config(tmp_path, output_root="scratch/x/") + nested = tmp_path / "code" + nested.mkdir() + monkeypatch.chdir(nested) + get_settings.cache_clear() + + settings = get_settings() + # Settings.output_root is absolute, anchored at the config file's dir + # (abspath, not resolve — symlinks must not be followed). + assert settings.output_root == Path(os.path.abspath(tmp_path / "scratch" / "x")) + + # output_root() returns the path relative to cwd → "../scratch/x/". + assert output_root() == "../scratch/x/" + + # table_path joins to an absolute path that works regardless of cwd. + tp = table_path(settings, "dataset") + assert tp.is_absolute() + assert tp == Path(os.path.abspath(tmp_path / "scratch" / "x" / "dataset")) From 822df72aee139fcbb34f411a87e3ea306a2ca4a7 Mon Sep 17 00:00:00 2001 From: reneyagmur Date: Wed, 10 Jun 2026 19:35:13 +0000 Subject: [PATCH 07/25] sync local co --- ccc_config.yaml | 2 +- code/etl_wnm_exc_01_dataset_dataitem.ipynb | 91 +--------------------- 2 files changed, 2 insertions(+), 91 deletions(-) diff --git a/ccc_config.yaml b/ccc_config.yaml index 9256860..6a121c9 100644 --- a/ccc_config.yaml +++ b/ccc_config.yaml @@ -1,5 +1,5 @@ # Package-wide settings for ConnectsCommonConnectivity. # Discovered by walking up from cwd (pyproject.toml/ruff/pytest pattern). # Edit this file (or set CCC_OUTPUT_ROOT) to repoint writers/readers. -output_root: scratch/em_patchseq_wnm_v1/ +output_root: scratch/em_patchseq_wnm_v2/ dry_run: false diff --git a/code/etl_wnm_exc_01_dataset_dataitem.ipynb b/code/etl_wnm_exc_01_dataset_dataitem.ipynb index 8e1dec4..60ae9ad 100644 --- a/code/etl_wnm_exc_01_dataset_dataitem.ipynb +++ b/code/etl_wnm_exc_01_dataset_dataitem.ipynb @@ -31,96 +31,7 @@ " DataItemDataSetAssociation,\n", " Modality,\n", ")\n", - "from connects_common_connectivity.write_utils import append_new_dataitems\n", - "import connects_common_connectivity.config as c" - ] - }, - { - "cell_type": "code", - "execution_count": 2, - "metadata": {}, - "outputs": [], - "source": [ - "OUTPUT_ROOT = c.output_root()" - ] - }, - { - "cell_type": "code", - "execution_count": 3, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "'../scratch/em_patchseq_wnm_v1/'" - ] - }, - "execution_count": 3, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "OUTPUT_ROOT" - ] - }, - { - "cell_type": "code", - "execution_count": 4, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "output_root=PosixPath('scratch/em_patchseq_wnm_v1') dry_run=False\n", - "dataset table path: scratch/em_patchseq_wnm_v1/dataset\n" - ] - } - ], - "source": [ - "settings = get_settings()\n", - "print(settings)\n", - "print(\"dataset table path:\", table_path(settings, \"dataset\"))" - ] - }, - { - "cell_type": "code", - "execution_count": 4, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "output_root=PosixPath('/tmp/ccc-smoke') dry_run=False\n" - ] - } - ], - "source": [ - "import os, importlib, connects_common_connectivity.config as c\n", - "os.environ[\"CCC_OUTPUT_ROOT\"] = \"/tmp/ccc-smoke\"\n", - "c.get_settings.cache_clear()\n", - "print(c.get_settings())" - ] - }, - { - "cell_type": "code", - "execution_count": 11, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "output_root=PosixPath('scratch/em_patchseq_wnm_v1') dry_run=False\n" - ] - } - ], - "source": [ - "del os.environ[\"CCC_OUTPUT_ROOT\"]\n", - "c.get_settings.cache_clear()\n", - "print(c.get_settings())" + "from connects_common_connectivity.write_utils import append_new_dataitems" ] }, { From eff8f99ed79224af5b431230d3d27fb9cc1b66dc Mon Sep 17 00:00:00 2001 From: reneyagmur Date: Wed, 10 Jun 2026 13:39:49 -0700 Subject: [PATCH 08/25] write plan edits --- planning/ARCHITECTURE.md | 50 +++-- planning/TODO.md | 33 +-- planning/prompts/02_write_spec.md | 109 +++++----- planning/prompts/03_writers.md | 254 ++++++++++++++++------ planning/prompts/04_public_api.md | 14 +- planning/prompts/06_notebook_migration.md | 12 +- 6 files changed, 308 insertions(+), 164 deletions(-) diff --git a/planning/ARCHITECTURE.md b/planning/ARCHITECTURE.md index 1cc8d32..1df8221 100644 --- a/planning/ARCHITECTURE.md +++ b/planning/ARCHITECTURE.md @@ -51,7 +51,7 @@ src/connects_common_connectivity/ write_spec.py # NEW registry — source of truth write_validation.py# NEW auto-derived strict submodels (write-safety validation) arrow_utils.py # MOVED from root (no rename) (models <-> Arrow conversion) - writers.py # NEW write_models() + typed wrappers + writers.py # NEW write_models() + write_projection_matrix() write_utils.py # MOVED from root (append-by-id backend, walk_ancestors, # populate_region_coverage) # --- deferred (see "Later — elaborations"; designs kept, not built yet) --- @@ -125,9 +125,9 @@ LinkML schema ──▶│ models.py (generated) │ │ │ │ ▼ ▼ ▼ validation write module read module - (strict submodel (write_dataset, (predicate-based + - derived per write_dataitem, flexible cross-dataset - class) write_features...) reads) + (strict submodel (write_models + (predicate-based + + derived per write_projection flexible cross-dataset + class) _matrix) reads) │ ▼ Settings (global output_root) @@ -173,7 +173,7 @@ fails loudly rather than writing somewhere arbitrary. `get_settings()` is a pure function of the filesystem (clearable in tests), not a mutable global. How the ETL uses it (kills the per-notebook setup): there is no config cell at all. A -notebook just imports and calls `write_dataset(...)` / `read_dataset(...)`; the library +notebook just imports and calls `write_models(...)` / `read_dataset(...)`; the library discovers `ccc_config.yaml` on its own. Writers/readers do `settings = settings or get_settings()`. To repoint local vs CodeOcean, edit the one file (or set `CCC_OUTPUT_ROOT`). A `table_path(settings, "dataset")` helper resolves per-table subdirectories so nothing @@ -235,22 +235,24 @@ not a strict-submodel validator. This keeps validation free of any dependency on ## Module 4 — `writers.py` (+ `io/write_utils.py`, `io/arrow_utils.py`) -A single dispatch core plus thin typed wrappers: - -- `write_models(models, *, settings=None)` — infers the class, looks up the registry, - converts via `io/arrow_utils.py`, attaches LinkML metadata, then writes per `write_mode` - (scoped overwrite with the registry-built predicate, or `append_new_by_id` via the - backend). It calls a **validation hook** before writing; in the write-IO phase that hook is - a pass-through, and Module 3 (built afterward) swaps in the real strict validator with no - restructuring. -- Typed wrappers for discoverability (`write_dataset`, `write_dataitem`, - `write_association`, `write_features`, `write_cluster`, `write_cluster_membership`, - `write_cell_to_cluster_mapping`, `write_projection_matrix`). `write_models` is the one - real entry point; the wrappers are sugar. To avoid hand-maintaining eight one-liners that - must stay in lockstep with the registry, **generate them from the registry** (a small - factory binding the class) and re-export the generated names from `io/__init__.py`. A - hand-written wrapper is justified only where a class needs a non-uniform signature - (e.g. `write_projection_matrix` taking the dense matrix for enrichment). +A single dispatch core, no per-class wrappers: + +- `write_models(models, *, settings=None) -> WriteResult` — infers the class, looks up + the registry, converts via `io/arrow_utils.py`, attaches LinkML metadata, then writes + per `write_mode` (scoped overwrite with the registry-built predicate, `append_new_by_id` + via the backend, `wide_parquet` for `CellFeatureMatrix`). It calls a **validation hook** + before writing; in the write-IO phase that hook is a pass-through, and Module 3 (built + afterward) swaps in the real strict validator with no restructuring. +- **No `write_dataset` / `write_dataitem` / `write_association` / etc. wrappers.** + `write_models()` infers the class from its argument; renaming it per class adds no + behavior, only drift surface. Discoverability is provided by + `WRITABLE_CLASSES = tuple(s.model_cls for s in REGISTRY.values())` plus + `write_models`'s docstring. +- `write_projection_matrix(pmm, matrix, *, settings=None) -> WriteResult` is the **one** + non-`write_models` public writer, justified because its signature is non-uniform (it + takes the dense matrix for `populate_region_coverage` enrichment before delegating to + `write_models`). No other exceptions — if a future class needs pre-write enrichment, the + caller does the enrichment and then calls `write_models`. - `io/write_utils.py` (moved from root): `append_new_dataitems` is the `append_new_by_id` backend; `walk_ancestors` is used by membership/mapping writers; `populate_region_coverage` (ported from `io_plans.md`) is the pre-write projection helper. `write_projection_matrix` @@ -260,9 +262,9 @@ A single dispatch core plus thin typed wrappers: plumbing the projection writer needs — same shelf as `append_new_dataitems` — not a separate "transforms" concern. -Wide feature matrices (`CellFeatureMatrix`) use `build_cell_feature_matrix_schema` (in -`io/arrow_utils.py`) and a matrix-specific writer path, since they are wide Parquet, not -row-modeled Delta tables. +Wide feature matrices (`CellFeatureMatrix`) stay inside the registry under +`write_mode = "wide_parquet"`; `write_models` dispatches them through +`build_cell_feature_matrix_schema` (in `io/arrow_utils.py`) and a Parquet write. ## Later — elaborations (deferred; design kept, not built yet) diff --git a/planning/TODO.md b/planning/TODO.md index 0bf5b1f..359ac17 100644 --- a/planning/TODO.md +++ b/planning/TODO.md @@ -17,15 +17,20 @@ Flat, ordered list. One row per prompt; sub-tasks live in the prompts. Design li `configure()` global, no `%run`. Re-exported from `io/__init__.py`. `ccc_config.yaml` seeded at repo root with `output_root: scratch/em_patchseq_wnm_v1/`. Tests: `tests/test_config.py` (14 tests, all passing). -- [ ] **W2 — Write spec registry** (`prompts/02_write_spec.md`) — `io/write_spec.py`: one - entry per writable class (`subdir`, `partition_by`, `scope_columns`, `write_mode`, - `required_for_write`, `cross_field_rules`). Seed DataSet/DataItem/Association now; add - others as W3 prototypes them. Includes the registry↔schema drift test. -- [ ] **W3 — Writers + relocation + per-class prototyping** (`prompts/03_writers.md`) — - Move `arrow_utils.py`/`write_utils.py` into `io/` (re-export shims at old paths). Build - `io/writers.py` (`write_models` + registry-generated typed wrappers) with a pass-through - validation hook for W5. Land `populate_region_coverage` in `io/write_utils.py`. For each - writable class, add a small real write example to a notebook and let it set the W2 entry. +- [ ] **W2 — Write spec registry (seed only)** (`prompts/02_write_spec.md`) — + `io/write_spec.py`: `WriteSpec` pydantic model, `REGISTRY` seeded with **exactly three** + entries (`DataSet`, `DataItem`, `DataItemDataSetAssociation`), `get_spec()` lookup, and + the drift test (`tests/test_write_spec.py`). `required_for_write` and + `cross_field_rules` left empty — W5 owns those. The remaining classes are W3's job. +- [ ] **W3 — Writers + relocation + registry expansion** (`prompts/03_writers.md`) — + Move `arrow_utils.py`/`write_utils.py` into `io/` (re-export shims at old paths, removed + in W6). Build `io/writers.py`: `write_models()` dispatch, `WriteResult` frozen dataclass, + `WRITABLE_CLASSES` discovery tuple, pass-through `_validation_hook` for W5 to swap, plus + `write_projection_matrix()` (the one non-`write_models` public writer, justified by its + non-uniform signature). **No per-class wrappers** — `write_models` infers the class. + Land `populate_region_coverage` in `io/write_utils.py`. **Expand the registry** by + prototyping each remaining writable class one at a time (notebook write → registry entry + → smoke test). `CellFeatureMatrix` stays in the registry under `wide_parquet` mode. Blocked by W1, W2. - [ ] **W4 — Public API** (`prompts/04_public_api.md`) — `io/__init__.py`: curated re-exports + `__all__`. The user-facing surface; defines what autocomplete shows. Blocked @@ -33,11 +38,11 @@ Flat, ordered list. One row per prompt; sub-tasks live in the prompts. Design li - [ ] **W5 — Write validation** (`prompts/05_validation.md`) — `io/write_validation.py`: `strict_model_for(cls)` flips `required_for_write` to required + attaches pure `cross_field_rules` (no I/O). Swap `validate_for_write` into the W3 hook. Blocked by W2, W3. -- [ ] **W6 — Notebook migration** (`prompts/06_notebook_migration.md`) — Create - `ccc_config.yaml` at repo root. Migrate every ETL notebook to typed writers; delete - hardcoded `OUTPUT_ROOT` and per-cell `mode`/`predicate`/`partition_by`. Run the patchseq - regression (exc then inh, both DataSet rows must coexist). Remove the W3 re-export shims - and confirm nothing imports the old paths. Blocked by W3 (W5 preferred). +- [ ] **W6 — Notebook migration** (`prompts/06_notebook_migration.md`) — Migrate every + ETL notebook to typed writers; delete hardcoded `OUTPUT_ROOT` and per-cell + `mode`/`predicate`/`partition_by` (`ccc_config.yaml` already exists from W1). Run the + patchseq regression (exc then inh, both DataSet rows must coexist). Remove the W3 + re-export shims and confirm nothing imports the old paths. Blocked by W3 (W5 preferred). - [ ] **W7 — Write-side test suite** (`prompts/07_tests.md`) — Drift, patchseq regression, idempotency, append-new-by-id, predicate construction, per-class example smoke, no-shim regression, public-API surface. Owns only the gaps not specified by W2/W3/W4/W5. diff --git a/planning/prompts/02_write_spec.md b/planning/prompts/02_write_spec.md index 1a78508..750fcb1 100644 --- a/planning/prompts/02_write_spec.md +++ b/planning/prompts/02_write_spec.md @@ -1,67 +1,70 @@ -# Agent prompt — Write spec registry +# Agent prompt — Write spec registry (seed only) > Prepend `00_shared_context.md`. Depends on nothing (reads generated models). ## Goal -Create `src/connects_common_connectivity/io/write_spec.py`: an explicit registry, one -entry per writable class, that is the single source of truth for how each class is -written. Plus a test that the registry cannot drift from the schema. +Create `src/connects_common_connectivity/io/write_spec.py` with the `WriteSpec` shape, a +`REGISTRY` seeded with **exactly three** entries (DataSet, DataItem, +DataItemDataSetAssociation), a `get_spec()` lookup, and a drift test. -## Approach: prototype, don't assume -Do NOT assume every class is scoped-overwrite-with-predicate. That pattern fits -DataSet/Association, but `append_new_by_id` already exists for DataItem because append was -right there, and other classes may want append or modes not yet named. **For each class, -build a small real write example in a notebook first** (paired with `03_writers.md`), see how -it actually wants to be written, and let that decide the entry. `write_mode` is an open -`Literal` you extend when an example doesn't fit the existing modes — not a constraint to -force classes into. +This prompt is the **minimum** needed to unblock W3. The remaining classes are added during +W3, where the writer exists to prototype against — see `03_writers.md` for that loop. -**Sequencing:** in W2 (this prompt) seed *only* the three correctness-critical classes -below — enough to unblock W3. The remaining entries are added during W3, where the writer -exists to prototype against; that loop (notebook example → registry entry) is W3's job, not -this one's. Trying to fill the whole registry up front contradicts "prototype, don't -assume." +## `WriteSpec` shape +Pydantic v2 `BaseModel` (the rest of the codebase uses pydantic — match it): -## Registry shape -Define a dataclass/pydantic model `WriteSpec` with fields: -- `model_cls` — the generated pydantic class (import from `..models`). -- `subdir: str` — Delta subdir under `output_root` (must match the notebook paths). -- `partition_by: list[str]` — Delta partition columns. -- `scope_columns: list[str]` — for scoped-overwrite classes, columns defining the predicate - (identity within the shared table). May be empty for append-mode classes. -- `write_mode: Literal[...]` — start with `"overwrite_scoped"`, `"append_new_by_id"`; add new - members when a class's example shows neither fits. Keep it easy to extend. -- `required_for_write: list[str]` — slots that must be non-null to write safely (may be - stricter than the schema's `required`). -- `cross_field_rules: list[str]` — names of cross-field checks (implemented in - `write_validation.py`); empty for now is fine. +```python +class WriteSpec(BaseModel): + model_cls: type # the generated pydantic class + subdir: str # Delta subdir under output_root + partition_by: list[str] # Delta partition columns + scope_columns: list[str] # columns defining the predicate + # (or the id column for append_new_by_id) + write_mode: Literal["overwrite_scoped", "append_new_by_id"] # extend in W3 if needed + required_for_write: list[str] = [] # leave empty here; W5 owns this + cross_field_rules: list[str] = [] # leave empty here; W5 owns this +``` -Expose `REGISTRY: dict[str, WriteSpec]` keyed by class name, and a -`get_spec(model_or_cls) -> WriteSpec` lookup. +Notes: +- `scope_columns` does double duty: for `overwrite_scoped` it's the predicate; for + `append_new_by_id` it's the id column(s) the backend dedupes on. One field, two + interpretations dispatched on `write_mode`. +- `required_for_write` and `cross_field_rules` are owned by W5 (validation). Leave them as + empty lists for the seed entries; do not guess. -## Seed these first (correctness-critical) -- `DataSet`: subdir `"dataset"`, partition `["project_id"]`, - **scope `["project_id", "id"]`** (THIS is the patchseq bug fix), mode - `overwrite_scoped`. -- `DataItem`: subdir `"dataitem"`, partition `["project_id"]`, mode `append_new_by_id`, - id column `"id"`. -- `DataItemDataSetAssociation`: subdir `"dataitem_dataset_association"`, partition - `["project_id"]`, scope `["project_id", "dataset_id"]`, mode `overwrite_scoped`. +Expose: +- `REGISTRY: dict[str, WriteSpec]` keyed by class name (`"DataSet"`, etc.). +- `get_spec(model_or_cls) -> WriteSpec` — accepts a class or an instance. -Then add entries for `Cluster`, `ClusterHierarchy`, `ClusterMembership`, -`CellFeatureSet`, `CellFeatureDefinition`, `CellToClusterMapping`, `MappingSet`, -`ProjectionMeasurementMatrix`, etc. **as each one's write example is prototyped** — read how -it's written today in `code/etl_*.ipynb` (grep `write_deltalake` and `predicate=`), try it -through the writer, and only then fix its entry. Where a notebook's current predicate looks -wrong (like the DataSet case), prefer the correct scope and note it in a comment. -`CellFeatureMatrix` is wide Parquet, not row Delta — mark it so the writer routes it to the -matrix path (`build_cell_feature_matrix_schema`). +## Seed exactly these three + +| class | subdir | partition_by | scope_columns | write_mode | +|---|---|---|---|---| +| `DataSet` | `dataset` | `["project_id"]` | `["project_id", "id"]` ← patchseq fix | `overwrite_scoped` | +| `DataItem` | `dataitem` | `["project_id"]` | `["id"]` | `append_new_by_id` | +| `DataItemDataSetAssociation` | `dataitem_dataset_association` | `["project_id"]` | `["project_id", "dataset_id"]` | `overwrite_scoped` | + +The subdir names must match the existing notebook paths (grep +`code/etl_*_01_dataset_dataitem.ipynb` for `write_deltalake(` to confirm). The DataSet +scope is the patchseq fix — today's notebooks predicate only on `project_id`, which is why +`visp_inh_patchseq` overwrites `visp_exc_patchseq`. + +**Do NOT add any other classes here.** `Cluster`, `ClusterMembership`, `MappingSet`, +`CellFeatureSet`, `CellToClusterMapping`, `ProjectionMeasurementMatrix`, `CellFeatureMatrix`, +etc. are W3's responsibility, added one at a time as their write examples are prototyped. ## Drift test (`tests/test_write_spec.py`) -- Every `REGISTRY` key resolves to a real class in `models.py`. -- Every column in `scope_columns` + `partition_by` + `required_for_write` corresponds to - a field on that model (check `model_fields`). Fail loudly otherwise. +- Every `REGISTRY` key resolves to a real class in `models.py` (importable, `model_cls` + matches the key). +- For each entry, every name in `scope_columns + partition_by + required_for_write` is a + field on `model_cls` (check `model_cls.model_fields`). Fail with the offending + class/field name. +- `get_spec(SomeClass)` and `get_spec(SomeClass(...))` return the same entry. ## Report -A table of each class → subdir / partition_by / scope_columns / write_mode, and call out -any notebook predicate you believe is wrong (do not fix notebooks here). +- The three subdir names you wrote, and the matching paths grep'd from the notebooks. +- Confirmation that `tests/test_write_spec.py` passes (`pytest tests/test_write_spec.py -q`). + +## Do not +- Add a fourth class. Edit `models.py` or schemas. Touch any notebook. Populate + `required_for_write` or `cross_field_rules` (those are W5's job). diff --git a/planning/prompts/03_writers.md b/planning/prompts/03_writers.md index b6a213d..6901004 100644 --- a/planning/prompts/03_writers.md +++ b/planning/prompts/03_writers.md @@ -1,74 +1,204 @@ -# Agent prompt — Writers (dispatch core + typed wrappers) +# Agent prompt — Writers (dispatch core + registry expansion) -> Prepend `00_shared_context.md`. Depends on `config.py`, `write_spec.py`. (Validation is -> built afterward and slots into the pass-through hook below — not a dependency here.) +> Prepend `00_shared_context.md`. Depends on `config.py` (W1), `write_spec.py` (W2). +> Validation (W5) slots into the pass-through hook below — not a dependency here. -## Relocation first (clean structure) -Before writing new code, MOVE the existing backends into `io/` (with re-export shims at the -old paths until notebook migration is done): +## What W3 ships +1. The dispatch core `write_models()` and a `WriteResult` value object. +2. The remaining `WriteSpec` entries (everything except the three W2 seeded), each driven + by a small write example so the entry reflects how the class actually wants to be + written. +3. `write_projection_matrix()` — the **only** standalone writer function, because it + needs a non-uniform signature (the dense matrix for `populate_region_coverage`). +4. The relocation of `arrow_utils.py` and `write_utils.py` into `io/`, plus the + `populate_region_coverage` helper. + +## No per-class wrapper functions +Decision: there are NO `write_dataset`, `write_dataitem`, `write_association`, etc. +wrappers. `write_models()` infers the class from its argument; renaming it eight times +adds no behavior, only drift surface. The single exception is `write_projection_matrix()` +because its signature is genuinely different (it accepts a dense matrix). Discoverability +is provided by `WRITABLE_CLASSES` (a tuple of `model_cls`) plus `write_models`'s docstring +listing them. + +## Relocation first +Before writing new code, MOVE the existing backends into `io/` with one-line re-export +shims at the old paths (deleted in W6): - `arrow_utils.py` → `io/arrow_utils.py` - `write_utils.py` → `io/write_utils.py` -All new code imports from the `io/` locations. - -## Goal -Create `src/connects_common_connectivity/io/writers.py`: a single write dispatch that uses -the registry so notebooks never hand-write `mode` / `predicate` / `partition_by` again. - -## Core -`write_models(models, *, settings=None) -> WriteResult`: -1. Accept a single model or an iterable; infer the class; require homogeneous type. -2. `settings = settings or get_settings()` (loads the discovered `ccc_config.yaml`; an - explicit `settings=` still wins). -3. Look up the `WriteSpec` via `get_spec`. -4. Call a **validation hook** before any IO. In this phase the hook is a pass-through - (identity) function — validation is built afterward (`05_validation.md`) and swaps the - real `validate_for_write` into this hook with no restructuring. Wire the call site now. + +All new code imports from the `io/` locations. The shims look like: +```python +# src/connects_common_connectivity/arrow_utils.py +from .io.arrow_utils import * # noqa: F401,F403 (deprecated; removed in W6) +``` + +Add a quick smoke test (`tests/test_write_relocation.py`) that asserts the public names +(`build_arrow_schema`, `models_to_table`, `attach_linkml_metadata`, +`build_cell_feature_matrix_schema`, `append_new_dataitems`, `walk_ancestors`) are +importable from BOTH the new and the shim path. + +## Core: `write_models` +```python +def write_models(models, *, settings: Settings | None = None) -> WriteResult: ... +``` + +1. Accept a single model or an iterable; require homogeneous type; infer the class. +2. `settings = settings or get_settings()`. Explicit `settings=` always wins. +3. `spec = get_spec(cls)`. +4. **Validation hook** — call `_validation_hook(models, spec)` before any IO. In W3 this + is a pass-through (identity) function defined at module top: + ```python + _validation_hook = lambda models, spec: models # replaced in W5 + ``` + Wire the call site now; W5 monkey-patches the real validator in. 5. Convert via `arrow_utils.models_to_table` + `build_arrow_schema`; attach metadata with - `attach_linkml_metadata(linkml_class=)`. + `attach_linkml_metadata(linkml_class=cls.__name__)`. 6. Resolve path with `table_path(settings, spec.subdir)`. -7. Dispatch on `spec.write_mode`: - - `overwrite_scoped`: build the predicate from `spec.scope_columns` and the row values - (e.g. `project_id = '...' AND id = '...'`), then - `write_deltalake(path, table, mode="overwrite", predicate=..., partition_by=spec.partition_by)`. - If a batch contains multiple distinct scope tuples, write per scope group (one - predicate each) — never widen a predicate to cover rows it shouldn't. - - `append_new_by_id`: delegate to `write_utils.append_new_dataitems` (the backend), - passing `project_id` and id column. -8. Return a small result object: rows written/appended, path, mode, predicate used. - -## Typed wrappers (generated from the registry) -`write_models` is the one real entry point. Provide the discoverable per-class names -(`write_dataset`, `write_dataitem`, `write_association`, `write_features`, `write_cluster`, -`write_cluster_membership`, `write_cell_to_cluster_mapping`, `write_projection_matrix`) but -**generate them from the registry** with a small factory that binds the class, rather than -hand-writing eight one-liners that can drift from the registry. Hand-write a wrapper only -where the signature is non-uniform (e.g. `write_projection_matrix` accepting the dense -matrix for enrichment). The generated names are re-exported from `io/__init__.py` (Phase 3b). +7. Dispatch on `spec.write_mode` (factor each branch into a private helper so the tests + below can target each in isolation): + - `_dispatch_overwrite_scoped`: group rows by their `scope_columns` tuple via + `_group_by_scope`. **Write each group with its own predicate** — never widen a + predicate to cover rows it shouldn't. Predicate built by `_build_predicate`, format + `col1 = 'val1' AND col2 = 'val2'` (single quotes, AND-joined). One + `write_deltalake(... mode="overwrite", predicate=..., partition_by=spec.partition_by)` + call per group. + - `_dispatch_append_new_by_id`: delegate to `write_utils.append_new_dataitems`. The + existing signature (`output_path, table, *, project_id, id_column="id"`) already + covers the seed entries; if a new `append_new_by_id` entry needs a different + partition column, generalize then. Pull `id_column` from `spec.scope_columns[0]` + and `project_id` from the row values. +8. Return a `WriteResult`. + +`write_models` should know nothing class-specific; everything class-specific lives in the +registry. The only places that mention specific model classes are `write_spec.py` (the +registry) and `write_projection_matrix` (the one signature exception). + +## `WriteResult` +A frozen dataclass — this is a return value, not validated data: + +```python +from dataclasses import dataclass +from pathlib import Path + +@dataclass(frozen=True) +class WriteResult: + class_name: str + path: Path + mode: str + predicates: tuple[str, ...] # one per group; () for append_new_by_id / wide_parquet + rows_written: int +``` + +Co-locate in `writers.py`. + +## Discovery: `WRITABLE_CLASSES` +Replaces the per-class wrappers. One line in `writers.py`: + +```python +WRITABLE_CLASSES: tuple[type, ...] = tuple(spec.model_cls for spec in REGISTRY.values()) +``` + +`write_models`'s docstring should list `WRITABLE_CLASSES` (or instruct the reader to print +it) so users can see what's writable without reading the registry source. + +## Registry expansion (the prototype loop — the main intellectual work of W3) +W2 only seeded `DataSet`, `DataItem`, `DataItemDataSetAssociation`. Add the rest now, +one at a time, each driven by a real write example. Do NOT batch them up front. + +For each class below: +1. **Read the existing notebook write.** `grep -n 'write_deltalake' code/etl_*.ipynb` to + find the call(s); note the current `mode`, `predicate`, and `partition_by`. +2. **Decide the mode.** If neither `overwrite_scoped` nor `append_new_by_id` fits, extend + the `Literal` in `write_spec.py` with a new value, document it in one comment line, + and add the dispatch branch in `write_models`. Don't force a class into a mode that + doesn't fit. +3. **Add the entry to `REGISTRY`.** If a current notebook predicate looks wrong (like the + DataSet case), use the correct scope and note it in a comment. +4. **Write a smoke test** in `tests/test_writers.py` (NOT a production notebook — + notebooks are W6) that constructs one or two instances and round-trips them through + `write_models`. +5. **Update the drift test** if the new entry exposes a column the test doesn't already + check. + +Classes to add this round, roughly grouped: +- Cluster side: `Cluster`, `ClusterHierarchy`, `ClusterMembership`. +- Mapping side: `MappingSet`, `CellToClusterMapping`. (`CellToCellMapping` and + `ClusterToClusterMapping` only if a notebook actually writes them this round; otherwise + defer.) +- Feature side: `CellFeatureSet`, `CellFeatureDefinition`. (`CellFeatureMatrix` is wide + Parquet — see "Wide feature matrices" below.) +- Projection: `ProjectionMeasurementMatrix`. See "Projection pre-write helper" below. + +If a class isn't written by any current notebook, skip it — adding an entry no caller +exercises violates "prototype, don't assume." ## Wide feature matrices -`CellFeatureMatrix` is wide Parquet. Route it through a matrix-specific path using -`build_cell_feature_matrix_schema` (now in `io/arrow_utils.py`); do not force it into the -row-Delta path. - -## Projection pre-write helper (in `io/write_utils.py`, not a transforms module) -`populate_region_coverage(pmm, matrix)` is write plumbing the projection writer needs — same -shelf as `append_new_dataitems` — so it lives in `io/write_utils.py`, NOT a separate -`transforms` module. Port it from `io/io_plans.md`: derive `region_coverage` from the dense -values array, return a copy of the `ProjectionMeasurementMatrix` (pure function, no mutation, -no IO). `write_projection_matrix` calls it (or accepts an already-enriched matrix). Do NOT -port `compare_region_coverage` — that is read-side analysis and is deferred (`09_analysis.md`). - -## Reconcile `write_utils.py` -Make `append_new_dataitems` the `append_new_by_id` backend. If you must generalize it -(e.g. parametrize the partition column), keep the existing signature working — its current -notebook callers must not break. +`CellFeatureMatrix` is wide Parquet, not row-Delta. It doesn't fit `overwrite_scoped` / +`append_new_by_id`. Keep it inside the registry by adding `write_mode = "wide_parquet"` +and routing it through `build_cell_feature_matrix_schema` + a Parquet write inside +`write_models`. Same registry, same dispatch, different branch — no separate wrapper. +(`write_models(cell_feature_matrix)` is the call.) If during prototyping the wide-Parquet +path turns out to need invariants that don't fit `WriteSpec` cleanly, stop and report +before adding a separate function. + +## Projection pre-write helper + `write_projection_matrix` +Port `populate_region_coverage(pmm, matrix)` from `io/io_plans.md` into +`io/write_utils.py` (write plumbing — same shelf as `append_new_dataitems`, NOT a separate +`transforms` module). Pure function: derive `region_coverage` from the dense values array, +return a NEW `ProjectionMeasurementMatrix` instance (no mutation, no IO). + +`write_projection_matrix` is the **one** non-`write_models` public writer: +```python +def write_projection_matrix(pmm, matrix, *, settings=None) -> WriteResult: + enriched = populate_region_coverage(pmm, matrix) + return write_models(enriched, settings=settings) +``` +It exists because its signature is non-uniform (takes the dense matrix). Don't introduce +a second exception — if some other class needs pre-write enrichment, route it through +`write_models` with the enrichment done by the caller, not via a new wrapper. + +Do NOT port `compare_region_coverage` — read-side, deferred (`_deferred/09_analysis.md`). + +## Private helpers (factor these out for testability) +- `_build_predicate(scope_columns, row_values) -> str` +- `_group_by_scope(table, scope_columns) -> list[tuple[tuple, Table]]` +- `_dispatch_overwrite_scoped(table, spec, path) -> WriteResult` +- `_dispatch_append_new_by_id(table, spec, path) -> WriteResult` +- `_validation_hook(models, spec) -> models` — pass-through; replaced by W5 + +These are private (underscore-prefixed). Tests import them directly to exercise their +units without going through Delta. ## Tests (`tests/test_writers.py`) -- Scoped overwrite writes only matching rows; a second dataset sharing `project_id` is - preserved (patchseq regression: write DataSet A, write DataSet B, both rows exist). -- Re-writing identical models is idempotent (no dupes, no loss). -- `append_new_by_id` appends only new ids. -- Predicate is built from `scope_columns`, verified by string/inspection. +- **Patchseq regression** (the headline): `write_models(DataSet(A))`, then + `write_models(DataSet(B))` with the same `project_id` but different `id`, read the + table back, assert both rows exist. +- **Idempotency**: writing the same models twice yields the same row count. +- **Append-new-by-id**: writing a batch with one new + one existing id appends only the + new one. +- **Multi-scope-group dispatch**: a batch with two distinct scope tuples produces two + predicates and two rows in the table; neither overwrites the other. Inspect + `WriteResult.predicates` to assert the count. +- **Predicate construction**: call `_build_predicate` directly and verify the format + (`col = 'val' AND col = 'val'`) by string match. +- **Per-class smoke**: iterate `WRITABLE_CLASSES` and round-trip a small instance of each + through `write_models` — every registry entry exercised. +- **`write_projection_matrix`**: enriches the PMM (sets `region_coverage`) and writes + successfully; the input is unmutated. + +## Reporting +- The full list of registry entries at the end of W3 (table: class / subdir / + partition_by / scope_columns / write_mode). +- Any class you skipped because no notebook writes it, and why. +- Any new `write_mode` you added beyond `overwrite_scoped` / `append_new_by_id` / + `wide_parquet`, with a one-sentence justification. +- Any current notebook predicate you believe is wrong (do not fix the notebook here — + W6 owns that). +- `pytest tests/ -q` summary (full suite, not just `test_writers.py`). ## Do not -- Hardcode any predicate. Touch `models.py` or schemas. +- Add per-class `write_*` wrapper functions. Hardcode any predicate. Skip the prototype + loop and bulk-add registry entries from intuition. Touch `models.py`, schemas, or any + notebook. Re-export internal backends from `io/__init__.py` (W4 owns the public + surface). diff --git a/planning/prompts/04_public_api.md b/planning/prompts/04_public_api.md index 9bdbe3a..9ab6f42 100644 --- a/planning/prompts/04_public_api.md +++ b/planning/prompts/04_public_api.md @@ -1,6 +1,6 @@ # Agent prompt — Public API (`io/__init__.py`) -> Prepend `00_shared_context.md`. Depends on writers (1.4); reader exports added later when +> Prepend `00_shared_context.md`. Depends on writers (W3); reader exports added later when > the read-side work happens. ## Why @@ -10,20 +10,24 @@ autocomplete. It also decouples the public surface from internal module layout. ## Requirements 1. A concise module docstring: one paragraph on the IO layer (note settings come from a - discovered `ccc_config.yaml`) + a 3–5 line usage example (a `write_*` call — no config - ceremony needed). + discovered `ccc_config.yaml`) + a 3–5 line usage example using `write_models(...)` and + `write_projection_matrix(...)` — no config ceremony needed. 2. Curated re-exports — only the names users should touch (write-side for now): - config (from the package root, `from ..config import ...`): `get_settings`, `Settings`, `table_path` - - writers: `write_models` + the generated typed wrappers + - writers: `write_models`, `write_projection_matrix`, `WriteResult`, `WRITABLE_CLASSES` - reader names are added here when readers land (deferred) — leave a clear TODO comment. Do NOT re-export backends (`arrow_utils`, `write_utils`) or internal helpers. + Do NOT add per-class wrappers (`write_dataset`, etc.) — they don't exist; `write_models` + infers the class. 3. Define `__all__` to match exactly the curated list (keeps `dir()` and `*` imports clean). 4. Keep it import-light: no heavy work at import time; just imports + `__all__`. ## Test (`tests/test_public_api.py`) - Every name in `__all__` is importable from `connects_common_connectivity.io`. - No backend/internal module name leaks into `__all__`. +- `__all__` does NOT contain any `write_dataset` / `write_dataitem` / etc. — those + wrappers don't exist by design. ## Do not -- Re-export internal backends. Touch `models.py` or schemas. +- Re-export internal backends. Add per-class wrappers. Touch `models.py` or schemas. diff --git a/planning/prompts/06_notebook_migration.md b/planning/prompts/06_notebook_migration.md index 10d1497..ec97c6c 100644 --- a/planning/prompts/06_notebook_migration.md +++ b/planning/prompts/06_notebook_migration.md @@ -20,15 +20,15 @@ changes. The library finds it by walking up from the notebook's working director ## Per ETL notebook 1. Delete the hardcoded `OUTPUT_ROOT = "../scratch/..."` entirely. There is no replacement config cell and no `%run` — the library discovers `ccc_config.yaml` on its own, so - `write_*` / `read_*` calls need neither a path nor `settings=`. (If a cell wants to show + `write_models(...)` calls need neither a path nor `settings=`. (If a cell wants to show the resolved config, it may `from connects_common_connectivity.io import get_settings; print(get_settings())`, but this is optional.) 2. Replace each direct `write_deltalake(... mode=... predicate=... partition_by=...)` call - with the matching typed writer (`write_dataset`, `write_dataitem`, `write_association`, - `write_features`, `write_cluster`, `write_cell_to_cluster_mapping`, - `write_projection_matrix`, ...). Delete the now-redundant `mode`/`predicate`/ - `partition_by` arguments and their explanatory comments — that logic now lives in the - registry. + with `write_models(my_instance)` (or `write_models([inst1, inst2])`). The class is + inferred from the argument; the registry owns mode / predicate / partition. Use + `write_projection_matrix(pmm, matrix)` for the one projection notebook — it's the + single non-`write_models` writer. Delete the now-redundant `mode`/`predicate`/ + `partition_by` arguments and their explanatory comments. 3. Keep verification cells; update their paths to use `table_path(get_settings(), ...)`. From ac5e9b00690f93b3028a49972335889b569b2b1e Mon Sep 17 00:00:00 2001 From: reneyagmur Date: Thu, 11 Jun 2026 05:07:12 +0000 Subject: [PATCH 09/25] write spec and registry and functions --- CHANGELOG.md | 19 + planning/TODO.md | 29 +- .../arrow_utils.py | 356 +----------------- .../io/arrow_utils.py | 353 +++++++++++++++++ .../io/write_spec.py | 171 +++++++++ .../io/write_utils.py | 173 +++++++++ .../io/writers.py | 300 +++++++++++++++ .../write_utils.py | 125 +----- tests/test_write_relocation.py | 51 +++ tests/test_write_spec.py | 55 +++ tests/test_writers.py | 322 ++++++++++++++++ 11 files changed, 1478 insertions(+), 476 deletions(-) create mode 100644 src/connects_common_connectivity/io/arrow_utils.py create mode 100644 src/connects_common_connectivity/io/write_spec.py create mode 100644 src/connects_common_connectivity/io/write_utils.py create mode 100644 src/connects_common_connectivity/io/writers.py create mode 100644 tests/test_write_relocation.py create mode 100644 tests/test_write_spec.py create mode 100644 tests/test_writers.py diff --git a/CHANGELOG.md b/CHANGELOG.md index 10652f1..607e76d 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -9,10 +9,29 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 ### Added +- Added `connects_common_connectivity.io.writers` with `write_models()` (the + single dispatch core for all generated pydantic models), + `write_projection_matrix()`, `WriteResult`, and `WRITABLE_CLASSES`. +- Added `populate_region_coverage()` in + `connects_common_connectivity.io.write_utils` for deriving + `ProjectionMeasurementMatrix.region_coverage` from a dense matrix. + ### Changed +- Moved `arrow_utils` and `write_utils` under + `connects_common_connectivity.io.*`. The old import paths + (`connects_common_connectivity.arrow_utils`, + `connects_common_connectivity.write_utils`) keep working as deprecated + re-export shims. + ### Deprecated +- Importing from `connects_common_connectivity.arrow_utils` and + `connects_common_connectivity.write_utils`; use + `connects_common_connectivity.io.arrow_utils` / + `connects_common_connectivity.io.write_utils` instead. The shims will be + removed once notebook migration completes. + ### Removed ### Fixed diff --git a/planning/TODO.md b/planning/TODO.md index 359ac17..e3ad5b9 100644 --- a/planning/TODO.md +++ b/planning/TODO.md @@ -17,21 +17,28 @@ Flat, ordered list. One row per prompt; sub-tasks live in the prompts. Design li `configure()` global, no `%run`. Re-exported from `io/__init__.py`. `ccc_config.yaml` seeded at repo root with `output_root: scratch/em_patchseq_wnm_v1/`. Tests: `tests/test_config.py` (14 tests, all passing). -- [ ] **W2 — Write spec registry (seed only)** (`prompts/02_write_spec.md`) — +- [x] **W2 — Write spec registry (seed only)** (`prompts/02_write_spec.md`) — `io/write_spec.py`: `WriteSpec` pydantic model, `REGISTRY` seeded with **exactly three** entries (`DataSet`, `DataItem`, `DataItemDataSetAssociation`), `get_spec()` lookup, and the drift test (`tests/test_write_spec.py`). `required_for_write` and `cross_field_rules` left empty — W5 owns those. The remaining classes are W3's job. -- [ ] **W3 — Writers + relocation + registry expansion** (`prompts/03_writers.md`) — - Move `arrow_utils.py`/`write_utils.py` into `io/` (re-export shims at old paths, removed - in W6). Build `io/writers.py`: `write_models()` dispatch, `WriteResult` frozen dataclass, - `WRITABLE_CLASSES` discovery tuple, pass-through `_validation_hook` for W5 to swap, plus - `write_projection_matrix()` (the one non-`write_models` public writer, justified by its - non-uniform signature). **No per-class wrappers** — `write_models` infers the class. - Land `populate_region_coverage` in `io/write_utils.py`. **Expand the registry** by - prototyping each remaining writable class one at a time (notebook write → registry entry - → smoke test). `CellFeatureMatrix` stays in the registry under `wide_parquet` mode. - Blocked by W1, W2. +- [x] **W3 — Writers + relocation + registry expansion** (`prompts/03_writers.md`) — + Moved `arrow_utils.py`/`write_utils.py` into `io/` (re-export shims at old paths, to + be removed in W6). Built `io/writers.py`: `write_models()` dispatch, `WriteResult` + frozen dataclass, `WRITABLE_CLASSES` discovery tuple, pass-through `_validation_hook` + for W5 to swap, plus `write_projection_matrix()` (the one non-`write_models` public + writer, justified by its non-uniform signature). **No per-class wrappers** — + `write_models` infers the class. `populate_region_coverage` landed in + `io/write_utils.py`. Registry expanded to 12 entries (added `Cluster`, + `ClusterHierarchy`, `ClusterMembership`, `MappingSet`, `CellToClusterMapping`, + `CellFeatureSet`, `CellFeatureDefinition`, `CellFeatureMatrix`, + `ProjectionMeasurementMatrix`); `CellToCellMapping` / `ClusterToClusterMapping` / + `AlgorithmRun` deferred (no notebook writes them this round). **Deviation:** did + **not** add `wide_parquet` mode — the wide cell-feature Parquet is built from raw + dataframes that don't fit `WriteSpec`'s shape; `CellFeatureMatrix` stays as + `overwrite_scoped` for its metadata-pointer rows. Revisit when the wide-matrix + contract is clarified. Tests: `tests/test_writers.py`, `tests/test_write_relocation.py` + (full suite 119 passing). - [ ] **W4 — Public API** (`prompts/04_public_api.md`) — `io/__init__.py`: curated re-exports + `__all__`. The user-facing surface; defines what autocomplete shows. Blocked by W3. diff --git a/src/connects_common_connectivity/arrow_utils.py b/src/connects_common_connectivity/arrow_utils.py index e9000e5..1e2f661 100644 --- a/src/connects_common_connectivity/arrow_utils.py +++ b/src/connects_common_connectivity/arrow_utils.py @@ -1,353 +1,7 @@ -"""Utilities for converting Pydantic (LinkML-derived) models to PyArrow Tables efficiently. +"""Deprecated re-export shim — moved to :mod:`connects_common_connectivity.io.arrow_utils`. -Design goals: -- Avoid JSON serialization; work directly with Python-native values. -- Normalize Enum instances, nested Pydantic models, and object references. -- Provide schema construction from Pydantic field annotations for stability. -- Support column-oriented batch conversion for speed and memory efficiency. +This module exists to avoid breaking notebooks and external imports while the +codebase is mid-migration to the ``io/`` layer. It will be removed in W6. """ -from __future__ import annotations - -from enum import Enum -from typing import Any, Iterable, List, Dict, Union, get_origin, get_args -import hashlib -from datetime import datetime, date - -import pyarrow as pa - -try: - from pydantic import BaseModel # type: ignore -except ImportError: # pragma: no cover - class BaseModel: # minimal placeholder to satisfy type hints - pass - -# Primitive Python -> Arrow type mapping (extend as needed) -PRIMITIVE_TYPE_MAP: Dict[Any, pa.DataType] = { - int: pa.int64(), - float: pa.float64(), - bool: pa.bool_(), - str: pa.string(), - bytes: pa.binary(), - datetime: pa.timestamp("us"), - date: pa.date32(), -} - - -def normalize_value(v: Any) -> Any: - """Recursively normalize a value so Arrow can ingest it. - - - Enum -> underlying value (typically str) - - Pydantic model -> dict of normalized fields - - list/dict -> element-wise normalization - - Other primitives unchanged - """ - if isinstance(v, Enum): - return v.value - if hasattr(v, "model_dump") and callable(getattr(v, "model_dump")): - return {k: normalize_value(x) for k, x in v.model_dump(mode="python").items()} - if isinstance(v, list): - return [normalize_value(x) for x in v] - if isinstance(v, dict): - return {k: normalize_value(x) for k, x in v.items()} - return v - - -def flatten_refs(row: Dict[str, Any]) -> Dict[str, Any]: - """Flatten object reference dicts to id form when they contain an identifier field. - - Example: {'parent_identifier': {'id': 'BR123'}} -> {'parent_identifier_id': 'BR123'} - Leaves the original key removed for simpler Arrow schema. - """ - for key, val in list(row.items()): - # Single embedded reference -> promote to *_id - if isinstance(val, dict): - ident = val.get("id") or val.get("identifier") - if ident is not None: # heuristic: treat as ref not complex struct - row[f"{key}"] = ident - continue - # List of simple embedded references -> replace with list of ids - if isinstance(val, list) and val and all(isinstance(x, dict) for x in val): - ids: List[Any] = [] - simple = True - for x in val: - ident = x.get("id") or x.get("identifier") if isinstance(x, dict) else None - if ident is None: - simple = False - break - ids.append(ident) - if simple: - row[key] = ids - return row - - -def model_to_row(model: Any, *, flatten: bool = True) -> Dict[str, Any]: - """Convert a single Pydantic model instance to a normalized row dict. - - Parameters - ---------- - model: BaseModel - Pydantic instance. - flatten: bool - If True, attempt to flatten object references to *_id columns. - """ - raw = model.model_dump(mode="python", exclude_none=True) - norm = {k: normalize_value(v) for k, v in raw.items()} - return flatten_refs(norm) if flatten else norm - - -def _arrow_field_for(name: str, annotation: Any, required: bool) -> pa.Field: - """Infer an Arrow Field from a Python type annotation. - - Handles Optional[T], List[T], Enums, and basic primitives. - Fallback is string. - """ - nullable = not required - origin = get_origin(annotation) - args = get_args(annotation) - - # Optional / Union[T, None] - if origin is Union and type(None) in args: - non_none = [a for a in args if a is not type(None)] - if non_none: - annotation = non_none[0] - nullable = True - origin = get_origin(annotation) - args = get_args(annotation) - - # List[T] - if origin is list and args: - inner = args[0] - inner_field = _arrow_field_for(name, inner, required=True) - return pa.field(name, pa.list_(inner_field.type), nullable=True) - - # Enum - if isinstance(annotation, type) and issubclass(annotation, Enum): - return pa.field(name, pa.string(), nullable=nullable) - - # Primitive direct map - if annotation in PRIMITIVE_TYPE_MAP: - return pa.field(name, PRIMITIVE_TYPE_MAP[annotation], nullable=nullable) - - # Fallback struct for embedded BaseModel? treat as JSON/string for now. - return pa.field(name, pa.string(), nullable=nullable) - - -def build_arrow_schema(model_cls: Any) -> pa.Schema: - """Build a stable Arrow schema from a Pydantic model class. - - Uses model_fields (Pydantic v2) annotations; unknown types default to string. - """ - fields: List[pa.Field] = [] - model_fields = getattr(model_cls, "model_fields", {}) - for fname, finfo in model_fields.items(): - annotation = getattr(finfo, "annotation", Any) - required = finfo.is_required() - fields.append(_arrow_field_for(fname, annotation, required)) - return pa.schema(fields) - - -def models_to_table(models: Iterable[Any], schema: Union[pa.Schema, None] = None, *, flatten: bool = True) -> pa.Table: - """Convert an iterable of Pydantic models to a PyArrow Table. - - If no schema is provided, one is generated from the class of the first model. - Column-oriented assembly for speed (single pass). Nested dicts become stringified. - """ - models = list(models) - if not models: - return pa.Table.from_arrays([], schema=schema or pa.schema([])) - - # Build schema if absent - if schema is None: - schema = build_arrow_schema(models[0].__class__) - # Schema must be a pyarrow.Schema now - assert isinstance(schema, pa.Schema), "Schema construction failed" - - # Initialize column buffers - buffers: Dict[str, List[Any]] = {field.name: [] for field in schema} # type: ignore[arg-type] - - for m in models: - row: Dict[str, Any] = model_to_row(m, flatten=flatten) - for field in schema: # type: ignore[arg-type] - val = row.get(field.name) - if isinstance(val, dict): - val = str(val) - # If list contains dicts, attempt to reduce each dict to id/identifier, else stringify - if isinstance(val, list) and val and any(isinstance(x, dict) for x in val): - reduced: List[Any] = [] - for x in val: - if isinstance(x, dict): - ident = x.get("id") or x.get("identifier") - reduced.append(ident if ident is not None else str(x)) - else: - reduced.append(x) - val = reduced - buffers[field.name].append(val) - # Build arrays now that buffers are filled - arrays: List[pa.Array] = [] - for field in schema: # type: ignore[arg-type] - arrays.append(pa.array(buffers[field.name], type=field.type)) - return pa.Table.from_arrays(arrays, schema=schema) -def _schema_fingerprint(schema: pa.Schema) -> str: - """Create a stable fingerprint for an Arrow schema (field names + types).""" - parts = [f"{f.name}:{f.type}" for f in schema] - digest = hashlib.sha256("|".join(parts).encode()).hexdigest() - return f"sha256:{digest}" - - -def attach_linkml_metadata(table: pa.Table, *, linkml_class: str, linkml_schema_version: str | None = None) -> pa.Table: - """Attach LinkML metadata (class, optional schema version, schema fingerprint) to an Arrow table. - - Parameters - ---------- - table : pa.Table - Table to decorate. - linkml_class : str - Name of the LinkML class represented by rows. - linkml_schema_version : str, optional - Version string of the LinkML schema. - """ - meta = dict(table.schema.metadata or {}) - meta.setdefault(b"linkml_class", linkml_class.encode()) - if linkml_schema_version is None: - try: # pragma: no cover - from connects_common_connectivity import __version__ # type: ignore - linkml_schema_version = __version__ - except Exception: - linkml_schema_version = None - if linkml_schema_version: - meta.setdefault(b"linkml_schema_version", str(linkml_schema_version).encode()) - meta.setdefault(b"schema_fingerprint", _schema_fingerprint(table.schema).encode()) - new_schema = table.schema.with_metadata(meta) # type: ignore[arg-type] - return table.replace_schema_metadata(new_schema.metadata) - - -__all__ = [ - "normalize_value", - "flatten_refs", - "model_to_row", - "build_arrow_schema", - "models_to_table", - "attach_linkml_metadata", - "build_cell_feature_matrix_schema", -] - -# --------------------------------------------------------------------------- -# Feature Matrix Schema Construction (Wide Parquet) -# --------------------------------------------------------------------------- -def _numpy_typestr_to_arrow(dtype_str: str) -> pa.DataType: - """Map a NumPy typestr (e.g. ' duration[ns]), M (datetime -> timestamp[ns]), S (bytes), U (unicode string), - O (object -> string), V (void/raw -> binary). - Complex 'c' not natively supported; raise for now. - """ - if not isinstance(dtype_str, str): # fail-safe - return pa.string() - dtype_str = dtype_str.strip() - # Basic validation: must match pattern like ', '|', '=') else dtype_str[0] - # Extract item size digits - import re - m = re.search(r'(\d+)$', dtype_str) - size = int(m.group(1)) if m else None - # Map kind - if kind == 't': - return pa.bool_() - if kind == 'b': - return pa.int8() - if kind == 'i': - # Signed integers - if size == 1: - return pa.int8() - if size == 2: - return pa.int16() - if size == 4: - return pa.int32() - if size == 8: - return pa.int64() - return pa.int32() # default - if kind == 'u': - if size == 1: - return pa.uint8() - if size == 2: - return pa.uint16() - if size == 4: - return pa.uint32() - if size == 8: - return pa.uint64() - return pa.uint32() - if kind == 'f': - if size == 2: # half precision not always supported; treat as float32 - return pa.float32() - if size == 4: - return pa.float32() - if size == 8: - return pa.float64() - return pa.float32() - if kind == 'm': # timedelta - return pa.duration('ns') - if kind == 'M': # datetime - return pa.timestamp('ns') - if kind == 'S': # bytes length-limited; store as binary - return pa.binary() - if kind == 'U': # unicode string - return pa.string() - if kind == 'O': # generic object -> string - return pa.string() - if kind == 'V': # raw data - return pa.binary() - if kind == 'c': # complex number not directly supported: represent as fixed-size list of two floats? - return pa.list_(pa.float64(), list_size=2) - return pa.string() - -def build_cell_feature_matrix_schema(cell_feature_set: Any, feature_definitions: Iterable[Any], *, cell_index_column: str = "id") -> pa.Schema: - """Construct a PyArrow schema for a wide CellFeatureMatrix Parquet file. - - Parameters - ---------- - cell_feature_set : CellFeatureSet (Pydantic/LinkML instance or object with 'id') - The feature set defining which features appear. - feature_definitions : Iterable[CellFeatureDefinition] - Collection of feature definition instances (must have id, data_type, unit, description). - cell_index_column : str, default 'id' - Name of the column holding DataItem identifiers (row index semantics). - - Returns - ------- - pyarrow.Schema - Schema with first field the cell index column (string) followed by one column per feature id - using mapped Arrow types and embedding metadata (feature_id, unit, dtype, description). - """ - fields: List[pa.Field] = [] - # Cell index column always string (DataItemId ultimately string in model) - fields.append(pa.field(cell_index_column, pa.string(), nullable=False, metadata={"role": "cell_index"})) - fields.append(pa.field('project_id', pa.string(), nullable=False, metadata={"description": "Project identifier"})) - fields.append(pa.field('feature_set_id', pa.string(), nullable=False, metadata={"description": "CellFeatureSet identifier"})) - for fd in feature_definitions: - fid = getattr(fd, 'id', None) - dtype_str = getattr(fd, 'data_type', None) or '' - unit = getattr(fd, 'unit', None) - desc = getattr(fd, 'description', None) - if not fid: - continue # skip invalid definition - arrow_type = _numpy_typestr_to_arrow(dtype_str) - meta: Dict[str, bytes] = {} - meta['feature_id'] = str(fid).encode() - if unit: - meta['unit'] = str(unit).encode() - if dtype_str: - meta['dtype'] = str(dtype_str).encode() - if desc: - meta['description'] = str(desc).encode() - fields.append(pa.field(str(fid), arrow_type, nullable=True, metadata=meta)) - - schema = pa.schema(fields, metadata={ - b'linkml_class': b'CellFeatureMatrix', - b'feature_set_id': str(getattr(cell_feature_set, 'id', 'UNKNOWN')).encode(), - b'schema_fingerprint': _schema_fingerprint(pa.schema(fields)).encode(), - }) - return schema - +from .io.arrow_utils import * # noqa: F401,F403 (deprecated; removed in W6) +from .io.arrow_utils import __all__ # noqa: F401 diff --git a/src/connects_common_connectivity/io/arrow_utils.py b/src/connects_common_connectivity/io/arrow_utils.py new file mode 100644 index 0000000..e9000e5 --- /dev/null +++ b/src/connects_common_connectivity/io/arrow_utils.py @@ -0,0 +1,353 @@ +"""Utilities for converting Pydantic (LinkML-derived) models to PyArrow Tables efficiently. + +Design goals: +- Avoid JSON serialization; work directly with Python-native values. +- Normalize Enum instances, nested Pydantic models, and object references. +- Provide schema construction from Pydantic field annotations for stability. +- Support column-oriented batch conversion for speed and memory efficiency. +""" +from __future__ import annotations + +from enum import Enum +from typing import Any, Iterable, List, Dict, Union, get_origin, get_args +import hashlib +from datetime import datetime, date + +import pyarrow as pa + +try: + from pydantic import BaseModel # type: ignore +except ImportError: # pragma: no cover + class BaseModel: # minimal placeholder to satisfy type hints + pass + +# Primitive Python -> Arrow type mapping (extend as needed) +PRIMITIVE_TYPE_MAP: Dict[Any, pa.DataType] = { + int: pa.int64(), + float: pa.float64(), + bool: pa.bool_(), + str: pa.string(), + bytes: pa.binary(), + datetime: pa.timestamp("us"), + date: pa.date32(), +} + + +def normalize_value(v: Any) -> Any: + """Recursively normalize a value so Arrow can ingest it. + + - Enum -> underlying value (typically str) + - Pydantic model -> dict of normalized fields + - list/dict -> element-wise normalization + - Other primitives unchanged + """ + if isinstance(v, Enum): + return v.value + if hasattr(v, "model_dump") and callable(getattr(v, "model_dump")): + return {k: normalize_value(x) for k, x in v.model_dump(mode="python").items()} + if isinstance(v, list): + return [normalize_value(x) for x in v] + if isinstance(v, dict): + return {k: normalize_value(x) for k, x in v.items()} + return v + + +def flatten_refs(row: Dict[str, Any]) -> Dict[str, Any]: + """Flatten object reference dicts to id form when they contain an identifier field. + + Example: {'parent_identifier': {'id': 'BR123'}} -> {'parent_identifier_id': 'BR123'} + Leaves the original key removed for simpler Arrow schema. + """ + for key, val in list(row.items()): + # Single embedded reference -> promote to *_id + if isinstance(val, dict): + ident = val.get("id") or val.get("identifier") + if ident is not None: # heuristic: treat as ref not complex struct + row[f"{key}"] = ident + continue + # List of simple embedded references -> replace with list of ids + if isinstance(val, list) and val and all(isinstance(x, dict) for x in val): + ids: List[Any] = [] + simple = True + for x in val: + ident = x.get("id") or x.get("identifier") if isinstance(x, dict) else None + if ident is None: + simple = False + break + ids.append(ident) + if simple: + row[key] = ids + return row + + +def model_to_row(model: Any, *, flatten: bool = True) -> Dict[str, Any]: + """Convert a single Pydantic model instance to a normalized row dict. + + Parameters + ---------- + model: BaseModel + Pydantic instance. + flatten: bool + If True, attempt to flatten object references to *_id columns. + """ + raw = model.model_dump(mode="python", exclude_none=True) + norm = {k: normalize_value(v) for k, v in raw.items()} + return flatten_refs(norm) if flatten else norm + + +def _arrow_field_for(name: str, annotation: Any, required: bool) -> pa.Field: + """Infer an Arrow Field from a Python type annotation. + + Handles Optional[T], List[T], Enums, and basic primitives. + Fallback is string. + """ + nullable = not required + origin = get_origin(annotation) + args = get_args(annotation) + + # Optional / Union[T, None] + if origin is Union and type(None) in args: + non_none = [a for a in args if a is not type(None)] + if non_none: + annotation = non_none[0] + nullable = True + origin = get_origin(annotation) + args = get_args(annotation) + + # List[T] + if origin is list and args: + inner = args[0] + inner_field = _arrow_field_for(name, inner, required=True) + return pa.field(name, pa.list_(inner_field.type), nullable=True) + + # Enum + if isinstance(annotation, type) and issubclass(annotation, Enum): + return pa.field(name, pa.string(), nullable=nullable) + + # Primitive direct map + if annotation in PRIMITIVE_TYPE_MAP: + return pa.field(name, PRIMITIVE_TYPE_MAP[annotation], nullable=nullable) + + # Fallback struct for embedded BaseModel? treat as JSON/string for now. + return pa.field(name, pa.string(), nullable=nullable) + + +def build_arrow_schema(model_cls: Any) -> pa.Schema: + """Build a stable Arrow schema from a Pydantic model class. + + Uses model_fields (Pydantic v2) annotations; unknown types default to string. + """ + fields: List[pa.Field] = [] + model_fields = getattr(model_cls, "model_fields", {}) + for fname, finfo in model_fields.items(): + annotation = getattr(finfo, "annotation", Any) + required = finfo.is_required() + fields.append(_arrow_field_for(fname, annotation, required)) + return pa.schema(fields) + + +def models_to_table(models: Iterable[Any], schema: Union[pa.Schema, None] = None, *, flatten: bool = True) -> pa.Table: + """Convert an iterable of Pydantic models to a PyArrow Table. + + If no schema is provided, one is generated from the class of the first model. + Column-oriented assembly for speed (single pass). Nested dicts become stringified. + """ + models = list(models) + if not models: + return pa.Table.from_arrays([], schema=schema or pa.schema([])) + + # Build schema if absent + if schema is None: + schema = build_arrow_schema(models[0].__class__) + # Schema must be a pyarrow.Schema now + assert isinstance(schema, pa.Schema), "Schema construction failed" + + # Initialize column buffers + buffers: Dict[str, List[Any]] = {field.name: [] for field in schema} # type: ignore[arg-type] + + for m in models: + row: Dict[str, Any] = model_to_row(m, flatten=flatten) + for field in schema: # type: ignore[arg-type] + val = row.get(field.name) + if isinstance(val, dict): + val = str(val) + # If list contains dicts, attempt to reduce each dict to id/identifier, else stringify + if isinstance(val, list) and val and any(isinstance(x, dict) for x in val): + reduced: List[Any] = [] + for x in val: + if isinstance(x, dict): + ident = x.get("id") or x.get("identifier") + reduced.append(ident if ident is not None else str(x)) + else: + reduced.append(x) + val = reduced + buffers[field.name].append(val) + # Build arrays now that buffers are filled + arrays: List[pa.Array] = [] + for field in schema: # type: ignore[arg-type] + arrays.append(pa.array(buffers[field.name], type=field.type)) + return pa.Table.from_arrays(arrays, schema=schema) +def _schema_fingerprint(schema: pa.Schema) -> str: + """Create a stable fingerprint for an Arrow schema (field names + types).""" + parts = [f"{f.name}:{f.type}" for f in schema] + digest = hashlib.sha256("|".join(parts).encode()).hexdigest() + return f"sha256:{digest}" + + +def attach_linkml_metadata(table: pa.Table, *, linkml_class: str, linkml_schema_version: str | None = None) -> pa.Table: + """Attach LinkML metadata (class, optional schema version, schema fingerprint) to an Arrow table. + + Parameters + ---------- + table : pa.Table + Table to decorate. + linkml_class : str + Name of the LinkML class represented by rows. + linkml_schema_version : str, optional + Version string of the LinkML schema. + """ + meta = dict(table.schema.metadata or {}) + meta.setdefault(b"linkml_class", linkml_class.encode()) + if linkml_schema_version is None: + try: # pragma: no cover + from connects_common_connectivity import __version__ # type: ignore + linkml_schema_version = __version__ + except Exception: + linkml_schema_version = None + if linkml_schema_version: + meta.setdefault(b"linkml_schema_version", str(linkml_schema_version).encode()) + meta.setdefault(b"schema_fingerprint", _schema_fingerprint(table.schema).encode()) + new_schema = table.schema.with_metadata(meta) # type: ignore[arg-type] + return table.replace_schema_metadata(new_schema.metadata) + + +__all__ = [ + "normalize_value", + "flatten_refs", + "model_to_row", + "build_arrow_schema", + "models_to_table", + "attach_linkml_metadata", + "build_cell_feature_matrix_schema", +] + +# --------------------------------------------------------------------------- +# Feature Matrix Schema Construction (Wide Parquet) +# --------------------------------------------------------------------------- +def _numpy_typestr_to_arrow(dtype_str: str) -> pa.DataType: + """Map a NumPy typestr (e.g. ' duration[ns]), M (datetime -> timestamp[ns]), S (bytes), U (unicode string), + O (object -> string), V (void/raw -> binary). + Complex 'c' not natively supported; raise for now. + """ + if not isinstance(dtype_str, str): # fail-safe + return pa.string() + dtype_str = dtype_str.strip() + # Basic validation: must match pattern like ', '|', '=') else dtype_str[0] + # Extract item size digits + import re + m = re.search(r'(\d+)$', dtype_str) + size = int(m.group(1)) if m else None + # Map kind + if kind == 't': + return pa.bool_() + if kind == 'b': + return pa.int8() + if kind == 'i': + # Signed integers + if size == 1: + return pa.int8() + if size == 2: + return pa.int16() + if size == 4: + return pa.int32() + if size == 8: + return pa.int64() + return pa.int32() # default + if kind == 'u': + if size == 1: + return pa.uint8() + if size == 2: + return pa.uint16() + if size == 4: + return pa.uint32() + if size == 8: + return pa.uint64() + return pa.uint32() + if kind == 'f': + if size == 2: # half precision not always supported; treat as float32 + return pa.float32() + if size == 4: + return pa.float32() + if size == 8: + return pa.float64() + return pa.float32() + if kind == 'm': # timedelta + return pa.duration('ns') + if kind == 'M': # datetime + return pa.timestamp('ns') + if kind == 'S': # bytes length-limited; store as binary + return pa.binary() + if kind == 'U': # unicode string + return pa.string() + if kind == 'O': # generic object -> string + return pa.string() + if kind == 'V': # raw data + return pa.binary() + if kind == 'c': # complex number not directly supported: represent as fixed-size list of two floats? + return pa.list_(pa.float64(), list_size=2) + return pa.string() + +def build_cell_feature_matrix_schema(cell_feature_set: Any, feature_definitions: Iterable[Any], *, cell_index_column: str = "id") -> pa.Schema: + """Construct a PyArrow schema for a wide CellFeatureMatrix Parquet file. + + Parameters + ---------- + cell_feature_set : CellFeatureSet (Pydantic/LinkML instance or object with 'id') + The feature set defining which features appear. + feature_definitions : Iterable[CellFeatureDefinition] + Collection of feature definition instances (must have id, data_type, unit, description). + cell_index_column : str, default 'id' + Name of the column holding DataItem identifiers (row index semantics). + + Returns + ------- + pyarrow.Schema + Schema with first field the cell index column (string) followed by one column per feature id + using mapped Arrow types and embedding metadata (feature_id, unit, dtype, description). + """ + fields: List[pa.Field] = [] + # Cell index column always string (DataItemId ultimately string in model) + fields.append(pa.field(cell_index_column, pa.string(), nullable=False, metadata={"role": "cell_index"})) + fields.append(pa.field('project_id', pa.string(), nullable=False, metadata={"description": "Project identifier"})) + fields.append(pa.field('feature_set_id', pa.string(), nullable=False, metadata={"description": "CellFeatureSet identifier"})) + for fd in feature_definitions: + fid = getattr(fd, 'id', None) + dtype_str = getattr(fd, 'data_type', None) or '' + unit = getattr(fd, 'unit', None) + desc = getattr(fd, 'description', None) + if not fid: + continue # skip invalid definition + arrow_type = _numpy_typestr_to_arrow(dtype_str) + meta: Dict[str, bytes] = {} + meta['feature_id'] = str(fid).encode() + if unit: + meta['unit'] = str(unit).encode() + if dtype_str: + meta['dtype'] = str(dtype_str).encode() + if desc: + meta['description'] = str(desc).encode() + fields.append(pa.field(str(fid), arrow_type, nullable=True, metadata=meta)) + + schema = pa.schema(fields, metadata={ + b'linkml_class': b'CellFeatureMatrix', + b'feature_set_id': str(getattr(cell_feature_set, 'id', 'UNKNOWN')).encode(), + b'schema_fingerprint': _schema_fingerprint(pa.schema(fields)).encode(), + }) + return schema + diff --git a/src/connects_common_connectivity/io/write_spec.py b/src/connects_common_connectivity/io/write_spec.py new file mode 100644 index 0000000..d7bd233 --- /dev/null +++ b/src/connects_common_connectivity/io/write_spec.py @@ -0,0 +1,171 @@ +"""Write-spec registry for IO-layer Delta writers. + +A :class:`WriteSpec` describes how a generated pydantic model is persisted into +the shared Delta lake: which subdirectory, which partition columns, which scope +columns, and which write mode the backend should dispatch on. + +Only the seed entries needed to unblock W3 are registered here +(``DataSet``, ``DataItem``, ``DataItemDataSetAssociation``). Additional classes +are added in W3 as their writers are prototyped — see +``planning/prompts/03_writers.md``. +""" + +from __future__ import annotations + +from typing import Literal + +from pydantic import BaseModel, ConfigDict + +from ..models import ( + CellFeatureDefinition, + CellFeatureMatrix, + CellFeatureSet, + CellToClusterMapping, + Cluster, + ClusterHierarchy, + ClusterMembership, + DataItem, + DataItemDataSetAssociation, + DataSet, + MappingSet, + ProjectionMeasurementMatrix, +) + + +class WriteSpec(BaseModel): + """Declarative description of how a model class is written to Delta.""" + + model_config = ConfigDict(arbitrary_types_allowed=True) + + model_cls: type + subdir: str + partition_by: list[str] + scope_columns: list[str] + write_mode: Literal["overwrite_scoped", "append_new_by_id"] + required_for_write: list[str] = [] + cross_field_rules: list[str] = [] + + +REGISTRY: dict[str, WriteSpec] = { + "DataSet": WriteSpec( + model_cls=DataSet, + subdir="dataset", + partition_by=["project_id"], + # patchseq fix: today's notebooks predicate only on project_id, which + # is why visp_inh_patchseq overwrites visp_exc_patchseq. Scoping on + # (project_id, id) keeps each DataSet row independent. + scope_columns=["project_id", "id"], + write_mode="overwrite_scoped", + ), + "DataItem": WriteSpec( + model_cls=DataItem, + subdir="dataitem", + partition_by=["project_id"], + scope_columns=["id"], + write_mode="append_new_by_id", + ), + "DataItemDataSetAssociation": WriteSpec( + model_cls=DataItemDataSetAssociation, + subdir="dataitem_dataset_association", + partition_by=["project_id"], + scope_columns=["project_id", "dataset_id"], + write_mode="overwrite_scoped", + ), + # Cluster taxonomy is project-agnostic in the schema — Cluster and + # ClusterHierarchy do not carry project_id. Scope is the hierarchy id + # (Cluster) or the row id (ClusterHierarchy), matching the existing + # cluster ETL notebooks. + "Cluster": WriteSpec( + model_cls=Cluster, + subdir="cluster", + partition_by=["hierarchy_id"], + scope_columns=["hierarchy_id"], + write_mode="overwrite_scoped", + ), + "ClusterHierarchy": WriteSpec( + model_cls=ClusterHierarchy, + subdir="clusterhierarchy", + partition_by=[], + scope_columns=["id"], + write_mode="overwrite_scoped", + ), + "ClusterMembership": WriteSpec( + model_cls=ClusterMembership, + subdir="clustermembership", + partition_by=["project_id", "hierarchy_id"], + scope_columns=["project_id", "hierarchy_id"], + write_mode="overwrite_scoped", + ), + "MappingSet": WriteSpec( + model_cls=MappingSet, + subdir="mappingset", + partition_by=["project_id"], + scope_columns=["project_id", "id"], + write_mode="overwrite_scoped", + ), + "CellToClusterMapping": WriteSpec( + model_cls=CellToClusterMapping, + subdir="celltoclustermapping", + partition_by=["project_id"], + # Notebooks predicate on (project_id, mapping_set), which is the + # mapping-set foreign key on the row. + scope_columns=["project_id", "mapping_set"], + write_mode="overwrite_scoped", + ), + "CellFeatureSet": WriteSpec( + model_cls=CellFeatureSet, + subdir="cellfeatureset", + partition_by=["project_id"], + scope_columns=["project_id", "id"], + write_mode="overwrite_scoped", + ), + "CellFeatureDefinition": WriteSpec( + model_cls=CellFeatureDefinition, + subdir="cellfeaturedefinition", + partition_by=["project_id", "feature_set_id"], + scope_columns=["project_id", "feature_set_id"], + write_mode="overwrite_scoped", + ), + "CellFeatureMatrix": WriteSpec( + model_cls=CellFeatureMatrix, + subdir="cellfeaturematrix", + partition_by=["project_id"], + scope_columns=["project_id", "feature_set_id"], + # CellFeatureMatrix rows are metadata pointers (one row per matrix); + # the wide-form numeric Parquet at ``cellfeatures/{feature_set_id}/`` + # is built from raw dataframes in the notebook, not from a model + # instance, so it does not flow through ``write_models`` and stays + # outside the registry. See planning/prompts/03_writers.md report. + write_mode="overwrite_scoped", + ), + "ProjectionMeasurementMatrix": WriteSpec( + model_cls=ProjectionMeasurementMatrix, + subdir="projectionmeasurementmatrix", + # ProjectionMeasurementMatrix is not ProjectScoped (schema gap noted + # in etl_wnm_exc_04). The notebook predicate is therefore ``id IN (...)`` + # only, with no partition columns. Once the schema gains + # ``ProjectScoped``, partition_by/scope_columns should be widened. + partition_by=[], + scope_columns=["id"], + write_mode="overwrite_scoped", + ), +} + + +def get_spec(model_or_cls: type | BaseModel) -> WriteSpec: + """Look up the :class:`WriteSpec` for a model class or instance. + + Accepts either the generated pydantic class itself or an instance of it, + keyed by ``__name__`` of the class. + """ + cls = model_or_cls if isinstance(model_or_cls, type) else type(model_or_cls) + try: + return REGISTRY[cls.__name__] + except KeyError as err: + raise KeyError( + f"No WriteSpec registered for {cls.__name__!r}. " + f"Known: {sorted(REGISTRY)}" + ) from err + + +__all__ = ["WriteSpec", "REGISTRY", "get_spec"] diff --git a/src/connects_common_connectivity/io/write_utils.py b/src/connects_common_connectivity/io/write_utils.py new file mode 100644 index 0000000..91a8b52 --- /dev/null +++ b/src/connects_common_connectivity/io/write_utils.py @@ -0,0 +1,173 @@ +"""Idempotent write helpers for Delta Lake tables shared across notebooks.""" +from __future__ import annotations + +from typing import Any, Iterator, Mapping, Optional, Tuple + +import pyarrow as pa +import pyarrow.compute as pc +from deltalake import write_deltalake + +__all__ = [ + "append_new_dataitems", + "populate_region_coverage", + "walk_ancestors", +] + + +def walk_ancestors( + leaf_id: str, + parent_of: Mapping[str, Optional[str]], +) -> Iterator[Tuple[str, bool]]: + """Yield ``(cluster_id, is_leaf)`` from a leaf cluster up to the root. + + Used by cluster-membership / cell-to-cluster-mapping notebooks to + denormalize the hierarchy into the membership/mapping table so that + consumers can filter at any level without a recursive cluster join. + The first yielded tuple has ``is_leaf=True``; all ancestors yield + ``is_leaf=False``. The walk terminates when ``parent_of[current]`` is + ``None`` (the root). + + Parameters + ---------- + leaf_id: + Cluster id to start from. Must be a key in ``parent_of``. + parent_of: + Mapping from cluster id to parent id, with ``None`` for the + root. Typically built as + ``dict(zip(cluster_df["id"], cluster_df["parent"]))`` filtered to + a single ``hierarchy_id``. + + Yields + ------ + tuple[str, bool] + ``(cluster_id, is_leaf)`` pairs from leaf to root, inclusive. + + Raises + ------ + KeyError + If ``leaf_id`` is not a key in ``parent_of`` (the caller should + validate cluster ids against the registered taxonomy first and + fail loudly on unknowns). + """ + if leaf_id not in parent_of: + raise KeyError(leaf_id) + cur: Optional[str] = leaf_id + is_leaf = True + while cur is not None: + yield cur, is_leaf + is_leaf = False + cur = parent_of.get(cur) + + +def append_new_dataitems( + output_path: str, + table: pa.Table, + *, + project_id: str, + id_column: str = "id", +) -> int: + """Append only rows whose ``id`` is not already in the Delta table for this project. + + Safe to call from multiple notebooks that share the same ``project_id`` partition + (e.g. ``visp_inh_patchseq_01`` and ``visp_exc_patchseq_01`` both write to + ``dataitem/`` under ``project_id='visp_patchseq'``). Unlike a scoped overwrite, + this function never removes rows written by another notebook. + + Idempotent: re-running with the same rows appends nothing and returns 0. + Handles the case where the Delta table does not yet exist. + + Parameters + ---------- + output_path: + Path to the Delta table directory. + table: + PyArrow table of candidate rows to append. + project_id: + Value used to filter existing rows before checking for duplicates. + id_column: + Name of the id column to deduplicate on. Defaults to ``"id"``. + + Returns + ------- + int + Number of rows actually appended (0 if all were already present). + """ + existing_ids: set[str] = set() + try: + import polars as pl + + existing_ids = set( + pl.read_delta(output_path) + .filter(pl.col("project_id") == project_id)[id_column] + .to_list() + ) + except Exception: + # Table doesn't exist yet, or read failed — treat all rows as new. + pass + + if existing_ids: + id_array = table.column(id_column) + existing_array = pa.array(list(existing_ids), type=id_array.type) + in_existing = pc.is_in(id_array, value_set=existing_array) + new_rows = table.filter(pc.invert(in_existing)) + else: + new_rows = table + + if new_rows.num_rows == 0: + return 0 + + write_deltalake(output_path, new_rows, mode="append", partition_by=["project_id"]) + return new_rows.num_rows + + +def populate_region_coverage(pmm: Any, matrix: Any) -> Any: + """Return a copy of ``pmm`` with ``region_coverage`` derived from ``matrix``. + + ``region_coverage`` is the subset of ``pmm.region_index`` whose + corresponding column in the dense ``matrix`` has at least one non-zero + value. Pure function: the input ``pmm`` is not mutated. + + Parameters + ---------- + pmm: + A :class:`ProjectionMeasurementMatrix` instance with ``region_index`` + already populated. + matrix: + Dense numeric array of shape + ``(len(pmm.data_item_index), len(pmm.region_index))`` — typically a + NumPy ``ndarray``, but anything that supports element-wise truthiness + plus column-wise ``any()`` works. + + Returns + ------- + ProjectionMeasurementMatrix + A new instance equal to ``pmm`` except that ``region_coverage`` is + the list of region ids with at least one non-zero entry, in the + order they appear in ``region_index``. + + Raises + ------ + ValueError + If ``pmm.region_index`` is missing or its length does not match + ``matrix.shape[1]``. + """ + region_index = getattr(pmm, "region_index", None) + if region_index is None: + raise ValueError("pmm.region_index must be set before populating region_coverage") + + import numpy as np + + arr = np.asarray(matrix) + if arr.ndim != 2: + raise ValueError( + f"matrix must be 2D (cells x regions); got shape {arr.shape!r}" + ) + if arr.shape[1] != len(region_index): + raise ValueError( + f"matrix.shape[1] ({arr.shape[1]}) must equal len(region_index) " + f"({len(region_index)})" + ) + + nonzero_cols = np.any(arr != 0, axis=0) + coverage = [r for r, keep in zip(region_index, nonzero_cols.tolist()) if keep] + return pmm.model_copy(update={"region_coverage": coverage}) diff --git a/src/connects_common_connectivity/io/writers.py b/src/connects_common_connectivity/io/writers.py new file mode 100644 index 0000000..1aba554 --- /dev/null +++ b/src/connects_common_connectivity/io/writers.py @@ -0,0 +1,300 @@ +"""Dispatch core for IO-layer Delta writers. + +A single public entry point — :func:`write_models` — accepts a homogeneous +batch of generated pydantic models and routes the write through the +:class:`~connects_common_connectivity.io.write_spec.WriteSpec` registered +for that class. The only standalone writer is +:func:`write_projection_matrix`, which exists because its signature is +genuinely non-uniform (it accepts a dense matrix alongside the model). + +Class-specific behavior lives in the registry, never here. Callers +discover what is writable via :data:`WRITABLE_CLASSES`. +""" + +from __future__ import annotations + +from dataclasses import dataclass +from pathlib import Path +from typing import Any, Iterable, Sequence + +import pyarrow as pa +from deltalake import write_deltalake +from pydantic import BaseModel + +from ..config import Settings, get_settings, table_path +from .arrow_utils import attach_linkml_metadata, build_arrow_schema, models_to_table +from .write_spec import REGISTRY, WriteSpec, get_spec +from .write_utils import append_new_dataitems, populate_region_coverage + +# --------------------------------------------------------------------------- +# Result type +# --------------------------------------------------------------------------- + + +@dataclass(frozen=True) +class WriteResult: + """Return value of a single :func:`write_models` invocation. + + ``predicates`` is one entry per scope group for ``overwrite_scoped`` + writes; an empty tuple for ``append_new_by_id`` (no predicate is + issued — Delta append + id-dedupe handles idempotency). + """ + + class_name: str + path: Path + mode: str + predicates: tuple[str, ...] + rows_written: int + + +# --------------------------------------------------------------------------- +# Discovery +# --------------------------------------------------------------------------- + + +WRITABLE_CLASSES: tuple[type, ...] = tuple( + spec.model_cls for spec in REGISTRY.values() +) + + +# --------------------------------------------------------------------------- +# Validation hook (replaced by W5) +# --------------------------------------------------------------------------- + + +def _validation_hook(models: Sequence[BaseModel], spec: WriteSpec) -> Sequence[BaseModel]: + """Pass-through identity hook; W5 monkey-patches this to enforce invariants.""" + return models + + +# --------------------------------------------------------------------------- +# Helpers (private; tested directly) +# --------------------------------------------------------------------------- + + +def _normalize_models(models: Any) -> list[BaseModel]: + """Coerce ``models`` to a list, accepting a single model or any iterable. + + Requires homogeneous type. Empty input is rejected — callers always + know which class they are writing. + """ + if isinstance(models, BaseModel): + return [models] + if isinstance(models, (str, bytes)) or not isinstance(models, Iterable): + raise TypeError( + f"write_models expected a pydantic model or iterable of models; " + f"got {type(models).__name__}" + ) + items = list(models) + if not items: + raise ValueError("write_models received an empty batch") + cls = type(items[0]) + for m in items: + if type(m) is not cls: + raise TypeError( + f"write_models requires homogeneous types; got " + f"{cls.__name__} and {type(m).__name__}" + ) + return items + + +def _format_value(v: Any) -> str: + """Render ``v`` as a single-quoted SQL literal for the Delta predicate.""" + if v is None: + return "NULL" + return "'" + str(v).replace("'", "''") + "'" + + +def _build_predicate(scope_columns: Sequence[str], row_values: Sequence[Any]) -> str: + """Build an AND-joined ``col = 'val'`` predicate for ``write_deltalake``. + + The format is exactly ``col1 = 'val1' AND col2 = 'val2'`` — single + quotes, AND-joined, no extra whitespace beyond the single space around + each operator. Notebooks that compose predicates by hand use the same + format; this helper is the canonical implementation. + """ + if len(scope_columns) != len(row_values): + raise ValueError( + f"scope_columns ({len(scope_columns)}) and row_values " + f"({len(row_values)}) length mismatch" + ) + parts = [f"{c} = {_format_value(v)}" for c, v in zip(scope_columns, row_values)] + return " AND ".join(parts) + + +def _group_by_scope( + table: pa.Table, scope_columns: Sequence[str] +) -> list[tuple[tuple, pa.Table]]: + """Partition ``table`` into one ``(scope_tuple, sub_table)`` per scope group. + + Scope groups preserve row order within each group. Two rows belong to + the same group iff they have equal values across every column in + ``scope_columns``. Order of groups is the order of first appearance. + """ + if not scope_columns: + raise ValueError("scope_columns must be non-empty for overwrite_scoped writes") + + cols = [table.column(c).to_pylist() for c in scope_columns] + keys: list[tuple] = list(zip(*cols)) if cols else [] + + seen: dict[tuple, list[int]] = {} + for i, key in enumerate(keys): + seen.setdefault(key, []).append(i) + + return [(key, table.take(pa.array(idxs))) for key, idxs in seen.items()] + + +# --------------------------------------------------------------------------- +# Dispatch branches +# --------------------------------------------------------------------------- + + +def _dispatch_overwrite_scoped( + table: pa.Table, spec: WriteSpec, path: Path +) -> WriteResult: + """Group by scope, issue one predicated overwrite per group.""" + groups = _group_by_scope(table, spec.scope_columns) + predicates: list[str] = [] + rows_written = 0 + partition_by = spec.partition_by or None + for key, sub in groups: + predicate = _build_predicate(spec.scope_columns, key) + write_deltalake( + str(path), + sub, + mode="overwrite", + predicate=predicate, + partition_by=partition_by, + ) + predicates.append(predicate) + rows_written += sub.num_rows + return WriteResult( + class_name=spec.model_cls.__name__, + path=path, + mode="overwrite_scoped", + predicates=tuple(predicates), + rows_written=rows_written, + ) + + +def _dispatch_append_new_by_id( + table: pa.Table, spec: WriteSpec, path: Path +) -> WriteResult: + """Append only rows whose id is new, scoped to a single ``project_id``.""" + if not spec.scope_columns: + raise ValueError( + f"{spec.model_cls.__name__}: scope_columns is empty for append_new_by_id " + f"(expected the id column at index 0)" + ) + id_column = spec.scope_columns[0] + + if "project_id" not in table.column_names: + raise ValueError( + f"{spec.model_cls.__name__}: append_new_by_id requires a 'project_id' " + f"column on every row (got columns {table.column_names!r})" + ) + project_ids = set(table.column("project_id").to_pylist()) + if len(project_ids) != 1: + raise ValueError( + f"{spec.model_cls.__name__}: append_new_by_id requires a single " + f"project_id per call (got {sorted(project_ids)!r}). Split the " + f"batch upstream." + ) + (project_id,) = project_ids + + rows_written = append_new_dataitems( + str(path), table, project_id=project_id, id_column=id_column + ) + return WriteResult( + class_name=spec.model_cls.__name__, + path=path, + mode="append_new_by_id", + predicates=(), + rows_written=rows_written, + ) + + +# --------------------------------------------------------------------------- +# Public API +# --------------------------------------------------------------------------- + + +def write_models(models: Any, *, settings: Settings | None = None) -> WriteResult: + """Write a batch of generated pydantic models to the shared Delta lake. + + The class is inferred from ``models`` and dispatched through its + :class:`WriteSpec` (see :mod:`connects_common_connectivity.io.write_spec`). + No per-class wrapper functions exist; renaming this function eight times + would add no behavior, only drift surface. + + Parameters + ---------- + models: + A single model instance or a non-empty iterable of instances of the + same class. The class must be one of :data:`WRITABLE_CLASSES`. + settings: + Optional explicit settings. Falls back to :func:`get_settings` when + omitted; an explicit ``settings=`` always wins (matches the + precedence documented in :mod:`connects_common_connectivity.config`). + + Returns + ------- + WriteResult + Class name, on-disk path, dispatch mode, the predicates issued (one + per scope group for ``overwrite_scoped``; empty for + ``append_new_by_id``), and the number of rows written. + + Notes + ----- + Writable classes (the registry, in order): + ``DataSet``, ``DataItem``, ``DataItemDataSetAssociation``, + ``Cluster``, ``ClusterHierarchy``, ``ClusterMembership``, + ``MappingSet``, ``CellToClusterMapping``, + ``CellFeatureSet``, ``CellFeatureDefinition``, ``CellFeatureMatrix``, + ``ProjectionMeasurementMatrix``. + Use ``WRITABLE_CLASSES`` to enumerate at runtime. + """ + items = _normalize_models(models) + cls = type(items[0]) + spec = get_spec(cls) + + items = list(_validation_hook(items, spec)) + + settings = settings or get_settings() + schema = build_arrow_schema(cls) + table = models_to_table(items, schema=schema) + table = attach_linkml_metadata(table, linkml_class=cls.__name__) + + path = table_path(settings, spec.subdir) + + if spec.write_mode == "overwrite_scoped": + return _dispatch_overwrite_scoped(table, spec, path) + if spec.write_mode == "append_new_by_id": + return _dispatch_append_new_by_id(table, spec, path) + raise ValueError( + f"{cls.__name__}: unsupported write_mode {spec.write_mode!r}. " + f"Add a dispatch branch in writers.py." + ) + + +def write_projection_matrix( + pmm: Any, matrix: Any, *, settings: Settings | None = None +) -> WriteResult: + """Enrich ``pmm`` with derived ``region_coverage`` and write it. + + The single non-:func:`write_models` public writer, justified by the + non-uniform signature: callers must hand in the dense ``matrix`` + alongside the model so coverage can be derived from it. The input + ``pmm`` is not mutated — :func:`populate_region_coverage` returns a + new instance. + """ + enriched = populate_region_coverage(pmm, matrix) + return write_models(enriched, settings=settings) + + +__all__ = [ + "WRITABLE_CLASSES", + "WriteResult", + "write_models", + "write_projection_matrix", +] diff --git a/src/connects_common_connectivity/write_utils.py b/src/connects_common_connectivity/write_utils.py index 6eee0cc..9c3cd10 100644 --- a/src/connects_common_connectivity/write_utils.py +++ b/src/connects_common_connectivity/write_utils.py @@ -1,114 +1,11 @@ -"""Idempotent write helpers for Delta Lake tables shared across notebooks.""" -from __future__ import annotations - -from typing import Iterator, Mapping, Optional, Tuple - -import pyarrow as pa -import pyarrow.compute as pc -from deltalake import write_deltalake - - -def walk_ancestors( - leaf_id: str, - parent_of: Mapping[str, Optional[str]], -) -> Iterator[Tuple[str, bool]]: - """Yield ``(cluster_id, is_leaf)`` from a leaf cluster up to the root. - - Used by cluster-membership / cell-to-cluster-mapping notebooks to - denormalize the hierarchy into the membership/mapping table so that - consumers can filter at any level without a recursive cluster join. - The first yielded tuple has ``is_leaf=True``; all ancestors yield - ``is_leaf=False``. The walk terminates when ``parent_of[current]`` is - ``None`` (the root). - - Parameters - ---------- - leaf_id: - Cluster id to start from. Must be a key in ``parent_of``. - parent_of: - Mapping from cluster id to parent id, with ``None`` for the - root. Typically built as - ``dict(zip(cluster_df["id"], cluster_df["parent"]))`` filtered to - a single ``hierarchy_id``. - - Yields - ------ - tuple[str, bool] - ``(cluster_id, is_leaf)`` pairs from leaf to root, inclusive. - - Raises - ------ - KeyError - If ``leaf_id`` is not a key in ``parent_of`` (the caller should - validate cluster ids against the registered taxonomy first and - fail loudly on unknowns). - """ - if leaf_id not in parent_of: - raise KeyError(leaf_id) - cur: Optional[str] = leaf_id - is_leaf = True - while cur is not None: - yield cur, is_leaf - is_leaf = False - cur = parent_of.get(cur) - - -def append_new_dataitems( - output_path: str, - table: pa.Table, - *, - project_id: str, - id_column: str = "id", -) -> int: - """Append only rows whose ``id`` is not already in the Delta table for this project. - - Safe to call from multiple notebooks that share the same ``project_id`` partition - (e.g. ``visp_inh_patchseq_01`` and ``visp_exc_patchseq_01`` both write to - ``dataitem/`` under ``project_id='visp_patchseq'``). Unlike a scoped overwrite, - this function never removes rows written by another notebook. - - Idempotent: re-running with the same rows appends nothing and returns 0. - Handles the case where the Delta table does not yet exist. - - Parameters - ---------- - output_path: - Path to the Delta table directory. - table: - PyArrow table of candidate rows to append. - project_id: - Value used to filter existing rows before checking for duplicates. - id_column: - Name of the id column to deduplicate on. Defaults to ``"id"``. - - Returns - ------- - int - Number of rows actually appended (0 if all were already present). - """ - existing_ids: set[str] = set() - try: - import polars as pl - - existing_ids = set( - pl.read_delta(output_path) - .filter(pl.col("project_id") == project_id)[id_column] - .to_list() - ) - except Exception: - # Table doesn't exist yet, or read failed — treat all rows as new. - pass - - if existing_ids: - id_array = table.column(id_column) - existing_array = pa.array(list(existing_ids), type=id_array.type) - in_existing = pc.is_in(id_array, value_set=existing_array) - new_rows = table.filter(pc.invert(in_existing)) - else: - new_rows = table - - if new_rows.num_rows == 0: - return 0 - - write_deltalake(output_path, new_rows, mode="append", partition_by=["project_id"]) - return new_rows.num_rows +"""Deprecated re-export shim — moved to :mod:`connects_common_connectivity.io.write_utils`. + +This module exists to avoid breaking notebooks and external imports while the +codebase is mid-migration to the ``io/`` layer. It will be removed in W6. +""" +from .io.write_utils import * # noqa: F401,F403 (deprecated; removed in W6) +from .io.write_utils import ( # noqa: F401 + append_new_dataitems, + populate_region_coverage, + walk_ancestors, +) diff --git a/tests/test_write_relocation.py b/tests/test_write_relocation.py new file mode 100644 index 0000000..4707e7f --- /dev/null +++ b/tests/test_write_relocation.py @@ -0,0 +1,51 @@ +"""Smoke test asserting public IO names are importable from BOTH paths. + +The shims at the package root re-export from ``io/`` to keep notebooks +working through W6. This test pins that contract: anything notebooks +import today must keep working. +""" + +from __future__ import annotations + + +def test_public_names_from_io_paths(): + from connects_common_connectivity.io.arrow_utils import ( # noqa: F401 + attach_linkml_metadata, + build_arrow_schema, + build_cell_feature_matrix_schema, + models_to_table, + ) + from connects_common_connectivity.io.write_utils import ( # noqa: F401 + append_new_dataitems, + populate_region_coverage, + walk_ancestors, + ) + + +def test_public_names_from_shim_paths(): + from connects_common_connectivity.arrow_utils import ( # noqa: F401 + attach_linkml_metadata, + build_arrow_schema, + build_cell_feature_matrix_schema, + models_to_table, + ) + from connects_common_connectivity.write_utils import ( # noqa: F401 + append_new_dataitems, + populate_region_coverage, + walk_ancestors, + ) + + +def test_shim_and_io_resolve_to_same_object(): + from connects_common_connectivity import arrow_utils as shim_arrow + from connects_common_connectivity import write_utils as shim_write + from connects_common_connectivity.io import arrow_utils as io_arrow + from connects_common_connectivity.io import write_utils as io_write + + assert shim_arrow.build_arrow_schema is io_arrow.build_arrow_schema + assert shim_arrow.models_to_table is io_arrow.models_to_table + assert shim_arrow.attach_linkml_metadata is io_arrow.attach_linkml_metadata + assert shim_arrow.build_cell_feature_matrix_schema is io_arrow.build_cell_feature_matrix_schema + assert shim_write.append_new_dataitems is io_write.append_new_dataitems + assert shim_write.walk_ancestors is io_write.walk_ancestors + assert shim_write.populate_region_coverage is io_write.populate_region_coverage diff --git a/tests/test_write_spec.py b/tests/test_write_spec.py new file mode 100644 index 0000000..e7219e9 --- /dev/null +++ b/tests/test_write_spec.py @@ -0,0 +1,55 @@ +"""Drift tests for the WriteSpec registry. + +These tests guard against the registry getting out of sync with +``models.py`` — e.g., a renamed field silently breaking a writer's +predicate. +""" + +from __future__ import annotations + +import pytest + +from connects_common_connectivity import models as models_module +from connects_common_connectivity.io.write_spec import REGISTRY, WriteSpec, get_spec + + +def test_registry_contains_seed_entries(): + seed = {"DataSet", "DataItem", "DataItemDataSetAssociation"} + assert seed.issubset(set(REGISTRY)) + + +@pytest.mark.parametrize("key", list(REGISTRY)) +def test_registry_key_matches_model_cls(key): + spec = REGISTRY[key] + cls = getattr(models_module, key, None) + assert cls is not None, f"models.py has no class named {key!r}" + assert spec.model_cls is cls, ( + f"REGISTRY[{key!r}].model_cls is {spec.model_cls!r}, expected {cls!r}" + ) + assert spec.model_cls.__name__ == key + + +@pytest.mark.parametrize("key", list(REGISTRY)) +def test_spec_columns_exist_on_model(key): + spec: WriteSpec = REGISTRY[key] + fields = set(spec.model_cls.model_fields) + for col in spec.scope_columns + spec.partition_by + spec.required_for_write: + assert col in fields, ( + f"{spec.model_cls.__name__}: column {col!r} is not a field " + f"(have: {sorted(fields)})" + ) + + +def test_get_spec_accepts_class_and_instance(): + ds_cls = REGISTRY["DataSet"].model_cls + instance = ds_cls(id="d1", name="example", project_id="p1") + assert get_spec(ds_cls) is REGISTRY["DataSet"] + assert get_spec(instance) is REGISTRY["DataSet"] + + +def test_get_spec_unknown_class_raises(): + class NotRegistered: + pass + + with pytest.raises(KeyError): + get_spec(NotRegistered) diff --git a/tests/test_writers.py b/tests/test_writers.py new file mode 100644 index 0000000..ab2f995 --- /dev/null +++ b/tests/test_writers.py @@ -0,0 +1,322 @@ +"""Tests for the IO writer dispatch core. + +Covers: + +* The patchseq regression — overlapping ``project_id`` writes do not wipe + each other (the original motivating bug). +* Idempotency, multi-scope-group dispatch, predicate construction. +* Append-new-by-id semantics. +* A per-class round-trip smoke test for every entry in ``WRITABLE_CLASSES``. +* ``write_projection_matrix`` enrichment + write. +""" + +from __future__ import annotations + +import numpy as np +import polars as pl +import pyarrow as pa +import pytest + +from connects_common_connectivity.config import Settings +from connects_common_connectivity.io.write_spec import REGISTRY +from connects_common_connectivity.io.writers import ( + WRITABLE_CLASSES, + WriteResult, + _build_predicate, + _group_by_scope, + write_models, + write_projection_matrix, +) +from connects_common_connectivity.models import ( + CellFeatureDefinition, + CellFeatureMatrix, + CellFeatureSet, + CellToClusterMapping, + Cluster, + ClusterHierarchy, + ClusterMembership, + DataItem, + DataItemDataSetAssociation, + DataSet, + Laterality, + MappingSet, + Modality, + ProjectionMeasurementMatrix, + ProjectionMeasurementType, + Unit, +) + +# --------------------------------------------------------------------------- +# Fixtures +# --------------------------------------------------------------------------- + + +@pytest.fixture +def settings(tmp_path) -> Settings: + return Settings(output_root=tmp_path) + + +def _read(path) -> pl.DataFrame: + return pl.read_delta(str(path)) + + +# --------------------------------------------------------------------------- +# Predicate construction +# --------------------------------------------------------------------------- + + +def test_build_predicate_format(): + assert ( + _build_predicate(["project_id"], ["minnie65"]) + == "project_id = 'minnie65'" + ) + assert ( + _build_predicate(["project_id", "id"], ["minnie65", "ds_a"]) + == "project_id = 'minnie65' AND id = 'ds_a'" + ) + + +def test_build_predicate_escapes_single_quotes(): + assert ( + _build_predicate(["name"], ["O'Hara"]) + == "name = 'O''Hara'" + ) + + +# --------------------------------------------------------------------------- +# _group_by_scope +# --------------------------------------------------------------------------- + + +def test_group_by_scope_preserves_first_appearance_order(): + table = pa.table( + { + "project_id": ["p", "p", "p"], + "id": ["b", "a", "b"], + "value": [1, 2, 3], + } + ) + groups = _group_by_scope(table, ["project_id", "id"]) + keys = [k for k, _ in groups] + assert keys == [("p", "b"), ("p", "a")] + # The first 'b' group should hold rows 0 and 2 (preserved order). + first_sub = groups[0][1] + assert first_sub.column("value").to_pylist() == [1, 3] + + +# --------------------------------------------------------------------------- +# Patchseq regression: the headline test +# --------------------------------------------------------------------------- + + +def test_patchseq_regression_two_datasets_same_project(settings): + """Two DataSet rows with the same ``project_id`` but different ``id`` must coexist. + + Before W2/W3 the notebooks predicated on ``project_id`` only, so a + second write wiped the first. The new ``scope_columns=[project_id, id]`` + keeps each row independent. + """ + ds_a = DataSet(id="visp_exc_patchseq", name="exc", project_id="visp_patchseq") + ds_b = DataSet(id="visp_inh_patchseq", name="inh", project_id="visp_patchseq") + write_models(ds_a, settings=settings) + write_models(ds_b, settings=settings) + + rows = _read(settings.output_root / "dataset") + ids = sorted(rows["id"].to_list()) + assert ids == ["visp_exc_patchseq", "visp_inh_patchseq"] + + +def test_overwrite_scoped_is_idempotent(settings): + ds = DataSet(id="d1", name="example", project_id="p1") + write_models(ds, settings=settings) + write_models(ds, settings=settings) + rows = _read(settings.output_root / "dataset") + assert rows.shape[0] == 1 + + +def test_multi_scope_group_dispatch_yields_one_predicate_per_group(settings): + rows_in = [ + DataSet(id="a", name="A", project_id="p1"), + DataSet(id="b", name="B", project_id="p1"), + ] + result = write_models(rows_in, settings=settings) + assert isinstance(result, WriteResult) + assert len(result.predicates) == 2 + assert result.rows_written == 2 + # Both end up in the table. + rows = _read(settings.output_root / "dataset") + assert sorted(rows["id"].to_list()) == ["a", "b"] + + +# --------------------------------------------------------------------------- +# append_new_by_id semantics (DataItem) +# --------------------------------------------------------------------------- + + +def test_append_new_by_id_only_appends_unseen(settings): + items_first = [ + DataItem(id="cell_1", name="cell_1", project_id="p1"), + DataItem(id="cell_2", name="cell_2", project_id="p1"), + ] + r1 = write_models(items_first, settings=settings) + assert r1.mode == "append_new_by_id" + assert r1.predicates == () + assert r1.rows_written == 2 + + items_second = [ + DataItem(id="cell_2", name="cell_2", project_id="p1"), # already there + DataItem(id="cell_3", name="cell_3", project_id="p1"), # new + ] + r2 = write_models(items_second, settings=settings) + assert r2.rows_written == 1 + + rows = _read(settings.output_root / "dataitem") + assert sorted(rows["id"].to_list()) == ["cell_1", "cell_2", "cell_3"] + + +def test_append_new_by_id_rejects_mixed_project_ids(settings): + bad = [ + DataItem(id="x", name="x", project_id="p1"), + DataItem(id="y", name="y", project_id="p2"), + ] + with pytest.raises(ValueError, match="single project_id"): + write_models(bad, settings=settings) + + +# --------------------------------------------------------------------------- +# Per-class smoke (every entry in WRITABLE_CLASSES exercised) +# --------------------------------------------------------------------------- + + +def _make_instance(cls): + """Return a minimal valid instance of ``cls`` for the round-trip smoke test.""" + if cls is DataSet: + return DataSet(id="ds1", name="ds", project_id="p1") + if cls is DataItem: + return DataItem(id="di1", name="di1", project_id="p1") + if cls is DataItemDataSetAssociation: + return DataItemDataSetAssociation( + dataitem_id="di1", dataset_id="ds1", project_id="p1" + ) + if cls is Cluster: + return Cluster(id="c1", hierarchy_id="h1", level=0) + if cls is ClusterHierarchy: + return ClusterHierarchy(id="h1", root="c1", clusters=["c1"]) + if cls is ClusterMembership: + return ClusterMembership( + item="cell_1", cluster="c1", hierarchy_id="h1", project_id="p1" + ) + if cls is MappingSet: + return MappingSet(id="m1", project_id="p1", name="m", method_name="dummy") + if cls is CellToClusterMapping: + return CellToClusterMapping( + id="ctc1", + project_id="p1", + mapping_set="m1", + source_cell="cell_1", + target_cluster="c1", + ) + if cls is CellFeatureSet: + return CellFeatureSet(id="fs1", project_id="p1") + if cls is CellFeatureDefinition: + return CellFeatureDefinition( + id="feat_a", + project_id="p1", + feature_set_id="fs1", + data_type="= 1 + + +# --------------------------------------------------------------------------- +# write_projection_matrix +# --------------------------------------------------------------------------- + + +def test_write_projection_matrix_enriches_and_does_not_mutate_input(settings): + pmm = ProjectionMeasurementMatrix( + id="pmm_test", + measurement_type=ProjectionMeasurementType.MICRONS_OF_AXON, + modality=Modality.MORPHOLOGY, + laterality=Laterality.IPSILATERAL, + unit=Unit.MICRONS_LENGTH, + data_item_index=["c1", "c2"], + region_index=["VISp", "ACA", "MOB"], + values="file:///tmp/pmm.delta", + ) + matrix = np.array( + [ + [1.0, 0.0, 0.0], + [0.0, 0.0, 2.0], + ] + ) + assert pmm.region_coverage in (None, []) + + result = write_projection_matrix(pmm, matrix, settings=settings) + assert result.class_name == "ProjectionMeasurementMatrix" + assert pmm.region_coverage in (None, []) # input not mutated + + rows = _read(settings.output_root / "projectionmeasurementmatrix") + coverage = rows.filter(pl.col("id") == "pmm_test")["region_coverage"].to_list()[0] + assert list(coverage) == ["VISp", "MOB"] + + +# --------------------------------------------------------------------------- +# Input validation +# --------------------------------------------------------------------------- + + +def test_write_models_rejects_empty(settings): + with pytest.raises(ValueError, match="empty"): + write_models([], settings=settings) + + +def test_write_models_rejects_heterogeneous(settings): + with pytest.raises(TypeError, match="homogeneous"): + write_models( + [ + DataSet(id="d1", name="d", project_id="p1"), + DataItem(id="x", name="x", project_id="p1"), + ], + settings=settings, + ) + + +def test_write_models_rejects_unregistered_class(settings): + class NotInRegistry: + pass + + with pytest.raises(TypeError): + write_models(NotInRegistry(), settings=settings) From 7842182a2a3b532c76a312b273c891cfe12b2930 Mon Sep 17 00:00:00 2001 From: reneyagmur Date: Thu, 11 Jun 2026 05:25:31 +0000 Subject: [PATCH 10/25] io init and validation based on write registry --- CHANGELOG.md | 10 ++ planning/TODO.md | 21 ++- planning/prompts/04_public_api.md | 42 +++-- planning/prompts/05_validation.md | 73 ++++----- .../io/__init__.py | 36 ++++- .../io/write_spec.py | 3 + .../io/write_validation.py | 149 ++++++++++++++++++ .../io/writers.py | 9 +- tests/test_config.py | 2 - tests/test_public_api.py | 50 ++++++ tests/test_write_validation.py | 131 +++++++++++++++ 11 files changed, 451 insertions(+), 75 deletions(-) create mode 100644 src/connects_common_connectivity/io/write_validation.py create mode 100644 tests/test_public_api.py create mode 100644 tests/test_write_validation.py diff --git a/CHANGELOG.md b/CHANGELOG.md index 607e76d..473a0a6 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -9,6 +9,16 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 ### Added +- Added write-time validation: `write_models()` now re-validates each + model through a runtime-derived strict subclass that flips + `WriteSpec.required_for_write` slots to non-optional, raising + `ValueError` before any IO if a write-required slot is missing or + `None`. Public helpers `strict_model_for()` and `validate_for_write()` + live in `connects_common_connectivity.io.write_validation`. +- Added curated public API at `connects_common_connectivity.io`: imports + for `get_settings`, `Settings`, `table_path`, `write_models`, + `write_projection_matrix`, `WriteResult`, and `WRITABLE_CLASSES` are + now stable and pinned by `__all__`. - Added `connects_common_connectivity.io.writers` with `write_models()` (the single dispatch core for all generated pydantic models), `write_projection_matrix()`, `WriteResult`, and `WRITABLE_CLASSES`. diff --git a/planning/TODO.md b/planning/TODO.md index e3ad5b9..90050af 100644 --- a/planning/TODO.md +++ b/planning/TODO.md @@ -39,12 +39,21 @@ Flat, ordered list. One row per prompt; sub-tasks live in the prompts. Design li `overwrite_scoped` for its metadata-pointer rows. Revisit when the wide-matrix contract is clarified. Tests: `tests/test_writers.py`, `tests/test_write_relocation.py` (full suite 119 passing). -- [ ] **W4 — Public API** (`prompts/04_public_api.md`) — `io/__init__.py`: curated - re-exports + `__all__`. The user-facing surface; defines what autocomplete shows. Blocked - by W3. -- [ ] **W5 — Write validation** (`prompts/05_validation.md`) — `io/write_validation.py`: - `strict_model_for(cls)` flips `required_for_write` to required + attaches pure - `cross_field_rules` (no I/O). Swap `validate_for_write` into the W3 hook. Blocked by W2, W3. +- [x] **W4 — Public API** (`prompts/04_public_api.md`) — `io/__init__.py`: curated + re-exports + `__all__` (`get_settings`, `Settings`, `table_path`, `write_models`, + `write_projection_matrix`, `WriteResult`, `WRITABLE_CLASSES`). Module docstring + with usage example, `# TODO(W8): reader exports` placeholder. Test: + `tests/test_public_api.py`. +- [x] **W5 — Write validation** (`prompts/05_validation.md`) — `io/write_validation.py`: + `strict_model_for(cls)` flips `required_for_write` to required + strips `Optional` + from those annotations (cached per class, no `models.py` mutation); + `validate_for_write()` re-validates instances and raises `ValueError` naming the + missing slots before any IO. Wired into `write_models` (replaces the W3 + pass-through hook). Populated `required_for_write` for `Cluster`, + `ClusterMembership`, and `CellFeatureDefinition` (the only entries whose + predicate / partition columns are `Optional` in the generated schema). Tests: + `tests/test_write_validation.py`. Cross-field rules deferred (still empty list + on every spec). - [ ] **W6 — Notebook migration** (`prompts/06_notebook_migration.md`) — Migrate every ETL notebook to typed writers; delete hardcoded `OUTPUT_ROOT` and per-cell `mode`/`predicate`/`partition_by` (`ccc_config.yaml` already exists from W1). Run the diff --git a/planning/prompts/04_public_api.md b/planning/prompts/04_public_api.md index 9ab6f42..149975c 100644 --- a/planning/prompts/04_public_api.md +++ b/planning/prompts/04_public_api.md @@ -1,33 +1,29 @@ # Agent prompt — Public API (`io/__init__.py`) -> Prepend `00_shared_context.md`. Depends on writers (W3); reader exports added later when -> the read-side work happens. +> Prepend `00_shared_context.md`. Depends on writers (W3). ## Why -`io/__init__.py` is the single most important file for "user-friendly": it defines what a -user types after `from connects_common_connectivity.io import …` and what shows up in -autocomplete. It also decouples the public surface from internal module layout. +`io/__init__.py` defines what users type after `from connects_common_connectivity.io +import …` and what shows up in autocomplete. It is the file that decides whether the +package feels curated or sprawling. ## Requirements -1. A concise module docstring: one paragraph on the IO layer (note settings come from a - discovered `ccc_config.yaml`) + a 3–5 line usage example using `write_models(...)` and - `write_projection_matrix(...)` — no config ceremony needed. -2. Curated re-exports — only the names users should touch (write-side for now): - - config (from the package root, `from ..config import ...`): `get_settings`, `Settings`, - `table_path` - - writers: `write_models`, `write_projection_matrix`, `WriteResult`, `WRITABLE_CLASSES` - - reader names are added here when readers land (deferred) — leave a clear TODO comment. - Do NOT re-export backends (`arrow_utils`, `write_utils`) or internal helpers. - Do NOT add per-class wrappers (`write_dataset`, etc.) — they don't exist; `write_models` - infers the class. -3. Define `__all__` to match exactly the curated list (keeps `dir()` and `*` imports clean). -4. Keep it import-light: no heavy work at import time; just imports + `__all__`. +1. Re-export, and only re-export, the curated names below. Source paths in parentheses. + - `get_settings`, `Settings`, `table_path` — from `..config` + - `write_models`, `write_projection_matrix`, `WriteResult`, `WRITABLE_CLASSES` + — from `.writers` +2. Define `__all__` to exactly that list (no more, no less). +3. Add a module docstring: one short paragraph on the IO layer + a 3–5 line usage + example using `write_models(...)` and `write_projection_matrix(...)`. No config + ceremony in the example — `get_settings()` is implicit. +4. Leave a single `# TODO(W8): reader exports` comment at the bottom of the imports + block, so the reader slot is obvious when W8 lands. ## Test (`tests/test_public_api.py`) -- Every name in `__all__` is importable from `connects_common_connectivity.io`. -- No backend/internal module name leaks into `__all__`. -- `__all__` does NOT contain any `write_dataset` / `write_dataitem` / etc. — those - wrappers don't exist by design. +- Import every name in `__all__` from `connects_common_connectivity.io` and assert it + resolves to a non-`None` object. +- Assert no name in `__all__` starts with `_`. ## Do not -- Re-export internal backends. Add per-class wrappers. Touch `models.py` or schemas. +- Re-export `arrow_utils`, `write_utils`, or any private helper. +- Touch `models.py` or schemas. \ No newline at end of file diff --git a/planning/prompts/05_validation.md b/planning/prompts/05_validation.md index 2a5689b..1aba4d8 100644 --- a/planning/prompts/05_validation.md +++ b/planning/prompts/05_validation.md @@ -1,46 +1,47 @@ # Agent prompt — Write-validation (auto-derived strict submodels) -> Prepend `00_shared_context.md`. Depends on `write_spec.py` and `writers.py` (built after -> the write path; wires into the pass-through validation hook left in `write_models`). +> Prepend `00_shared_context.md`. Depends on `write_spec.py` (W2) and `writers.py` (W3). +> Wires into the pass-through `_validation_hook(models, spec) -> models` left in +> `write_models`. ## Naming -File is `io/write_validation.py`, NOT `io/validation.py`: this is specifically write-safety -validation coupled to `write_spec`. The generic word "validation" is already claimed by -`cli.py`'s LinkML full-conformance check — keep the two distinct. +File is `io/write_validation.py` — write-time, pydantic-only, registry-coupled. The +generic word "validation" is already used by `cli.py`'s LinkML conformance check; the +two are intentionally distinct. -## Goal -Create `src/connects_common_connectivity/io/write_validation.py` that derives a STRICT pydantic -submodel per class **at runtime** from (a) the generated model in `models.py` and (b) the -registry's `required_for_write` + `cross_field_rules`. Single source of truth: nothing -is restated from the schema. +## What W5 ships +1. **Populate `required_for_write`** on the registry entries that need it. Driven by + the same prototype loop as W3: read the corresponding notebook's write call, identify + the slots the predicate / partition / append-id depend on, and list them. Empty list + is a valid answer — only add slots a real write actually relies on. +2. `strict_model_for(model_cls) -> type[BaseModel]`: + - Subclass the generated model at runtime; do NOT mutate `models.py` classes. + - For each name in `spec.required_for_write`, override the field to be required + (no default, not Optional). Use any pydantic v2 mechanism that doesn't touch the + parent class. + - Cache by class so the derived type is built once. +3. `validate_for_write(models, spec) -> models` — accepts the same shape `_validation_hook` + already does (single instance OR iterable, returns the same shape). Runs each instance + through the strict submodel; on failure, raise an error naming the class and the + failing slot. Pydantic-only, no I/O. +4. **Wire it in.** In `write_models`, replace the pass-through `_validation_hook` with + `validate_for_write`. This is the only edit to `writers.py`. -## Requirements -1. `strict_model_for(model_cls) -> type[BaseModel]`: - - Subclass the generated model. - - For each slot in the registry's `required_for_write`, make it required (no default / - not Optional). Use pydantic v2 mechanisms (`model_fields` overrides via - `create_model` or field re-annotation) — do NOT edit the generated class in place. - - Attach each named `cross_field_rule` as a `@model_validator(mode="after")`. - - Cache the derived class (e.g. `functools.lru_cache`) so it's built once. -2. `validate_for_write(model) -> model` (or list): run the instance through the strict - submodel, raising a clear error that names the class, the failing slot/rule, and the - offending value. This runs on the hot write path, so keep it pydantic-only (fast, **no - I/O**); do NOT call the LinkML/`cli.py` validator here. -3. **Wire it into `write_models`:** replace the pass-through validation hook left by - `03_writers.md` with `validate_for_write`. This is the only change to the writer. -4. Implement a starter cross-field rule registry (a dict name → callable). Rules here MUST - be pure: they inspect only the model instance in hand, do no I/O, and never read other - tables. Add rules only as the registry references them. - - Do NOT implement `association_dataset_exists` here. It reads written DataSets, so it is - a referential check, not a structural one — it is deferred with the read-side work as an - opt-in `check_refs` (`09_analysis.md`). Keeping it out keeps validation free of any - dependency on readers. +## Out of scope (deferred, not skipped) +- Cross-field rules. `WriteSpec.cross_field_rules` exists as an empty list; until a real + invariant needs one, do not introduce a rule registry. Add the dict + `model_validator` + scaffolding when the first rule is actually written, not before. +- Referential checks (e.g. "association.dataset_id exists in DataSet"). These read other + tables and belong with the read-side opt-in `check_refs` (`_deferred/09_analysis.md`), + not on the write path. ## Tests (`tests/test_write_validation.py`) -- A model missing a `required_for_write` slot fails `validate_for_write` before any IO. -- A valid model passes and is returned unchanged (round-trip equality on fields). -- The generated `models.py` class is unchanged after deriving the strict model - (no in-place mutation). +- A model with a missing `required_for_write` slot fails before any IO. +- A model with all slots present passes and is returned unchanged (field-by-field equal). +- The class object in `models.py` has the same `model_fields` after `strict_model_for` + runs as before — proving no in-place mutation. +- `validate_for_write([m1, m2], spec)` accepts a list (same shape contract as the hook). ## Do not -- Edit `models.py`. Restate schema field definitions. Put LinkML validation on the write path. +- Edit `models.py` or schemas. Restate field types from the schema. Call the LinkML + validator on the write path. Add cross-field rules speculatively. \ No newline at end of file diff --git a/src/connects_common_connectivity/io/__init__.py b/src/connects_common_connectivity/io/__init__.py index c4018fc..93e7759 100644 --- a/src/connects_common_connectivity/io/__init__.py +++ b/src/connects_common_connectivity/io/__init__.py @@ -1,13 +1,37 @@ """IO layer for ConnectsCommonConnectivity. -This package owns write/read backends and (re-)exports a few package-wide -helpers for convenience. The settings live in :mod:`connects_common_connectivity.config`; -they are re-exported here so IO callers can ``from connects_common_connectivity.io -import get_settings, table_path``. +The IO layer owns the write/read path between generated pydantic models +and the shared Delta lake. This module is the curated public surface: +import from here for stable user code; everything else under ``io/`` is +internal plumbing. + +Example:: + + from connects_common_connectivity.io import write_models, write_projection_matrix + from connects_common_connectivity.models import DataSet + + write_models(DataSet(id="ds1", name="example", project_id="p1")) + write_projection_matrix(pmm, dense_matrix) """ from __future__ import annotations -from ..config import Settings, get_settings, output_root, table_path +from ..config import Settings, get_settings, table_path +from .writers import ( + WRITABLE_CLASSES, + WriteResult, + write_models, + write_projection_matrix, +) + +# TODO(W8): reader exports -__all__ = ["Settings", "get_settings", "output_root", "table_path"] +__all__ = [ + "get_settings", + "Settings", + "table_path", + "write_models", + "write_projection_matrix", + "WriteResult", + "WRITABLE_CLASSES", +] diff --git a/src/connects_common_connectivity/io/write_spec.py b/src/connects_common_connectivity/io/write_spec.py index d7bd233..400dbad 100644 --- a/src/connects_common_connectivity/io/write_spec.py +++ b/src/connects_common_connectivity/io/write_spec.py @@ -81,6 +81,7 @@ class WriteSpec(BaseModel): partition_by=["hierarchy_id"], scope_columns=["hierarchy_id"], write_mode="overwrite_scoped", + required_for_write=["hierarchy_id"], ), "ClusterHierarchy": WriteSpec( model_cls=ClusterHierarchy, @@ -95,6 +96,7 @@ class WriteSpec(BaseModel): partition_by=["project_id", "hierarchy_id"], scope_columns=["project_id", "hierarchy_id"], write_mode="overwrite_scoped", + required_for_write=["hierarchy_id"], ), "MappingSet": WriteSpec( model_cls=MappingSet, @@ -125,6 +127,7 @@ class WriteSpec(BaseModel): partition_by=["project_id", "feature_set_id"], scope_columns=["project_id", "feature_set_id"], write_mode="overwrite_scoped", + required_for_write=["feature_set_id"], ), "CellFeatureMatrix": WriteSpec( model_cls=CellFeatureMatrix, diff --git a/src/connects_common_connectivity/io/write_validation.py b/src/connects_common_connectivity/io/write_validation.py new file mode 100644 index 0000000..3c1c199 --- /dev/null +++ b/src/connects_common_connectivity/io/write_validation.py @@ -0,0 +1,149 @@ +"""Write-time, pydantic-only validation hooked into :func:`write_models`. + +The IO layer should never blindly trust that a model carries every slot +the write actually depends on. Many generated fields are ``Optional`` in +``models.py`` because the schema permits them to be missing in some +contexts, but the *write* path needs them concretely (e.g. the predicate +columns, the partition columns, the id used for dedupe). + +W2's :class:`WriteSpec` records this in ``required_for_write``. This +module turns that list into a real check by deriving a strict pydantic +subclass of the generated model — runtime-only, never mutating +``models.py`` — and re-validating each instance through it before any IO. + +The CLI's LinkML-conformance check is a different beast (whole-schema, +generic, no registry). The two intentionally do not share code. +""" + +from __future__ import annotations + +from functools import lru_cache +from types import UnionType +from typing import Any, Iterable, Sequence, Union, get_args, get_origin + +from pydantic import BaseModel, Field, ValidationError, create_model + +from .write_spec import WriteSpec + + +__all__ = ["strict_model_for", "validate_for_write"] + + +def _strip_optional(annotation: Any) -> Any: + """Return ``annotation`` with ``None`` removed from any top-level Union. + + A field annotated ``Optional[str]`` (``str | None``) accepts ``None`` as + a valid value even when ``Field(...)`` makes it required. For write-time + enforcement we want ``None`` to be a validation error, so we strip the + ``NoneType`` arm of any top-level union. + """ + origin = get_origin(annotation) + if origin is Union or origin is UnionType: + args = tuple(a for a in get_args(annotation) if a is not type(None)) + if not args: + return annotation + if len(args) == 1: + return args[0] + return Union[args] # type: ignore[return-value] + return annotation + + +@lru_cache(maxsize=None) +def strict_model_for(model_cls: type) -> type[BaseModel]: + """Return a pydantic subclass of ``model_cls`` with write-required slots forced. + + For each name in the registered :attr:`WriteSpec.required_for_write` + list, the corresponding field on the returned subclass is required + (no default, ``...`` ellipsis). The annotation, validators, and other + metadata of the parent class are preserved — only the default is + flipped. + + Cached on ``model_cls`` so the derived class is built once and reused + across calls. + + Important: ``models.py`` is never mutated. The returned class is a + runtime-only subclass; assertions on the parent class's + ``model_fields`` continue to reflect the schema as generated. + """ + # Local import avoids a hard top-level cycle through the registry. + from .write_spec import REGISTRY + + spec = REGISTRY.get(model_cls.__name__) + required: Sequence[str] = spec.required_for_write if spec else () + if not required: + # Nothing to tighten — return the original class. + return model_cls + + overrides: dict[str, Any] = {} + for name in required: + finfo = model_cls.model_fields.get(name) + if finfo is None: + raise ValueError( + f"{model_cls.__name__}: required_for_write field {name!r} " + f"is not declared on the model" + ) + overrides[name] = (_strip_optional(finfo.annotation), Field(...)) + + strict = create_model( + f"{model_cls.__name__}_StrictWrite", + __base__=model_cls, + **overrides, + ) + return strict + + +def _coerce_iterable(models: Any) -> tuple[bool, list[BaseModel]]: + """Return ``(was_iterable, items)`` for the same shape contract as the hook.""" + if isinstance(models, BaseModel): + return False, [models] + if isinstance(models, (str, bytes)) or not isinstance(models, Iterable): + raise TypeError( + f"validate_for_write expected a model or iterable; " + f"got {type(models).__name__}" + ) + return True, list(models) + + +def validate_for_write(models: Any, spec: WriteSpec) -> Any: + """Re-validate ``models`` through the strict submodel for ``spec.model_cls``. + + Same shape contract as the W3 ``_validation_hook``: a single instance + in returns a single instance out; an iterable in returns a list out. + No I/O. Pydantic-only. On failure, raises :class:`ValueError` naming + the class and the failing slot. + """ + was_iter, items = _coerce_iterable(models) + if not items: + return items if was_iter else None + + cls = type(items[0]) + if cls is not spec.model_cls: + raise TypeError( + f"validate_for_write: spec.model_cls is {spec.model_cls.__name__!r} " + f"but received {cls.__name__!r}" + ) + + strict = strict_model_for(cls) + if strict is cls: + return items if was_iter else items[0] + + revalidated: list[BaseModel] = [] + for m in items: + try: + revalidated.append(strict.model_validate(m.model_dump())) + except ValidationError as err: + missing = sorted( + { + ".".join(str(p) for p in e.get("loc", ())) + for e in err.errors() + if e.get("type") + in ("missing", "none_not_allowed", "string_type", "value_error") + } + ) + slot_text = ", ".join(missing) if missing else "(see below)" + raise ValueError( + f"{cls.__name__}: missing required_for_write slot(s): " + f"{slot_text}. {err}" + ) from err + + return revalidated if was_iter else revalidated[0] diff --git a/src/connects_common_connectivity/io/writers.py b/src/connects_common_connectivity/io/writers.py index 1aba554..b6dc56c 100644 --- a/src/connects_common_connectivity/io/writers.py +++ b/src/connects_common_connectivity/io/writers.py @@ -25,6 +25,7 @@ from .arrow_utils import attach_linkml_metadata, build_arrow_schema, models_to_table from .write_spec import REGISTRY, WriteSpec, get_spec from .write_utils import append_new_dataitems, populate_region_coverage +from .write_validation import validate_for_write # --------------------------------------------------------------------------- # Result type @@ -63,8 +64,12 @@ class WriteResult: def _validation_hook(models: Sequence[BaseModel], spec: WriteSpec) -> Sequence[BaseModel]: - """Pass-through identity hook; W5 monkey-patches this to enforce invariants.""" - return models + """Strict re-validation against ``spec.required_for_write`` (W5). + + Identity-shaped: takes a sequence in, returns a sequence out. Pure + pydantic; no I/O. + """ + return validate_for_write(list(models), spec) # --------------------------------------------------------------------------- diff --git a/tests/test_config.py b/tests/test_config.py index dd27641..0ebd243 100644 --- a/tests/test_config.py +++ b/tests/test_config.py @@ -120,14 +120,12 @@ def test_io_reexports_settings_helpers(): from connects_common_connectivity.io import ( Settings as IOSettings, get_settings as io_get_settings, - output_root as io_output_root, table_path as io_table_path, ) assert IOSettings is Settings assert io_get_settings is get_settings assert io_table_path is table_path - assert io_output_root is output_root def test_get_settings_is_cached(tmp_path, monkeypatch): diff --git a/tests/test_public_api.py b/tests/test_public_api.py new file mode 100644 index 0000000..f139261 --- /dev/null +++ b/tests/test_public_api.py @@ -0,0 +1,50 @@ +"""Lock the curated public surface of ``connects_common_connectivity.io``. + +The public API is whatever ``__all__`` says — nothing more, nothing less. +""" + +from __future__ import annotations + +import importlib + +import connects_common_connectivity.io as io_mod + + +EXPECTED = { + "get_settings", + "Settings", + "table_path", + "write_models", + "write_projection_matrix", + "WriteResult", + "WRITABLE_CLASSES", +} + + +def test_all_exact_set(): + assert set(io_mod.__all__) == EXPECTED + + +def test_all_resolves_to_non_none_objects(): + for name in io_mod.__all__: + obj = getattr(io_mod, name) + assert obj is not None, f"io.{name} resolved to None" + + +def test_no_private_names_in_all(): + for name in io_mod.__all__: + assert not name.startswith("_"), f"private name {name!r} in __all__" + + +def test_each_name_imports_cleanly(): + mod = importlib.reload(io_mod) + for name in EXPECTED: + assert hasattr(mod, name), f"io.{name} missing" + + +def test_internal_modules_not_re_exported(): + # arrow_utils / write_utils / write_spec / writers are accessible as + # submodules (they're real modules) but their names must not leak into + # io.__all__. + forbidden = {"arrow_utils", "write_utils", "write_spec", "writers"} + assert forbidden.isdisjoint(set(io_mod.__all__)) diff --git a/tests/test_write_validation.py b/tests/test_write_validation.py new file mode 100644 index 0000000..9083625 --- /dev/null +++ b/tests/test_write_validation.py @@ -0,0 +1,131 @@ +"""Tests for write-time validation (auto-derived strict submodels).""" + +from __future__ import annotations + +import pytest + +from connects_common_connectivity.io.write_spec import REGISTRY, WriteSpec +from connects_common_connectivity.io.write_validation import ( + strict_model_for, + validate_for_write, +) +from connects_common_connectivity.models import ( + CellFeatureDefinition, + Cluster, + DataSet, +) + + +# --------------------------------------------------------------------------- +# strict_model_for +# --------------------------------------------------------------------------- + + +def test_strict_model_subclasses_parent_without_mutating_it(): + before = dict(Cluster.model_fields) + strict = strict_model_for(Cluster) + after = dict(Cluster.model_fields) + + assert before.keys() == after.keys() + for k in before: + assert before[k].is_required() == after[k].is_required(), ( + f"Cluster.model_fields[{k!r}] was mutated" + ) + assert issubclass(strict, Cluster) + assert strict is not Cluster + + +def test_strict_model_for_is_cached(): + a = strict_model_for(Cluster) + b = strict_model_for(Cluster) + assert a is b + + +def test_strict_model_returns_parent_when_no_required_for_write(): + # DataSet has empty required_for_write; the strict subclass is just the parent. + assert REGISTRY["DataSet"].required_for_write == [] + assert strict_model_for(DataSet) is DataSet + + +def test_strict_model_flips_optional_field_to_required(): + strict = strict_model_for(Cluster) + # On the parent, hierarchy_id is optional. + assert not Cluster.model_fields["hierarchy_id"].is_required() + # On the strict subclass, hierarchy_id is required. + assert strict.model_fields["hierarchy_id"].is_required() + + +# --------------------------------------------------------------------------- +# validate_for_write — failure path +# --------------------------------------------------------------------------- + + +def test_missing_required_for_write_slot_raises_before_io(): + spec = REGISTRY["Cluster"] + bad = Cluster(id="c1") # hierarchy_id missing + with pytest.raises(ValueError, match="hierarchy_id"): + validate_for_write(bad, spec) + + +def test_missing_slot_names_class_in_error(): + spec = REGISTRY["CellFeatureDefinition"] + bad = CellFeatureDefinition(id="f1", project_id="p1") # feature_set_id missing + with pytest.raises(ValueError, match="CellFeatureDefinition"): + validate_for_write(bad, spec) + + +# --------------------------------------------------------------------------- +# validate_for_write — happy path +# --------------------------------------------------------------------------- + + +def test_valid_model_passes_and_round_trips_field_by_field(): + spec = REGISTRY["Cluster"] + good = Cluster(id="c1", hierarchy_id="h1", level=2) + result = validate_for_write(good, spec) + # Field-by-field equality with the input. + for name in Cluster.model_fields: + assert getattr(result, name) == getattr(good, name) + + +def test_validate_for_write_accepts_a_list(): + spec = REGISTRY["Cluster"] + items = [ + Cluster(id="c1", hierarchy_id="h1"), + Cluster(id="c2", hierarchy_id="h1"), + ] + result = validate_for_write(items, spec) + assert isinstance(result, list) + assert [m.id for m in result] == ["c1", "c2"] + + +def test_validate_for_write_passthrough_when_required_is_empty(): + spec = REGISTRY["DataSet"] + ds = DataSet(id="d1", name="d", project_id="p1") + result = validate_for_write(ds, spec) + # No revalidation needed; identity-equal. + assert result is ds + + +def test_validate_for_write_rejects_class_mismatch(): + spec = REGISTRY["Cluster"] + not_a_cluster = DataSet(id="d1", name="d", project_id="p1") + with pytest.raises(TypeError, match="Cluster"): + validate_for_write(not_a_cluster, spec) + + +# --------------------------------------------------------------------------- +# Wired into write_models +# --------------------------------------------------------------------------- + + +def test_write_models_calls_validation_before_io(tmp_path): + from connects_common_connectivity.config import Settings + from connects_common_connectivity.io.writers import write_models + + settings = Settings(output_root=tmp_path) + bad = Cluster(id="c1") # hierarchy_id missing + with pytest.raises(ValueError, match="hierarchy_id"): + write_models(bad, settings=settings) + # No table directory created — IO never happened. + assert not (tmp_path / "cluster").exists() From 954d589a30fc00f18e79d0f937f45c2e03ac4e08 Mon Sep 17 00:00:00 2001 From: reneyagmur Date: Fri, 12 Jun 2026 14:24:37 -0700 Subject: [PATCH 11/25] notebook migration refinement --- planning/prompts/06_notebook_migration.md | 78 +++++++++++++++++------ planning/prompts/07_tests.md | 16 +++-- 2 files changed, 70 insertions(+), 24 deletions(-) diff --git a/planning/prompts/06_notebook_migration.md b/planning/prompts/06_notebook_migration.md index ec97c6c..951de01 100644 --- a/planning/prompts/06_notebook_migration.md +++ b/planning/prompts/06_notebook_migration.md @@ -3,34 +3,72 @@ > Prepend `00_shared_context.md`. Depends on writers (and readers for verification cells). ## Goal -Migrate the ETL notebooks in `code/etl_*.ipynb` to use the new IO API. Move bookkeeping -into the library; keep the science logic and verification cells. The output path lives in -ONE file (`ccc_config.yaml`) discovered automatically — notebooks carry no path and no -config cell. - -## First: create the config file -Create `ccc_config.yaml` at the repo root (the single source of truth, version-controlled): -```yaml -output_root: ../scratch/em_patchseq_wnm_v1/ # match the value grep'd from code/*.ipynb -dry_run: false -``` -To repoint local vs CodeOcean, edit this file (or set `CCC_OUTPUT_ROOT`); nothing else -changes. The library finds it by walking up from the notebook's working directory. +Migrate the ETL notebooks in `code/etl_*.ipynb` to use the new IO API. Replace the +hand-rolled `write_deltalake(... mode/predicate/partition_by ...)` calls with +`write_models` / `write_projection_matrix`, and replace the hardcoded +`OUTPUT_ROOT = "../scratch/..."` constant with a call to +`connects_common_connectivity.config.output_root()` — a cwd-aware helper that returns +the path string with trailing `/`, so it's a literal drop-in for the old constant. +Notebooks keep their per-dataset config cell (input paths, dataset/project ids, +versions, feature-set ids, etc.); only the output root and the manual write +bookkeeping move into the library. + +## Required reading before touching any notebook +1. `etl_example_prompt.md` (repo root) — describes the **pre-migration** notebook patterns: + write predicates, two-level overwrite rules, `append_new_dataitems`, the patchseq + shared-partition bug, parent propagation, etc. Read this so you understand WHAT each + notebook is doing scientifically and WHY the old write patterns were shaped that way. + Treat its rules about ids, enums, schemas, and verification cells as still binding. +2. `src/connects_common_connectivity/io/` — the **post-migration** target. The functions + `write_models`, `write_projection_matrix`, `get_settings`, `table_path` (re-exported + from `connects_common_connectivity.io`) now own everything `etl_example_prompt.md` + spelled out by hand: mode, predicate, partition_by, append-new-by-id, two-level scoping + per class. Migration is the act of replacing those manual rules with these calls. +3. The config file `ccc_config.yaml` already exists at repo root — do NOT recreate it. + Migration only edits notebooks. + +## What changes between old and new +| Old (per `etl_example_prompt.md`) | New (this migration) | +|---|---| +| `OUTPUT_ROOT = "../scratch/..."` constant in cell 3 | `OUTPUT_ROOT = output_root()` — same string shape, sourced from `ccc_config.yaml` | +| `write_deltalake(path, table, mode="overwrite", predicate=..., partition_by=...)` | `write_models(instance_or_list)` — registry owns mode/predicate/partition | +| `append_new_dataitems(...)` for `dataitem/` | `write_models(dataitem_list)` — append-new-by-id is the registered mode | +| Manual two-level predicate strings | None in notebooks; the `WriteSpec` for each class encodes them | +| Verification cell hardcoded path string | `output_root() + "/"` (or `table_path(get_settings(), "
")` for a typed `Path`) | +| `write_deltalake(...)` for projection matrix wide form | `write_projection_matrix(pmm, dense_matrix)` | + +The model construction, ETL transforms, and verification assertions do not change. ## Per ETL notebook -1. Delete the hardcoded `OUTPUT_ROOT = "../scratch/..."` entirely. There is no replacement - config cell and no `%run` — the library discovers `ccc_config.yaml` on its own, so - `write_models(...)` calls need neither a path nor `settings=`. (If a cell wants to show - the resolved config, it may `from connects_common_connectivity.io import get_settings; - print(get_settings())`, but this is optional.) +1. Replace the hardcoded `OUTPUT_ROOT = "../scratch/..."` with + `OUTPUT_ROOT = output_root()` (imported from + `connects_common_connectivity.config`). The helper returns a cwd-relative path + string with trailing `/`, so existing string concatenations like + `OUTPUT_ROOT + "dataitem/"` keep working. `write_models(...)` calls need neither a + path nor `settings=` — the library discovers `ccc_config.yaml` on its own. 2. Replace each direct `write_deltalake(... mode=... predicate=... partition_by=...)` call with `write_models(my_instance)` (or `write_models([inst1, inst2])`). The class is inferred from the argument; the registry owns mode / predicate / partition. Use `write_projection_matrix(pmm, matrix)` for the one projection notebook — it's the single non-`write_models` writer. Delete the now-redundant `mode`/`predicate`/ `partition_by` arguments and their explanatory comments. -3. Keep verification cells; update their paths to use - `table_path(get_settings(), ...)`. +3. Keep verification cells; their `OUTPUT_ROOT + "
/"` reads continue to work + unchanged once `OUTPUT_ROOT` is sourced from `output_root()`. + +## Pilot first — do not fan out +Migrate ONE notebook end-to-end before touching any others. Pick +`etl_visp_inh_patchseq_01_dataset_dataitem.ipynb` as the pilot (small, exercises the +patchseq bug, uses both `DataSet` and `DataItem` writes). For the pilot: + +1. Run the pre-migration version once and record the output Delta tables (row counts and + `(project_id, id)` sets for `dataset/`, `dataitem/`, `dataitem_dataset_association/`). +2. Migrate the notebook per the rules above and run it against a **fresh** output root + (point `ccc_config.yaml` or `CCC_OUTPUT_ROOT` somewhere new so the pre-migration data + is preserved for comparison). +3. Diff: assert the post-migration tables match the pre-migration ones in row count and + `(project_id, id)` set equality. Any drift is a registry/spec bug — STOP and report + before migrating further notebooks. +4. Only after the pilot passes the diff, proceed in the order below. ## Migrate in this order 1. `etl_*_01_dataset_dataitem.ipynb` (all of minnie, wnm, visp_exc/inh patchseq) — these diff --git a/planning/prompts/07_tests.md b/planning/prompts/07_tests.md index 8d1a688..17fe774 100644 --- a/planning/prompts/07_tests.md +++ b/planning/prompts/07_tests.md @@ -4,15 +4,20 @@ > analysis tests are deferred with that work.) ## Goal -Pull the write-side suite together and fill the gaps. Several cases are already specified in -their owning prompts — do NOT re-specify them here, just ensure they exist and run as one -suite: +This is the LAST write-side prompt that will run — prompts 02–05 will not be re-executed. +That means this prompt is responsible for both the gaps below AND any cleanup left over +from earlier prompts. Several cases are already specified in their owning prompts: - Registry↔schema drift → `02_write_spec.md` (`tests/test_write_spec.py`). - Patchseq shared-partition regression, idempotency, append-new-by-id, predicate construction → `03_writers.md` (`tests/test_writers.py`). - Strict-validation failures → `05_validation.md` (`tests/test_write_validation.py`). - Public-API surface → `04_public_api.md` (`tests/test_public_api.py`). +If any of those tests are missing, red, or do not actually assert what their prompt +claimed, **fix them here** — there is no second pass. When you patch a test owned by an +earlier prompt, list which prompt and which test in the report so the spec docs can be +updated later. + Use small synthetic models written to a `tmp_path` Delta root (point `CCC_OUTPUT_ROOT` at `tmp_path`, or a tmp `ccc_config.yaml`) so tests never touch real data. @@ -22,8 +27,11 @@ Use small synthetic models written to a `tmp_path` Delta root (point `CCC_OUTPUT 2. **No-shim regression (TODO 3.4):** after migration, assert no module imports the old write-side paths `arrow_utils`, `write_utils` (grep / import-scan); the shims must be gone. 3. Confirm the suite is collected and green together (no per-prompt drift). +4. Patch any 02–05 test gaps discovered while running the suite (see goal above). Round-trip and cross-dataset read tests are deferred to the read-side work. ## Reporting -Run `pytest -q` and paste the summary. Do not mark complete with failures. +Run `uv run pytest -q` (this repo uses `uv` — plain `pytest` will not pick up the +project venv) and paste the summary. Do not mark complete with failures. Also list any +tests you patched on behalf of an earlier prompt and a one-line reason for each. From f573b99dbaf1bb26e8276d2609d38f943efe98b5 Mon Sep 17 00:00:00 2001 From: reneyagmur Date: Mon, 15 Jun 2026 20:43:37 +0000 Subject: [PATCH 12/25] notebook miogration and patches to writespec --- CHANGELOG.md | 27 +- code/etl_minnie_01_dataset_dataitem.ipynb | 242 +-------- code/etl_minnie_02_cell_features.ipynb | 477 ++++++------------ ...ie_03_cluster_and_cluster_membership.ipynb | 151 +----- code/etl_minnie_04_cell_cell.ipynb | 2 +- code/etl_tasic_01_cluster.ipynb | 176 +++---- ...isp_exc_patchseq_01_dataset_dataitem.ipynb | 145 ++---- ...l_visp_exc_patchseq_02_cell_features.ipynb | 169 +++---- ...eq_03_cluster_membership_and_mapping.ipynb | 304 ++++------- ...isp_inh_patchseq_01_dataset_dataitem.ipynb | 155 +++--- ...l_visp_inh_patchseq_02_cell_features.ipynb | 225 ++++++--- ...eq_03_cluster_membership_and_mapping.ipynb | 277 +++++----- code/etl_visp_met_types_01_cluster.ipynb | 176 +++---- code/etl_wnm_exc_01_dataset_dataitem.ipynb | 172 +++---- code/etl_wnm_exc_02_cell_features.ipynb | 409 +++++++-------- ...l_wnm_exc_03_cell_to_cluster_mapping.ipynb | 120 ++--- code/etl_wnm_exc_04_projection_matrix.ipynb | 193 +++---- code/parse_minnie_clustering.ipynb | 4 +- .../arrow_utils.py | 7 - .../io/write_spec.py | 18 + .../write_utils.py | 11 - tests/test_write_relocation.py | 114 +++-- tests/test_write_utils.py | 2 +- tests/test_writers.py | 6 + 24 files changed, 1389 insertions(+), 2193 deletions(-) delete mode 100644 src/connects_common_connectivity/arrow_utils.py delete mode 100644 src/connects_common_connectivity/write_utils.py diff --git a/CHANGELOG.md b/CHANGELOG.md index 473a0a6..88c2eb2 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -9,6 +9,10 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 ### Added +- Added `WriteSpec` registry entries for `AlgorithmRun` and + `HierarchyCategory` (both project-agnostic, scope=`["id"]`, + `overwrite_scoped`). These classes are now writable through + `write_models(...)` and surface in `WRITABLE_CLASSES`. - Added write-time validation: `write_models()` now re-validates each model through a runtime-derived strict subclass that flips `WriteSpec.required_for_write` slots to non-optional, raising @@ -28,22 +32,25 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 ### Changed +- Migrated `code/etl_*.ipynb` notebooks to the curated IO API: + hardcoded `OUTPUT_ROOT = "../scratch/..."` strings are replaced with + `output_root()` from `connects_common_connectivity.config`, and + hand-rolled `write_deltalake(..., mode=..., predicate=..., partition_by=...)` + calls for registry-backed models are replaced with `write_models(...)` + (and `write_projection_matrix(...)` for projection matrices). - Moved `arrow_utils` and `write_utils` under - `connects_common_connectivity.io.*`. The old import paths - (`connects_common_connectivity.arrow_utils`, - `connects_common_connectivity.write_utils`) keep working as deprecated - re-export shims. + `connects_common_connectivity.io.*`. ### Deprecated -- Importing from `connects_common_connectivity.arrow_utils` and - `connects_common_connectivity.write_utils`; use - `connects_common_connectivity.io.arrow_utils` / - `connects_common_connectivity.io.write_utils` instead. The shims will be - removed once notebook migration completes. - ### Removed +- Removed the deprecated re-export shims + `connects_common_connectivity.arrow_utils` and + `connects_common_connectivity.write_utils`. Import from + `connects_common_connectivity.io.arrow_utils` / + `connects_common_connectivity.io.write_utils` instead. + ### Fixed ### Security diff --git a/code/etl_minnie_01_dataset_dataitem.ipynb b/code/etl_minnie_01_dataset_dataitem.ipynb index 81c4028..db105fb 100644 --- a/code/etl_minnie_01_dataset_dataitem.ipynb +++ b/code/etl_minnie_01_dataset_dataitem.ipynb @@ -12,14 +12,7 @@ { "cell_type": "code", "execution_count": 1, - "metadata": { - "execution": { - "iopub.execute_input": "2026-04-30T23:47:19.297771Z", - "iopub.status.busy": "2026-04-30T23:47:19.297592Z", - "iopub.status.idle": "2026-04-30T23:47:20.707966Z", - "shell.execute_reply": "2026-04-30T23:47:20.707186Z" - } - }, + "metadata": {}, "outputs": [ { "name": "stderr", @@ -37,49 +30,37 @@ "import pandas as pd\n", "import polars as pl\n", "import pyarrow as pa\n", - "from deltalake import write_deltalake\n", "\n", - "from connects_common_connectivity.arrow_utils import (\n", - " build_arrow_schema,\n", - " models_to_table,\n", - " attach_linkml_metadata,\n", - ")\n", "from connects_common_connectivity.models import (\n", " DataSet,\n", " DataItem,\n", " DataItemDataSetAssociation,\n", " Modality,\n", ")\n", - "from connects_common_connectivity.write_utils import append_new_dataitems" + "from connects_common_connectivity.config import output_root\n", + "from connects_common_connectivity.io import write_models\n" ] }, { "cell_type": "code", - "execution_count": 2, - "metadata": { - "execution": { - "iopub.execute_input": "2026-04-30T23:47:20.710164Z", - "iopub.status.busy": "2026-04-30T23:47:20.709772Z", - "iopub.status.idle": "2026-04-30T23:47:20.713941Z", - "shell.execute_reply": "2026-04-30T23:47:20.713229Z" - } - }, + "execution_count": 5, + "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "OUTPUT_ROOT : ../scratch/em_patchseq_wnm_v1/\n", + "OUTPUT_ROOT : ../scratch/em_patchseq_wnm_v2/\n", "PROJECT_ID : minnie65\n", "DATASET_ID : minnie65_v1412_nuclei\n", "CAVE_DATASTACK : minnie65_phase3_v1\n", - "CAVE_VERSION : 1412\n", + "CAVE_VERSION : 1419\n", "CAVE_VIEW : nucleus_detection_lookup_v1\n" ] } ], "source": [ - "OUTPUT_ROOT = \"../scratch/em_patchseq_wnm_v1/\"\n", + "OUTPUT_ROOT = output_root()\n", "PROJECT_ID = \"minnie65\"\n", "DATASET_ID = \"minnie65_v1412_nuclei\"\n", "CAVE_DATASTACK = \"minnie65_phase3_v1\"\n", @@ -102,106 +83,8 @@ ] }, { - "cell_type": "code", - "execution_count": 3, - "metadata": { - "execution": { - "iopub.execute_input": "2026-04-30T23:47:20.715508Z", - "iopub.status.busy": "2026-04-30T23:47:20.715326Z", - "iopub.status.idle": "2026-04-30T23:47:24.276080Z", - "shell.execute_reply": "2026-04-30T23:47:24.275422Z" - } - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Shape: (133969, 7)\n" - ] - }, - { - "data": { - "text/html": [ - "
\n", - "\n", - "
\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
idvolumept_root_idorig_root_idpt_supervoxel_idpt_positionpt_position_lookup
1373879229.04504386469113609013560786469113609013560796218056992431305[228816, 239776, 19593][228816, 239776, 19593]
320185893.75383686469113537389367886469113537389367884955554103121097[146848, 213600, 26267][146848, 213600, 26267]
4600774135.1897918646911356823787440111493022281121981[339120, 276112, 19442][339520, 276480, 19506]
\n", - "" - ], - "text/plain": [ - " id volume pt_root_id orig_root_id \\\n", - "1 373879 229.045043 864691136090135607 864691136090135607 \n", - "3 201858 93.753836 864691135373893678 864691135373893678 \n", - "4 600774 135.189791 864691135682378744 0 \n", - "\n", - " pt_supervoxel_id pt_position pt_position_lookup \n", - "1 96218056992431305 [228816, 239776, 19593] [228816, 239776, 19593] \n", - "3 84955554103121097 [146848, 213600, 26267] [146848, 213600, 26267] \n", - "4 111493022281121981 [339120, 276112, 19442] [339520, 276480, 19506] " - ] - }, - "execution_count": 3, - "metadata": {}, - "output_type": "execute_result" - } - ], + "cell_type": "raw", + "metadata": {}, "source": [ "client = caveclient.CAVEclient(CAVE_DATASTACK, auth_token=os.environ[\"CUSTOM_KEY\"])\n", "client.materialize.version = CAVE_VERSION\n", @@ -222,7 +105,7 @@ }, { "cell_type": "code", - "execution_count": 4, + "execution_count": null, "metadata": { "execution": { "iopub.execute_input": "2026-04-30T23:47:24.278066Z", @@ -231,15 +114,7 @@ "shell.execute_reply": "2026-04-30T23:47:24.364040Z" } }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "DataSet written: (1, 5)\n" - ] - } - ], + "outputs": [], "source": [ "dataset = DataSet(\n", " id=DATASET_ID,\n", @@ -248,27 +123,13 @@ " modality=Modality.ELECTRON_MICROSCOPY.value,\n", " project_id=PROJECT_ID,\n", ")\n", - "\n", - "schema_ds = build_arrow_schema(DataSet)\n", - "table_ds = models_to_table([dataset], schema=schema_ds)\n", - "table_ds = attach_linkml_metadata(table_ds, linkml_class=\"DataSet\")\n", - "\n", - "# mode='overwrite' makes re-runs idempotent instead of appending duplicates.\n", - "# predicate scopes the overwrite to this project only — other projects' rows\n", - "# in the shared Delta table are left untouched.\n", - "write_deltalake(\n", - " OUTPUT_ROOT + \"dataset/\",\n", - " table_ds,\n", - " mode=\"overwrite\",\n", - " predicate=f\"project_id = '{PROJECT_ID}'\",\n", - " partition_by=[\"project_id\"],\n", - ")\n", - "print(\"DataSet written:\", table_ds.shape)" + "result = write_models([dataset])\n", + "print(f\"DataSet written: {result.rows_written} rows\")" ] }, { "cell_type": "code", - "execution_count": 5, + "execution_count": null, "metadata": { "execution": { "iopub.execute_input": "2026-04-30T23:47:24.366592Z", @@ -277,29 +138,12 @@ "shell.execute_reply": "2026-04-30T23:47:24.397292Z" } }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "(1, 5)\n", - "shape: (1, 5)\n", - "┌──────────────────────┬─────────────────┬──────────────────────┬─────────────────────┬────────────┐\n", - "│ id ┆ name ┆ publication ┆ modality ┆ project_id │\n", - "│ --- ┆ --- ┆ --- ┆ --- ┆ --- │\n", - "│ str ┆ str ┆ str ┆ str ┆ str │\n", - "╞══════════════════════╪═════════════════╪══════════════════════╪═════════════════════╪════════════╡\n", - "│ minnie65_v1412_nucle ┆ Minnie65 v1412 ┆ doi.org/10.1038/s415 ┆ ELECTRON_MICROSCOPY ┆ minnie65 │\n", - "│ i ┆ nucleus catalog ┆ 86-025-087… ┆ ┆ │\n", - "└──────────────────────┴─────────────────┴──────────────────────┴─────────────────────┴────────────┘\n" - ] - } - ], + "outputs": [], "source": [ "# Verification\n", "ds_verify = (\n", " pl.read_delta(OUTPUT_ROOT + \"dataset/\")\n", - " .filter(pl.col(\"project_id\") == PROJECT_ID)\n", + " .filter((pl.col(\"project_id\") == PROJECT_ID) & (pl.col(\"id\") == DATASET_ID))\n", " .filter(pl.col(\"id\") == DATASET_ID)\n", ")\n", "print(ds_verify.shape)\n", @@ -317,7 +161,7 @@ }, { "cell_type": "code", - "execution_count": 6, + "execution_count": null, "metadata": { "execution": { "iopub.execute_input": "2026-04-30T23:47:24.400176Z", @@ -326,29 +170,13 @@ "shell.execute_reply": "2026-04-30T23:47:26.643138Z" } }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "DataItem rows appended: 0 (total in batch: 133969)\n" - ] - } - ], + "outputs": [], "source": [ "dataitems = [\n", " DataItem(id=str(row.id), name=str(row.pt_root_id), project_id=PROJECT_ID)\n", " for row in nuc_df.itertuples()\n", "]\n", - "\n", - "schema_di = build_arrow_schema(DataItem)\n", - "table_di = models_to_table(dataitems, schema=schema_di)\n", - "table_di = attach_linkml_metadata(table_di, linkml_class=\"DataItem\")\n", - "\n", - "# append_new_dataitems checks which ids already exist for this project and appends\n", - "# only new rows — safe when multiple _01 notebooks share a project_id, since\n", - "# each dataset's cells are registered without wiping the other's rows.\n", - "n_appended = append_new_dataitems(OUTPUT_ROOT + \"dataitem/\", table_di, project_id=PROJECT_ID)\n", + "n_appended = write_models(dataitems).rows_written\n", "print(f\"DataItem rows appended: {n_appended} (total in batch: {len(dataitems)})\")" ] }, @@ -407,7 +235,7 @@ }, { "cell_type": "code", - "execution_count": 8, + "execution_count": null, "metadata": { "execution": { "iopub.execute_input": "2026-04-30T23:47:26.723482Z", @@ -416,15 +244,7 @@ "shell.execute_reply": "2026-04-30T23:47:28.450093Z" } }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "DataItemDataSetAssociation written: (133969, 3)\n" - ] - } - ], + "outputs": [], "source": [ "associations = [\n", " DataItemDataSetAssociation(\n", @@ -434,22 +254,8 @@ " )\n", " for item in dataitems\n", "]\n", - "\n", - "schema_assoc = build_arrow_schema(DataItemDataSetAssociation)\n", - "table_assoc = models_to_table(associations, schema=schema_assoc)\n", - "table_assoc = attach_linkml_metadata(table_assoc, linkml_class=\"DataItemDataSetAssociation\")\n", - "\n", - "# mode='overwrite' makes re-runs idempotent instead of appending duplicates.\n", - "# predicate scopes the overwrite to this project only — other projects' rows\n", - "# in the shared Delta table are left untouched.\n", - "write_deltalake(\n", - " OUTPUT_ROOT + \"dataitem_dataset_association/\",\n", - " table_assoc,\n", - " mode=\"overwrite\",\n", - " predicate=f\"project_id = '{PROJECT_ID}' AND dataset_id = '{DATASET_ID}'\",\n", - " partition_by=[\"project_id\"],\n", - ")\n", - "print(\"DataItemDataSetAssociation written:\", table_assoc.shape)" + "result = write_models(associations)\n", + "print(f\"DataItemDataSetAssociation written: {result.rows_written} rows\")" ] }, { diff --git a/code/etl_minnie_02_cell_features.ipynb b/code/etl_minnie_02_cell_features.ipynb index 08ae4fc..32ecc83 100644 --- a/code/etl_minnie_02_cell_features.ipynb +++ b/code/etl_minnie_02_cell_features.ipynb @@ -4,14 +4,14 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "# ETL — Minnie65: Cell Features\n", + "# ETL \u2014 Minnie65: Cell Features\n", "\n", "Writes the CSM dendrite-ultrastructure cohort `DataSet` (`minnie65_v1412_csm_cluster`), its `DataItemDataSetAssociation` links, `CellFeatureDefinition` rows, `CellFeatureSet` rows, wide-form feature parquet tables, and `CellFeatureMatrix` pointer rows for two feature sets. Each feature-set section is independently idempotent. Prerequisite: `etl_minnie_01_dataset_dataitem.ipynb`." ] }, { "cell_type": "code", - "execution_count": 1, + "execution_count": null, "metadata": { "execution": { "iopub.execute_input": "2026-04-30T23:47:37.269601Z", @@ -20,16 +20,7 @@ "shell.execute_reply": "2026-04-30T23:47:39.328688Z" } }, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "/opt/conda/lib/python3.12/site-packages/requests/__init__.py:113: RequestsDependencyWarning: urllib3 (2.2.2) or chardet (7.4.3)/charset_normalizer (3.3.2) doesn't match a supported version!\n", - " warnings.warn(\n" - ] - } - ], + "outputs": [], "source": [ "import os\n", "from pathlib import Path\n", @@ -42,12 +33,6 @@ "import standard_transform\n", "from deltalake import write_deltalake\n", "\n", - "from connects_common_connectivity.arrow_utils import (\n", - " attach_linkml_metadata,\n", - " build_arrow_schema,\n", - " build_cell_feature_matrix_schema,\n", - " models_to_table,\n", - ")\n", "from connects_common_connectivity.models import (\n", " CellFeatureDefinition,\n", " CellFeatureMatrix,\n", @@ -56,12 +41,15 @@ " DataSet,\n", " Modality,\n", " Unit,\n", - ")" + ")\n", + "from connects_common_connectivity.io.arrow_utils import build_cell_feature_matrix_schema\n", + "from connects_common_connectivity.config import output_root\n", + "from connects_common_connectivity.io import write_models\n" ] }, { "cell_type": "code", - "execution_count": 2, + "execution_count": null, "metadata": { "execution": { "iopub.execute_input": "2026-04-30T23:47:39.331475Z", @@ -70,23 +58,11 @@ "shell.execute_reply": "2026-04-30T23:47:39.334773Z" } }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "OUTPUT_ROOT : ../scratch/em_patchseq_wnm_v1/\n", - "PROJECT_ID : minnie65\n", - "COHORT_DATASET_ID : minnie65_v1412_csm_cluster\n", - "FSI_CSM : csm_cluster_features\n", - "FSI_STD : minnie65_std_transform_coordinates\n" - ] - } - ], + "outputs": [], "source": [ "FEATURES_PARQUET = \"/data/minnie1412/minnie_features.parquet\"\n", "FEATURES_CSV = \"/data/minnie1412/minnie_cell_features.csv\"\n", - "OUTPUT_ROOT = \"../scratch/em_patchseq_wnm_v1/\"\n", + "OUTPUT_ROOT = output_root()\n", "PROJECT_ID = \"minnie65\"\n", "COHORT_DATASET_ID = \"minnie65_v1412_csm_cluster\"\n", "FSI_CSM = \"csm_cluster_features\"\n", @@ -134,7 +110,7 @@ " pl.read_delta(OUTPUT_ROOT + \"dataset/\")\n", " .filter(pl.col(\"id\") == \"minnie65_v1412_nuclei\")\n", ")\n", - "assert prereq.shape[0] == 1, \"etl_minnie_01 must be run first — minnie65_v1412_nuclei DataSet not found\"\n", + "assert prereq.shape[0] == 1, \"etl_minnie_01 must be run first \u2014 minnie65_v1412_nuclei DataSet not found\"\n", "print(\"Prerequisite OK:\", prereq[\"id\"][0])" ] }, @@ -161,7 +137,7 @@ "name": "stdout", "output_type": "stream", "text": [ - "Dropped 4 duplicate id row(s): 35787 → 35783\n", + "Dropped 4 duplicate id row(s): 35787 \u2192 35783\n", "Features parquet shape: (35783, 112)\n", "Feature metadata CSV shape: (82, 6)\n" ] @@ -285,7 +261,7 @@ " \n", " \n", "\n", - "

3 rows × 112 columns

\n", + "

3 rows \u00d7 112 columns

\n", "" ], "text/plain": [ @@ -415,7 +391,7 @@ "# table cell-indexed (one row per nucleus id).\n", "n_before = len(feat_df)\n", "feat_df = feat_df.drop_duplicates(subset=\"id\", keep=\"first\")\n", - "print(f\"Dropped {n_before - len(feat_df)} duplicate id row(s): {n_before} → {len(feat_df)}\")\n", + "print(f\"Dropped {n_before - len(feat_df)} duplicate id row(s): {n_before} \u2192 {len(feat_df)}\")\n", "\n", "print(\"Features parquet shape:\", feat_df.shape)\n", "print(\"Feature metadata CSV shape:\", feat_meta.shape)\n", @@ -432,7 +408,7 @@ }, { "cell_type": "code", - "execution_count": 5, + "execution_count": null, "metadata": { "execution": { "iopub.execute_input": "2026-04-30T23:47:39.520747Z", @@ -441,16 +417,7 @@ "shell.execute_reply": "2026-04-30T23:47:40.174643Z" } }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "DataSet written: (1, 5)\n", - "DataItemDataSetAssociation written: (35783, 3)\n" - ] - } - ], + "outputs": [], "source": [ "cohort_ds = DataSet(\n", " id=COHORT_DATASET_ID,\n", @@ -463,22 +430,7 @@ " DataItemDataSetAssociation(dataitem_id=cid, dataset_id=COHORT_DATASET_ID, project_id=PROJECT_ID)\n", " for cid in cell_ids\n", "]\n", - "\n", - "schema_ds = build_arrow_schema(DataSet)\n", - "table_ds = attach_linkml_metadata(models_to_table([cohort_ds], schema=schema_ds), linkml_class=\"DataSet\")\n", - "schema_assoc = build_arrow_schema(DataItemDataSetAssociation)\n", - "table_assoc = attach_linkml_metadata(models_to_table(associations, schema=schema_assoc), linkml_class=\"DataItemDataSetAssociation\")\n", - "\n", - "# Predicate must include id=COHORT_DATASET_ID, not just project_id. The dataset/ table\n", - "# is shared across all notebooks for this project — a predicate of project_id='minnie65'\n", - "# alone would wipe the nucleus catalog row (minnie65_v1412_nuclei) written by\n", - "# etl_minnie_01, forcing that notebook to be rerun before this one works again.\n", - "write_deltalake(OUTPUT_ROOT + \"dataset/\", table_ds,\n", - " mode=\"overwrite\", predicate=f\"project_id = '{PROJECT_ID}' AND id = '{COHORT_DATASET_ID}'\", partition_by=[\"project_id\"])\n", - "write_deltalake(OUTPUT_ROOT + \"dataitem_dataset_association/\", table_assoc,\n", - " mode=\"overwrite\", predicate=f\"project_id = '{PROJECT_ID}'\", partition_by=[\"project_id\"])\n", - "print(\"DataSet written:\", table_ds.shape)\n", - "print(\"DataItemDataSetAssociation written:\", table_assoc.shape)" + "result = write_models([cohort_ds])" ] }, { @@ -499,14 +451,14 @@ "text": [ "DataSet: (1, 5)\n", "shape: (1, 5)\n", - "┌────────────────────────────┬────────────────────┬─────────────┬─────────────────────┬────────────┐\n", - "│ id ┆ name ┆ publication ┆ modality ┆ project_id │\n", - "│ --- ┆ --- ┆ --- ┆ --- ┆ --- │\n", - "│ str ┆ str ┆ str ┆ str ┆ str │\n", - "╞════════════════════════════╪════════════════════╪═════════════╪═════════════════════╪════════════╡\n", - "│ minnie65_v1412_csm_cluster ┆ Minnie65 v1412 CSM ┆ null ┆ ELECTRON_MICROSCOPY ┆ minnie65 │\n", - "│ ┆ dendrite ul… ┆ ┆ ┆ │\n", - "└────────────────────────────┴────────────────────┴─────────────┴─────────────────────┴────────────┘\n", + "\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n", + "\u2502 id \u2506 name \u2506 publication \u2506 modality \u2506 project_id \u2502\n", + "\u2502 --- \u2506 --- \u2506 --- \u2506 --- \u2506 --- \u2502\n", + "\u2502 str \u2506 str \u2506 str \u2506 str \u2506 str \u2502\n", + "\u255e\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2561\n", + "\u2502 minnie65_v1412_csm_cluster \u2506 Minnie65 v1412 CSM \u2506 null \u2506 ELECTRON_MICROSCOPY \u2506 minnie65 \u2502\n", + "\u2502 \u2506 dendrite ul\u2026 \u2506 \u2506 \u2506 \u2502\n", + "\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n", "Associations: (35783, 3)\n" ] } @@ -534,7 +486,7 @@ }, { "cell_type": "code", - "execution_count": 7, + "execution_count": null, "metadata": { "execution": { "iopub.execute_input": "2026-04-30T23:47:40.224480Z", @@ -543,15 +495,7 @@ "shell.execute_reply": "2026-04-30T23:47:40.318057Z" } }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "CSM CellFeatureDefinition written: (82, 8)\n" - ] - } - ], + "outputs": [], "source": [ "# Build CSM CellFeatureDefinitions from CSV\n", "csm_fds = []\n", @@ -569,19 +513,8 @@ " if pd.notna(row[\"range_max\"]):\n", " kwargs[\"range_max\"] = float(row[\"range_max\"])\n", " csm_fds.append(CellFeatureDefinition(**kwargs))\n", - "\n", - "schema_cfd = build_arrow_schema(CellFeatureDefinition)\n", - "table_cfd_csm = attach_linkml_metadata(\n", - " models_to_table(csm_fds, schema=schema_cfd), linkml_class=\"CellFeatureDefinition\"\n", - ")\n", - "# Predicate scopes this overwrite to the CSM feature set only — STD definitions are untouched.\n", - "write_deltalake(\n", - " OUTPUT_ROOT + \"cellfeaturedefinition/\", table_cfd_csm,\n", - " mode=\"overwrite\",\n", - " predicate=f\"project_id = '{PROJECT_ID}' AND feature_set_id = '{FSI_CSM}'\",\n", - " partition_by=[\"project_id\", \"feature_set_id\"],\n", - ")\n", - "print(\"CSM CellFeatureDefinition written:\", table_cfd_csm.shape)" + "result = write_models(csm_fds)\n", + "print(f\"CellFeatureDefinition written: {result.rows_written} rows\")" ] }, { @@ -602,23 +535,23 @@ "text": [ "(82, 8)\n", "shape: (3, 8)\n", - "┌────────────┬────────────┬────────────┬───────────┬───────────┬───────────┬───────────┬───────────┐\n", - "│ id ┆ descriptio ┆ unit ┆ data_type ┆ range_min ┆ range_max ┆ project_i ┆ feature_s │\n", - "│ --- ┆ n ┆ --- ┆ --- ┆ --- ┆ --- ┆ d ┆ et_id │\n", - "│ str ┆ --- ┆ str ┆ str ┆ f64 ┆ f64 ┆ --- ┆ --- │\n", - "│ ┆ str ┆ ┆ ┆ ┆ ┆ str ┆ str │\n", - "╞════════════╪════════════╪════════════╪═══════════╪═══════════╪═══════════╪═══════════╪═══════════╡\n", - "│ nucleus_vo ┆ Nucleus ┆ MICRONS_CU ┆ pa.Table: diff --git a/tests/test_writers.py b/tests/test_writers.py index ab2f995..c9af853 100644 --- a/tests/test_writers.py +++ b/tests/test_writers.py @@ -28,6 +28,7 @@ write_projection_matrix, ) from connects_common_connectivity.models import ( + AlgorithmRun, CellFeatureDefinition, CellFeatureMatrix, CellFeatureSet, @@ -38,6 +39,7 @@ DataItem, DataItemDataSetAssociation, DataSet, + HierarchyCategory, Laterality, MappingSet, Modality, @@ -245,6 +247,10 @@ def _make_instance(cls): region_index=["VISp"], values="file:///tmp/pmm.delta", ) + if cls is AlgorithmRun: + return AlgorithmRun(id="run1", algorithm_name="kmeans") + if cls is HierarchyCategory: + return HierarchyCategory(id="cluster", description="leaf", level="0") raise AssertionError(f"no fixture for {cls.__name__}") From 046df9811995789d49d3f58cf46fcd364bef819a Mon Sep 17 00:00:00 2001 From: reneyagmur Date: Mon, 15 Jun 2026 21:55:06 +0000 Subject: [PATCH 13/25] minnie save v 1300 migration notebooks --- code/etl_minnie_01_dataset_dataitem.ipynb | 222 +++++-- code/etl_minnie_02_cell_features.ipynb | 595 ++++++++---------- ...ie_03_cluster_and_cluster_membership.ipynb | 182 +++--- code/etl_minnie_04_cell_cell.ipynb | 355 ++++++++--- .../io/writers.py | 22 + 5 files changed, 793 insertions(+), 583 deletions(-) diff --git a/code/etl_minnie_01_dataset_dataitem.ipynb b/code/etl_minnie_01_dataset_dataitem.ipynb index db105fb..ac6dd2b 100644 --- a/code/etl_minnie_01_dataset_dataitem.ipynb +++ b/code/etl_minnie_01_dataset_dataitem.ipynb @@ -6,7 +6,7 @@ "source": [ "# ETL — Minnie65: DataSet & DataItem\n", "\n", - "Writes one `DataSet` record (`dataset_id = \"minnie65_v1412_nuclei\"`, `project_id = \"minnie65\"`) and one `DataItem` per nucleus from the CAVE `nucleus_detection_lookup_v1` view at materialization version 1412, plus the corresponding `DataItemDataSetAssociation` links. Cohort DataSets (e.g. `minnie65_v1412_csm_cluster`) and cell features are written by later notebooks." + "Writes one `DataSet` record (`dataset_id = \"minnie65_v1300_nuclei\"`, `project_id = \"minnie65\"`) and one `DataItem` per nucleus from the CAVE `nucleus_detection_lookup_v1` view at materialization version 1300, plus the corresponding `DataItemDataSetAssociation` links. Cohort DataSets (e.g. `minnie65_v1300_csm_cluster`) and cell features are written by later notebooks." ] }, { @@ -43,7 +43,7 @@ }, { "cell_type": "code", - "execution_count": 5, + "execution_count": 2, "metadata": {}, "outputs": [ { @@ -52,9 +52,9 @@ "text": [ "OUTPUT_ROOT : ../scratch/em_patchseq_wnm_v2/\n", "PROJECT_ID : minnie65\n", - "DATASET_ID : minnie65_v1412_nuclei\n", + "DATASET_ID : minnie65_v1300_nuclei\n", "CAVE_DATASTACK : minnie65_phase3_v1\n", - "CAVE_VERSION : 1419\n", + "CAVE_VERSION : 1300\n", "CAVE_VIEW : nucleus_detection_lookup_v1\n" ] } @@ -62,9 +62,9 @@ "source": [ "OUTPUT_ROOT = output_root()\n", "PROJECT_ID = \"minnie65\"\n", - "DATASET_ID = \"minnie65_v1412_nuclei\"\n", + "DATASET_ID = \"minnie65_v1300_nuclei\"\n", "CAVE_DATASTACK = \"minnie65_phase3_v1\"\n", - "CAVE_VERSION = 1412\n", + "CAVE_VERSION = 1300\n", "CAVE_VIEW = \"nucleus_detection_lookup_v1\"\n", "\n", "print(f\"OUTPUT_ROOT : {OUTPUT_ROOT}\")\n", @@ -83,8 +83,99 @@ ] }, { - "cell_type": "raw", + "cell_type": "code", + "execution_count": 3, "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Shape: (133969, 7)\n" + ] + }, + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
idvolumept_root_idorig_root_idpt_supervoxel_idpt_positionpt_position_lookup
1373879229.04504386469113609013560786469113609013560796218056992431305[228816, 239776, 19593][228816, 239776, 19593]
320185893.75383686469113537389367886469113537389367884955554103121097[146848, 213600, 26267][146848, 213600, 26267]
4600774135.1897918646911356823787440111493022281121981[339120, 276112, 19442][339520, 276480, 19506]
\n", + "
" + ], + "text/plain": [ + " id volume pt_root_id orig_root_id \\\n", + "1 373879 229.045043 864691136090135607 864691136090135607 \n", + "3 201858 93.753836 864691135373893678 864691135373893678 \n", + "4 600774 135.189791 864691135682378744 0 \n", + "\n", + " pt_supervoxel_id pt_position pt_position_lookup \n", + "1 96218056992431305 [228816, 239776, 19593] [228816, 239776, 19593] \n", + "3 84955554103121097 [146848, 213600, 26267] [146848, 213600, 26267] \n", + "4 111493022281121981 [339120, 276112, 19442] [339520, 276480, 19506] " + ] + }, + "execution_count": 3, + "metadata": {}, + "output_type": "execute_result" + } + ], "source": [ "client = caveclient.CAVEclient(CAVE_DATASTACK, auth_token=os.environ[\"CUSTOM_KEY\"])\n", "client.materialize.version = CAVE_VERSION\n", @@ -105,20 +196,21 @@ }, { "cell_type": "code", - "execution_count": null, - "metadata": { - "execution": { - "iopub.execute_input": "2026-04-30T23:47:24.278066Z", - "iopub.status.busy": "2026-04-30T23:47:24.277862Z", - "iopub.status.idle": "2026-04-30T23:47:24.364744Z", - "shell.execute_reply": "2026-04-30T23:47:24.364040Z" + "execution_count": 4, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "DataSet written: 1 rows\n" + ] } - }, - "outputs": [], + ], "source": [ "dataset = DataSet(\n", " id=DATASET_ID,\n", - " name=\"Minnie65 v1412 nucleus catalog\",\n", + " name=\"Minnie65 v1300 nucleus catalog\",\n", " publication=\"doi.org/10.1038/s41586-025-08778-6\",\n", " modality=Modality.ELECTRON_MICROSCOPY.value,\n", " project_id=PROJECT_ID,\n", @@ -129,16 +221,26 @@ }, { "cell_type": "code", - "execution_count": null, - "metadata": { - "execution": { - "iopub.execute_input": "2026-04-30T23:47:24.366592Z", - "iopub.status.busy": "2026-04-30T23:47:24.366391Z", - "iopub.status.idle": "2026-04-30T23:47:24.398220Z", - "shell.execute_reply": "2026-04-30T23:47:24.397292Z" + "execution_count": 5, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "(1, 5)\n", + "shape: (1, 5)\n", + "┌──────────────────────┬─────────────────┬──────────────────────┬─────────────────────┬────────────┐\n", + "│ id ┆ name ┆ publication ┆ modality ┆ project_id │\n", + "│ --- ┆ --- ┆ --- ┆ --- ┆ --- │\n", + "│ str ┆ str ┆ str ┆ str ┆ str │\n", + "╞══════════════════════╪═════════════════╪══════════════════════╪═════════════════════╪════════════╡\n", + "│ minnie65_v1300_nucle ┆ Minnie65 v1300 ┆ doi.org/10.1038/s415 ┆ ELECTRON_MICROSCOPY ┆ minnie65 │\n", + "│ i ┆ nucleus catalog ┆ 86-025-087… ┆ ┆ │\n", + "└──────────────────────┴─────────────────┴──────────────────────┴─────────────────────┴────────────┘\n" + ] } - }, - "outputs": [], + ], "source": [ "# Verification\n", "ds_verify = (\n", @@ -161,16 +263,17 @@ }, { "cell_type": "code", - "execution_count": null, - "metadata": { - "execution": { - "iopub.execute_input": "2026-04-30T23:47:24.400176Z", - "iopub.status.busy": "2026-04-30T23:47:24.399893Z", - "iopub.status.idle": "2026-04-30T23:47:26.643935Z", - "shell.execute_reply": "2026-04-30T23:47:26.643138Z" + "execution_count": 6, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "DataItem rows appended: 133969 (total in batch: 133969)\n" + ] } - }, - "outputs": [], + ], "source": [ "dataitems = [\n", " DataItem(id=str(row.id), name=str(row.pt_root_id), project_id=PROJECT_ID)\n", @@ -183,14 +286,7 @@ { "cell_type": "code", "execution_count": 7, - "metadata": { - "execution": { - "iopub.execute_input": "2026-04-30T23:47:26.645628Z", - "iopub.status.busy": "2026-04-30T23:47:26.645427Z", - "iopub.status.idle": "2026-04-30T23:47:26.721724Z", - "shell.execute_reply": "2026-04-30T23:47:26.720975Z" - } - }, + "metadata": {}, "outputs": [ { "name": "stdout", @@ -235,16 +331,17 @@ }, { "cell_type": "code", - "execution_count": null, - "metadata": { - "execution": { - "iopub.execute_input": "2026-04-30T23:47:26.723482Z", - "iopub.status.busy": "2026-04-30T23:47:26.723266Z", - "iopub.status.idle": "2026-04-30T23:47:28.450853Z", - "shell.execute_reply": "2026-04-30T23:47:28.450093Z" + "execution_count": 8, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "DataItemDataSetAssociation written: 133969 rows\n" + ] } - }, - "outputs": [], + ], "source": [ "associations = [\n", " DataItemDataSetAssociation(\n", @@ -261,14 +358,7 @@ { "cell_type": "code", "execution_count": 9, - "metadata": { - "execution": { - "iopub.execute_input": "2026-04-30T23:47:28.452562Z", - "iopub.status.busy": "2026-04-30T23:47:28.452363Z", - "iopub.status.idle": "2026-04-30T23:47:28.497155Z", - "shell.execute_reply": "2026-04-30T23:47:28.496377Z" - } - }, + "metadata": {}, "outputs": [ { "name": "stdout", @@ -281,11 +371,11 @@ "│ --- ┆ --- ┆ --- │\n", "│ str ┆ str ┆ str │\n", "╞═════════════╪═══════════════════════╪════════════╡\n", - "│ 373879 ┆ minnie65_v1412_nuclei ┆ minnie65 │\n", - "│ 201858 ┆ minnie65_v1412_nuclei ┆ minnie65 │\n", - "│ 600774 ┆ minnie65_v1412_nuclei ┆ minnie65 │\n", - "│ 408486 ┆ minnie65_v1412_nuclei ┆ minnie65 │\n", - "│ 598774 ┆ minnie65_v1412_nuclei ┆ minnie65 │\n", + "│ 373879 ┆ minnie65_v1300_nuclei ┆ minnie65 │\n", + "│ 201858 ┆ minnie65_v1300_nuclei ┆ minnie65 │\n", + "│ 600774 ┆ minnie65_v1300_nuclei ┆ minnie65 │\n", + "│ 408486 ┆ minnie65_v1300_nuclei ┆ minnie65 │\n", + "│ 598774 ┆ minnie65_v1300_nuclei ┆ minnie65 │\n", "└─────────────┴───────────────────────┴────────────┘\n" ] } @@ -318,7 +408,7 @@ "| `dataitem_dataset_association/` | `DataItemDataSetAssociation` | `len(nuc_df)` |\n", "\n", "**Intentionally not written here:**\n", - "- Cohort DataSets (e.g. `minnie65_v1412_csm_cluster`) — each cohort is an additional `DataSet` row plus `DataItemDataSetAssociation` rows pointing at the same `DataItem` ids; written by `_02`/`_03` notebooks.\n", + "- Cohort DataSets (e.g. `minnie65_v1300_csm_cluster`) — each cohort is an additional `DataSet` row plus `DataItemDataSetAssociation` rows pointing at the same `DataItem` ids; written by `_02`/`_03` notebooks.\n", "- Cell features (`pt_position`, cell type labels, etc.) — written in `_02` as `CellFeature` records." ] }, diff --git a/code/etl_minnie_02_cell_features.ipynb b/code/etl_minnie_02_cell_features.ipynb index 32ecc83..01c3828 100644 --- a/code/etl_minnie_02_cell_features.ipynb +++ b/code/etl_minnie_02_cell_features.ipynb @@ -4,23 +4,25 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "# ETL \u2014 Minnie65: Cell Features\n", + "# ETL — Minnie65: Cell Features\n", "\n", - "Writes the CSM dendrite-ultrastructure cohort `DataSet` (`minnie65_v1412_csm_cluster`), its `DataItemDataSetAssociation` links, `CellFeatureDefinition` rows, `CellFeatureSet` rows, wide-form feature parquet tables, and `CellFeatureMatrix` pointer rows for two feature sets. Each feature-set section is independently idempotent. Prerequisite: `etl_minnie_01_dataset_dataitem.ipynb`." + "Writes the CSM dendrite-ultrastructure cohort `DataSet` (`minnie65_v1300_csm_cluster`), its `DataItemDataSetAssociation` links, `CellFeatureDefinition` rows, `CellFeatureSet` rows, wide-form feature parquet tables, and `CellFeatureMatrix` pointer rows for two feature sets. Each feature-set section is independently idempotent. Prerequisite: `etl_minnie_01_dataset_dataitem.ipynb`." ] }, { "cell_type": "code", - "execution_count": null, - "metadata": { - "execution": { - "iopub.execute_input": "2026-04-30T23:47:37.269601Z", - "iopub.status.busy": "2026-04-30T23:47:37.269413Z", - "iopub.status.idle": "2026-04-30T23:47:39.329482Z", - "shell.execute_reply": "2026-04-30T23:47:39.328688Z" + "execution_count": 1, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "/opt/conda/lib/python3.12/site-packages/requests/__init__.py:113: RequestsDependencyWarning: urllib3 (2.2.2) or chardet (7.4.3)/charset_normalizer (3.3.2) doesn't match a supported version!\n", + " warnings.warn(\n" + ] } - }, - "outputs": [], + ], "source": [ "import os\n", "from pathlib import Path\n", @@ -49,26 +51,31 @@ }, { "cell_type": "code", - "execution_count": null, - "metadata": { - "execution": { - "iopub.execute_input": "2026-04-30T23:47:39.331475Z", - "iopub.status.busy": "2026-04-30T23:47:39.331078Z", - "iopub.status.idle": "2026-04-30T23:47:39.335310Z", - "shell.execute_reply": "2026-04-30T23:47:39.334773Z" + "execution_count": 2, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "OUTPUT_ROOT : ../scratch/em_patchseq_wnm_v2/\n", + "PROJECT_ID : minnie65\n", + "COHORT_DATASET_ID : minnie65_v1300_csm_cluster\n", + "FSI_CSM : csm_cluster_features\n", + "FSI_STD : minnie65_std_transform_coordinates\n" + ] } - }, - "outputs": [], + ], "source": [ "FEATURES_PARQUET = \"/data/minnie1412/minnie_features.parquet\"\n", "FEATURES_CSV = \"/data/minnie1412/minnie_cell_features.csv\"\n", "OUTPUT_ROOT = output_root()\n", "PROJECT_ID = \"minnie65\"\n", - "COHORT_DATASET_ID = \"minnie65_v1412_csm_cluster\"\n", + "COHORT_DATASET_ID = \"minnie65_v1300_csm_cluster\"\n", "FSI_CSM = \"csm_cluster_features\"\n", "FSI_STD = \"minnie65_std_transform_coordinates\"\n", "CAVE_DATASTACK = \"minnie65_phase3_v1\"\n", - "CAVE_VERSION = 1412\n", + "CAVE_VERSION = 1300\n", "CAVE_VIEW = \"nucleus_detection_lookup_v1\"\n", "\n", "print(f\"OUTPUT_ROOT : {OUTPUT_ROOT}\")\n", @@ -88,29 +95,22 @@ { "cell_type": "code", "execution_count": 3, - "metadata": { - "execution": { - "iopub.execute_input": "2026-04-30T23:47:39.337347Z", - "iopub.status.busy": "2026-04-30T23:47:39.337158Z", - "iopub.status.idle": "2026-04-30T23:47:39.378022Z", - "shell.execute_reply": "2026-04-30T23:47:39.377250Z" - } - }, + "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "Prerequisite OK: minnie65_v1412_nuclei\n" + "Prerequisite OK: minnie65_v1300_nuclei\n" ] } ], "source": [ "prereq = (\n", " pl.read_delta(OUTPUT_ROOT + \"dataset/\")\n", - " .filter(pl.col(\"id\") == \"minnie65_v1412_nuclei\")\n", + " .filter(pl.col(\"id\") == \"minnie65_v1300_nuclei\")\n", ")\n", - "assert prereq.shape[0] == 1, \"etl_minnie_01 must be run first \u2014 minnie65_v1412_nuclei DataSet not found\"\n", + "assert prereq.shape[0] == 1, \"etl_minnie_01 must be run first — minnie65_v1300_nuclei DataSet not found\"\n", "print(\"Prerequisite OK:\", prereq[\"id\"][0])" ] }, @@ -124,20 +124,13 @@ { "cell_type": "code", "execution_count": 4, - "metadata": { - "execution": { - "iopub.execute_input": "2026-04-30T23:47:39.380017Z", - "iopub.status.busy": "2026-04-30T23:47:39.379810Z", - "iopub.status.idle": "2026-04-30T23:47:39.518990Z", - "shell.execute_reply": "2026-04-30T23:47:39.518178Z" - } - }, + "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "Dropped 4 duplicate id row(s): 35787 \u2192 35783\n", + "Dropped 4 duplicate id row(s): 35787 → 35783\n", "Features parquet shape: (35783, 112)\n", "Feature metadata CSV shape: (82, 6)\n" ] @@ -261,7 +254,7 @@ " \n", " \n", "\n", - "

3 rows \u00d7 112 columns

\n", + "

3 rows × 112 columns

\n", "" ], "text/plain": [ @@ -391,7 +384,7 @@ "# table cell-indexed (one row per nucleus id).\n", "n_before = len(feat_df)\n", "feat_df = feat_df.drop_duplicates(subset=\"id\", keep=\"first\")\n", - "print(f\"Dropped {n_before - len(feat_df)} duplicate id row(s): {n_before} \u2192 {len(feat_df)}\")\n", + "print(f\"Dropped {n_before - len(feat_df)} duplicate id row(s): {n_before} → {len(feat_df)}\")\n", "\n", "print(\"Features parquet shape:\", feat_df.shape)\n", "print(\"Feature metadata CSV shape:\", feat_meta.shape)\n", @@ -408,42 +401,37 @@ }, { "cell_type": "code", - "execution_count": null, - "metadata": { - "execution": { - "iopub.execute_input": "2026-04-30T23:47:39.520747Z", - "iopub.status.busy": "2026-04-30T23:47:39.520544Z", - "iopub.status.idle": "2026-04-30T23:47:40.175516Z", - "shell.execute_reply": "2026-04-30T23:47:40.174643Z" - } - }, + "execution_count": 12, + "metadata": {}, "outputs": [], "source": [ "cohort_ds = DataSet(\n", " id=COHORT_DATASET_ID,\n", - " name=\"Minnie65 v1412 CSM dendrite ultrastructure cohort\",\n", + " name=\"Minnie65 v1300 CSM dendrite ultrastructure cohort\",\n", " modality=Modality.ELECTRON_MICROSCOPY.value,\n", " project_id=PROJECT_ID,\n", ")\n", + "result = write_models([cohort_ds])" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "metadata": {}, + "outputs": [], + "source": [ "cell_ids = feat_df[\"id\"].astype(str).tolist()\n", "associations = [\n", " DataItemDataSetAssociation(dataitem_id=cid, dataset_id=COHORT_DATASET_ID, project_id=PROJECT_ID)\n", " for cid in cell_ids\n", "]\n", - "result = write_models([cohort_ds])" + "result = write_models(associations)" ] }, { "cell_type": "code", - "execution_count": 6, - "metadata": { - "execution": { - "iopub.execute_input": "2026-04-30T23:47:40.177331Z", - "iopub.status.busy": "2026-04-30T23:47:40.177117Z", - "iopub.status.idle": "2026-04-30T23:47:40.222700Z", - "shell.execute_reply": "2026-04-30T23:47:40.221892Z" - } - }, + "execution_count": 14, + "metadata": {}, "outputs": [ { "name": "stdout", @@ -451,14 +439,14 @@ "text": [ "DataSet: (1, 5)\n", "shape: (1, 5)\n", - "\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n", - "\u2502 id \u2506 name \u2506 publication \u2506 modality \u2506 project_id \u2502\n", - "\u2502 --- \u2506 --- \u2506 --- \u2506 --- \u2506 --- \u2502\n", - "\u2502 str \u2506 str \u2506 str \u2506 str \u2506 str \u2502\n", - "\u255e\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2561\n", - "\u2502 minnie65_v1412_csm_cluster \u2506 Minnie65 v1412 CSM \u2506 null \u2506 ELECTRON_MICROSCOPY \u2506 minnie65 \u2502\n", - "\u2502 \u2506 dendrite ul\u2026 \u2506 \u2506 \u2506 \u2502\n", - "\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n", + "┌────────────────────────────┬────────────────────┬─────────────┬─────────────────────┬────────────┐\n", + "│ id ┆ name ┆ publication ┆ modality ┆ project_id │\n", + "│ --- ┆ --- ┆ --- ┆ --- ┆ --- │\n", + "│ str ┆ str ┆ str ┆ str ┆ str │\n", + "╞════════════════════════════╪════════════════════╪═════════════╪═════════════════════╪════════════╡\n", + "│ minnie65_v1300_csm_cluster ┆ Minnie65 v1300 CSM ┆ null ┆ ELECTRON_MICROSCOPY ┆ minnie65 │\n", + "│ ┆ dendrite ul… ┆ ┆ ┆ │\n", + "└────────────────────────────┴────────────────────┴─────────────┴─────────────────────┴────────────┘\n", "Associations: (35783, 3)\n" ] } @@ -486,16 +474,17 @@ }, { "cell_type": "code", - "execution_count": null, - "metadata": { - "execution": { - "iopub.execute_input": "2026-04-30T23:47:40.224480Z", - "iopub.status.busy": "2026-04-30T23:47:40.224274Z", - "iopub.status.idle": "2026-04-30T23:47:40.318717Z", - "shell.execute_reply": "2026-04-30T23:47:40.318057Z" + "execution_count": 15, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "CellFeatureDefinition written: 82 rows\n" + ] } - }, - "outputs": [], + ], "source": [ "# Build CSM CellFeatureDefinitions from CSV\n", "csm_fds = []\n", @@ -519,15 +508,8 @@ }, { "cell_type": "code", - "execution_count": 8, - "metadata": { - "execution": { - "iopub.execute_input": "2026-04-30T23:47:40.320667Z", - "iopub.status.busy": "2026-04-30T23:47:40.320468Z", - "iopub.status.idle": "2026-04-30T23:47:40.360176Z", - "shell.execute_reply": "2026-04-30T23:47:40.359367Z" - } - }, + "execution_count": 16, + "metadata": {}, "outputs": [ { "name": "stdout", @@ -535,23 +517,23 @@ "text": [ "(82, 8)\n", "shape: (3, 8)\n", - "\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n", - "\u2502 id \u2506 descriptio \u2506 unit \u2506 data_type \u2506 range_min \u2506 range_max \u2506 project_i \u2506 feature_s \u2502\n", - "\u2502 --- \u2506 n \u2506 --- \u2506 --- \u2506 --- \u2506 --- \u2506 d \u2506 et_id \u2502\n", - "\u2502 str \u2506 --- \u2506 str \u2506 str \u2506 f64 \u2506 f64 \u2506 --- \u2506 --- \u2502\n", - "\u2502 \u2506 str \u2506 \u2506 \u2506 \u2506 \u2506 str \u2506 str \u2502\n", - "\u255e\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2561\n", - "\u2502 nucleus_vo \u2506 Nucleus \u2506 MICRONS_CU \u2506 \n", + "shape: (3, 9)
pre_pt_root_idpost_pt_root_idn_synsum_sizepre_nuc_idpost_nuc_id__index_level_0__pre_nuc_id_strpost_nuc_id_str
i64i64i64i64i64i64i64strstr
8646911351368998658646911348847431621598433717530404343176584"337175""304043"
8646911353603462008646911348847567301592033016733914243181492"330167""339142"
8646911353736017368646911348847567301385227359533914243181499"273595""339142"
" + ], + "text/plain": [ + "shape: (3, 9)\n", + "┌────────────┬────────────┬───────┬──────────┬───┬────────────┬────────────┬───────────┬───────────┐\n", + "│ pre_pt_roo ┆ post_pt_ro ┆ n_syn ┆ sum_size ┆ … ┆ post_nuc_i ┆ __index_le ┆ pre_nuc_i ┆ post_nuc_ │\n", + "│ t_id ┆ ot_id ┆ --- ┆ --- ┆ ┆ d ┆ vel_0__ ┆ d_str ┆ id_str │\n", + "│ --- ┆ --- ┆ i64 ┆ i64 ┆ ┆ --- ┆ --- ┆ --- ┆ --- │\n", + "│ i64 ┆ i64 ┆ ┆ ┆ ┆ i64 ┆ i64 ┆ str ┆ str │\n", + "╞════════════╪════════════╪═══════╪══════════╪═══╪════════════╪════════════╪═══════════╪═══════════╡\n", + "│ 8646911351 ┆ 8646911348 ┆ 1 ┆ 5984 ┆ … ┆ 304043 ┆ 43176584 ┆ 337175 ┆ 304043 │\n", + "│ 36899865 ┆ 84743162 ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", + "│ 8646911353 ┆ 8646911348 ┆ 1 ┆ 5920 ┆ … ┆ 339142 ┆ 43181492 ┆ 330167 ┆ 339142 │\n", + "│ 60346200 ┆ 84756730 ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", + "│ 8646911353 ┆ 8646911348 ┆ 1 ┆ 3852 ┆ … ┆ 339142 ┆ 43181499 ┆ 273595 ┆ 339142 │\n", + "│ 73601736 ┆ 84756730 ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", + "└────────────┴────────────┴───────┴──────────┴───┴────────────┴────────────┴───────────┴───────────┘" + ] + }, + "execution_count": 24, + "metadata": {}, + "output_type": "execute_result" + } + ], "source": [ "conn_df = pl.read_parquet(PARQUET_PATH)\n", "print(f\"Raw parquet rows: {conn_df.shape[0]}\")\n", @@ -321,10 +446,18 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 25, "id": "6c002b88", "metadata": {}, - "outputs": [], + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Example 1 filtered rows ((proofread ∩ CSM)-pre × CSM-post): 583750\n" + ] + } + ], "source": [ "conn_ex1 = conn_df.filter(\n", " pl.col(\"pre_nuc_id_str\").is_in(proofread_nuc_ids)\n", @@ -335,10 +468,18 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 26, "id": "91251dcc", "metadata": {}, - "outputs": [], + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "All 34313 unique cell ids confirmed in dataitem/.\n" + ] + } + ], "source": [ "pre_ids_ex1 = set(conn_ex1[\"pre_nuc_id_str\"].to_list())\n", "post_ids_ex1 = set(conn_ex1[\"post_nuc_id_str\"].to_list())\n", @@ -351,10 +492,18 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 27, "id": "ebf80af9", "metadata": {}, - "outputs": [], + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "CellCellConnectivityLong rows (Example 1): 1167500\n" + ] + } + ], "source": [ "rows_ex1 = []\n", "for row in conn_ex1.iter_rows(named=True):\n", @@ -390,10 +539,18 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 28, "id": "e0ef2071", "metadata": {}, - "outputs": [], + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Written to cellcellconnectivitylong_proofread_pre_to_csm_post/: 1167500 rows\n" + ] + } + ], "source": [ "schema_cc = build_arrow_schema(CellCellConnectivityLong)\n", "table_ex1 = attach_linkml_metadata(\n", @@ -413,10 +570,44 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 29, "id": "a494d0fe", "metadata": {}, - "outputs": [], + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Shape: (1167500, 9)\n", + "shape: (2, 2)\n", + "┌─────────────────────┬────────┐\n", + "│ measurement_type ┆ len │\n", + "│ --- ┆ --- │\n", + "│ str ┆ u32 │\n", + "╞═════════════════════╪════════╡\n", + "│ SYNAPSE_COUNT ┆ 583750 │\n", + "│ SUM_ANATOMICAL_SIZE ┆ 583750 │\n", + "└─────────────────────┴────────┘\n", + "shape: (3, 9)\n", + "┌─────────────┬────────────┬────────────┬────────────┬───┬───────┬───────┬────────────┬────────────┐\n", + "│ id ┆ descriptio ┆ presynapti ┆ postsynapt ┆ … ┆ value ┆ unit ┆ project_id ┆ measuremen │\n", + "│ --- ┆ n ┆ c_cell ┆ ic_cell ┆ ┆ --- ┆ --- ┆ --- ┆ t_type │\n", + "│ str ┆ --- ┆ --- ┆ --- ┆ ┆ f64 ┆ str ┆ str ┆ --- │\n", + "│ ┆ str ┆ str ┆ str ┆ ┆ ┆ ┆ ┆ str │\n", + "╞═════════════╪════════════╪════════════╪════════════╪═══╪═══════╪═══════╪════════════╪════════════╡\n", + "│ 301245_3614 ┆ null ┆ 301245 ┆ 361468 ┆ … ┆ 1.0 ┆ COUNT ┆ minnie65 ┆ SYNAPSE_CO │\n", + "│ 68_SYNAPSE_ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ UNT │\n", + "│ COUNT ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", + "│ 262722_2564 ┆ null ┆ 262722 ┆ 256466 ┆ … ┆ 1.0 ┆ COUNT ┆ minnie65 ┆ SYNAPSE_CO │\n", + "│ 66_SYNAPSE_ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ UNT │\n", + "│ COUNT ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", + "│ 296668_2285 ┆ null ┆ 296668 ┆ 228553 ┆ … ┆ 1.0 ┆ COUNT ┆ minnie65 ┆ SYNAPSE_CO │\n", + "│ 53_SYNAPSE_ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ UNT │\n", + "│ COUNT ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", + "└─────────────┴────────────┴────────────┴────────────┴───┴───────┴───────┴────────────┴────────────┘\n" + ] + } + ], "source": [ "ex1_v = pl.read_delta(OUTPUT_ROOT + \"cellcellconnectivitylong_proofread_pre_to_csm_post/\").filter(\n", " pl.col(\"project_id\") == PROJECT_ID\n", @@ -441,7 +632,7 @@ }, { "cell_type": "code", - "execution_count": 18, + "execution_count": 30, "id": "34be2c33", "metadata": {}, "outputs": [ @@ -449,7 +640,7 @@ "name": "stdout", "output_type": "stream", "text": [ - "Example 2 filtered rows ((proofread ∩ CSM) × (proofread ∩ CSM)): 96788\n" + "Example 2 filtered rows ((proofread ∩ CSM) × (proofread ∩ CSM)): 83877\n" ] } ], @@ -463,7 +654,7 @@ }, { "cell_type": "code", - "execution_count": 19, + "execution_count": 31, "id": "b0aa6ab0", "metadata": {}, "outputs": [ @@ -471,7 +662,7 @@ "name": "stdout", "output_type": "stream", "text": [ - "All 1861 unique cell ids confirmed in dataitem/.\n" + "All 1636 unique cell ids confirmed in dataitem/.\n" ] } ], @@ -487,7 +678,7 @@ }, { "cell_type": "code", - "execution_count": 20, + "execution_count": 32, "id": "6e4a2ee8", "metadata": {}, "outputs": [ @@ -495,7 +686,7 @@ "name": "stdout", "output_type": "stream", "text": [ - "CellCellConnectivityLong rows (Example 2): 96788\n" + "CellCellConnectivityLong rows (Example 2): 83877\n" ] } ], @@ -522,7 +713,7 @@ }, { "cell_type": "code", - "execution_count": 21, + "execution_count": 33, "id": "65ffa4e9", "metadata": {}, "outputs": [ @@ -530,7 +721,7 @@ "name": "stdout", "output_type": "stream", "text": [ - "Written to cellcellconnectivitylong_proofread_to_proofread/: 96788 rows\n" + "Written to cellcellconnectivitylong_proofread_to_proofread/: 83877 rows\n" ] } ], @@ -552,7 +743,7 @@ }, { "cell_type": "code", - "execution_count": 22, + "execution_count": 34, "id": "e40d8180", "metadata": {}, "outputs": [ @@ -560,14 +751,14 @@ "name": "stdout", "output_type": "stream", "text": [ - "Shape: (96788, 9)\n", + "Shape: (83877, 9)\n", "shape: (1, 2)\n", "┌──────────────────┬───────┐\n", "│ measurement_type ┆ len │\n", "│ --- ┆ --- │\n", "│ str ┆ u32 │\n", "╞══════════════════╪═══════╡\n", - "│ SYNAPSE_COUNT ┆ 96788 │\n", + "│ SYNAPSE_COUNT ┆ 83877 │\n", "└──────────────────┴───────┘\n", "shape: (3, 9)\n", "┌─────────────┬────────────┬────────────┬────────────┬───┬───────┬───────┬────────────┬────────────┐\n", @@ -611,7 +802,7 @@ "\n", "| Path | Rows |\n", "|------|------|\n", - "| `dataset/` | +1 (`minnie65_v1412_proofread` = proofread ∩ CSM cells) |\n", + "| `dataset/` | +1 (`minnie65_v1300_proofread` = proofread ∩ CSM cells) |\n", "| `dataitem_dataset_association/` | one per proofread ∩ CSM cell |\n", "| `cellcellconnectivitylong_proofread_pre_to_csm_post/` | 2 × filtered pairs: (proofread ∩ CSM)-pre × CSM-post (`SYNAPSE_COUNT` + `SUM_ANATOMICAL_SIZE`) |\n", "| `cellcellconnectivitylong_proofread_to_proofread/` | 1 × filtered pairs: (proofread ∩ CSM) × (proofread ∩ CSM) (`SYNAPSE_COUNT` only) |\n", diff --git a/src/connects_common_connectivity/io/writers.py b/src/connects_common_connectivity/io/writers.py index b6dc56c..e32072e 100644 --- a/src/connects_common_connectivity/io/writers.py +++ b/src/connects_common_connectivity/io/writers.py @@ -297,9 +297,31 @@ def write_projection_matrix( return write_models(enriched, settings=settings) +def write_cellcellconnectivitylong( + *args: Any, **kwargs: Any +) -> WriteResult: + """Placeholder writer for ``CellCellConnectivityLong`` rows. + + TODO: ``CellCellConnectivityLong`` is not yet in the WriteSpec REGISTRY, + and the existing ETL notebooks (``etl_minnie_04_cell_cell.ipynb``, + ``parse_minnie_clustering.ipynb``) write to non-canonical, run-specific + subdirs (e.g. ``cellcellconnectivitylong_proofread_pre_to_csm_post/``) + rather than the canonical ``cellcellconnectivitylong/`` subdir that + ``write_models`` would resolve. Until we either (a) consolidate those + callers onto the canonical subdir and add a ``WriteSpec``, or (b) extend + the dispatch to accept a per-call subdir override, those notebooks keep + using ``write_deltalake`` directly. This stub exists as a reminder. + """ + raise NotImplementedError( + "write_cellcellconnectivitylong is not implemented yet; " + "see writers.py docstring for migration plan." + ) + + __all__ = [ "WRITABLE_CLASSES", "WriteResult", "write_models", "write_projection_matrix", + "write_cellcellconnectivitylong", ] From 82aaad86bcfd65e11c2568b4e82ea99e2124543f Mon Sep 17 00:00:00 2001 From: reneyagmur Date: Tue, 16 Jun 2026 20:51:12 +0000 Subject: [PATCH 14/25] rewired output root in write functions, v1dd explore --- .codeocean/datasets.json | 4 + code/etl_minnie_01_dataset_dataitem.ipynb | 72 +- code/etl_minnie_02_cell_features.ipynb | 266 ++-- ...ie_03_cluster_and_cluster_membership.ipynb | 88 +- code/etl_minnie_04_cell_cell.ipynb | 222 ++-- code/etl_tasic_01_cluster.ipynb | 86 +- code/etl_v1dd_00_explore.ipynb | 1101 +++++++++++++++++ ...isp_exc_patchseq_01_dataset_dataitem.ipynb | 72 +- ...l_visp_exc_patchseq_02_cell_features.ipynb | 122 +- ...eq_03_cluster_membership_and_mapping.ipynb | 178 +-- ...isp_inh_patchseq_01_dataset_dataitem.ipynb | 70 +- ...l_visp_inh_patchseq_02_cell_features.ipynb | 132 +- ...eq_03_cluster_membership_and_mapping.ipynb | 140 +-- code/etl_visp_met_types_01_cluster.ipynb | 84 +- code/etl_wnm_exc_01_dataset_dataitem.ipynb | 72 +- code/etl_wnm_exc_02_cell_features.ipynb | 268 ++-- ...l_wnm_exc_03_cell_to_cluster_mapping.ipynb | 70 +- code/etl_wnm_exc_04_projection_matrix.ipynb | 126 +- .../io/writers.py | 61 +- tests/test_writers.py | 76 ++ 20 files changed, 2268 insertions(+), 1042 deletions(-) create mode 100644 code/etl_v1dd_00_explore.ipynb diff --git a/.codeocean/datasets.json b/.codeocean/datasets.json index 716efc2..d16d13c 100644 --- a/.codeocean/datasets.json +++ b/.codeocean/datasets.json @@ -17,6 +17,10 @@ "id": "78a80081-c645-4e38-beb7-b9d9308a35d9", "mount": "microns1412" }, + { + "id": "aafc99cc-92ee-4d04-b152-92f1063a3268", + "mount": "v1dd_1196" + }, { "id": "aff09b9b-5cdc-49ef-8e39-358a8ead98d8", "mount": "visp-patchseq-taxonomy-info" diff --git a/code/etl_minnie_01_dataset_dataitem.ipynb b/code/etl_minnie_01_dataset_dataitem.ipynb index ac6dd2b..ed3d463 100644 --- a/code/etl_minnie_01_dataset_dataitem.ipynb +++ b/code/etl_minnie_01_dataset_dataitem.ipynb @@ -4,7 +4,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "# ETL — Minnie65: DataSet & DataItem\n", + "# ETL \u2014 Minnie65: DataSet & DataItem\n", "\n", "Writes one `DataSet` record (`dataset_id = \"minnie65_v1300_nuclei\"`, `project_id = \"minnie65\"`) and one `DataItem` per nucleus from the CAVE `nucleus_detection_lookup_v1` view at materialization version 1300, plus the corresponding `DataItemDataSetAssociation` links. Cohort DataSets (e.g. `minnie65_v1300_csm_cluster`) and cell features are written by later notebooks." ] @@ -215,7 +215,7 @@ " modality=Modality.ELECTRON_MICROSCOPY.value,\n", " project_id=PROJECT_ID,\n", ")\n", - "result = write_models([dataset])\n", + "result = write_models([dataset], output_root=OUTPUT_ROOT)\n", "print(f\"DataSet written: {result.rows_written} rows\")" ] }, @@ -230,14 +230,14 @@ "text": [ "(1, 5)\n", "shape: (1, 5)\n", - "┌──────────────────────┬─────────────────┬──────────────────────┬─────────────────────┬────────────┐\n", - "│ id ┆ name ┆ publication ┆ modality ┆ project_id │\n", - "│ --- ┆ --- ┆ --- ┆ --- ┆ --- │\n", - "│ str ┆ str ┆ str ┆ str ┆ str │\n", - "╞══════════════════════╪═════════════════╪══════════════════════╪═════════════════════╪════════════╡\n", - "│ minnie65_v1300_nucle ┆ Minnie65 v1300 ┆ doi.org/10.1038/s415 ┆ ELECTRON_MICROSCOPY ┆ minnie65 │\n", - "│ i ┆ nucleus catalog ┆ 86-025-087… ┆ ┆ │\n", - "└──────────────────────┴─────────────────┴──────────────────────┴─────────────────────┴────────────┘\n" + "\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n", + "\u2502 id \u2506 name \u2506 publication \u2506 modality \u2506 project_id \u2502\n", + "\u2502 --- \u2506 --- \u2506 --- \u2506 --- \u2506 --- \u2502\n", + "\u2502 str \u2506 str \u2506 str \u2506 str \u2506 str \u2502\n", + "\u255e\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2561\n", + "\u2502 minnie65_v1300_nucle \u2506 Minnie65 v1300 \u2506 doi.org/10.1038/s415 \u2506 ELECTRON_MICROSCOPY \u2506 minnie65 \u2502\n", + "\u2502 i \u2506 nucleus catalog \u2506 86-025-087\u2026 \u2506 \u2506 \u2502\n", + "\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n" ] } ], @@ -279,7 +279,7 @@ " DataItem(id=str(row.id), name=str(row.pt_root_id), project_id=PROJECT_ID)\n", " for row in nuc_df.itertuples()\n", "]\n", - "n_appended = write_models(dataitems).rows_written\n", + "n_appended = write_models(dataitems, output_root=OUTPUT_ROOT).rows_written\n", "print(f\"DataItem rows appended: {n_appended} (total in batch: {len(dataitems)})\")" ] }, @@ -294,17 +294,17 @@ "text": [ "(133969, 4)\n", "shape: (5, 4)\n", - "┌────────┬────────────────────┬───────────────────┬────────────┐\n", - "│ id ┆ name ┆ neuroglancer_link ┆ project_id │\n", - "│ --- ┆ --- ┆ --- ┆ --- │\n", - "│ str ┆ str ┆ str ┆ str │\n", - "╞════════╪════════════════════╪═══════════════════╪════════════╡\n", - "│ 373879 ┆ 864691136090135607 ┆ null ┆ minnie65 │\n", - "│ 201858 ┆ 864691135373893678 ┆ null ┆ minnie65 │\n", - "│ 600774 ┆ 864691135682378744 ┆ null ┆ minnie65 │\n", - "│ 408486 ┆ 864691135194387242 ┆ null ┆ minnie65 │\n", - "│ 598774 ┆ 864691135741608653 ┆ null ┆ minnie65 │\n", - "└────────┴────────────────────┴───────────────────┴────────────┘\n" + "\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n", + "\u2502 id \u2506 name \u2506 neuroglancer_link \u2506 project_id \u2502\n", + "\u2502 --- \u2506 --- \u2506 --- \u2506 --- \u2502\n", + "\u2502 str \u2506 str \u2506 str \u2506 str \u2502\n", + "\u255e\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2561\n", + "\u2502 373879 \u2506 864691136090135607 \u2506 null \u2506 minnie65 \u2502\n", + "\u2502 201858 \u2506 864691135373893678 \u2506 null \u2506 minnie65 \u2502\n", + "\u2502 600774 \u2506 864691135682378744 \u2506 null \u2506 minnie65 \u2502\n", + "\u2502 408486 \u2506 864691135194387242 \u2506 null \u2506 minnie65 \u2502\n", + "\u2502 598774 \u2506 864691135741608653 \u2506 null \u2506 minnie65 \u2502\n", + "\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n" ] } ], @@ -351,7 +351,7 @@ " )\n", " for item in dataitems\n", "]\n", - "result = write_models(associations)\n", + "result = write_models(associations, output_root=OUTPUT_ROOT)\n", "print(f\"DataItemDataSetAssociation written: {result.rows_written} rows\")" ] }, @@ -366,17 +366,17 @@ "text": [ "(133969, 3)\n", "shape: (5, 3)\n", - "┌─────────────┬───────────────────────┬────────────┐\n", - "│ dataitem_id ┆ dataset_id ┆ project_id │\n", - "│ --- ┆ --- ┆ --- │\n", - "│ str ┆ str ┆ str │\n", - "╞═════════════╪═══════════════════════╪════════════╡\n", - "│ 373879 ┆ minnie65_v1300_nuclei ┆ minnie65 │\n", - "│ 201858 ┆ minnie65_v1300_nuclei ┆ minnie65 │\n", - "│ 600774 ┆ minnie65_v1300_nuclei ┆ minnie65 │\n", - "│ 408486 ┆ minnie65_v1300_nuclei ┆ minnie65 │\n", - "│ 598774 ┆ minnie65_v1300_nuclei ┆ minnie65 │\n", - "└─────────────┴───────────────────────┴────────────┘\n" + "\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n", + "\u2502 dataitem_id \u2506 dataset_id \u2506 project_id \u2502\n", + "\u2502 --- \u2506 --- \u2506 --- \u2502\n", + "\u2502 str \u2506 str \u2506 str \u2502\n", + "\u255e\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2561\n", + "\u2502 373879 \u2506 minnie65_v1300_nuclei \u2506 minnie65 \u2502\n", + "\u2502 201858 \u2506 minnie65_v1300_nuclei \u2506 minnie65 \u2502\n", + "\u2502 600774 \u2506 minnie65_v1300_nuclei \u2506 minnie65 \u2502\n", + "\u2502 408486 \u2506 minnie65_v1300_nuclei \u2506 minnie65 \u2502\n", + "\u2502 598774 \u2506 minnie65_v1300_nuclei \u2506 minnie65 \u2502\n", + "\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n" ] } ], @@ -408,8 +408,8 @@ "| `dataitem_dataset_association/` | `DataItemDataSetAssociation` | `len(nuc_df)` |\n", "\n", "**Intentionally not written here:**\n", - "- Cohort DataSets (e.g. `minnie65_v1300_csm_cluster`) — each cohort is an additional `DataSet` row plus `DataItemDataSetAssociation` rows pointing at the same `DataItem` ids; written by `_02`/`_03` notebooks.\n", - "- Cell features (`pt_position`, cell type labels, etc.) — written in `_02` as `CellFeature` records." + "- Cohort DataSets (e.g. `minnie65_v1300_csm_cluster`) \u2014 each cohort is an additional `DataSet` row plus `DataItemDataSetAssociation` rows pointing at the same `DataItem` ids; written by `_02`/`_03` notebooks.\n", + "- Cell features (`pt_position`, cell type labels, etc.) \u2014 written in `_02` as `CellFeature` records." ] }, { diff --git a/code/etl_minnie_02_cell_features.ipynb b/code/etl_minnie_02_cell_features.ipynb index 01c3828..dd20d19 100644 --- a/code/etl_minnie_02_cell_features.ipynb +++ b/code/etl_minnie_02_cell_features.ipynb @@ -4,7 +4,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "# ETL — Minnie65: Cell Features\n", + "# ETL \u2014 Minnie65: Cell Features\n", "\n", "Writes the CSM dendrite-ultrastructure cohort `DataSet` (`minnie65_v1300_csm_cluster`), its `DataItemDataSetAssociation` links, `CellFeatureDefinition` rows, `CellFeatureSet` rows, wide-form feature parquet tables, and `CellFeatureMatrix` pointer rows for two feature sets. Each feature-set section is independently idempotent. Prerequisite: `etl_minnie_01_dataset_dataitem.ipynb`." ] @@ -110,7 +110,7 @@ " pl.read_delta(OUTPUT_ROOT + \"dataset/\")\n", " .filter(pl.col(\"id\") == \"minnie65_v1300_nuclei\")\n", ")\n", - "assert prereq.shape[0] == 1, \"etl_minnie_01 must be run first — minnie65_v1300_nuclei DataSet not found\"\n", + "assert prereq.shape[0] == 1, \"etl_minnie_01 must be run first \u2014 minnie65_v1300_nuclei DataSet not found\"\n", "print(\"Prerequisite OK:\", prereq[\"id\"][0])" ] }, @@ -130,7 +130,7 @@ "name": "stdout", "output_type": "stream", "text": [ - "Dropped 4 duplicate id row(s): 35787 → 35783\n", + "Dropped 4 duplicate id row(s): 35787 \u2192 35783\n", "Features parquet shape: (35783, 112)\n", "Feature metadata CSV shape: (82, 6)\n" ] @@ -254,7 +254,7 @@ " \n", " \n", "\n", - "

3 rows × 112 columns

\n", + "

3 rows \u00d7 112 columns

\n", "" ], "text/plain": [ @@ -384,7 +384,7 @@ "# table cell-indexed (one row per nucleus id).\n", "n_before = len(feat_df)\n", "feat_df = feat_df.drop_duplicates(subset=\"id\", keep=\"first\")\n", - "print(f\"Dropped {n_before - len(feat_df)} duplicate id row(s): {n_before} → {len(feat_df)}\")\n", + "print(f\"Dropped {n_before - len(feat_df)} duplicate id row(s): {n_before} \u2192 {len(feat_df)}\")\n", "\n", "print(\"Features parquet shape:\", feat_df.shape)\n", "print(\"Feature metadata CSV shape:\", feat_meta.shape)\n", @@ -411,7 +411,7 @@ " modality=Modality.ELECTRON_MICROSCOPY.value,\n", " project_id=PROJECT_ID,\n", ")\n", - "result = write_models([cohort_ds])" + "result = write_models([cohort_ds], output_root=OUTPUT_ROOT)" ] }, { @@ -425,7 +425,7 @@ " DataItemDataSetAssociation(dataitem_id=cid, dataset_id=COHORT_DATASET_ID, project_id=PROJECT_ID)\n", " for cid in cell_ids\n", "]\n", - "result = write_models(associations)" + "result = write_models(associations, output_root=OUTPUT_ROOT)" ] }, { @@ -439,14 +439,14 @@ "text": [ "DataSet: (1, 5)\n", "shape: (1, 5)\n", - "┌────────────────────────────┬────────────────────┬─────────────┬─────────────────────┬────────────┐\n", - "│ id ┆ name ┆ publication ┆ modality ┆ project_id │\n", - "│ --- ┆ --- ┆ --- ┆ --- ┆ --- │\n", - "│ str ┆ str ┆ str ┆ str ┆ str │\n", - "╞════════════════════════════╪════════════════════╪═════════════╪═════════════════════╪════════════╡\n", - "│ minnie65_v1300_csm_cluster ┆ Minnie65 v1300 CSM ┆ null ┆ ELECTRON_MICROSCOPY ┆ minnie65 │\n", - "│ ┆ dendrite ul… ┆ ┆ ┆ │\n", - "└────────────────────────────┴────────────────────┴─────────────┴─────────────────────┴────────────┘\n", + "\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n", + "\u2502 id \u2506 name \u2506 publication \u2506 modality \u2506 project_id \u2502\n", + "\u2502 --- \u2506 --- \u2506 --- \u2506 --- \u2506 --- \u2502\n", + "\u2502 str \u2506 str \u2506 str \u2506 str \u2506 str \u2502\n", + "\u255e\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2561\n", + "\u2502 minnie65_v1300_csm_cluster \u2506 Minnie65 v1300 CSM \u2506 null \u2506 ELECTRON_MICROSCOPY \u2506 minnie65 \u2502\n", + "\u2502 \u2506 dendrite ul\u2026 \u2506 \u2506 \u2506 \u2502\n", + "\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n", "Associations: (35783, 3)\n" ] } @@ -502,7 +502,7 @@ " if pd.notna(row[\"range_max\"]):\n", " kwargs[\"range_max\"] = float(row[\"range_max\"])\n", " csm_fds.append(CellFeatureDefinition(**kwargs))\n", - "result = write_models(csm_fds)\n", + "result = write_models(csm_fds, output_root=OUTPUT_ROOT)\n", "print(f\"CellFeatureDefinition written: {result.rows_written} rows\")" ] }, @@ -517,23 +517,23 @@ "text": [ "(82, 8)\n", "shape: (3, 8)\n", - "┌────────────┬────────────┬────────────┬───────────┬───────────┬───────────┬───────────┬───────────┐\n", - "│ id ┆ descriptio ┆ unit ┆ data_type ┆ range_min ┆ range_max ┆ project_i ┆ feature_s │\n", - "│ --- ┆ n ┆ --- ┆ --- ┆ --- ┆ --- ┆ d ┆ et_id │\n", - "│ str ┆ --- ┆ str ┆ str ┆ f64 ┆ f64 ┆ --- ┆ --- │\n", - "│ ┆ str ┆ ┆ ┆ ┆ ┆ str ┆ str │\n", - "╞════════════╪════════════╪════════════╪═══════════╪═══════════╪═══════════╪═══════════╪═══════════╡\n", - "│ nucleus_vo ┆ Nucleus ┆ MICRONS_CU ┆ \n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
idpt_position_xpt_position_ypt_position_zpt_position_trform_xpt_position_trform_ypt_position_trform_zpt_root_idvolumecell_type_coarsecell_type
0228132632828749849738270-323721.447979549910.283106392909.832613864691132737039043458.464831NoneNone
1543247130492297791583880330339.020171595962.275760-306424.55135486469113273083998873.345940NoneNone
2203262624680531094283770-252082.627894203770.72823521544.029756864691132654552792338.276613EL3-IT
\n", + "" + ], + "text/plain": [ + " id pt_position_x pt_position_y pt_position_z pt_position_trform_x \\\n", + "0 228132 632828 749849 738270 -323721.447979 \n", + "1 543247 1304922 977915 83880 330339.020171 \n", + "2 203262 624680 531094 283770 -252082.627894 \n", + "\n", + " pt_position_trform_y pt_position_trform_z pt_root_id volume \\\n", + "0 549910.283106 392909.832613 864691132737039043 458.464831 \n", + "1 595962.275760 -306424.551354 864691132730839988 73.345940 \n", + "2 203770.728235 21544.029756 864691132654552792 338.276613 \n", + "\n", + " cell_type_coarse cell_type \n", + "0 None None \n", + "1 None None \n", + "2 E L3-IT " + ] + }, + "execution_count": 4, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "soma_df = pd.read_feather(DATA_ROOT / \"soma_and_cell_type_1196.feather\")\n", + "print(\"shape :\", soma_df.shape)\n", + "print(\"cols :\", list(soma_df.columns))\n", + "print(\"dtypes :\\n\", soma_df.dtypes)\n", + "print(\"n_unique pt_root_id :\", soma_df['pt_root_id'].nunique())\n", + "print(\"n_unique id :\", soma_df['id'].nunique())\n", + "print(\"cell_type_coarse counts:\\n\", soma_df['cell_type_coarse'].value_counts(dropna=False).head())\n", + "print(\"cell_type counts (top 10):\\n\", soma_df['cell_type'].value_counts(dropna=False).head(10))\n", + "soma_df.head(3)" + ] + }, + { + "cell_type": "markdown", + "id": "744dd27f", + "metadata": {}, + "source": [ + "**Schema mapping:**\n", + "\n", + "- `core_schema.yaml::DataItem` — one row per nucleus, `id = str(row.id)` (the soma id), `name = str(row.pt_root_id)`. Mirrors `etl_minnie_01`. Also `DataItemDataSetAssociation` linking each soma to the V1DD `DataSet`.\n", + "- `cell_features_schema.yaml::CellFeatureMatrix` — `pt_position_{x,y,z}` (voxel-space soma centroid), `pt_position_trform_{x,y,z}` (transformed/CCF coords), and `volume` make a numeric feature set (e.g. `feature_set_id = \"v1dd_soma_geometry\"`).\n", + "- `cell_type_coarse` / `cell_type` — categorical labels. Two options: (a) write as categorical columns inside a `CellFeatureMatrix` (cf. Minnie's CSM coarse types), or (b) treat the V1DD coarse/fine cell-type taxonomy as a `clustering_schema.yaml::ClusterHierarchy` and write `ClusterMembership` rows. Pattern (b) matches `etl_minnie_03_cluster_and_cluster_membership.ipynb`." + ] + }, + { + "cell_type": "markdown", + "id": "60aee4b0", + "metadata": {}, + "source": [ + "## 3. Proofread axon / dendrite lists — `.npy`\n", + "\n", + "Lists of `pt_root_id`s whose axon (resp. dendrite) has been manually proofread. These define the proofread cohort used in the synapse table below." + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "id": "9a1cf28b", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "axon list : shape (1210,) dtype int64\n", + " n_unique : 1210\n", + " sample : [864691132534275418, 864691132534315610, 864691132535664474, 864691132536286810, 864691132536904794]\n", + "\n", + "dendrite list : shape (63986,) dtype int64\n", + " n_unique : 63986\n", + " sample : [864691132496108732, 864691132511800666, 864691132525163794, 864691132533275738, 864691132533347418]\n", + "\n", + "overlap axon ∩ dendrite : 1148\n" + ] + } + ], + "source": [ + "axon_ids = np.load(DATA_ROOT / \"proofread_axon_list_1196.npy\", allow_pickle=True)\n", + "dend_ids = np.load(DATA_ROOT / \"proofread_dendrite_list_1196.npy\", allow_pickle=True)\n", + "\n", + "print(\"axon list : shape\", axon_ids.shape, \"dtype\", axon_ids.dtype)\n", + "print(\" n_unique :\", len(set(axon_ids.tolist())))\n", + "print(\" sample :\", axon_ids[:5].tolist())\n", + "print()\n", + "print(\"dendrite list : shape\", dend_ids.shape, \"dtype\", dend_ids.dtype)\n", + "print(\" n_unique :\", len(set(dend_ids.tolist())))\n", + "print(\" sample :\", dend_ids[:5].tolist())\n", + "print()\n", + "print(\"overlap axon ∩ dendrite :\", len(set(axon_ids.tolist()) & set(dend_ids.tolist())))" + ] + }, + { + "cell_type": "markdown", + "id": "284b3da6", + "metadata": {}, + "source": [ + "**Schema mapping:** These are cohort definitions, not features. Best modelled as two extra `core_schema.yaml::DataSet` rows (e.g. `v1dd_1196_proofread_axons`, `v1dd_1196_proofread_dendrites`) with their own `DataItemDataSetAssociation` rows pointing at the existing soma `DataItem` ids. Same pattern as the Minnie cohort DataSets noted in `etl_minnie_01`'s summary cell. Note: ids here are `pt_root_id` (int64); the soma `DataItem`s above are keyed by the soma `id` column — a `pt_root_id → soma_id` join is required before writing the associations." + ] + }, + { + "cell_type": "markdown", + "id": "783658a6", + "metadata": {}, + "source": [ + "## 4. Functional coregistration — `coregistration_1196.feather`\n", + "\n", + "Maps EM `pt_root_id`s to functional 2P ROIs (volume / column / plane / roi tuple)." + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "id": "55cc1599", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "shape : (571, 5)\n", + "cols : ['pt_root_id', 'column', 'volume', 'plane', 'roi']\n", + "n_unique pt_root_id : 553\n", + "n_unique (volume,column,plane,roi): 565\n" + ] + }, + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
pt_root_idcolumnvolumeplaneroi
0864691132830842994130143
186469113274146645713240
286469113277089372913398
\n", + "
" + ], + "text/plain": [ + " pt_root_id column volume plane roi\n", + "0 864691132830842994 1 3 0 143\n", + "1 864691132741466457 1 3 2 40\n", + "2 864691132770893729 1 3 3 98" + ] + }, + "execution_count": 7, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "coreg_df = pd.read_feather(DATA_ROOT / \"coregistration_1196.feather\")\n", + "print(\"shape :\", coreg_df.shape)\n", + "print(\"cols :\", list(coreg_df.columns))\n", + "print(\"n_unique pt_root_id :\", coreg_df['pt_root_id'].nunique())\n", + "print(\"n_unique (volume,column,plane,roi):\", coreg_df.drop_duplicates(['volume','column','plane','roi']).shape[0])\n", + "coreg_df.head(3)" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "id": "af630fa8-8303-4315-bfa1-6ccfe657f956", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "array([ 143, 40, 98, 100, 60, 105, 12, 232, 109, 409, 269,\n", + " 230, 29, 443, 170, 145, 226, 402, 99, 144, 120, 38,\n", + " 206, 117, 21, 30, 4, 341, 22, 25, 361, 14, 19,\n", + " 0, 240, 166, 444, 159, 189, 346, 75, 360, 548, 212,\n", + " 207, 215, 89, 187, 31, 45, 6, 271, 129, 139, 158,\n", + " 237, 245, 112, 5, 177, 367, 197, 193, 318, 150, 163,\n", + " 49, 93, 368, 254, 203, 247, 33, 15, 62, 69, 90,\n", + " 119, 154, 169, 195, 192, 184, 140, 116, 222, 77, 228,\n", + " 191, 121, 94, 141, 10, 36, 52, 32, 3, 67, 108,\n", + " 70, 73, 74, 78, 92, 97, 113, 58, 125, 134, 152,\n", + " 61, 72, 17, 84, 46, 26, 39, 41, 44, 671, 48,\n", + " 43, 227, 457, 127, 122, 229, 176, 107, 87, 231, 380,\n", + " 148, 255, 379, 258, 552, 295, 251, 261, 623, 481, 34,\n", + " 316, 432, 257, 223, 211, 137, 173, 440, 294, 395, 185,\n", + " 37, 162, 183, 253, 27, 867, 16, 80, 500, 221, 155,\n", + " 160, 194, 135, 149, 161, 164, 168, 200, 115, 201, 263,\n", + " 370, 132, 114, 13, 42, 47, 59, 65, 83, 101, 9,\n", + " 128, 103, 104, 81, 133, 204, 55, 35, 50, 64, 66,\n", + " 76, 202, 82, 88, 95, 7, 18, 198, 213, 250, 289,\n", + " 23, 282, 287, 56, 20, 24, 28, 171, 147, 11, 8,\n", + " 172, 85, 872, 68, 509, 256, 96, 106, 118, 389, 153,\n", + " 259, 281, 243, 401, 157, 265, 317, 1, 339, 421, 462,\n", + " 479, 499, 511, 2, 460, 165, 180, 196, 314, 71, 130,\n", + " 57, 63, 86, 91, 411, 420, 306, 293, 218, 217, 182,\n", + " 267, 233, 581, 284, 142, 280, 278, 635, 388, 657, 333,\n", + " 556, 234, 365, 167, 220, 151, 291, 416, 584, 79, 326,\n", + " 475, 355, 319, 1113, 236, 383, 393])" + ] + }, + "execution_count": 11, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "coreg_df.roi.unique()" + ] + }, + { + "cell_type": "markdown", + "id": "831e6318", + "metadata": {}, + "source": [ + "**Schema mapping:** A cross-modal cell-to-cell link table. Two reasonable options:\n", + "\n", + "- `mappings_schema.yaml` — if a `CellToCellMapping` (or similar cross-cell mapping) class exists, this is the natural home (EM cell ↔ functional cell).\n", + "- Otherwise, register the coregistered functional cells as `DataItem`s in a `v1dd_coregistered_functional_cells` `DataSet` (id = the 4-tuple stringified), then write association rows. The mapping itself (EM ↔ functional) can be a `CellCellConnectivityLong` row with a relation tag like `coregistration` — but that is a stretch and a dedicated mapping class is preferred. Schema-fit decision deferred to notebook `_03`." + ] + }, + { + "cell_type": "markdown", + "id": "1f3ecd8e", + "metadata": {}, + "source": [ + "## 5. Functional SNR — `snr_by_cell.feather`\n", + "\n", + "One SNR scalar per functional ROI (keyed by the same `volume / column / plane / roi` tuple as the coregistration table)." + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "id": "6a252201", + "metadata": { + "execution": { + "iopub.execute_input": "2026-06-16T18:43:21.648866Z", + "iopub.status.busy": "2026-06-16T18:43:21.648688Z", + "iopub.status.idle": "2026-06-16T18:43:21.660835Z", + "shell.execute_reply": "2026-06-16T18:43:21.660207Z" + } + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "shape : (4458, 5)\n", + "cols : ['column', 'volume', 'plane', 'roi', 'snr']\n", + "n_unique cells: 4458\n", + "snr describe :\n", + " count 4458.000000\n", + "mean 4.196671\n", + "std 4.135927\n", + "min 0.953515\n", + "25% 2.021927\n", + "50% 3.285306\n", + "75% 4.877459\n", + "max 93.560258\n", + "Name: snr, dtype: float64\n" + ] + }, + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
columnvolumeplaneroisnr
013002.974124
113012.304902
213021.442091
\n", + "
" + ], + "text/plain": [ + " column volume plane roi snr\n", + "0 1 3 0 0 2.974124\n", + "1 1 3 0 1 2.304902\n", + "2 1 3 0 2 1.442091" + ] + }, + "execution_count": 7, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "snr_df = pd.read_feather(DATA_ROOT / \"snr_by_cell.feather\")\n", + "print(\"shape :\", snr_df.shape)\n", + "print(\"cols :\", list(snr_df.columns))\n", + "print(\"n_unique cells:\", snr_df.drop_duplicates(['volume','column','plane','roi']).shape[0])\n", + "print(\"snr describe :\\n\", snr_df['snr'].describe())\n", + "snr_df.head(3)" + ] + }, + { + "cell_type": "markdown", + "id": "44dab7c2", + "metadata": {}, + "source": [ + "**Schema mapping:** `cell_features_schema.yaml::CellFeatureMatrix` with one `CellFeatureDefinition` (`snr`, dtype `\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
idpre_pt_position_xpre_pt_position_ypre_pt_position_zpost_pt_position_xpost_pt_position_ypost_pt_position_zctr_pt_position_xctr_pt_position_yctr_pt_position_zsizepre_pt_root_idpost_pt_root_id
0354386968758200.5802316.1304380.0757861.0802558.6304650.0757967.7802597.4304380.0240864691132536286810864691132734919083
1378070488792063.2514342.5183735.0792664.6514284.3183915.0792412.4514294.0183735.03056864691132572190492864691132606767301
2499493001977071.3390075.8191340.0976974.3390104.9190935.0976838.5390337.7190935.01346864691132573738810864691132747578447
\n", + "" + ], + "text/plain": [ + " id pre_pt_position_x pre_pt_position_y pre_pt_position_z \\\n", + "0 354386968 758200.5 802316.1 304380.0 \n", + "1 378070488 792063.2 514342.5 183735.0 \n", + "2 499493001 977071.3 390075.8 191340.0 \n", + "\n", + " post_pt_position_x post_pt_position_y post_pt_position_z \\\n", + "0 757861.0 802558.6 304650.0 \n", + "1 792664.6 514284.3 183915.0 \n", + "2 976974.3 390104.9 190935.0 \n", + "\n", + " ctr_pt_position_x ctr_pt_position_y ctr_pt_position_z size \\\n", + "0 757967.7 802597.4 304380.0 240 \n", + "1 792412.4 514294.0 183735.0 3056 \n", + "2 976838.5 390337.7 190935.0 1346 \n", + "\n", + " pre_pt_root_id post_pt_root_id \n", + "0 864691132536286810 864691132734919083 \n", + "1 864691132572190492 864691132606767301 \n", + "2 864691132573738810 864691132747578447 " + ] + }, + "execution_count": 8, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "syn_df = pd.read_feather(DATA_ROOT / \"syn_df_all_to_proofread_to_all_1196.feather\")\n", + "syn_label_df = pd.read_feather(DATA_ROOT / \"syn_label_df_all_to_proofread_to_all_1196.feather\")\n", + "\n", + "print(\"syn_df shape :\", syn_df.shape)\n", + "print(\"syn_df cols :\", list(syn_df.columns))\n", + "print(\"n_unique pre_pt_root :\", syn_df['pre_pt_root_id'].nunique())\n", + "print(\"n_unique post_pt_root :\", syn_df['post_pt_root_id'].nunique())\n", + "print(\"size describe :\\n\", syn_df['size'].describe())\n", + "print()\n", + "print(\"syn_label_df shape :\", syn_label_df.shape)\n", + "print(\"syn_label_df cols :\", list(syn_label_df.columns))\n", + "print(\"syn_label_df index :\", syn_label_df.index.name)\n", + "print(\"tag counts :\\n\", syn_label_df['tag'].value_counts(dropna=False).head())\n", + "syn_df.head(3)" + ] + }, + { + "cell_type": "markdown", + "id": "a848e37a", + "metadata": {}, + "source": [ + "**Schema mapping:**\n", + "\n", + "- `cell_cell_schema.yaml::CellCellConnectivityLong` — aggregate synapses per (pre, post) pair into synapse-count and total-size weights, written to a dedicated subdirectory per §5g of the prompt guide (e.g. `cellcellconnectivitylong_all_to_proofread_to_all/`). Mirrors `etl_minnie_04_cell_cell.ipynb`.\n", + "- Raw per-synapse rows (8.2M) do **not** fit any current schema — there is no per-synapse class in the common schemas. They would either stay as a parquet sidecar or be summarized away. The label table (spine vs other) is per-synapse and would be summarized in the same aggregation (e.g. as `n_spine_synapses` weight or a separate connectivity matrix)." + ] + }, + { + "cell_type": "markdown", + "id": "1e0ee419", + "metadata": {}, + "source": [ + "## 7. Functional cell–cell correlations\n", + "\n", + "`cell_cell_correlations_by_stimulus.feather` — all functional ROI pairs, one Pearson correlation per stimulus condition.\n", + "\n", + "`cell_cell_correlations_by_stimulus_coregistered.feather` — same, but restricted to coregistered EM cells and keyed by `pt_root_id` rather than ROI tuple." + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "id": "23774f12", + "metadata": { + "execution": { + "iopub.execute_input": "2026-06-16T18:43:34.090024Z", + "iopub.status.busy": "2026-06-16T18:43:34.089746Z", + "iopub.status.idle": "2026-06-16T18:43:43.292708Z", + "shell.execute_reply": "2026-06-16T18:43:43.292033Z" + } + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "corr_df shape : (8846260, 13)\n", + "corr_df cols : ['pre_roi', 'post_roi', 'pre_plane', 'post_plane', 'column', 'volume', 'drifting_gratings_full', 'drifting_gratings_windowed', 'locally_sparse_noise', 'natural_images', 'natural_images_12', 'natural_movie', 'spontaneous']\n", + "stimulus columns : ['drifting_gratings_full', 'drifting_gratings_windowed', 'locally_sparse_noise', 'natural_images', 'natural_images_12', 'natural_movie', 'spontaneous']\n", + "\n", + "corr_co_df shape : (148728, 9)\n", + "corr_co_df cols : ['pre_pt_root_id', 'post_pt_root_id', 'drifting_gratings_full', 'drifting_gratings_windowed', 'locally_sparse_noise', 'natural_images', 'natural_images_12', 'natural_movie', 'spontaneous']\n", + "corr_co_df n_unique pre : 551\n", + "corr_co_df n_unique post: 551\n" + ] + }, + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
pre_pt_root_idpost_pt_root_iddrifting_gratings_fulldrifting_gratings_windowedlocally_sparse_noisenatural_imagesnatural_images_12natural_moviespontaneous
08646911326318723548646911329937477010.0501420.0278190.1544720.1817240.1589090.0049570.078471
18646911326318723548646911327864477560.0552670.0182840.1194180.1155870.1240210.0102960.197349
28646911326318723548646911326179615370.0654440.0623670.1436600.0650780.0732970.0531970.108968
\n", + "
" + ], + "text/plain": [ + " pre_pt_root_id post_pt_root_id drifting_gratings_full \\\n", + "0 864691132631872354 864691132993747701 0.050142 \n", + "1 864691132631872354 864691132786447756 0.055267 \n", + "2 864691132631872354 864691132617961537 0.065444 \n", + "\n", + " drifting_gratings_windowed locally_sparse_noise natural_images \\\n", + "0 0.027819 0.154472 0.181724 \n", + "1 0.018284 0.119418 0.115587 \n", + "2 0.062367 0.143660 0.065078 \n", + "\n", + " natural_images_12 natural_movie spontaneous \n", + "0 0.158909 0.004957 0.078471 \n", + "1 0.124021 0.010296 0.197349 \n", + "2 0.073297 0.053197 0.108968 " + ] + }, + "execution_count": 9, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "corr_df = pd.read_feather(DATA_ROOT / \"cell_cell_correlations_by_stimulus.feather\")\n", + "corr_co_df = pd.read_feather(DATA_ROOT / \"cell_cell_correlations_by_stimulus_coregistered.feather\")\n", + "\n", + "stim_cols = ['drifting_gratings_full','drifting_gratings_windowed','locally_sparse_noise',\n", + " 'natural_images','natural_images_12','natural_movie','spontaneous']\n", + "\n", + "print(\"corr_df shape :\", corr_df.shape)\n", + "print(\"corr_df cols :\", list(corr_df.columns))\n", + "print(\"stimulus columns :\", stim_cols)\n", + "print()\n", + "print(\"corr_co_df shape :\", corr_co_df.shape)\n", + "print(\"corr_co_df cols :\", list(corr_co_df.columns))\n", + "print(\"corr_co_df n_unique pre :\", corr_co_df['pre_pt_root_id'].nunique())\n", + "print(\"corr_co_df n_unique post:\", corr_co_df['post_pt_root_id'].nunique())\n", + "corr_co_df.head(3)" + ] + }, + { + "cell_type": "markdown", + "id": "eebce5a3", + "metadata": {}, + "source": [ + "**Schema mapping:** Both tables are cell-pair × scalar-per-stimulus → `cell_cell_schema.yaml::CellCellConnectivityLong`, one folder per stimulus condition (§5g pattern) **per table**:\n", + "\n", + "- `cellcellconnectivitylong_func_corr_/` — keyed by functional-cell ids (from §4 registration). 8.8M rows × 7 stimuli ≈ 62M rows total; may want to threshold or sample.\n", + "- `cellcellconnectivitylong_func_corr_coreg_/` — keyed by EM `pt_root_id` (i.e. by soma `DataItem` ids). 149k rows per stimulus; small.\n", + "\n", + "The coregistered version is the one with direct anatomical interpretability and should be prioritized." + ] + }, + { + "cell_type": "markdown", + "id": "e1417e82", + "metadata": {}, + "source": [ + "## Summary — proposed notebook split\n", + "\n", + "| File(s) | Schema target | Future notebook |\n", + "|---|---|---|\n", + "| `data_description.json`, `subject.json`, `soma_and_cell_type_1196.feather` | `DataSet` + `DataItem` + `DataItemDataSetAssociation` | `etl_v1dd_01_dataset_dataitem.ipynb` |\n", + "| `proofread_axon_list_1196.npy`, `proofread_dendrite_list_1196.npy` | extra cohort `DataSet`s + associations | `etl_v1dd_01_dataset_dataitem.ipynb` (or `_01b`) |\n", + "| `soma_and_cell_type_1196.feather` (numeric cols), `snr_by_cell.feather` | `CellFeatureMatrix` | `etl_v1dd_02_cell_features.ipynb` |\n", + "| `soma_and_cell_type_1196.feather` (`cell_type` / `cell_type_coarse`) | `ClusterHierarchy` + `ClusterMembership` | `etl_v1dd_03_cluster_and_cluster_membership.ipynb` |\n", + "| `coregistration_1196.feather` | cross-modal mapping (schema TBD; see §4 note) | `etl_v1dd_03_mapping.ipynb` |\n", + "| `syn_df_…_1196.feather` (+ labels) | `CellCellConnectivityLong` (aggregated) | `etl_v1dd_04_cell_cell.ipynb` |\n", + "| `cell_cell_correlations_by_stimulus_coregistered.feather` | `CellCellConnectivityLong` (one folder per stimulus) | `etl_v1dd_04_cell_cell.ipynb` |\n", + "| `cell_cell_correlations_by_stimulus.feather` | same, functional-cell-keyed | `etl_v1dd_04_cell_cell.ipynb` (defer if functional cells aren't registered) |\n", + "\n", + "**Open questions for the schema owner before writing _01:**\n", + "1. Is there a canonical cross-modal cell-link class for the EM↔functional coregistration table, or should it ride on `CellCellConnectivityLong`?\n", + "2. Should non-coregistered functional cells (the 4458 in `snr_by_cell` minus the 571 coregistered) be registered as `DataItem`s? If yes, with what id scheme — the `(volume, column, plane, roi)` 4-tuple stringified?\n", + "3. Does the V1DD cell-type taxonomy already exist as a `ClusterHierarchy` somewhere (shared with MICrONS CSM), or does this dataset own it?" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.12.4" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/code/etl_visp_exc_patchseq_01_dataset_dataitem.ipynb b/code/etl_visp_exc_patchseq_01_dataset_dataitem.ipynb index 0b9ae4b..3a6e565 100644 --- a/code/etl_visp_exc_patchseq_01_dataset_dataitem.ipynb +++ b/code/etl_visp_exc_patchseq_01_dataset_dataitem.ipynb @@ -4,7 +4,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "# ETL — VISp Excitatory Patch-seq: DataSet & DataItem\n", + "# ETL \u2014 VISp Excitatory Patch-seq: DataSet & DataItem\n", "\n", "Writes one `DataSet` record (`dataset_id = \"visp_exc_patchseq\"`, `project_id = \"visp_patchseq\"`), one `DataItem` per cell from `inferred_met_types.csv`, and the corresponding `DataItemDataSetAssociation` links. No prerequisites; features and cluster mappings are written in `_02` and `_03`." ] @@ -200,7 +200,7 @@ " modality=Modality.MORPHOLOGY.value,\n", " project_id=PROJECT_ID,\n", ")\n", - "result = write_models([dataset])\n", + "result = write_models([dataset], output_root=OUTPUT_ROOT)\n", "print(f\"DataSet written: {result.rows_written} rows\")" ] }, @@ -222,14 +222,14 @@ "text": [ "(1, 5)\n", "shape: (1, 5)\n", - "┌───────────────────┬─────────────────┬───────────────────────────────┬────────────┬───────────────┐\n", - "│ id ┆ name ┆ publication ┆ modality ┆ project_id │\n", - "│ --- ┆ --- ┆ --- ┆ --- ┆ --- │\n", - "│ str ┆ str ┆ str ┆ str ┆ str │\n", - "╞═══════════════════╪═════════════════╪═══════════════════════════════╪════════════╪═══════════════╡\n", - "│ visp_exc_patchseq ┆ VISp excitatory ┆ doi.org/10.1101/2023.11.25.56 ┆ MORPHOLOGY ┆ visp_patchseq │\n", - "│ ┆ Patch-seq data… ┆ 8… ┆ ┆ │\n", - "└───────────────────┴─────────────────┴───────────────────────────────┴────────────┴───────────────┘\n" + "\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n", + "\u2502 id \u2506 name \u2506 publication \u2506 modality \u2506 project_id \u2502\n", + "\u2502 --- \u2506 --- \u2506 --- \u2506 --- \u2506 --- \u2502\n", + "\u2502 str \u2506 str \u2506 str \u2506 str \u2506 str \u2502\n", + "\u255e\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2561\n", + "\u2502 visp_exc_patchseq \u2506 VISp excitatory \u2506 doi.org/10.1101/2023.11.25.56 \u2506 MORPHOLOGY \u2506 visp_patchseq \u2502\n", + "\u2502 \u2506 Patch-seq data\u2026 \u2506 8\u2026 \u2506 \u2506 \u2502\n", + "\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n" ] } ], @@ -280,7 +280,7 @@ " DataItem(id=cid, name=cid, project_id=PROJECT_ID)\n", " for cid in cell_ids\n", "]\n", - "n_appended = write_models(dataitems).rows_written\n", + "n_appended = write_models(dataitems, output_root=OUTPUT_ROOT).rows_written\n", "print(f\"DataItem rows appended: {n_appended} (total in batch: {len(cell_ids)})\")" ] }, @@ -302,17 +302,17 @@ "text": [ "(1528, 4)\n", "shape: (5, 4)\n", - "┌───────────┬───────────┬───────────────────┬───────────────┐\n", - "│ id ┆ name ┆ neuroglancer_link ┆ project_id │\n", - "│ --- ┆ --- ┆ --- ┆ --- │\n", - "│ str ┆ str ┆ str ┆ str │\n", - "╞═══════════╪═══════════╪═══════════════════╪═══════════════╡\n", - "│ 908902400 ┆ 908902400 ┆ null ┆ visp_patchseq │\n", - "│ 965091329 ┆ 965091329 ┆ null ┆ visp_patchseq │\n", - "│ 978149378 ┆ 978149378 ┆ null ┆ visp_patchseq │\n", - "│ 834891776 ┆ 834891776 ┆ null ┆ visp_patchseq │\n", - "│ 897003522 ┆ 897003522 ┆ null ┆ visp_patchseq │\n", - "└───────────┴───────────┴───────────────────┴───────────────┘\n" + "\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n", + "\u2502 id \u2506 name \u2506 neuroglancer_link \u2506 project_id \u2502\n", + "\u2502 --- \u2506 --- \u2506 --- \u2506 --- \u2502\n", + "\u2502 str \u2506 str \u2506 str \u2506 str \u2502\n", + "\u255e\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2561\n", + "\u2502 908902400 \u2506 908902400 \u2506 null \u2506 visp_patchseq \u2502\n", + "\u2502 965091329 \u2506 965091329 \u2506 null \u2506 visp_patchseq \u2502\n", + "\u2502 978149378 \u2506 978149378 \u2506 null \u2506 visp_patchseq \u2502\n", + "\u2502 834891776 \u2506 834891776 \u2506 null \u2506 visp_patchseq \u2502\n", + "\u2502 897003522 \u2506 897003522 \u2506 null \u2506 visp_patchseq \u2502\n", + "\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n" ] } ], @@ -365,7 +365,7 @@ " )\n", " for cid in cell_ids\n", "]\n", - "result = write_models(associations)\n", + "result = write_models(associations, output_root=OUTPUT_ROOT)\n", "print(f\"DataItemDataSetAssociation written: {result.rows_written} rows\")" ] }, @@ -387,17 +387,17 @@ "text": [ "(1528, 3)\n", "shape: (5, 3)\n", - "┌─────────────┬───────────────────┬───────────────┐\n", - "│ dataitem_id ┆ dataset_id ┆ project_id │\n", - "│ --- ┆ --- ┆ --- │\n", - "│ str ┆ str ┆ str │\n", - "╞═════════════╪═══════════════════╪═══════════════╡\n", - "│ 908902400 ┆ visp_exc_patchseq ┆ visp_patchseq │\n", - "│ 965091329 ┆ visp_exc_patchseq ┆ visp_patchseq │\n", - "│ 978149378 ┆ visp_exc_patchseq ┆ visp_patchseq │\n", - "│ 834891776 ┆ visp_exc_patchseq ┆ visp_patchseq │\n", - "│ 897003522 ┆ visp_exc_patchseq ┆ visp_patchseq │\n", - "└─────────────┴───────────────────┴───────────────┘\n" + "\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n", + "\u2502 dataitem_id \u2506 dataset_id \u2506 project_id \u2502\n", + "\u2502 --- \u2506 --- \u2506 --- \u2502\n", + "\u2502 str \u2506 str \u2506 str \u2502\n", + "\u255e\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2561\n", + "\u2502 908902400 \u2506 visp_exc_patchseq \u2506 visp_patchseq \u2502\n", + "\u2502 965091329 \u2506 visp_exc_patchseq \u2506 visp_patchseq \u2502\n", + "\u2502 978149378 \u2506 visp_exc_patchseq \u2506 visp_patchseq \u2502\n", + "\u2502 834891776 \u2506 visp_exc_patchseq \u2506 visp_patchseq \u2502\n", + "\u2502 897003522 \u2506 visp_exc_patchseq \u2506 visp_patchseq \u2502\n", + "\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n" ] } ], @@ -429,8 +429,8 @@ "| `dataitem_dataset_association/` | `DataItemDataSetAssociation` | 1 528 |\n", "\n", "**Input columns intentionally not written here:**\n", - "- `t_type` — transcriptomic type label; written in a later notebook as `CellToClusterMapping`.\n", - "- `met_type`, `inferred_met_type` — MET-type labels; written in a later notebook as `ClusterMembership`." + "- `t_type` \u2014 transcriptomic type label; written in a later notebook as `CellToClusterMapping`.\n", + "- `met_type`, `inferred_met_type` \u2014 MET-type labels; written in a later notebook as `ClusterMembership`." ] }, { diff --git a/code/etl_visp_exc_patchseq_02_cell_features.ipynb b/code/etl_visp_exc_patchseq_02_cell_features.ipynb index 96a5c25..75d084f 100644 --- a/code/etl_visp_exc_patchseq_02_cell_features.ipynb +++ b/code/etl_visp_exc_patchseq_02_cell_features.ipynb @@ -4,9 +4,9 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "# ETL — VISp excitatory Patch-seq: Cell Features\n", + "# ETL \u2014 VISp excitatory Patch-seq: Cell Features\n", "\n", - "Writes 50 `CellFeatureDefinition` rows, one `CellFeatureSet` (`exc_visp_morph_features`), the wide-form morphology feature parquet (389 cells × 50 features), and one `CellFeatureMatrix` pointer. All 389 cells are already registered in `DataItem` by `etl_visp_exc_patchseq_01_dataset_dataitem.ipynb`; no new cell registration is needed. Prerequisite: `etl_visp_exc_patchseq_01_dataset_dataitem.ipynb` (`project_id=\"visp_patchseq\"`, `dataset_id=\"visp_exc_patchseq\"`)." + "Writes 50 `CellFeatureDefinition` rows, one `CellFeatureSet` (`exc_visp_morph_features`), the wide-form morphology feature parquet (389 cells \u00d7 50 features), and one `CellFeatureMatrix` pointer. All 389 cells are already registered in `DataItem` by `etl_visp_exc_patchseq_01_dataset_dataitem.ipynb`; no new cell registration is needed. Prerequisite: `etl_visp_exc_patchseq_01_dataset_dataitem.ipynb` (`project_id=\"visp_patchseq\"`, `dataset_id=\"visp_exc_patchseq\"`)." ] }, { @@ -116,7 +116,7 @@ " )\n", ")\n", "assert assoc.shape[0] > 0, (\n", - " f\"etl_visp_exc_patchseq_01 must be run first — \"\n", + " f\"etl_visp_exc_patchseq_01 must be run first \u2014 \"\n", " f\"no DataItemDataSetAssociation rows for dataset_id='{DATASET_ID}'\"\n", ")\n", "registered_ids = set(assoc[\"dataitem_id\"].to_list())\n", @@ -288,7 +288,7 @@ " if pd.notna(row[\"range_max\"]):\n", " kwargs[\"range_max\"] = float(row[\"range_max\"])\n", " feature_defs.append(CellFeatureDefinition(**kwargs))\n", - "result = write_models(feature_defs)\n", + "result = write_models(feature_defs, output_root=OUTPUT_ROOT)\n", "print(f\"CellFeatureDefinition written: {result.rows_written} rows\")" ] }, @@ -310,23 +310,23 @@ "text": [ "(50, 8)\n", "shape: (3, 8)\n", - "┌────────────┬────────────┬────────────┬───────────┬───────────┬───────────┬───────────┬───────────┐\n", - "│ id ┆ descriptio ┆ unit ┆ data_type ┆ range_min ┆ range_max ┆ project_i ┆ feature_s │\n", - "│ --- ┆ n ┆ --- ┆ --- ┆ --- ┆ --- ┆ d ┆ et_id │\n", - "│ str ┆ --- ┆ str ┆ str ┆ f64 ┆ f64 ┆ --- ┆ --- │\n", - "│ ┆ str ┆ ┆ ┆ ┆ ┆ str ┆ str │\n", - "╞════════════╪════════════╪════════════╪═══════════╪═══════════╪═══════════╪═══════════╪═══════════╡\n", - "│ apical_den ┆ Difference ┆ MICRONS_LE ┆ \n", " \n", "\n", - "

3 rows × 53 columns

\n", + "

3 rows \u00d7 53 columns

\n", "" ], "text/plain": [ @@ -628,7 +628,7 @@ "wide_df = pd.read_csv(WIDE_CSV)\n", "print(\"Wide CSV shape:\", wide_df.shape)\n", "\n", - "# Rename id column; convert int64 → str to match DataItem ids (values unchanged).\n", + "# Rename id column; convert int64 \u2192 str to match DataItem ids (values unchanged).\n", "wide_df = wide_df.rename(columns={\"specimen_id\": \"id\"})\n", "wide_df[\"id\"] = wide_df[\"id\"].astype(str)\n", "wide_df[\"project_id\"] = PROJECT_ID\n", @@ -693,24 +693,24 @@ "text": [ "(389, 53)\n", "shape: (3, 53)\n", - "┌───────────┬───────────┬───────────┬───────────┬───┬───────────┬───────────┬───────────┬──────────┐\n", - "│ id ┆ apical_de ┆ apical_de ┆ apical_de ┆ … ┆ basal_den ┆ soma_alig ┆ project_i ┆ feature_ │\n", - "│ --- ┆ ndrite_bi ┆ ndrite_bi ┆ ndrite_de ┆ ┆ drite_tot ┆ ned_dist_ ┆ d ┆ set_id │\n", - "│ str ┆ as_x ┆ as_y ┆ pth_pc_0 ┆ ┆ al_surfac ┆ from_pia ┆ --- ┆ --- │\n", - "│ ┆ --- ┆ --- ┆ --- ┆ ┆ e_a… ┆ --- ┆ str ┆ str │\n", - "│ ┆ f32 ┆ f32 ┆ f32 ┆ ┆ --- ┆ f32 ┆ ┆ │\n", - "│ ┆ ┆ ┆ ┆ ┆ f32 ┆ ┆ ┆ │\n", - "╞═══════════╪═══════════╪═══════════╪═══════════╪═══╪═══════════╪═══════════╪═══════════╪══════════╡\n", - "│ 601628311 ┆ 147.07672 ┆ 388.73733 ┆ -0.134846 ┆ … ┆ 4251.7275 ┆ 543.53991 ┆ visp_patc ┆ exc_visp │\n", - "│ ┆ 1 ┆ 5 ┆ ┆ ┆ 39 ┆ 7 ┆ hseq ┆ _morph_f │\n", - "│ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ eatures │\n", - "│ 603229579 ┆ 117.91847 ┆ 513.29565 ┆ 181.61207 ┆ … ┆ 3349.0791 ┆ 541.66149 ┆ visp_patc ┆ exc_visp │\n", - "│ ┆ 2 ┆ 4 ┆ 6 ┆ ┆ 02 ┆ 9 ┆ hseq ┆ _morph_f │\n", - "│ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ eatures │\n", - "│ 603337985 ┆ 74.871315 ┆ 382.30099 ┆ -77.73823 ┆ … ┆ 2933.8225 ┆ 458.13421 ┆ visp_patc ┆ exc_visp │\n", - "│ ┆ ┆ 5 ┆ 5 ┆ ┆ 1 ┆ 6 ┆ hseq ┆ _morph_f │\n", - "│ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ eatures │\n", - "└───────────┴───────────┴───────────┴───────────┴───┴───────────┴───────────┴───────────┴──────────┘\n" + "\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n", + "\u2502 id \u2506 apical_de \u2506 apical_de \u2506 apical_de \u2506 \u2026 \u2506 basal_den \u2506 soma_alig \u2506 project_i \u2506 feature_ \u2502\n", + "\u2502 --- \u2506 ndrite_bi \u2506 ndrite_bi \u2506 ndrite_de \u2506 \u2506 drite_tot \u2506 ned_dist_ \u2506 d \u2506 set_id \u2502\n", + "\u2502 str \u2506 as_x \u2506 as_y \u2506 pth_pc_0 \u2506 \u2506 al_surfac \u2506 from_pia \u2506 --- \u2506 --- \u2502\n", + "\u2502 \u2506 --- \u2506 --- \u2506 --- \u2506 \u2506 e_a\u2026 \u2506 --- \u2506 str \u2506 str \u2502\n", + "\u2502 \u2506 f32 \u2506 f32 \u2506 f32 \u2506 \u2506 --- \u2506 f32 \u2506 \u2506 \u2502\n", + "\u2502 \u2506 \u2506 \u2506 \u2506 \u2506 f32 \u2506 \u2506 \u2506 \u2502\n", + "\u255e\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2561\n", + "\u2502 601628311 \u2506 147.07672 \u2506 388.73733 \u2506 -0.134846 \u2506 \u2026 \u2506 4251.7275 \u2506 543.53991 \u2506 visp_patc \u2506 exc_visp \u2502\n", + "\u2502 \u2506 1 \u2506 5 \u2506 \u2506 \u2506 39 \u2506 7 \u2506 hseq \u2506 _morph_f \u2502\n", + "\u2502 \u2506 \u2506 \u2506 \u2506 \u2506 \u2506 \u2506 \u2506 eatures \u2502\n", + "\u2502 603229579 \u2506 117.91847 \u2506 513.29565 \u2506 181.61207 \u2506 \u2026 \u2506 3349.0791 \u2506 541.66149 \u2506 visp_patc \u2506 exc_visp \u2502\n", + "\u2502 \u2506 2 \u2506 4 \u2506 6 \u2506 \u2506 02 \u2506 9 \u2506 hseq \u2506 _morph_f \u2502\n", + "\u2502 \u2506 \u2506 \u2506 \u2506 \u2506 \u2506 \u2506 \u2506 eatures \u2502\n", + "\u2502 603337985 \u2506 74.871315 \u2506 382.30099 \u2506 -77.73823 \u2506 \u2026 \u2506 2933.8225 \u2506 458.13421 \u2506 visp_patc \u2506 exc_visp \u2502\n", + "\u2502 \u2506 \u2506 5 \u2506 5 \u2506 \u2506 1 \u2506 6 \u2506 hseq \u2506 _morph_f \u2502\n", + "\u2502 \u2506 \u2506 \u2506 \u2506 \u2506 \u2506 \u2506 \u2506 eatures \u2502\n", + "\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n" ] } ], @@ -761,7 +761,7 @@ " cell_index_column=\"id\",\n", " project_id=PROJECT_ID,\n", ")\n", - "result = write_models([cfm])\n", + "result = write_models([cfm], output_root=OUTPUT_ROOT)\n", "print(f\"CellFeatureMatrix written: {result.rows_written} rows\")" ] }, @@ -783,14 +783,14 @@ "text": [ "(1, 5)\n", "shape: (1, 5)\n", - "┌────────────────────┬────────────────────┬────────────────────┬───────────────────┬───────────────┐\n", - "│ id ┆ feature_set_id ┆ parquet_path ┆ cell_index_column ┆ project_id │\n", - "│ --- ┆ --- ┆ --- ┆ --- ┆ --- │\n", - "│ str ┆ str ┆ str ┆ str ┆ str │\n", - "╞════════════════════╪════════════════════╪════════════════════╪═══════════════════╪═══════════════╡\n", - "│ visp_patchseq_exc_ ┆ exc_visp_morph_fea ┆ file:///scratch/em ┆ id ┆ visp_patchseq │\n", - "│ visp_morph_f… ┆ tures ┆ _patchseq_wn… ┆ ┆ │\n", - "└────────────────────┴────────────────────┴────────────────────┴───────────────────┴───────────────┘\n" + "\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n", + "\u2502 id \u2506 feature_set_id \u2506 parquet_path \u2506 cell_index_column \u2506 project_id \u2502\n", + "\u2502 --- \u2506 --- \u2506 --- \u2506 --- \u2506 --- \u2502\n", + "\u2502 str \u2506 str \u2506 str \u2506 str \u2506 str \u2502\n", + "\u255e\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2561\n", + "\u2502 visp_patchseq_exc_ \u2506 exc_visp_morph_fea \u2506 file:///scratch/em \u2506 id \u2506 visp_patchseq \u2502\n", + "\u2502 visp_morph_f\u2026 \u2506 tures \u2506 _patchseq_wn\u2026 \u2506 \u2506 \u2502\n", + "\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n" ] } ], @@ -814,7 +814,7 @@ "|---|---|---|\n", "| `cellfeaturedefinition/` | `CellFeatureDefinition` | 50 |\n", "| `cellfeatureset/` | `CellFeatureSet` | 1 (`exc_visp_morph_features`) |\n", - "| `cellfeatures/exc_visp_morph_features/` | wide parquet | 389 cells × 50 features |\n", + "| `cellfeatures/exc_visp_morph_features/` | wide parquet | 389 cells \u00d7 50 features |\n", "| `cellfeaturematrix/` | `CellFeatureMatrix` | 1 |\n", "\n", "All writes use `mode=\"overwrite\"` with a two-level predicate (`project_id AND feature_set_id`) so re-running is idempotent. The inh Patch-seq notebook (same `project_id`, `feature_set_id='inh_visp_morph_features'`) and any future WNM notebook (same `feature_set_id`, `project_id='visp_wnm'`) cannot clobber these rows." diff --git a/code/etl_visp_exc_patchseq_03_cluster_membership_and_mapping.ipynb b/code/etl_visp_exc_patchseq_03_cluster_membership_and_mapping.ipynb index 4f97f79..652009b 100644 --- a/code/etl_visp_exc_patchseq_03_cluster_membership_and_mapping.ipynb +++ b/code/etl_visp_exc_patchseq_03_cluster_membership_and_mapping.ipynb @@ -5,13 +5,13 @@ "id": "da0a1046", "metadata": {}, "source": [ - "# ETL — VISp Excitatory Patch-seq: Cluster Membership & Cell-to-Cluster Mapping\n", + "# ETL \u2014 VISp Excitatory Patch-seq: Cluster Membership & Cell-to-Cluster Mapping\n", "\n", "Registers three taxonomy assignments per VISp excitatory Patch-seq cell (`project_id=\"visp_patchseq\"`, `dataset_id=\"visp_exc_patchseq\"`):\n", "\n", - "1. **T-type** against the Tasic 2018 taxonomy (`hierarchy_id=\"tasic_2018_visp_taxonomy\"`) as `CellToClusterMapping` — these cells are *mapped* into Tasic via the same Patch-seq tree-mapping pipeline used in Gouwens et al. 2020; they were not part of the Tasic dataset. Source column: `t_type` (1528 cells, with the legacy `ET → PT` rename applied).\n", - "2. **Ground-truth MET-type** against the VISp MET-types taxonomy (`hierarchy_id=\"visp_met_types_taxonomy\"`) as `ClusterMembership` — Patch-seq cells *define* the MET-types space, so this is direct membership, not a mapping. Source column: `met_type` (Gouwens 2020 mMET-type assignments, 384 cells).\n", - "3. **Inferred MET-type** against the same VISp MET-types taxonomy as `CellToClusterMapping` — this is an algorithmically predicted label, semantically a *mapping* rather than direct membership. Source column: `inferred_met_type`, registered only for the 1053 cells that lack a ground-truth `met_type` (so this set is disjoint from the membership rows above; the inferred column matches ground truth perfectly on the overlap, asserted in-notebook). The producing algorithm is not documented in the source data — `MappingSet.method_name` is a generic placeholder and should be updated when the method is confirmed.\n", + "1. **T-type** against the Tasic 2018 taxonomy (`hierarchy_id=\"tasic_2018_visp_taxonomy\"`) as `CellToClusterMapping` \u2014 these cells are *mapped* into Tasic via the same Patch-seq tree-mapping pipeline used in Gouwens et al. 2020; they were not part of the Tasic dataset. Source column: `t_type` (1528 cells, with the legacy `ET \u2192 PT` rename applied).\n", + "2. **Ground-truth MET-type** against the VISp MET-types taxonomy (`hierarchy_id=\"visp_met_types_taxonomy\"`) as `ClusterMembership` \u2014 Patch-seq cells *define* the MET-types space, so this is direct membership, not a mapping. Source column: `met_type` (Gouwens 2020 mMET-type assignments, 384 cells).\n", + "3. **Inferred MET-type** against the same VISp MET-types taxonomy as `CellToClusterMapping` \u2014 this is an algorithmically predicted label, semantically a *mapping* rather than direct membership. Source column: `inferred_met_type`, registered only for the 1053 cells that lack a ground-truth `met_type` (so this set is disjoint from the membership rows above; the inferred column matches ground truth perfectly on the overlap, asserted in-notebook). The producing algorithm is not documented in the source data \u2014 `MappingSet.method_name` is a generic placeholder and should be updated when the method is confirmed.\n", "\n", "Per-cell rows are emitted at the leaf level **and at every ancestor level** so that level-agnostic queries against `clustermembership/` / `celltoclustermapping/` work without a hierarchy join. `probability` (when available) is recorded on the leaf row only and left null on ancestors, matching the reference notebook convention.\n", "\n", @@ -135,7 +135,7 @@ " .filter((pl.col(\"project_id\") == PROJECT_ID) & (pl.col(\"dataset_id\") == DATASET_ID))\n", ")\n", "assert assoc.shape[0] > 0, (\n", - " f\"etl_visp_exc_patchseq_01 must run first — no association rows for dataset_id='{DATASET_ID}'\"\n", + " f\"etl_visp_exc_patchseq_01 must run first \u2014 no association rows for dataset_id='{DATASET_ID}'\"\n", ")\n", "registered_ids = set(assoc[\"dataitem_id\"].to_list())\n", "print(f\"Registered DataItems for {DATASET_ID}: {len(registered_ids)}\")\n", @@ -144,8 +144,8 @@ "cluster_df = pl.read_delta(OUTPUT_ROOT + \"cluster/\")\n", "ttype_clu = cluster_df.filter(pl.col(\"hierarchy_id\") == TTYPE_HIERARCHY_ID)\n", "met_clu = cluster_df.filter(pl.col(\"hierarchy_id\") == METTYPE_HIERARCHY_ID)\n", - "assert ttype_clu.shape[0] > 0, f\"etl_tasic_01_cluster must run first — no clusters for {TTYPE_HIERARCHY_ID}\"\n", - "assert met_clu.shape[0] > 0, f\"etl_visp_met_types_01_cluster must run first — no clusters for {METTYPE_HIERARCHY_ID}\"\n", + "assert ttype_clu.shape[0] > 0, f\"etl_tasic_01_cluster must run first \u2014 no clusters for {TTYPE_HIERARCHY_ID}\"\n", + "assert met_clu.shape[0] > 0, f\"etl_visp_met_types_01_cluster must run first \u2014 no clusters for {METTYPE_HIERARCHY_ID}\"\n", "\n", "ttype_parent = dict(zip(ttype_clu[\"id\"].to_list(), ttype_clu[\"parent\"].to_list()))\n", "met_parent = dict(zip(met_clu[\"id\"].to_list(), met_clu[\"parent\"].to_list()))\n", @@ -291,9 +291,9 @@ "id": "2340b987", "metadata": {}, "source": [ - "## T-type → `CellToClusterMapping` against Tasic 2018\n", + "## T-type \u2192 `CellToClusterMapping` against Tasic 2018\n", "\n", - "Apply the legacy `ET → PT` rename so that t-type labels match Tasic cluster ids (Tasic predates the ET nomenclature). Validate every translated label exists as a Tasic cluster id; raise on unknowns. Emit one `CellToClusterMapping` per (cell, ancestor) pair against `target_hierarchy=tasic_2018_visp_taxonomy`." + "Apply the legacy `ET \u2192 PT` rename so that t-type labels match Tasic cluster ids (Tasic predates the ET nomenclature). Validate every translated label exists as a Tasic cluster id; raise on unknowns. Emit one `CellToClusterMapping` per (cell, ancestor) pair against `target_hierarchy=tasic_2018_visp_taxonomy`." ] }, { @@ -349,7 +349,7 @@ } ], "source": [ - "# MappingSet — one row describing the t-type assignment method.\n", + "# MappingSet \u2014 one row describing the t-type assignment method.\n", "ttype_mapping_set = MappingSet(\n", " id=MAPPING_SET_ID,\n", " name=\"VISp excitatory Patch-seq T-type assignments\",\n", @@ -364,7 +364,7 @@ " target_hierarchy=TTYPE_HIERARCHY_ID,\n", " project_id=PROJECT_ID,\n", ")\n", - "result = write_models([ttype_mapping_set])\n", + "result = write_models([ttype_mapping_set], output_root=OUTPUT_ROOT)\n", "print(f\"MappingSet written: {result.rows_written} rows\")\n" ] }, @@ -387,17 +387,17 @@ "text": [ "(1, 13)\n", "shape: (1, 13)\n", - "┌───────────┬───────────┬───────────┬───────────┬───┬───────────┬───────────┬───────────┬──────────┐\n", - "│ id ┆ name ┆ descripti ┆ method_na ┆ … ┆ source_hi ┆ target_hi ┆ json_obje ┆ project_ │\n", - "│ --- ┆ --- ┆ on ┆ me ┆ ┆ erarchy ┆ erarchy ┆ ct ┆ id │\n", - "│ str ┆ str ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ --- ┆ --- │\n", - "│ ┆ ┆ str ┆ str ┆ ┆ str ┆ str ┆ str ┆ str │\n", - "╞═══════════╪═══════════╪═══════════╪═══════════╪═══╪═══════════╪═══════════╪═══════════╪══════════╡\n", - "│ visp_exc_ ┆ VISp exci ┆ Tree-mapp ┆ Patch-seq ┆ … ┆ null ┆ tasic_201 ┆ null ┆ visp_pat │\n", - "│ patchseq_ ┆ tatory ┆ ing of ┆ tree-mapp ┆ ┆ ┆ 8_visp_ta ┆ ┆ chseq │\n", - "│ ttype_map ┆ Patch-seq ┆ VISp exci ┆ ing ┆ ┆ ┆ xonomy ┆ ┆ │\n", - "│ pin… ┆ T-ty… ┆ tator… ┆ (Gouwen… ┆ ┆ ┆ ┆ ┆ │\n", - "└───────────┴───────────┴───────────┴───────────┴───┴───────────┴───────────┴───────────┴──────────┘\n" + "\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n", + "\u2502 id \u2506 name \u2506 descripti \u2506 method_na \u2506 \u2026 \u2506 source_hi \u2506 target_hi \u2506 json_obje \u2506 project_ \u2502\n", + "\u2502 --- \u2506 --- \u2506 on \u2506 me \u2506 \u2506 erarchy \u2506 erarchy \u2506 ct \u2506 id \u2502\n", + "\u2502 str \u2506 str \u2506 --- \u2506 --- \u2506 \u2506 --- \u2506 --- \u2506 --- \u2506 --- \u2502\n", + "\u2502 \u2506 \u2506 str \u2506 str \u2506 \u2506 str \u2506 str \u2506 str \u2506 str \u2502\n", + "\u255e\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2561\n", + "\u2502 visp_exc_ \u2506 VISp exci \u2506 Tree-mapp \u2506 Patch-seq \u2506 \u2026 \u2506 null \u2506 tasic_201 \u2506 null \u2506 visp_pat \u2502\n", + "\u2502 patchseq_ \u2506 tatory \u2506 ing of \u2506 tree-mapp \u2506 \u2506 \u2506 8_visp_ta \u2506 \u2506 chseq \u2502\n", + "\u2502 ttype_map \u2506 Patch-seq \u2506 VISp exci \u2506 ing \u2506 \u2506 \u2506 xonomy \u2506 \u2506 \u2502\n", + "\u2502 pin\u2026 \u2506 T-ty\u2026 \u2506 tator\u2026 \u2506 (Gouwen\u2026 \u2506 \u2506 \u2506 \u2506 \u2506 \u2502\n", + "\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n" ] } ], @@ -448,7 +448,7 @@ "ttype_mappings: list[CellToClusterMapping] = []\n", "for cell_id, leaf in zip(df.index, translated):\n", " if not isinstance(leaf, str):\n", - " continue # no t_type — skip (current data has none, but be defensive)\n", + " continue # no t_type \u2014 skip (current data has none, but be defensive)\n", " for cid, is_leaf in walk_ancestors(leaf, ttype_parent):\n", " ttype_mappings.append(CellToClusterMapping(\n", " id=f\"{cell_id}-{cid}-{PROJECT_ID}-{TTYPE_HIERARCHY_ID}\",\n", @@ -459,7 +459,7 @@ " project_id=PROJECT_ID,\n", " ))\n", "print(f\"CellToClusterMapping rows built: {len(ttype_mappings)}\")\n", - "result = write_models(ttype_mappings)\n", + "result = write_models(ttype_mappings, output_root=OUTPUT_ROOT)\n", "print(f\"CellToClusterMapping written: {result.rows_written} rows\")\n" ] }, @@ -482,22 +482,22 @@ "text": [ "(6112, 8)\n", "shape: (3, 8)\n", - "┌─────────────┬─────────────┬─────────────┬─────────────┬───────┬─────────────┬───────┬────────────┐\n", - "│ id ┆ mapping_set ┆ source_cell ┆ target_clus ┆ score ┆ probability ┆ notes ┆ project_id │\n", - "│ --- ┆ --- ┆ --- ┆ ter ┆ --- ┆ --- ┆ --- ┆ --- │\n", - "│ str ┆ str ┆ str ┆ --- ┆ f64 ┆ f64 ┆ str ┆ str │\n", - "│ ┆ ┆ ┆ str ┆ ┆ ┆ ┆ │\n", - "╞═════════════╪═════════════╪═════════════╪═════════════╪═══════╪═════════════╪═══════╪════════════╡\n", - "│ 908902400-L ┆ visp_exc_pa ┆ 908902400 ┆ L6 CT VISp ┆ null ┆ null ┆ null ┆ visp_patch │\n", - "│ 6 CT VISp ┆ tchseq_ttyp ┆ ┆ Ctxn3 Sla ┆ ┆ ┆ ┆ seq │\n", - "│ Ctxn3 Sla… ┆ e_mappin… ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ 908902400-L ┆ visp_exc_pa ┆ 908902400 ┆ L6 CT ┆ null ┆ null ┆ null ┆ visp_patch │\n", - "│ 6 CT-visp_p ┆ tchseq_ttyp ┆ ┆ ┆ ┆ ┆ ┆ seq │\n", - "│ atchseq-… ┆ e_mappin… ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ 908902400-G ┆ visp_exc_pa ┆ 908902400 ┆ Glutamaterg ┆ null ┆ null ┆ null ┆ visp_patch │\n", - "│ lutamatergi ┆ tchseq_ttyp ┆ ┆ ic ┆ ┆ ┆ ┆ seq │\n", - "│ c-visp_p… ┆ e_mappin… ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "└─────────────┴─────────────┴─────────────┴─────────────┴───────┴─────────────┴───────┴────────────┘\n" + "\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n", + "\u2502 id \u2506 mapping_set \u2506 source_cell \u2506 target_clus \u2506 score \u2506 probability \u2506 notes \u2506 project_id \u2502\n", + "\u2502 --- \u2506 --- \u2506 --- \u2506 ter \u2506 --- \u2506 --- \u2506 --- \u2506 --- \u2502\n", + "\u2502 str \u2506 str \u2506 str \u2506 --- \u2506 f64 \u2506 f64 \u2506 str \u2506 str \u2502\n", + "\u2502 \u2506 \u2506 \u2506 str \u2506 \u2506 \u2506 \u2506 \u2502\n", + "\u255e\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2561\n", + "\u2502 908902400-L \u2506 visp_exc_pa \u2506 908902400 \u2506 L6 CT VISp \u2506 null \u2506 null \u2506 null \u2506 visp_patch \u2502\n", + "\u2502 6 CT VISp \u2506 tchseq_ttyp \u2506 \u2506 Ctxn3 Sla \u2506 \u2506 \u2506 \u2506 seq \u2502\n", + "\u2502 Ctxn3 Sla\u2026 \u2506 e_mappin\u2026 \u2506 \u2506 \u2506 \u2506 \u2506 \u2506 \u2502\n", + "\u2502 908902400-L \u2506 visp_exc_pa \u2506 908902400 \u2506 L6 CT \u2506 null \u2506 null \u2506 null \u2506 visp_patch \u2502\n", + "\u2502 6 CT-visp_p \u2506 tchseq_ttyp \u2506 \u2506 \u2506 \u2506 \u2506 \u2506 seq \u2502\n", + "\u2502 atchseq-\u2026 \u2506 e_mappin\u2026 \u2506 \u2506 \u2506 \u2506 \u2506 \u2506 \u2502\n", + "\u2502 908902400-G \u2506 visp_exc_pa \u2506 908902400 \u2506 Glutamaterg \u2506 null \u2506 null \u2506 null \u2506 visp_patch \u2502\n", + "\u2502 lutamatergi \u2506 tchseq_ttyp \u2506 \u2506 ic \u2506 \u2506 \u2506 \u2506 seq \u2502\n", + "\u2502 c-visp_p\u2026 \u2506 e_mappin\u2026 \u2506 \u2506 \u2506 \u2506 \u2506 \u2506 \u2502\n", + "\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n" ] } ], @@ -525,7 +525,7 @@ "id": "6f6c11c5", "metadata": {}, "source": [ - "## MET-type → `ClusterMembership` against VISp MET-types\n", + "## MET-type \u2192 `ClusterMembership` against VISp MET-types\n", "\n", "Subset to cells with non-null `met_type` (Gouwens 2020 mMET-type ground-truth assignments). Validate every label is a known MET cluster id; raise on unknowns. Emit one `ClusterMembership` per (cell, ancestor) pair with `hierarchy_id=\"visp_met_types_taxonomy\"`. Membership (not mapping), because Patch-seq cells *define* this taxonomy.\n", "\n", @@ -603,7 +603,7 @@ "import polars as _pl\n", "other_cm = _pl.DataFrame({\"item\": []})\n", "all_memberships = memberships\n", - "result = write_models(all_memberships)\n", + "result = write_models(all_memberships, output_root=OUTPUT_ROOT)\n", "print(f\"ClusterMembership written: {result.rows_written} rows\")\n" ] }, @@ -626,19 +626,19 @@ "text": [ "(1152, 7)\n", "shape: (3, 7)\n", - "┌────────────┬───────────────┬──────────────┬─────────────┬──────────┬──────────────┬──────────────┐\n", - "│ item ┆ cluster ┆ membership_s ┆ probability ┆ distance ┆ project_id ┆ hierarchy_id │\n", - "│ --- ┆ --- ┆ core ┆ --- ┆ --- ┆ --- ┆ --- │\n", - "│ str ┆ str ┆ --- ┆ f64 ┆ f64 ┆ str ┆ str │\n", - "│ ┆ ┆ f64 ┆ ┆ ┆ ┆ │\n", - "╞════════════╪═══════════════╪══════════════╪═════════════╪══════════╪══════════════╪══════════════╡\n", - "│ 1039273993 ┆ L6b ┆ null ┆ null ┆ null ┆ visp_patchse ┆ visp_met_typ │\n", - "│ ┆ ┆ ┆ ┆ ┆ q ┆ es_taxonomy │\n", - "│ 1039273993 ┆ Glutamatergic ┆ null ┆ null ┆ null ┆ visp_patchse ┆ visp_met_typ │\n", - "│ ┆ ┆ ┆ ┆ ┆ q ┆ es_taxonomy │\n", - "│ 1039273993 ┆ cell ┆ null ┆ null ┆ null ┆ visp_patchse ┆ visp_met_typ │\n", - "│ ┆ ┆ ┆ ┆ ┆ q ┆ es_taxonomy │\n", - "└────────────┴───────────────┴──────────────┴─────────────┴──────────┴──────────────┴──────────────┘\n", + "\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n", + "\u2502 item \u2506 cluster \u2506 membership_s \u2506 probability \u2506 distance \u2506 project_id \u2506 hierarchy_id \u2502\n", + "\u2502 --- \u2506 --- \u2506 core \u2506 --- \u2506 --- \u2506 --- \u2506 --- \u2502\n", + "\u2502 str \u2506 str \u2506 --- \u2506 f64 \u2506 f64 \u2506 str \u2506 str \u2502\n", + "\u2502 \u2506 \u2506 f64 \u2506 \u2506 \u2506 \u2506 \u2502\n", + "\u255e\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2561\n", + "\u2502 1039273993 \u2506 L6b \u2506 null \u2506 null \u2506 null \u2506 visp_patchse \u2506 visp_met_typ \u2502\n", + "\u2502 \u2506 \u2506 \u2506 \u2506 \u2506 q \u2506 es_taxonomy \u2502\n", + "\u2502 1039273993 \u2506 Glutamatergic \u2506 null \u2506 null \u2506 null \u2506 visp_patchse \u2506 visp_met_typ \u2502\n", + "\u2502 \u2506 \u2506 \u2506 \u2506 \u2506 q \u2506 es_taxonomy \u2502\n", + "\u2502 1039273993 \u2506 cell \u2506 null \u2506 null \u2506 null \u2506 visp_patchse \u2506 visp_met_typ \u2502\n", + "\u2502 \u2506 \u2506 \u2506 \u2506 \u2506 q \u2506 es_taxonomy \u2502\n", + "\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n", "Our cells present: 384 / 384\n", "Other-notebook rows preserved: 0\n" ] @@ -679,9 +679,9 @@ "id": "ce675b24", "metadata": {}, "source": [ - "## Inferred MET-type → `CellToClusterMapping`\n", + "## Inferred MET-type \u2192 `CellToClusterMapping`\n", "\n", - "`inferred_met_type` is an algorithmically predicted MET-type label, available for 1437 of 1528 cells. It is *inferred*, not direct measurement, so it belongs as `CellToClusterMapping` against `target_hierarchy=visp_met_types_taxonomy` — distinct from the ground-truth `met_type` membership written above.\n", + "`inferred_met_type` is an algorithmically predicted MET-type label, available for 1437 of 1528 cells. It is *inferred*, not direct measurement, so it belongs as `CellToClusterMapping` against `target_hierarchy=visp_met_types_taxonomy` \u2014 distinct from the ground-truth `met_type` membership written above.\n", "\n", "**Subset rule:** register only the 1053 cells whose `met_type` is null. The 384 cells with ground-truth `met_type` are already in `ClusterMembership` (and the inferred column agrees with them perfectly on the overlap, asserted below).\n" ] @@ -766,7 +766,7 @@ " target_hierarchy=METTYPE_HIERARCHY_ID,\n", " project_id=PROJECT_ID,\n", ")\n", - "result = write_models([inferred_mapping_set])\n", + "result = write_models([inferred_mapping_set], output_root=OUTPUT_ROOT)\n", "print(f\"MappingSet written: {result.rows_written} rows\")\n" ] }, @@ -789,17 +789,17 @@ "text": [ "(1, 13)\n", "shape: (1, 13)\n", - "┌───────────┬───────────┬───────────┬───────────┬───┬───────────┬───────────┬───────────┬──────────┐\n", - "│ id ┆ name ┆ descripti ┆ method_na ┆ … ┆ source_hi ┆ target_hi ┆ json_obje ┆ project_ │\n", - "│ --- ┆ --- ┆ on ┆ me ┆ ┆ erarchy ┆ erarchy ┆ ct ┆ id │\n", - "│ str ┆ str ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ --- ┆ --- │\n", - "│ ┆ ┆ str ┆ str ┆ ┆ str ┆ str ┆ str ┆ str │\n", - "╞═══════════╪═══════════╪═══════════╪═══════════╪═══╪═══════════╪═══════════╪═══════════╪══════════╡\n", - "│ visp_exc_ ┆ VISp exci ┆ Algorithm ┆ inferred ┆ … ┆ null ┆ visp_met_ ┆ null ┆ visp_pat │\n", - "│ patchseq_ ┆ tatory ┆ ically ┆ MET-type ┆ ┆ ┆ types_tax ┆ ┆ chseq │\n", - "│ inferred_ ┆ Patch-seq ┆ predicted ┆ assignmen ┆ ┆ ┆ onomy ┆ ┆ │\n", - "│ met… ┆ infe… ┆ MET-… ┆ t (… ┆ ┆ ┆ ┆ ┆ │\n", - "└───────────┴───────────┴───────────┴───────────┴───┴───────────┴───────────┴───────────┴──────────┘\n" + "\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n", + "\u2502 id \u2506 name \u2506 descripti \u2506 method_na \u2506 \u2026 \u2506 source_hi \u2506 target_hi \u2506 json_obje \u2506 project_ \u2502\n", + "\u2502 --- \u2506 --- \u2506 on \u2506 me \u2506 \u2506 erarchy \u2506 erarchy \u2506 ct \u2506 id \u2502\n", + "\u2502 str \u2506 str \u2506 --- \u2506 --- \u2506 \u2506 --- \u2506 --- \u2506 --- \u2506 --- \u2502\n", + "\u2502 \u2506 \u2506 str \u2506 str \u2506 \u2506 str \u2506 str \u2506 str \u2506 str \u2502\n", + "\u255e\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2561\n", + "\u2502 visp_exc_ \u2506 VISp exci \u2506 Algorithm \u2506 inferred \u2506 \u2026 \u2506 null \u2506 visp_met_ \u2506 null \u2506 visp_pat \u2502\n", + "\u2502 patchseq_ \u2506 tatory \u2506 ically \u2506 MET-type \u2506 \u2506 \u2506 types_tax \u2506 \u2506 chseq \u2502\n", + "\u2502 inferred_ \u2506 Patch-seq \u2506 predicted \u2506 assignmen \u2506 \u2506 \u2506 onomy \u2506 \u2506 \u2502\n", + "\u2502 met\u2026 \u2506 infe\u2026 \u2506 MET-\u2026 \u2506 t (\u2026 \u2506 \u2506 \u2506 \u2506 \u2506 \u2502\n", + "\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n" ] } ], @@ -864,7 +864,7 @@ " project_id=PROJECT_ID,\n", " ))\n", "print(f\"CellToClusterMapping (inferred) rows built: {len(inferred_mappings)}\")\n", - "result = write_models(inferred_mappings)\n", + "result = write_models(inferred_mappings, output_root=OUTPUT_ROOT)\n", "print(f\"CellToClusterMapping written: {result.rows_written} rows\")\n" ] }, @@ -887,22 +887,22 @@ "text": [ "(3159, 8)\n", "shape: (3, 8)\n", - "┌─────────────┬─────────────┬─────────────┬─────────────┬───────┬─────────────┬───────┬────────────┐\n", - "│ id ┆ mapping_set ┆ source_cell ┆ target_clus ┆ score ┆ probability ┆ notes ┆ project_id │\n", - "│ --- ┆ --- ┆ --- ┆ ter ┆ --- ┆ --- ┆ --- ┆ --- │\n", - "│ str ┆ str ┆ str ┆ --- ┆ f64 ┆ f64 ┆ str ┆ str │\n", - "│ ┆ ┆ ┆ str ┆ ┆ ┆ ┆ │\n", - "╞═════════════╪═════════════╪═════════════╪═════════════╪═══════╪═════════════╪═══════╪════════════╡\n", - "│ 908902400-L ┆ visp_exc_pa ┆ 908902400 ┆ L6 CT-1 ┆ null ┆ null ┆ null ┆ visp_patch │\n", - "│ 6 CT-1-visp ┆ tchseq_infe ┆ ┆ ┆ ┆ ┆ ┆ seq │\n", - "│ _patchse… ┆ rred_met… ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ 908902400-G ┆ visp_exc_pa ┆ 908902400 ┆ Glutamaterg ┆ null ┆ null ┆ null ┆ visp_patch │\n", - "│ lutamatergi ┆ tchseq_infe ┆ ┆ ic ┆ ┆ ┆ ┆ seq │\n", - "│ c-visp_p… ┆ rred_met… ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ 908902400-c ┆ visp_exc_pa ┆ 908902400 ┆ cell ┆ null ┆ null ┆ null ┆ visp_patch │\n", - "│ ell-visp_pa ┆ tchseq_infe ┆ ┆ ┆ ┆ ┆ ┆ seq │\n", - "│ tchseq-v… ┆ rred_met… ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "└─────────────┴─────────────┴─────────────┴─────────────┴───────┴─────────────┴───────┴────────────┘\n" + "\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n", + "\u2502 id \u2506 mapping_set \u2506 source_cell \u2506 target_clus \u2506 score \u2506 probability \u2506 notes \u2506 project_id \u2502\n", + "\u2502 --- \u2506 --- \u2506 --- \u2506 ter \u2506 --- \u2506 --- \u2506 --- \u2506 --- \u2502\n", + "\u2502 str \u2506 str \u2506 str \u2506 --- \u2506 f64 \u2506 f64 \u2506 str \u2506 str \u2502\n", + "\u2502 \u2506 \u2506 \u2506 str \u2506 \u2506 \u2506 \u2506 \u2502\n", + "\u255e\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2561\n", + "\u2502 908902400-L \u2506 visp_exc_pa \u2506 908902400 \u2506 L6 CT-1 \u2506 null \u2506 null \u2506 null \u2506 visp_patch \u2502\n", + "\u2502 6 CT-1-visp \u2506 tchseq_infe \u2506 \u2506 \u2506 \u2506 \u2506 \u2506 seq \u2502\n", + "\u2502 _patchse\u2026 \u2506 rred_met\u2026 \u2506 \u2506 \u2506 \u2506 \u2506 \u2506 \u2502\n", + "\u2502 908902400-G \u2506 visp_exc_pa \u2506 908902400 \u2506 Glutamaterg \u2506 null \u2506 null \u2506 null \u2506 visp_patch \u2502\n", + "\u2502 lutamatergi \u2506 tchseq_infe \u2506 \u2506 ic \u2506 \u2506 \u2506 \u2506 seq \u2502\n", + "\u2502 c-visp_p\u2026 \u2506 rred_met\u2026 \u2506 \u2506 \u2506 \u2506 \u2506 \u2506 \u2502\n", + "\u2502 908902400-c \u2506 visp_exc_pa \u2506 908902400 \u2506 cell \u2506 null \u2506 null \u2506 null \u2506 visp_patch \u2502\n", + "\u2502 ell-visp_pa \u2506 tchseq_infe \u2506 \u2506 \u2506 \u2506 \u2506 \u2506 seq \u2502\n", + "\u2502 tchseq-v\u2026 \u2506 rred_met\u2026 \u2506 \u2506 \u2506 \u2506 \u2506 \u2506 \u2502\n", + "\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n" ] } ], @@ -955,9 +955,9 @@ "|---|---|---|\n", "| `mappingset/` (`id={MAPPING_SET_ID}`) | `MappingSet` (T-type tree mapping) | 1 |\n", "| `mappingset/` (`id={MAPPING_SET_INFERRED_ID}`) | `MappingSet` (inferred MET-type, method unspecified) | 1 |\n", - "| `celltoclustermapping/` (`mapping_set={MAPPING_SET_ID}`) | `CellToClusterMapping` | one per (cell × t-type ancestor), all 1528 cells |\n", - "| `celltoclustermapping/` (`mapping_set={MAPPING_SET_INFERRED_ID}`) | `CellToClusterMapping` (inferred) | one per (cell × MET-type ancestor), 1053 cells without ground-truth `met_type` |\n", - "| `clustermembership/` (`hierarchy_id={METTYPE_HIERARCHY_ID}`) | `ClusterMembership` | one per (cell × MET-type ancestor), 384 cells with ground-truth `met_type` |\n", + "| `celltoclustermapping/` (`mapping_set={MAPPING_SET_ID}`) | `CellToClusterMapping` | one per (cell \u00d7 t-type ancestor), all 1528 cells |\n", + "| `celltoclustermapping/` (`mapping_set={MAPPING_SET_INFERRED_ID}`) | `CellToClusterMapping` (inferred) | one per (cell \u00d7 MET-type ancestor), 1053 cells without ground-truth `met_type` |\n", + "| `clustermembership/` (`hierarchy_id={METTYPE_HIERARCHY_ID}`) | `ClusterMembership` | one per (cell \u00d7 MET-type ancestor), 384 cells with ground-truth `met_type` |\n", "\n", "All three columns of `inferred_met_types.csv` are now registered. The 91 cells with neither `met_type` nor `inferred_met_type` are unrepresented in cluster tables (no label to assign).\n" ] diff --git a/code/etl_visp_inh_patchseq_01_dataset_dataitem.ipynb b/code/etl_visp_inh_patchseq_01_dataset_dataitem.ipynb index c6065e6..f5589c7 100644 --- a/code/etl_visp_inh_patchseq_01_dataset_dataitem.ipynb +++ b/code/etl_visp_inh_patchseq_01_dataset_dataitem.ipynb @@ -4,7 +4,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "# ETL — VISp Inhibitory Patch-seq: DataSet & DataItem\n", + "# ETL \u2014 VISp Inhibitory Patch-seq: DataSet & DataItem\n", "\n", "Writes one `DataSet` record (`dataset_id = \"visp_inh_patchseq\"`, `project_id = \"visp_patchseq\"`), one `DataItem` per cell from `patchseq_tx_cell_ttype_labels.csv`, and the corresponding `DataItemDataSetAssociation` links. No prerequisites; features and cluster mappings are written in `_02` and `_03`." ] @@ -196,7 +196,7 @@ " modality=Modality.MORPHOLOGY.value,\n", " project_id=PROJECT_ID,\n", ")\n", - "result = write_models([dataset])\n", + "result = write_models([dataset], output_root=OUTPUT_ROOT)\n", "print(f\"DataSet written: {result.rows_written} rows\")" ] }, @@ -218,14 +218,14 @@ "text": [ "(1, 5)\n", "shape: (1, 5)\n", - "┌───────────────────┬─────────────────┬───────────────────────────────┬────────────┬───────────────┐\n", - "│ id ┆ name ┆ publication ┆ modality ┆ project_id │\n", - "│ --- ┆ --- ┆ --- ┆ --- ┆ --- │\n", - "│ str ┆ str ┆ str ┆ str ┆ str │\n", - "╞═══════════════════╪═════════════════╪═══════════════════════════════╪════════════╪═══════════════╡\n", - "│ visp_inh_patchseq ┆ VISp inhibitory ┆ doi.org/10.1016/j.cell.2020.0 ┆ MORPHOLOGY ┆ visp_patchseq │\n", - "│ ┆ Patch-seq data… ┆ 9… ┆ ┆ │\n", - "└───────────────────┴─────────────────┴───────────────────────────────┴────────────┴───────────────┘\n" + "\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n", + "\u2502 id \u2506 name \u2506 publication \u2506 modality \u2506 project_id \u2502\n", + "\u2502 --- \u2506 --- \u2506 --- \u2506 --- \u2506 --- \u2502\n", + "\u2502 str \u2506 str \u2506 str \u2506 str \u2506 str \u2502\n", + "\u255e\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2561\n", + "\u2502 visp_inh_patchseq \u2506 VISp inhibitory \u2506 doi.org/10.1016/j.cell.2020.0 \u2506 MORPHOLOGY \u2506 visp_patchseq \u2502\n", + "\u2502 \u2506 Patch-seq data\u2026 \u2506 9\u2026 \u2506 \u2506 \u2502\n", + "\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n" ] } ], @@ -275,7 +275,7 @@ " DataItem(id=cid, name=cid, project_id=PROJECT_ID)\n", " for cid in cell_ids\n", "]\n", - "n_appended = write_models(dataitems).rows_written\n", + "n_appended = write_models(dataitems, output_root=OUTPUT_ROOT).rows_written\n", "print(f\"DataItem rows appended: {n_appended} (total in batch: {len(cell_ids)})\")" ] }, @@ -297,17 +297,17 @@ "text": [ "(4287, 4)\n", "shape: (5, 4)\n", - "┌───────────┬───────────┬───────────────────┬───────────────┐\n", - "│ id ┆ name ┆ neuroglancer_link ┆ project_id │\n", - "│ --- ┆ --- ┆ --- ┆ --- │\n", - "│ str ┆ str ┆ str ┆ str │\n", - "╞═══════════╪═══════════╪═══════════════════╪═══════════════╡\n", - "│ 888001481 ┆ 888001481 ┆ null ┆ visp_patchseq │\n", - "│ 736493069 ┆ 736493069 ┆ null ┆ visp_patchseq │\n", - "│ 830445950 ┆ 830445950 ┆ null ┆ visp_patchseq │\n", - "│ 644941196 ┆ 644941196 ┆ null ┆ visp_patchseq │\n", - "│ 658075752 ┆ 658075752 ┆ null ┆ visp_patchseq │\n", - "└───────────┴───────────┴───────────────────┴───────────────┘\n" + "\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n", + "\u2502 id \u2506 name \u2506 neuroglancer_link \u2506 project_id \u2502\n", + "\u2502 --- \u2506 --- \u2506 --- \u2506 --- \u2502\n", + "\u2502 str \u2506 str \u2506 str \u2506 str \u2502\n", + "\u255e\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2561\n", + "\u2502 888001481 \u2506 888001481 \u2506 null \u2506 visp_patchseq \u2502\n", + "\u2502 736493069 \u2506 736493069 \u2506 null \u2506 visp_patchseq \u2502\n", + "\u2502 830445950 \u2506 830445950 \u2506 null \u2506 visp_patchseq \u2502\n", + "\u2502 644941196 \u2506 644941196 \u2506 null \u2506 visp_patchseq \u2502\n", + "\u2502 658075752 \u2506 658075752 \u2506 null \u2506 visp_patchseq \u2502\n", + "\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n" ] } ], @@ -360,7 +360,7 @@ " )\n", " for cid in cell_ids\n", "]\n", - "result = write_models(associations)\n", + "result = write_models(associations, output_root=OUTPUT_ROOT)\n", "print(f\"DataItemDataSetAssociation written: {result.rows_written} rows\")" ] }, @@ -382,17 +382,17 @@ "text": [ "(2759, 3)\n", "shape: (5, 3)\n", - "┌─────────────┬───────────────────┬───────────────┐\n", - "│ dataitem_id ┆ dataset_id ┆ project_id │\n", - "│ --- ┆ --- ┆ --- │\n", - "│ str ┆ str ┆ str │\n", - "╞═════════════╪═══════════════════╪═══════════════╡\n", - "│ 888001481 ┆ visp_inh_patchseq ┆ visp_patchseq │\n", - "│ 736493069 ┆ visp_inh_patchseq ┆ visp_patchseq │\n", - "│ 830445950 ┆ visp_inh_patchseq ┆ visp_patchseq │\n", - "│ 644941196 ┆ visp_inh_patchseq ┆ visp_patchseq │\n", - "│ 658075752 ┆ visp_inh_patchseq ┆ visp_patchseq │\n", - "└─────────────┴───────────────────┴───────────────┘\n" + "\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n", + "\u2502 dataitem_id \u2506 dataset_id \u2506 project_id \u2502\n", + "\u2502 --- \u2506 --- \u2506 --- \u2502\n", + "\u2502 str \u2506 str \u2506 str \u2502\n", + "\u255e\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2561\n", + "\u2502 888001481 \u2506 visp_inh_patchseq \u2506 visp_patchseq \u2502\n", + "\u2502 736493069 \u2506 visp_inh_patchseq \u2506 visp_patchseq \u2502\n", + "\u2502 830445950 \u2506 visp_inh_patchseq \u2506 visp_patchseq \u2502\n", + "\u2502 644941196 \u2506 visp_inh_patchseq \u2506 visp_patchseq \u2502\n", + "\u2502 658075752 \u2506 visp_inh_patchseq \u2506 visp_patchseq \u2502\n", + "\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n" ] } ], @@ -425,7 +425,7 @@ "| `dataitem_dataset_association/` | `DataItemDataSetAssociation` | 2 759 |\n", "\n", "**Input columns intentionally not written here:**\n", - "- `ttype` — T-type label; written in a later notebook as `CellToClusterMapping`." + "- `ttype` \u2014 T-type label; written in a later notebook as `CellToClusterMapping`." ] }, { diff --git a/code/etl_visp_inh_patchseq_02_cell_features.ipynb b/code/etl_visp_inh_patchseq_02_cell_features.ipynb index 14b1e27..8c4d444 100644 --- a/code/etl_visp_inh_patchseq_02_cell_features.ipynb +++ b/code/etl_visp_inh_patchseq_02_cell_features.ipynb @@ -4,7 +4,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "# ETL — VISp inhibitory Patch-seq: Cell Features\n", + "# ETL \u2014 VISp inhibitory Patch-seq: Cell Features\n", "\n", "Writes 46 `CellFeatureDefinition` rows, one `CellFeatureSet` (`inh_visp_morph_features`), the wide-form morphology feature parquet, and one `CellFeatureMatrix` pointer. Also registers any cell ids present in the wide-form CSV but absent from the `DataItem` table (i.e., cells not in the original `_01` source CSV). Prerequisite: `etl_visp_inh_patchseq_01_dataset_dataitem.ipynb` (`project_id=\"visp_inh_patchseq\"`, `dataset_id=\"visp_inh_patchseq\"`)." ] @@ -119,7 +119,7 @@ " .filter(pl.col(\"project_id\") == PROJECT_ID)\n", ")\n", "assert existing_dataitems.shape[0] > 0, (\n", - " f\"etl_visp_inh_patchseq_01 must be run first — no DataItem rows for project_id='{PROJECT_ID}'\"\n", + " f\"etl_visp_inh_patchseq_01 must be run first \u2014 no DataItem rows for project_id='{PROJECT_ID}'\"\n", ")\n", "print(f\"Prerequisite OK: {existing_dataitems.shape[0]} DataItem rows for project_id='{PROJECT_ID}'\")" ] @@ -191,7 +191,7 @@ ], "source": [ "if new_ids:\n", - " n_di = write_models([DataItem(id=cid, name=cid, project_id=PROJECT_ID) for cid in new_ids]).rows_written\n", + " n_di = write_models([DataItem(id=cid, name=cid, project_id=PROJECT_ID) for cid in new_ids], output_root=OUTPUT_ROOT).rows_written\n", " print(f\"DataItems appended: {n_di}\")\n", "\n", " schema_assoc = build_arrow_schema(DataItemDataSetAssociation)\n", @@ -213,7 +213,7 @@ " )\n", " print(f\"Associations appended: {len(new_ids)}\")\n", "else:\n", - " print(\"No new cells to register — all already present.\")" + " print(\"No new cells to register \u2014 all already present.\")" ] }, { @@ -413,7 +413,7 @@ " if pd.notna(row[\"range_max\"]):\n", " kwargs[\"range_max\"] = float(row[\"range_max\"])\n", " feature_defs.append(CellFeatureDefinition(**kwargs))\n", - "result = write_models(feature_defs)\n", + "result = write_models(feature_defs, output_root=OUTPUT_ROOT)\n", "print(f\"CellFeatureDefinition written: {result.rows_written} rows\")" ] }, @@ -435,25 +435,25 @@ "text": [ "(46, 8)\n", "shape: (3, 8)\n", - "┌────────────┬────────────┬────────────┬───────────┬───────────┬───────────┬───────────┬───────────┐\n", - "│ id ┆ descriptio ┆ unit ┆ data_type ┆ range_min ┆ range_max ┆ project_i ┆ feature_s │\n", - "│ --- ┆ n ┆ --- ┆ --- ┆ --- ┆ --- ┆ d ┆ et_id │\n", - "│ str ┆ --- ┆ str ┆ str ┆ f64 ┆ f64 ┆ --- ┆ --- │\n", - "│ ┆ str ┆ ┆ ┆ ┆ ┆ str ┆ str │\n", - "╞════════════╪════════════╪════════════╪═══════════╪═══════════╪═══════════╪═══════════╪═══════════╡\n", - "│ axon_bias_ ┆ Difference ┆ MICRONS_LE ┆ \n", " \n", "\n", - "

3 rows × 49 columns

\n", + "

3 rows \u00d7 49 columns

\n", "" ], "text/plain": [ @@ -745,7 +745,7 @@ "wide_df = pd.read_csv(WIDE_CSV)\n", "print(\"Wide CSV shape:\", wide_df.shape)\n", "\n", - "# Rename id column; convert int64 → str to match DataItem ids (values unchanged).\n", + "# Rename id column; convert int64 \u2192 str to match DataItem ids (values unchanged).\n", "wide_df = wide_df.rename(columns={\"specimen_id\": \"id\"})\n", "wide_df[\"id\"] = wide_df[\"id\"].astype(str)\n", "wide_df[\"project_id\"] = PROJECT_ID\n", @@ -810,24 +810,24 @@ "text": [ "(520, 49)\n", "shape: (3, 49)\n", - "┌───────────┬───────────┬───────────┬───────────┬───┬───────────┬───────────┬───────────┬──────────┐\n", - "│ id ┆ axon_bias ┆ axon_bias ┆ axon_dept ┆ … ┆ basal_den ┆ soma_alig ┆ project_i ┆ feature_ │\n", - "│ --- ┆ _x ┆ _y ┆ h_pc_0 ┆ ┆ drite_tot ┆ ned_dist_ ┆ d ┆ set_id │\n", - "│ str ┆ --- ┆ --- ┆ --- ┆ ┆ al_surfac ┆ from_pia ┆ --- ┆ --- │\n", - "│ ┆ f32 ┆ f32 ┆ f32 ┆ ┆ e_a… ┆ --- ┆ str ┆ str │\n", - "│ ┆ ┆ ┆ ┆ ┆ --- ┆ f32 ┆ ┆ │\n", - "│ ┆ ┆ ┆ ┆ ┆ f32 ┆ ┆ ┆ │\n", - "╞═══════════╪═══════════╪═══════════╪═══════════╪═══╪═══════════╪═══════════╪═══════════╪══════════╡\n", - "│ 601506507 ┆ 180.83319 ┆ -249.8307 ┆ -255.2250 ┆ … ┆ 7207.4599 ┆ 357.15982 ┆ visp_patc ┆ inh_visp │\n", - "│ ┆ 1 ┆ 5 ┆ 98 ┆ ┆ 61 ┆ 1 ┆ hseq ┆ _morph_f │\n", - "│ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ eatures │\n", - "│ 601790961 ┆ 25.481123 ┆ 434.25106 ┆ -216.8098 ┆ … ┆ 11691.149 ┆ 663.10302 ┆ visp_patc ┆ inh_visp │\n", - "│ ┆ ┆ 8 ┆ 91 ┆ ┆ 414 ┆ 7 ┆ hseq ┆ _morph_f │\n", - "│ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ eatures │\n", - "│ 601803754 ┆ 42.650597 ┆ 104.69784 ┆ 1157.1519 ┆ … ┆ 11384.542 ┆ 170.36506 ┆ visp_patc ┆ inh_visp │\n", - "│ ┆ ┆ 5 ┆ 78 ┆ ┆ 969 ┆ 7 ┆ hseq ┆ _morph_f │\n", - "│ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ eatures │\n", - "└───────────┴───────────┴───────────┴───────────┴───┴───────────┴───────────┴───────────┴──────────┘\n" + "\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n", + "\u2502 id \u2506 axon_bias \u2506 axon_bias \u2506 axon_dept \u2506 \u2026 \u2506 basal_den \u2506 soma_alig \u2506 project_i \u2506 feature_ \u2502\n", + "\u2502 --- \u2506 _x \u2506 _y \u2506 h_pc_0 \u2506 \u2506 drite_tot \u2506 ned_dist_ \u2506 d \u2506 set_id \u2502\n", + "\u2502 str \u2506 --- \u2506 --- \u2506 --- \u2506 \u2506 al_surfac \u2506 from_pia \u2506 --- \u2506 --- \u2502\n", + "\u2502 \u2506 f32 \u2506 f32 \u2506 f32 \u2506 \u2506 e_a\u2026 \u2506 --- \u2506 str \u2506 str \u2502\n", + "\u2502 \u2506 \u2506 \u2506 \u2506 \u2506 --- \u2506 f32 \u2506 \u2506 \u2502\n", + "\u2502 \u2506 \u2506 \u2506 \u2506 \u2506 f32 \u2506 \u2506 \u2506 \u2502\n", + "\u255e\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2561\n", + "\u2502 601506507 \u2506 180.83319 \u2506 -249.8307 \u2506 -255.2250 \u2506 \u2026 \u2506 7207.4599 \u2506 357.15982 \u2506 visp_patc \u2506 inh_visp \u2502\n", + "\u2502 \u2506 1 \u2506 5 \u2506 98 \u2506 \u2506 61 \u2506 1 \u2506 hseq \u2506 _morph_f \u2502\n", + "\u2502 \u2506 \u2506 \u2506 \u2506 \u2506 \u2506 \u2506 \u2506 eatures \u2502\n", + "\u2502 601790961 \u2506 25.481123 \u2506 434.25106 \u2506 -216.8098 \u2506 \u2026 \u2506 11691.149 \u2506 663.10302 \u2506 visp_patc \u2506 inh_visp \u2502\n", + "\u2502 \u2506 \u2506 8 \u2506 91 \u2506 \u2506 414 \u2506 7 \u2506 hseq \u2506 _morph_f \u2502\n", + "\u2502 \u2506 \u2506 \u2506 \u2506 \u2506 \u2506 \u2506 \u2506 eatures \u2502\n", + "\u2502 601803754 \u2506 42.650597 \u2506 104.69784 \u2506 1157.1519 \u2506 \u2026 \u2506 11384.542 \u2506 170.36506 \u2506 visp_patc \u2506 inh_visp \u2502\n", + "\u2502 \u2506 \u2506 5 \u2506 78 \u2506 \u2506 969 \u2506 7 \u2506 hseq \u2506 _morph_f \u2502\n", + "\u2502 \u2506 \u2506 \u2506 \u2506 \u2506 \u2506 \u2506 \u2506 eatures \u2502\n", + "\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n" ] } ], @@ -878,7 +878,7 @@ " cell_index_column=\"id\",\n", " project_id=PROJECT_ID,\n", ")\n", - "result = write_models([cfm])\n", + "result = write_models([cfm], output_root=OUTPUT_ROOT)\n", "print(f\"CellFeatureMatrix written: {result.rows_written} rows\")" ] }, @@ -900,14 +900,14 @@ "text": [ "(1, 5)\n", "shape: (1, 5)\n", - "┌────────────────────┬────────────────────┬────────────────────┬───────────────────┬───────────────┐\n", - "│ id ┆ feature_set_id ┆ parquet_path ┆ cell_index_column ┆ project_id │\n", - "│ --- ┆ --- ┆ --- ┆ --- ┆ --- │\n", - "│ str ┆ str ┆ str ┆ str ┆ str │\n", - "╞════════════════════╪════════════════════╪════════════════════╪═══════════════════╪═══════════════╡\n", - "│ visp_patchseq_inh_ ┆ inh_visp_morph_fea ┆ file:///scratch/em ┆ id ┆ visp_patchseq │\n", - "│ visp_morph_f… ┆ tures ┆ _patchseq_wn… ┆ ┆ │\n", - "└────────────────────┴────────────────────┴────────────────────┴───────────────────┴───────────────┘\n" + "\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n", + "\u2502 id \u2506 feature_set_id \u2506 parquet_path \u2506 cell_index_column \u2506 project_id \u2502\n", + "\u2502 --- \u2506 --- \u2506 --- \u2506 --- \u2506 --- \u2502\n", + "\u2502 str \u2506 str \u2506 str \u2506 str \u2506 str \u2502\n", + "\u255e\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2561\n", + "\u2502 visp_patchseq_inh_ \u2506 inh_visp_morph_fea \u2506 file:///scratch/em \u2506 id \u2506 visp_patchseq \u2502\n", + "\u2502 visp_morph_f\u2026 \u2506 tures \u2506 _patchseq_wn\u2026 \u2506 \u2506 \u2502\n", + "\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n" ] } ], @@ -929,14 +929,14 @@ "\n", "| Output path | Class | Rows |\n", "|---|---|---|\n", - "| `dataitem/` | `DataItem` | +new cells from wide CSV (≤ 520 total, 120 new on first run) |\n", + "| `dataitem/` | `DataItem` | +new cells from wide CSV (\u2264 520 total, 120 new on first run) |\n", "| `dataitem_dataset_association/` | `DataItemDataSetAssociation` | +new cells from wide CSV |\n", "| `cellfeaturedefinition/` | `CellFeatureDefinition` | 46 |\n", "| `cellfeatureset/` | `CellFeatureSet` | 1 (`inh_visp_morph_features`) |\n", - "| `cellfeatures/inh_visp_morph_features/` | wide parquet | 520 cells × 46 features |\n", + "| `cellfeatures/inh_visp_morph_features/` | wide parquet | 520 cells \u00d7 46 features |\n", "| `cellfeaturematrix/` | `CellFeatureMatrix` | 1 |\n", "\n", - "`dataitem/` and `dataitem_dataset_association/` use `append_new_dataitems` / `mode=\"append\"` scoped to new cells only — re-running is idempotent and never wipes rows from `etl_visp_inh_patchseq_01`. All other writes use `mode=\"overwrite\"` with a scoped predicate." + "`dataitem/` and `dataitem_dataset_association/` use `append_new_dataitems` / `mode=\"append\"` scoped to new cells only \u2014 re-running is idempotent and never wipes rows from `etl_visp_inh_patchseq_01`. All other writes use `mode=\"overwrite\"` with a scoped predicate." ] }, { diff --git a/code/etl_visp_inh_patchseq_03_cluster_membership_and_mapping.ipynb b/code/etl_visp_inh_patchseq_03_cluster_membership_and_mapping.ipynb index 7bc6f1e..47da515 100644 --- a/code/etl_visp_inh_patchseq_03_cluster_membership_and_mapping.ipynb +++ b/code/etl_visp_inh_patchseq_03_cluster_membership_and_mapping.ipynb @@ -5,24 +5,24 @@ "id": "d07bcdbd", "metadata": {}, "source": [ - "# ETL — VISp Inhibitory Patch-seq: Cluster Membership & Cell-to-Cluster Mapping\n", + "# ETL \u2014 VISp Inhibitory Patch-seq: Cluster Membership & Cell-to-Cluster Mapping\n", "\n", "For VISp inhibitory Patch-seq cells (`project_id=\"visp_patchseq\"`, `dataset_id=\"visp_inh_patchseq\"`),\n", "this notebook registers two assignments per cell:\n", "\n", - "1. **T-type → Tasic 2018 VISp scRNA-seq taxonomy** as `CellToClusterMapping` (the cells were\n", - " not part of Tasic — this is a *mapping*). Source: `patchseq_tx_cell_ttype_labels.csv`,\n", - " column `ttype`, indexed by cell id. **No `ET → PT` translation** (that was an exc-only\n", + "1. **T-type \u2192 Tasic 2018 VISp scRNA-seq taxonomy** as `CellToClusterMapping` (the cells were\n", + " not part of Tasic \u2014 this is a *mapping*). Source: `patchseq_tx_cell_ttype_labels.csv`,\n", + " column `ttype`, indexed by cell id. **No `ET \u2192 PT` translation** (that was an exc-only\n", " convention; inhibitory ttypes don't contain `ET`).\n", - "2. **MET-type → VISp MET-types taxonomy** as `ClusterMembership` (these cells *belong* to\n", - " these MET-types by direct measurement — same cohort that defined the taxonomy). Source:\n", + "2. **MET-type \u2192 VISp MET-types taxonomy** as `ClusterMembership` (these cells *belong* to\n", + " these MET-types by direct measurement \u2014 same cohort that defined the taxonomy). Source:\n", " `visp_met_cell_assignments_text_names.csv`, column `met_type`, indexed by cell id.\n", "\n", "Both writes use **parent propagation**: one row per (cell, ancestor) pair walked from the\n", "leaf to the root via `walk_ancestors` (in `connects_common_connectivity.io.write_utils`).\n", "`probability` is left null (no probability column in either source).\n", "\n", - "## Section 0 — register missing inhibitory dataset associations\n", + "## Section 0 \u2014 register missing inhibitory dataset associations\n", "\n", "The MET CSV has 495 cells. All 495 are already in `dataitem/` (registered by earlier\n", "notebooks), but only 392 are associated with `dataset_id=\"visp_inh_patchseq\"`. The other\n", @@ -48,8 +48,8 @@ "|---|---|---|\n", "| `dataitem_dataset_association/` | `DataItemDataSetAssociation` | +103 (first run); 0 on re-run |\n", "| `mappingset/` | `MappingSet` | 1 (`visp_inh_patchseq_ttype_mapping`) |\n", - "| `celltoclustermapping/` | `CellToClusterMapping` | 2759 cells × 4 ancestors = 11036 |\n", - "| `clustermembership/` | `ClusterMembership` | 495 cells × 3 ancestors = 1485 (merged with exc's 1152 → 2637 total under predicate) |\n" + "| `celltoclustermapping/` | `CellToClusterMapping` | 2759 cells \u00d7 4 ancestors = 11036 |\n", + "| `clustermembership/` | `ClusterMembership` | 495 cells \u00d7 3 ancestors = 1485 (merged with exc's 1152 \u2192 2637 total under predicate) |\n" ] }, { @@ -175,7 +175,7 @@ " .filter(pl.col(\"project_id\") == PROJECT_ID)\n", ")\n", "assert existing_dataitems.shape[0] > 0, (\n", - " f\"earlier notebooks must run first — no DataItem rows for project_id='{PROJECT_ID}'\"\n", + " f\"earlier notebooks must run first \u2014 no DataItem rows for project_id='{PROJECT_ID}'\"\n", ")\n", "registered_ids = set(existing_dataitems[\"id\"].to_list())\n", "print(f\"DataItems for project_id='{PROJECT_ID}': {len(registered_ids)}\")\n", @@ -183,8 +183,8 @@ "cluster_df = pl.read_delta(OUTPUT_ROOT + \"cluster/\")\n", "ttype_clu = cluster_df.filter(pl.col(\"hierarchy_id\") == TTYPE_HIERARCHY_ID)\n", "met_clu = cluster_df.filter(pl.col(\"hierarchy_id\") == METTYPE_HIERARCHY_ID)\n", - "assert ttype_clu.shape[0] > 0, f\"etl_tasic_01_cluster must run first — no clusters for {TTYPE_HIERARCHY_ID}\"\n", - "assert met_clu.shape[0] > 0, f\"etl_visp_met_types_01_cluster must run first — no clusters for {METTYPE_HIERARCHY_ID}\"\n", + "assert ttype_clu.shape[0] > 0, f\"etl_tasic_01_cluster must run first \u2014 no clusters for {TTYPE_HIERARCHY_ID}\"\n", + "assert met_clu.shape[0] > 0, f\"etl_visp_met_types_01_cluster must run first \u2014 no clusters for {METTYPE_HIERARCHY_ID}\"\n", "\n", "ttype_parent = dict(zip(ttype_clu[\"id\"].to_list(), ttype_clu[\"parent\"].to_list()))\n", "met_parent = dict(zip(met_clu[\"id\"].to_list(), met_clu[\"parent\"].to_list()))\n", @@ -196,7 +196,7 @@ "id": "7d1b7045", "metadata": {}, "source": [ - "## Section 0 — register missing `visp_inh_patchseq` associations\n", + "## Section 0 \u2014 register missing `visp_inh_patchseq` associations\n", "\n", "Cells in `visp_met_cell_assignments_text_names.csv` that exist in `dataitem/` for\n", "`project_id='visp_patchseq'` but lack a `dataset_id='visp_inh_patchseq'` association\n", @@ -351,7 +351,7 @@ "name": "stdout", "output_type": "stream", "text": [ - "No new associations needed — all MET cells already linked to inh dataset.\n", + "No new associations needed \u2014 all MET cells already linked to inh dataset.\n", "Total visp_inh_patchseq associations now: 2879\n" ] } @@ -376,7 +376,7 @@ " )\n", " print(f\"Associations appended: {len(ids_needing_assoc)}\")\n", "else:\n", - " print(\"No new associations needed — all MET cells already linked to inh dataset.\")\n", + " print(\"No new associations needed \u2014 all MET cells already linked to inh dataset.\")\n", "\n", "# Verify post-condition: every MET cell now has the inh dataset association.\n", "post_assoc = (\n", @@ -397,7 +397,7 @@ "id": "b6557137", "metadata": {}, "source": [ - "## Section 1 — T-type → `CellToClusterMapping` against Tasic 2018\n" + "## Section 1 \u2014 T-type \u2192 `CellToClusterMapping` against Tasic 2018\n" ] }, { @@ -484,9 +484,9 @@ "tt_df.index = tt_df.index.astype(str)\n", "print(\"T-type CSV shape:\", tt_df.shape)\n", "print(\"ttype non-null:\", tt_df[\"ttype\"].notna().sum())\n", - "# Inhibitory ttypes don't contain \"ET\" — assert and skip the legacy ET→PT translation.\n", + "# Inhibitory ttypes don't contain \"ET\" \u2014 assert and skip the legacy ET\u2192PT translation.\n", "assert tt_df[\"ttype\"].astype(str).str.contains(\"ET\").sum() == 0, (\n", - " \"unexpected 'ET' in inhibitory ttypes — exc-only translation rule should not apply\"\n", + " \"unexpected 'ET' in inhibitory ttypes \u2014 exc-only translation rule should not apply\"\n", ")\n", "tt_df.head(3)\n" ] @@ -558,7 +558,7 @@ } ], "source": [ - "# MappingSet — one row describing the t-type assignment method.\n", + "# MappingSet \u2014 one row describing the t-type assignment method.\n", "ttype_mapping_set = MappingSet(\n", " id=MAPPING_SET_ID,\n", " name=\"VISp inhibitory Patch-seq T-type assignments\",\n", @@ -566,14 +566,14 @@ " \"Tree-mapping of VISp inhibitory Patch-seq cells onto the Tasic 2018 VISp \"\n", " \"scRNA-seq taxonomy, as used in Gouwens et al. 2020. Source labels are read \"\n", " \"from the `ttype` column of patchseq_tx_cell_ttype_labels.csv. No legacy \"\n", - " \"ET→PT rename is applied (inhibitory ttypes do not contain 'ET').\"\n", + " \"ET\u2192PT rename is applied (inhibitory ttypes do not contain 'ET').\"\n", " ),\n", " method_name=\"Patch-seq tree-mapping (Gouwens et al. 2020)\",\n", " source_dataset=DATASET_ID,\n", " target_hierarchy=TTYPE_HIERARCHY_ID,\n", " project_id=PROJECT_ID,\n", ")\n", - "result = write_models([ttype_mapping_set])\n", + "result = write_models([ttype_mapping_set], output_root=OUTPUT_ROOT)\n", "print(f\"MappingSet written: {result.rows_written} rows\")\n" ] }, @@ -596,17 +596,17 @@ "text": [ "(1, 13)\n", "shape: (1, 13)\n", - "┌───────────┬───────────┬───────────┬───────────┬───┬───────────┬───────────┬───────────┬──────────┐\n", - "│ id ┆ name ┆ descripti ┆ method_na ┆ … ┆ source_hi ┆ target_hi ┆ json_obje ┆ project_ │\n", - "│ --- ┆ --- ┆ on ┆ me ┆ ┆ erarchy ┆ erarchy ┆ ct ┆ id │\n", - "│ str ┆ str ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ --- ┆ --- │\n", - "│ ┆ ┆ str ┆ str ┆ ┆ str ┆ str ┆ str ┆ str │\n", - "╞═══════════╪═══════════╪═══════════╪═══════════╪═══╪═══════════╪═══════════╪═══════════╪══════════╡\n", - "│ visp_inh_ ┆ VISp inhi ┆ Tree-mapp ┆ Patch-seq ┆ … ┆ null ┆ tasic_201 ┆ null ┆ visp_pat │\n", - "│ patchseq_ ┆ bitory ┆ ing of ┆ tree-mapp ┆ ┆ ┆ 8_visp_ta ┆ ┆ chseq │\n", - "│ ttype_map ┆ Patch-seq ┆ VISp inhi ┆ ing ┆ ┆ ┆ xonomy ┆ ┆ │\n", - "│ pin… ┆ T-ty… ┆ bitor… ┆ (Gouwen… ┆ ┆ ┆ ┆ ┆ │\n", - "└───────────┴───────────┴───────────┴───────────┴───┴───────────┴───────────┴───────────┴──────────┘\n" + "\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n", + "\u2502 id \u2506 name \u2506 descripti \u2506 method_na \u2506 \u2026 \u2506 source_hi \u2506 target_hi \u2506 json_obje \u2506 project_ \u2502\n", + "\u2502 --- \u2506 --- \u2506 on \u2506 me \u2506 \u2506 erarchy \u2506 erarchy \u2506 ct \u2506 id \u2502\n", + "\u2502 str \u2506 str \u2506 --- \u2506 --- \u2506 \u2506 --- \u2506 --- \u2506 --- \u2506 --- \u2502\n", + "\u2502 \u2506 \u2506 str \u2506 str \u2506 \u2506 str \u2506 str \u2506 str \u2506 str \u2502\n", + "\u255e\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2561\n", + "\u2502 visp_inh_ \u2506 VISp inhi \u2506 Tree-mapp \u2506 Patch-seq \u2506 \u2026 \u2506 null \u2506 tasic_201 \u2506 null \u2506 visp_pat \u2502\n", + "\u2502 patchseq_ \u2506 bitory \u2506 ing of \u2506 tree-mapp \u2506 \u2506 \u2506 8_visp_ta \u2506 \u2506 chseq \u2502\n", + "\u2502 ttype_map \u2506 Patch-seq \u2506 VISp inhi \u2506 ing \u2506 \u2506 \u2506 xonomy \u2506 \u2506 \u2502\n", + "\u2502 pin\u2026 \u2506 T-ty\u2026 \u2506 bitor\u2026 \u2506 (Gouwen\u2026 \u2506 \u2506 \u2506 \u2506 \u2506 \u2502\n", + "\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n" ] } ], @@ -657,7 +657,7 @@ "ttype_mappings: list[CellToClusterMapping] = []\n", "for cell_id, leaf in zip(tt_df.index, tt_df[\"ttype\"]):\n", " if not isinstance(leaf, str):\n", - " continue # defensive — current data has no NaN ttypes\n", + " continue # defensive \u2014 current data has no NaN ttypes\n", " for cid, is_leaf in walk_ancestors(leaf, ttype_parent):\n", " ttype_mappings.append(CellToClusterMapping(\n", " id=f\"{cell_id}-{cid}-{PROJECT_ID}-{TTYPE_HIERARCHY_ID}\",\n", @@ -668,7 +668,7 @@ " project_id=PROJECT_ID,\n", " ))\n", "print(f\"CellToClusterMapping rows built: {len(ttype_mappings)}\")\n", - "result = write_models(ttype_mappings)\n", + "result = write_models(ttype_mappings, output_root=OUTPUT_ROOT)\n", "print(f\"CellToClusterMapping written: {result.rows_written} rows\")\n" ] }, @@ -691,23 +691,23 @@ "text": [ "(11036, 8)\n", "shape: (3, 8)\n", - "┌─────────────┬─────────────┬─────────────┬─────────────┬───────┬─────────────┬───────┬────────────┐\n", - "│ id ┆ mapping_set ┆ source_cell ┆ target_clus ┆ score ┆ probability ┆ notes ┆ project_id │\n", - "│ --- ┆ --- ┆ --- ┆ ter ┆ --- ┆ --- ┆ --- ┆ --- │\n", - "│ str ┆ str ┆ str ┆ --- ┆ f64 ┆ f64 ┆ str ┆ str │\n", - "│ ┆ ┆ ┆ str ┆ ┆ ┆ ┆ │\n", - "╞═════════════╪═════════════╪═════════════╪═════════════╪═══════╪═════════════╪═══════╪════════════╡\n", - "│ 888001481-L ┆ visp_inh_pa ┆ 888001481 ┆ Lamp5 ┆ null ┆ null ┆ null ┆ visp_patch │\n", - "│ amp5 ┆ tchseq_ttyp ┆ ┆ Fam19a1 ┆ ┆ ┆ ┆ seq │\n", - "│ Fam19a1 ┆ e_mappin… ┆ ┆ Tmem182 ┆ ┆ ┆ ┆ │\n", - "│ Tmem18… ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ 888001481-L ┆ visp_inh_pa ┆ 888001481 ┆ Lamp5 ┆ null ┆ null ┆ null ┆ visp_patch │\n", - "│ amp5-visp_p ┆ tchseq_ttyp ┆ ┆ ┆ ┆ ┆ ┆ seq │\n", - "│ atchseq-… ┆ e_mappin… ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ 888001481-G ┆ visp_inh_pa ┆ 888001481 ┆ GABAergic ┆ null ┆ null ┆ null ┆ visp_patch │\n", - "│ ABAergic-vi ┆ tchseq_ttyp ┆ ┆ ┆ ┆ ┆ ┆ seq │\n", - "│ sp_patch… ┆ e_mappin… ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "└─────────────┴─────────────┴─────────────┴─────────────┴───────┴─────────────┴───────┴────────────┘\n" + "\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n", + "\u2502 id \u2506 mapping_set \u2506 source_cell \u2506 target_clus \u2506 score \u2506 probability \u2506 notes \u2506 project_id \u2502\n", + "\u2502 --- \u2506 --- \u2506 --- \u2506 ter \u2506 --- \u2506 --- \u2506 --- \u2506 --- \u2502\n", + "\u2502 str \u2506 str \u2506 str \u2506 --- \u2506 f64 \u2506 f64 \u2506 str \u2506 str \u2502\n", + "\u2502 \u2506 \u2506 \u2506 str \u2506 \u2506 \u2506 \u2506 \u2502\n", + "\u255e\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2561\n", + "\u2502 888001481-L \u2506 visp_inh_pa \u2506 888001481 \u2506 Lamp5 \u2506 null \u2506 null \u2506 null \u2506 visp_patch \u2502\n", + "\u2502 amp5 \u2506 tchseq_ttyp \u2506 \u2506 Fam19a1 \u2506 \u2506 \u2506 \u2506 seq \u2502\n", + "\u2502 Fam19a1 \u2506 e_mappin\u2026 \u2506 \u2506 Tmem182 \u2506 \u2506 \u2506 \u2506 \u2502\n", + "\u2502 Tmem18\u2026 \u2506 \u2506 \u2506 \u2506 \u2506 \u2506 \u2506 \u2502\n", + "\u2502 888001481-L \u2506 visp_inh_pa \u2506 888001481 \u2506 Lamp5 \u2506 null \u2506 null \u2506 null \u2506 visp_patch \u2502\n", + "\u2502 amp5-visp_p \u2506 tchseq_ttyp \u2506 \u2506 \u2506 \u2506 \u2506 \u2506 seq \u2502\n", + "\u2502 atchseq-\u2026 \u2506 e_mappin\u2026 \u2506 \u2506 \u2506 \u2506 \u2506 \u2506 \u2502\n", + "\u2502 888001481-G \u2506 visp_inh_pa \u2506 888001481 \u2506 GABAergic \u2506 null \u2506 null \u2506 null \u2506 visp_patch \u2502\n", + "\u2502 ABAergic-vi \u2506 tchseq_ttyp \u2506 \u2506 \u2506 \u2506 \u2506 \u2506 seq \u2502\n", + "\u2502 sp_patch\u2026 \u2506 e_mappin\u2026 \u2506 \u2506 \u2506 \u2506 \u2506 \u2506 \u2502\n", + "\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n" ] } ], @@ -733,13 +733,13 @@ "id": "7aea9cb7", "metadata": {}, "source": [ - "## Section 2 — MET-type → `ClusterMembership` against VISp MET-types\n", + "## Section 2 \u2014 MET-type \u2192 `ClusterMembership` against VISp MET-types\n", "\n", "Uses **merge-then-overwrite**: read the existing rows under the\n", "`(project_id, hierarchy_id)` predicate, drop rows whose `item` is one of our 495\n", "cells, union with the new rows, and overwrite. This preserves whatever else is\n", "written under the same predicate (e.g. `etl_visp_exc_patchseq_03`'s 1152 rows for\n", - "the exc cells — disjoint cell ids, but same `project_id`/`hierarchy_id` partition).\n" + "the exc cells \u2014 disjoint cell ids, but same `project_id`/`hierarchy_id` partition).\n" ] }, { @@ -800,7 +800,7 @@ } ], "source": [ - "# Build new ClusterMembership rows (one per cell × ancestor).\n", + "# Build new ClusterMembership rows (one per cell \u00d7 ancestor).\n", "new_memberships: list[ClusterMembership] = []\n", "for cell_id, leaf in zip(met_clean[\"specimen_id\"], met_clean[\"met_type\"]):\n", " for cid, is_leaf in walk_ancestors(leaf, met_parent):\n", @@ -866,7 +866,7 @@ } ], "source": [ - "result = write_models(all_memberships)\n", + "result = write_models(all_memberships, output_root=OUTPUT_ROOT)\n", "print(f\"ClusterMembership written: {result.rows_written} rows\")\n" ] }, @@ -889,19 +889,19 @@ "text": [ "(1485, 7)\n", "shape: (3, 7)\n", - "┌───────────┬───────────┬────────────────┬─────────────┬──────────┬───────────────┬────────────────┐\n", - "│ item ┆ cluster ┆ membership_sco ┆ probability ┆ distance ┆ project_id ┆ hierarchy_id │\n", - "│ --- ┆ --- ┆ re ┆ --- ┆ --- ┆ --- ┆ --- │\n", - "│ str ┆ str ┆ --- ┆ f64 ┆ f64 ┆ str ┆ str │\n", - "│ ┆ ┆ f64 ┆ ┆ ┆ ┆ │\n", - "╞═══════════╪═══════════╪════════════════╪═════════════╪══════════╪═══════════════╪════════════════╡\n", - "│ 601506507 ┆ Vip-MET-2 ┆ null ┆ null ┆ null ┆ visp_patchseq ┆ visp_met_types │\n", - "│ ┆ ┆ ┆ ┆ ┆ ┆ _taxonomy │\n", - "│ 601506507 ┆ GABAergic ┆ null ┆ null ┆ null ┆ visp_patchseq ┆ visp_met_types │\n", - "│ ┆ ┆ ┆ ┆ ┆ ┆ _taxonomy │\n", - "│ 601506507 ┆ cell ┆ null ┆ null ┆ null ┆ visp_patchseq ┆ visp_met_types │\n", - "│ ┆ ┆ ┆ ┆ ┆ ┆ _taxonomy │\n", - "└───────────┴───────────┴────────────────┴─────────────┴──────────┴───────────────┴────────────────┘\n", + "\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n", + "\u2502 item \u2506 cluster \u2506 membership_sco \u2506 probability \u2506 distance \u2506 project_id \u2506 hierarchy_id \u2502\n", + "\u2502 --- \u2506 --- \u2506 re \u2506 --- \u2506 --- \u2506 --- \u2506 --- \u2502\n", + "\u2502 str \u2506 str \u2506 --- \u2506 f64 \u2506 f64 \u2506 str \u2506 str \u2502\n", + "\u2502 \u2506 \u2506 f64 \u2506 \u2506 \u2506 \u2506 \u2502\n", + "\u255e\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2561\n", + "\u2502 601506507 \u2506 Vip-MET-2 \u2506 null \u2506 null \u2506 null \u2506 visp_patchseq \u2506 visp_met_types \u2502\n", + "\u2502 \u2506 \u2506 \u2506 \u2506 \u2506 \u2506 _taxonomy \u2502\n", + "\u2502 601506507 \u2506 GABAergic \u2506 null \u2506 null \u2506 null \u2506 visp_patchseq \u2506 visp_met_types \u2502\n", + "\u2502 \u2506 \u2506 \u2506 \u2506 \u2506 \u2506 _taxonomy \u2502\n", + "\u2502 601506507 \u2506 cell \u2506 null \u2506 null \u2506 null \u2506 visp_patchseq \u2506 visp_met_types \u2502\n", + "\u2502 \u2506 \u2506 \u2506 \u2506 \u2506 \u2506 _taxonomy \u2502\n", + "\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n", "Our cells present: 495 / 495\n", "Other-notebook rows preserved: 0\n" ] @@ -950,8 +950,8 @@ "|---|---|---|\n", "| `dataitem_dataset_association/` | `DataItemDataSetAssociation` | up to 103 (first run) |\n", "| `mappingset/` | `MappingSet` | 1 (`visp_inh_patchseq_ttype_mapping`) |\n", - "| `celltoclustermapping/` | `CellToClusterMapping` | 2759 cells × 4 levels = 11036 |\n", - "| `clustermembership/` | `ClusterMembership` | 495 cells × 3 levels = 1485 (merged with prior rows under same predicate) |\n", + "| `celltoclustermapping/` | `CellToClusterMapping` | 2759 cells \u00d7 4 levels = 11036 |\n", + "| `clustermembership/` | `ClusterMembership` | 495 cells \u00d7 3 levels = 1485 (merged with prior rows under same predicate) |\n", "\n", "All writes are scoped by two-level predicates and are individually idempotent on re-run.\n" ] diff --git a/code/etl_visp_met_types_01_cluster.ipynb b/code/etl_visp_met_types_01_cluster.ipynb index 92df10b..47ca867 100644 --- a/code/etl_visp_met_types_01_cluster.ipynb +++ b/code/etl_visp_met_types_01_cluster.ipynb @@ -5,11 +5,11 @@ "id": "d0e57e11", "metadata": {}, "source": [ - "# ETL — VISp MET-types Taxonomy (cluster reference)\n", + "# ETL \u2014 VISp MET-types Taxonomy (cluster reference)\n", "\n", "Registers the **VISp MET-types taxonomy** as a global cluster reference. Writes `algorithmrun/`, `clusterhierarchy/`, `cluster/`, `hierarchycategory/`. **Out of scope:** no `DataItem` registration.\n", "\n", - "Source: `met_type_colors.json` (45 MET-type labels, leaf-only colors). Two real levels (class → cluster) with a synthetic `cell` root. Class-level colors sourced from Tasic's `anno.feather` for visual consistency. Schema caveats already documented in `etl_tasic_01_cluster.ipynb`; not repeated here." + "Source: `met_type_colors.json` (45 MET-type labels, leaf-only colors). Two real levels (class \u2192 cluster) with a synthetic `cell` root. Class-level colors sourced from Tasic's `anno.feather` for visual consistency. Schema caveats already documented in `etl_tasic_01_cluster.ipynb`; not repeated here." ] }, { @@ -125,7 +125,7 @@ "GABA_COLOR = tasic_class_colors[\"GABAergic\"]\n", "GLUT_COLOR = tasic_class_colors[\"Glutamatergic\"]\n", "\n", - "# Leaf split: \"MET\" in label → GABAergic, otherwise → Glutamatergic.\n", + "# Leaf split: \"MET\" in label \u2192 GABAergic, otherwise \u2192 Glutamatergic.\n", "gaba_met_types = [t for t in met_colors if \"MET\" in t]\n", "glut_met_types = [t for t in met_colors if \"MET\" not in t]\n", "\n", @@ -143,7 +143,7 @@ "id": "95b9bb7e", "metadata": {}, "source": [ - "## `HierarchyCategory` — 3 rows (`major_class`/`class`/`cluster`); no `subclass` for this taxonomy" + "## `HierarchyCategory` \u2014 3 rows (`major_class`/`class`/`cluster`); no `subclass` for this taxonomy" ] }, { @@ -176,7 +176,7 @@ "]\n", "CATEGORY_IDS = [c.id for c in category_rows]\n", "\n", - "result = write_models(category_rows)\n", + "result = write_models(category_rows, output_root=OUTPUT_ROOT)\n", "print(f\"HierarchyCategory written: {result.rows_written} rows\")\n" ] }, @@ -198,16 +198,16 @@ "output_type": "stream", "text": [ "shape: (4, 3)\n", - "┌─────────────┬─────────────────────────────────┬───────┐\n", - "│ id ┆ description ┆ level │\n", - "│ --- ┆ --- ┆ --- │\n", - "│ str ┆ str ┆ str │\n", - "╞═════════════╪═════════════════════════════════╪═══════╡\n", - "│ class ┆ Top-level cell class. ┆ 2 │\n", - "│ cluster ┆ Leaf cluster (cell type / MET-… ┆ 0 │\n", - "│ major_class ┆ Synthetic root grouping all cl… ┆ 3 │\n", - "│ subclass ┆ Subclass of cell types. ┆ 1 │\n", - "└─────────────┴─────────────────────────────────┴───────┘\n" + "\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n", + "\u2502 id \u2506 description \u2506 level \u2502\n", + "\u2502 --- \u2506 --- \u2506 --- \u2502\n", + "\u2502 str \u2506 str \u2506 str \u2502\n", + "\u255e\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2561\n", + "\u2502 class \u2506 Top-level cell class. \u2506 2 \u2502\n", + "\u2502 cluster \u2506 Leaf cluster (cell type / MET-\u2026 \u2506 0 \u2502\n", + "\u2502 major_class \u2506 Synthetic root grouping all cl\u2026 \u2506 3 \u2502\n", + "\u2502 subclass \u2506 Subclass of cell types. \u2506 1 \u2502\n", + "\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n" ] } ], @@ -225,7 +225,7 @@ "id": "0114bd8d", "metadata": {}, "source": [ - "## `AlgorithmRun` — 1 row" + "## `AlgorithmRun` \u2014 1 row" ] }, { @@ -260,7 +260,7 @@ " # produced_hierarchies omitted: schema declares it as inlined dict[id, ClusterHierarchy].\n", ")\n", "\n", - "result = write_models([run_row])\n", + "result = write_models([run_row], output_root=OUTPUT_ROOT)\n", "print(f\"AlgorithmRun written: {result.rows_written} rows\")\n" ] }, @@ -283,18 +283,18 @@ "text": [ "(1, 9)\n", "shape: (1, 9)\n", - "┌───────────┬───────────┬───────────┬───────────┬───┬───────────┬───────────┬───────────┬──────────┐\n", - "│ id ┆ algorithm ┆ algorithm ┆ json_obje ┆ … ┆ input_dat ┆ produced_ ┆ score_des ┆ distance │\n", - "│ --- ┆ _name ┆ _version ┆ ct ┆ ┆ aset ┆ hierarchi ┆ cription ┆ _descrip │\n", - "│ str ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ es ┆ --- ┆ tion │\n", - "│ ┆ str ┆ str ┆ str ┆ ┆ str ┆ --- ┆ str ┆ --- │\n", - "│ ┆ ┆ ┆ ┆ ┆ ┆ str ┆ ┆ str │\n", - "╞═══════════╪═══════════╪═══════════╪═══════════╪═══╪═══════════╪═══════════╪═══════════╪══════════╡\n", - "│ visp_met_ ┆ VISp ┆ 2021 ┆ null ┆ … ┆ null ┆ null ┆ null ┆ null │\n", - "│ types_clu ┆ MET-types ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ stering ┆ taxonomy ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "│ ┆ (Patch… ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "└───────────┴───────────┴───────────┴───────────┴───┴───────────┴───────────┴───────────┴──────────┘\n" + "\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n", + "\u2502 id \u2506 algorithm \u2506 algorithm \u2506 json_obje \u2506 \u2026 \u2506 input_dat \u2506 produced_ \u2506 score_des \u2506 distance \u2502\n", + "\u2502 --- \u2506 _name \u2506 _version \u2506 ct \u2506 \u2506 aset \u2506 hierarchi \u2506 cription \u2506 _descrip \u2502\n", + "\u2502 str \u2506 --- \u2506 --- \u2506 --- \u2506 \u2506 --- \u2506 es \u2506 --- \u2506 tion \u2502\n", + "\u2502 \u2506 str \u2506 str \u2506 str \u2506 \u2506 str \u2506 --- \u2506 str \u2506 --- \u2502\n", + "\u2502 \u2506 \u2506 \u2506 \u2506 \u2506 \u2506 str \u2506 \u2506 str \u2502\n", + "\u255e\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2561\n", + "\u2502 visp_met_ \u2506 VISp \u2506 2021 \u2506 null \u2506 \u2026 \u2506 null \u2506 null \u2506 null \u2506 null \u2502\n", + "\u2502 types_clu \u2506 MET-types \u2506 \u2506 \u2506 \u2506 \u2506 \u2506 \u2506 \u2502\n", + "\u2502 stering \u2506 taxonomy \u2506 \u2506 \u2506 \u2506 \u2506 \u2506 \u2506 \u2502\n", + "\u2502 \u2506 (Patch\u2026 \u2506 \u2506 \u2506 \u2506 \u2506 \u2506 \u2506 \u2502\n", + "\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n" ] } ], @@ -310,7 +310,7 @@ "id": "c92da157", "metadata": {}, "source": [ - "## `Cluster` — 48 rows (1 synthetic root + 2 classes + 45 leaves)" + "## `Cluster` \u2014 48 rows (1 synthetic root + 2 classes + 45 leaves)" ] }, { @@ -389,7 +389,7 @@ "\n", "assert len(cluster_rows) == 1 + 2 + 45 == 48\n", "print(f\"Cluster rows built: {len(cluster_rows)}\")\n", - "result = write_models(cluster_rows)\n", + "result = write_models(cluster_rows, output_root=OUTPUT_ROOT)\n", "print(f\"Cluster written: {result.rows_written} rows\")" ] }, @@ -412,15 +412,15 @@ "text": [ "(48, 9)\n", "shape: (3, 2)\n", - "┌────────────────────┬─────┐\n", - "│ hierarchy_category ┆ len │\n", - "│ --- ┆ --- │\n", - "│ str ┆ u32 │\n", - "╞════════════════════╪═════╡\n", - "│ class ┆ 2 │\n", - "│ cluster ┆ 45 │\n", - "│ major_class ┆ 1 │\n", - "└────────────────────┴─────┘\n" + "\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2510\n", + "\u2502 hierarchy_category \u2506 len \u2502\n", + "\u2502 --- \u2506 --- \u2502\n", + "\u2502 str \u2506 u32 \u2502\n", + "\u255e\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2561\n", + "\u2502 class \u2506 2 \u2502\n", + "\u2502 cluster \u2506 45 \u2502\n", + "\u2502 major_class \u2506 1 \u2502\n", + "\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2518\n" ] } ], @@ -437,7 +437,7 @@ "id": "27f5ed38", "metadata": {}, "source": [ - "## `ClusterHierarchy` — 1 row" + "## `ClusterHierarchy` \u2014 1 row" ] }, { @@ -468,7 +468,7 @@ " root=ROOT_ID,\n", " clusters=[c.id for c in cluster_rows],\n", ")\n", - "result = write_models([hierarchy_row])\n", + "result = write_models([hierarchy_row], output_root=OUTPUT_ROOT)\n", "print(f\"ClusterHierarchy written: {result.rows_written} rows\")" ] }, diff --git a/code/etl_wnm_exc_01_dataset_dataitem.ipynb b/code/etl_wnm_exc_01_dataset_dataitem.ipynb index d700c16..dc10fbc 100644 --- a/code/etl_wnm_exc_01_dataset_dataitem.ipynb +++ b/code/etl_wnm_exc_01_dataset_dataitem.ipynb @@ -4,7 +4,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "# ETL — VISp Excitatory Whole Neuron Morphology: DataSet & DataItem\n", + "# ETL \u2014 VISp Excitatory Whole Neuron Morphology: DataSet & DataItem\n", "\n", "Writes one `DataSet` record (`dataset_id = \"visp_exc_wnm\"`, `project_id = \"visp_wnm\"`), one `DataItem` per cell from `FullMorphMetaData_Master.csv` (cell id = SWC filename with `.swc` stripped), and the corresponding `DataItemDataSetAssociation` links. No prerequisites; features and cluster mappings are written in `_02` and `_03`." ] @@ -298,7 +298,7 @@ " modality=Modality.MORPHOLOGY.value,\n", " project_id=PROJECT_ID,\n", ")\n", - "result = write_models([dataset])\n", + "result = write_models([dataset], output_root=OUTPUT_ROOT)\n", "print(f\"DataSet written: {result.rows_written} rows\")" ] }, @@ -320,14 +320,14 @@ "text": [ "(1, 5)\n", "shape: (1, 5)\n", - "┌──────────────┬───────────────────────┬─────────────────────────────────┬────────────┬────────────┐\n", - "│ id ┆ name ┆ publication ┆ modality ┆ project_id │\n", - "│ --- ┆ --- ┆ --- ┆ --- ┆ --- │\n", - "│ str ┆ str ┆ str ┆ str ┆ str │\n", - "╞══════════════╪═══════════════════════╪═════════════════════════════════╪════════════╪════════════╡\n", - "│ visp_exc_wnm ┆ VISp excitatory whole ┆ doi.org/10.1101/2023.11.25.568… ┆ MORPHOLOGY ┆ visp_wnm │\n", - "│ ┆ neuron m… ┆ ┆ ┆ │\n", - "└──────────────┴───────────────────────┴─────────────────────────────────┴────────────┴────────────┘\n" + "\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n", + "\u2502 id \u2506 name \u2506 publication \u2506 modality \u2506 project_id \u2502\n", + "\u2502 --- \u2506 --- \u2506 --- \u2506 --- \u2506 --- \u2502\n", + "\u2502 str \u2506 str \u2506 str \u2506 str \u2506 str \u2502\n", + "\u255e\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2561\n", + "\u2502 visp_exc_wnm \u2506 VISp excitatory whole \u2506 doi.org/10.1101/2023.11.25.568\u2026 \u2506 MORPHOLOGY \u2506 visp_wnm \u2502\n", + "\u2502 \u2506 neuron m\u2026 \u2506 \u2506 \u2506 \u2502\n", + "\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n" ] } ], @@ -378,7 +378,7 @@ " DataItem(id=cid, name=cid, project_id=PROJECT_ID)\n", " for cid in cell_ids\n", "]\n", - "n_appended = write_models(dataitems).rows_written\n", + "n_appended = write_models(dataitems, output_root=OUTPUT_ROOT).rows_written\n", "print(f\"DataItem rows appended: {n_appended} (total in batch: {len(cell_ids)})\")" ] }, @@ -400,17 +400,17 @@ "text": [ "(341, 4)\n", "shape: (5, 4)\n", - "┌──────────────────────────────┬──────────────────────────────┬───────────────────┬────────────┐\n", - "│ id ┆ name ┆ neuroglancer_link ┆ project_id │\n", - "│ --- ┆ --- ┆ --- ┆ --- │\n", - "│ str ┆ str ┆ str ┆ str │\n", - "╞══════════════════════════════╪══════════════════════════════╪═══════════════════╪════════════╡\n", - "│ 182709_6984-X2452-Y12423_reg ┆ 182709_6984-X2452-Y12423_reg ┆ null ┆ visp_wnm │\n", - "│ 182709_7126-X2913-Y10535_reg ┆ 182709_7126-X2913-Y10535_reg ┆ null ┆ visp_wnm │\n", - "│ 182724_5937-X3804-Y11955_reg ┆ 182724_5937-X3804-Y11955_reg ┆ null ┆ visp_wnm │\n", - "│ 182724_6175-X3782-Y10859_reg ┆ 182724_6175-X3782-Y10859_reg ┆ null ┆ visp_wnm │\n", - "│ 182724_6354-X4834-Y8105_reg ┆ 182724_6354-X4834-Y8105_reg ┆ null ┆ visp_wnm │\n", - "└──────────────────────────────┴──────────────────────────────┴───────────────────┴────────────┘\n" + "\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n", + "\u2502 id \u2506 name \u2506 neuroglancer_link \u2506 project_id \u2502\n", + "\u2502 --- \u2506 --- \u2506 --- \u2506 --- \u2502\n", + "\u2502 str \u2506 str \u2506 str \u2506 str \u2502\n", + "\u255e\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2561\n", + "\u2502 182709_6984-X2452-Y12423_reg \u2506 182709_6984-X2452-Y12423_reg \u2506 null \u2506 visp_wnm \u2502\n", + "\u2502 182709_7126-X2913-Y10535_reg \u2506 182709_7126-X2913-Y10535_reg \u2506 null \u2506 visp_wnm \u2502\n", + "\u2502 182724_5937-X3804-Y11955_reg \u2506 182724_5937-X3804-Y11955_reg \u2506 null \u2506 visp_wnm \u2502\n", + "\u2502 182724_6175-X3782-Y10859_reg \u2506 182724_6175-X3782-Y10859_reg \u2506 null \u2506 visp_wnm \u2502\n", + "\u2502 182724_6354-X4834-Y8105_reg \u2506 182724_6354-X4834-Y8105_reg \u2506 null \u2506 visp_wnm \u2502\n", + "\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n" ] } ], @@ -463,7 +463,7 @@ " )\n", " for cid in cell_ids\n", "]\n", - "result = write_models(associations)\n", + "result = write_models(associations, output_root=OUTPUT_ROOT)\n", "print(f\"DataItemDataSetAssociation written: {result.rows_written} rows\")" ] }, @@ -485,17 +485,17 @@ "text": [ "(341, 3)\n", "shape: (5, 3)\n", - "┌──────────────────────────────┬──────────────┬────────────┐\n", - "│ dataitem_id ┆ dataset_id ┆ project_id │\n", - "│ --- ┆ --- ┆ --- │\n", - "│ str ┆ str ┆ str │\n", - "╞══════════════════════════════╪══════════════╪════════════╡\n", - "│ 182709_6984-X2452-Y12423_reg ┆ visp_exc_wnm ┆ visp_wnm │\n", - "│ 182709_7126-X2913-Y10535_reg ┆ visp_exc_wnm ┆ visp_wnm │\n", - "│ 182724_5937-X3804-Y11955_reg ┆ visp_exc_wnm ┆ visp_wnm │\n", - "│ 182724_6175-X3782-Y10859_reg ┆ visp_exc_wnm ┆ visp_wnm │\n", - "│ 182724_6354-X4834-Y8105_reg ┆ visp_exc_wnm ┆ visp_wnm │\n", - "└──────────────────────────────┴──────────────┴────────────┘\n" + "\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n", + "\u2502 dataitem_id \u2506 dataset_id \u2506 project_id \u2502\n", + "\u2502 --- \u2506 --- \u2506 --- \u2502\n", + "\u2502 str \u2506 str \u2506 str \u2502\n", + "\u255e\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2561\n", + "\u2502 182709_6984-X2452-Y12423_reg \u2506 visp_exc_wnm \u2506 visp_wnm \u2502\n", + "\u2502 182709_7126-X2913-Y10535_reg \u2506 visp_exc_wnm \u2506 visp_wnm \u2502\n", + "\u2502 182724_5937-X3804-Y11955_reg \u2506 visp_exc_wnm \u2506 visp_wnm \u2502\n", + "\u2502 182724_6175-X3782-Y10859_reg \u2506 visp_exc_wnm \u2506 visp_wnm \u2502\n", + "\u2502 182724_6354-X4834-Y8105_reg \u2506 visp_exc_wnm \u2506 visp_wnm \u2502\n", + "\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n" ] } ], @@ -527,8 +527,8 @@ "| `dataitem_dataset_association/` | `DataItemDataSetAssociation` | 341 |\n", "\n", "**Input columns intentionally not written here:**\n", - "- `predicted_met_type`, `probability` — MET-type classification; written in a later notebook as `CellToClusterMapping`.\n", - "- `ccf_soma_location`, `ccf_soma_x/y/z` and remaining morphology metadata — written in a later notebook as `SingleCellRecon` records." + "- `predicted_met_type`, `probability` \u2014 MET-type classification; written in a later notebook as `CellToClusterMapping`.\n", + "- `ccf_soma_location`, `ccf_soma_x/y/z` and remaining morphology metadata \u2014 written in a later notebook as `SingleCellRecon` records." ] }, { diff --git a/code/etl_wnm_exc_02_cell_features.ipynb b/code/etl_wnm_exc_02_cell_features.ipynb index 00856ae..59d7ec8 100644 --- a/code/etl_wnm_exc_02_cell_features.ipynb +++ b/code/etl_wnm_exc_02_cell_features.ipynb @@ -4,9 +4,9 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "# ETL — WNM excitatory: Cell Features (three feature sets)\n", + "# ETL \u2014 WNM excitatory: Cell Features (three feature sets)\n", "\n", - "Writes WNM excitatory neuron cell features across three feature sets: **`exc_visp_morph_features`** (shared with `etl_visp_exc_patchseq_02`; defs/set owned by that notebook), **`wnm_exc_local_axon_features`** (local axon + apical dendrite morphology), and **`wnm_exc_complete_axon_features`** (whole-brain axon features from fMOST — placeholder, file not yet available). All rows use `project_id=\"visp_wnm\"`. Prerequisites: `etl_wnm_exc_01_dataset_dataitem.ipynb` and `etl_visp_exc_patchseq_02_cell_features.ipynb`." + "Writes WNM excitatory neuron cell features across three feature sets: **`exc_visp_morph_features`** (shared with `etl_visp_exc_patchseq_02`; defs/set owned by that notebook), **`wnm_exc_local_axon_features`** (local axon + apical dendrite morphology), and **`wnm_exc_complete_axon_features`** (whole-brain axon features from fMOST \u2014 placeholder, file not yet available). All rows use `project_id=\"visp_wnm\"`. Prerequisites: `etl_wnm_exc_01_dataset_dataitem.ipynb` and `etl_visp_exc_patchseq_02_cell_features.ipynb`." ] }, { @@ -129,7 +129,7 @@ " )\n", ")\n", "assert assoc.shape[0] > 0, (\n", - " f\"etl_wnm_exc_01 must be run first — \"\n", + " f\"etl_wnm_exc_01 must be run first \u2014 \"\n", " f\"no DataItemDataSetAssociation rows for dataset_id='{DATASET_ID}'\"\n", ")\n", "wnm_registered_ids = set(assoc[\"dataitem_id\"].to_list())\n", @@ -138,7 +138,7 @@ "# Assert the shared CellFeatureSet exists (written by etl_visp_exc_patchseq_02).\n", "cfs_check = pl.read_delta(OUTPUT_ROOT + \"cellfeatureset/\").filter(pl.col(\"id\") == FSI_SHARED)\n", "assert cfs_check.shape[0] == 1, (\n", - " f\"etl_visp_exc_patchseq_02 must be run first — \"\n", + " f\"etl_visp_exc_patchseq_02 must be run first \u2014 \"\n", " f\"CellFeatureSet '{FSI_SHARED}' not found\"\n", ")\n", "print(f\"Shared CellFeatureSet '{FSI_SHARED}' found.\")" @@ -149,7 +149,7 @@ "metadata": {}, "source": [ "---\n", - "## Set 1 — `exc_visp_morph_features` (shared defs; WNM rows only)\n", + "## Set 1 \u2014 `exc_visp_morph_features` (shared defs; WNM rows only)\n", "\n", "Defs and `CellFeatureSet` are owned by `etl_visp_exc_patchseq_02_cell_features.ipynb`. This notebook only writes WNM rows to `cellfeatures/exc_visp_morph_features/` and the corresponding `CellFeatureMatrix` pointer." ] @@ -326,7 +326,7 @@ " \n", " \n", "\n", - "

3 rows × 45 columns

\n", + "

3 rows \u00d7 45 columns

\n", "" ], "text/plain": [ @@ -418,7 +418,7 @@ "name": "stdout", "output_type": "stream", "text": [ - "WARNING: 6 shared def columns are missing from Set1 CSV — will be filled with NaN:\n", + "WARNING: 6 shared def columns are missing from Set1 CSV \u2014 will be filled with NaN:\n", " ['apical_dendrite_mean_diameter', 'apical_dendrite_total_surface_area', 'axon_exit_distance', 'axon_exit_theta', 'basal_dendrite_mean_diameter', 'basal_dendrite_total_surface_area']\n", "After alignment: 6 NaN-filled, 0 dropped\n" ] @@ -431,13 +431,13 @@ "extra_cols = [c for c in csv_feat_cols if c not in shared_def_ids]\n", "\n", "if missing_cols:\n", - " print(f\"WARNING: {len(missing_cols)} shared def columns are missing from Set1 CSV — \"\n", + " print(f\"WARNING: {len(missing_cols)} shared def columns are missing from Set1 CSV \u2014 \"\n", " f\"will be filled with NaN:\\n {missing_cols}\")\n", " for col in missing_cols:\n", " wide1_raw[col] = np.nan\n", "\n", "if extra_cols:\n", - " print(f\"WARNING: {len(extra_cols)} columns in CSV are NOT in shared defs — dropping:\\n {extra_cols}\")\n", + " print(f\"WARNING: {len(extra_cols)} columns in CSV are NOT in shared defs \u2014 dropping:\\n {extra_cols}\")\n", " wide1_raw = wide1_raw.drop(columns=extra_cols)\n", "\n", "print(f\"After alignment: {len(missing_cols)} NaN-filled, {len(extra_cols)} dropped\")" @@ -508,7 +508,7 @@ "# Register new cells (DataItem + DataItemDataSetAssociation) for those in Set1 not yet in _01.\n", "if new_ids_set1:\n", " new_items = [DataItem(id=i, name=i, project_id=PROJECT_ID) for i in new_ids_set1]\n", - " n_appended = write_models(new_items).rows_written\n", + " n_appended = write_models(new_items, output_root=OUTPUT_ROOT).rows_written\n", " print(f\"Appended {n_appended} new DataItem rows\")\n", "\n", " new_assoc = [\n", @@ -520,7 +520,7 @@ " models_to_table(new_assoc, schema=schema_assoc),\n", " linkml_class=\"DataItemDataSetAssociation\",\n", " )\n", - " # Append — association table uses append_new_dataitems pattern\n", + " # Append \u2014 association table uses append_new_dataitems pattern\n", " # (no overwrite predicate since we only add new rows here)\n", " existing_assoc = pl.read_delta(OUTPUT_ROOT + \"dataitem_dataset_association/\").filter(\n", " (pl.col(\"project_id\") == PROJECT_ID) & (pl.col(\"dataset_id\") == DATASET_ID)\n", @@ -540,7 +540,7 @@ " else:\n", " print(\"No new association rows to append.\")\n", "else:\n", - " print(\"All Set1 cells already registered — no new DataItem or association writes.\")\n", + " print(\"All Set1 cells already registered \u2014 no new DataItem or association writes.\")\n", "\n", "# Refresh registered ids so Set2 coverage check reflects newly added cells.\n", "wnm_registered_ids = set(\n", @@ -657,20 +657,20 @@ "text": [ "(345, 53)\n", "shape: (3, 3)\n", - "┌─────────────────────────────┬────────────┬─────────────────────────┐\n", - "│ id ┆ project_id ┆ feature_set_id │\n", - "│ --- ┆ --- ┆ --- │\n", - "│ str ┆ str ┆ str │\n", - "╞═════════════════════════════╪════════════╪═════════════════════════╡\n", - "│ 17109_6201-X4328-Y6753_reg ┆ visp_wnm ┆ exc_visp_morph_features │\n", - "│ 17109_6301-X4756-Y24516_reg ┆ visp_wnm ┆ exc_visp_morph_features │\n", - "│ 17109_6601-X4384-Y7436_reg ┆ visp_wnm ┆ exc_visp_morph_features │\n", - "└─────────────────────────────┴────────────┴─────────────────────────┘\n" + "\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n", + "\u2502 id \u2506 project_id \u2506 feature_set_id \u2502\n", + "\u2502 --- \u2506 --- \u2506 --- \u2502\n", + "\u2502 str \u2506 str \u2506 str \u2502\n", + "\u255e\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2561\n", + "\u2502 17109_6201-X4328-Y6753_reg \u2506 visp_wnm \u2506 exc_visp_morph_features \u2502\n", + "\u2502 17109_6301-X4756-Y24516_reg \u2506 visp_wnm \u2506 exc_visp_morph_features \u2502\n", + "\u2502 17109_6601-X4384-Y7436_reg \u2506 visp_wnm \u2506 exc_visp_morph_features \u2502\n", + "\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n" ] } ], "source": [ - "# Verification — Set1 wide parquet.\n", + "# Verification \u2014 Set1 wide parquet.\n", "set1_v = pl.read_delta(OUTPUT_ROOT + f\"cellfeatures/{FSI_SHARED}/\").filter(\n", " pl.col(\"project_id\") == PROJECT_ID\n", ")\n", @@ -710,7 +710,7 @@ " cell_index_column=\"id\",\n", " project_id=PROJECT_ID,\n", ")\n", - "result = write_models([cfm1])\n", + "result = write_models([cfm1], output_root=OUTPUT_ROOT)\n", "print(f\"CellFeatureMatrix written: {result.rows_written} rows\")" ] }, @@ -732,19 +732,19 @@ "text": [ "(1, 5)\n", "shape: (1, 5)\n", - "┌─────────────────────┬─────────────────────┬─────────────────────┬───────────────────┬────────────┐\n", - "│ id ┆ feature_set_id ┆ parquet_path ┆ cell_index_column ┆ project_id │\n", - "│ --- ┆ --- ┆ --- ┆ --- ┆ --- │\n", - "│ str ┆ str ┆ str ┆ str ┆ str │\n", - "╞═════════════════════╪═════════════════════╪═════════════════════╪═══════════════════╪════════════╡\n", - "│ visp_wnm_exc_visp_m ┆ exc_visp_morph_feat ┆ file:///scratch/em_ ┆ id ┆ visp_wnm │\n", - "│ orph_featur… ┆ ures ┆ patchseq_wn… ┆ ┆ │\n", - "└─────────────────────┴─────────────────────┴─────────────────────┴───────────────────┴────────────┘\n" + "\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n", + "\u2502 id \u2506 feature_set_id \u2506 parquet_path \u2506 cell_index_column \u2506 project_id \u2502\n", + "\u2502 --- \u2506 --- \u2506 --- \u2506 --- \u2506 --- \u2502\n", + "\u2502 str \u2506 str \u2506 str \u2506 str \u2506 str \u2502\n", + "\u255e\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2561\n", + "\u2502 visp_wnm_exc_visp_m \u2506 exc_visp_morph_feat \u2506 file:///scratch/em_ \u2506 id \u2506 visp_wnm \u2502\n", + "\u2502 orph_featur\u2026 \u2506 ures \u2506 patchseq_wn\u2026 \u2506 \u2506 \u2502\n", + "\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n" ] } ], "source": [ - "# Verification — CellFeatureMatrix Set1.\n", + "# Verification \u2014 CellFeatureMatrix Set1.\n", "cfm1_v = pl.read_delta(OUTPUT_ROOT + \"cellfeaturematrix/\").filter(\n", " (pl.col(\"project_id\") == PROJECT_ID) & (pl.col(\"feature_set_id\") == FSI_SHARED)\n", ")\n", @@ -757,7 +757,7 @@ "metadata": {}, "source": [ "---\n", - "## Set 2 — `wnm_exc_local_axon_features`" + "## Set 2 \u2014 `wnm_exc_local_axon_features`" ] }, { @@ -898,7 +898,7 @@ " \n", " \n", "\n", - "

3 rows × 52 columns

\n", + "

3 rows \u00d7 52 columns

\n", "" ], "text/plain": [ @@ -1051,7 +1051,7 @@ ], "source": [ "# Write CellFeatureDefinition for Set2.\n", - "result = write_models(feature_defs_2)\n", + "result = write_models(feature_defs_2, output_root=OUTPUT_ROOT)\n", "print(f\"CellFeatureDefinition written: {result.rows_written} rows\")" ] }, @@ -1073,27 +1073,27 @@ "text": [ "(51, 8)\n", "shape: (3, 8)\n", - "┌──────────────┬─────────────┬──────┬───────────┬───────────┬───────────┬────────────┬─────────────┐\n", - "│ id ┆ description ┆ unit ┆ data_type ┆ range_min ┆ range_max ┆ project_id ┆ feature_set │\n", - "│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ _id │\n", - "│ str ┆ str ┆ str ┆ str ┆ f64 ┆ f64 ┆ str ┆ --- │\n", - "│ ┆ ┆ ┆ ┆ ┆ ┆ ┆ str │\n", - "╞══════════════╪═════════════╪══════╪═══════════╪═══════════╪═══════════╪════════════╪═════════════╡\n", - "│ apical_dendr ┆ null ┆ null ┆ 0, (\n", - " f\"etl_wnm_exc_01 must run first — no association rows for dataset_id='{DATASET_ID}'\"\n", + " f\"etl_wnm_exc_01 must run first \u2014 no association rows for dataset_id='{DATASET_ID}'\"\n", ")\n", "registered_ids = set(assoc[\"dataitem_id\"].to_list())\n", "print(f\"Registered DataItems for {DATASET_ID}: {len(registered_ids)}\")\n", @@ -135,7 +135,7 @@ " .filter(pl.col(\"hierarchy_id\") == METTYPE_HIERARCHY_ID)\n", ")\n", "assert met_clu.shape[0] > 0, (\n", - " f\"etl_visp_met_types_01_cluster must run first — no clusters for {METTYPE_HIERARCHY_ID}\"\n", + " f\"etl_visp_met_types_01_cluster must run first \u2014 no clusters for {METTYPE_HIERARCHY_ID}\"\n", ")\n", "met_parent = dict(zip(met_clu[\"id\"].to_list(), met_clu[\"parent\"].to_list()))\n", "print(f\"Clusters loaded: {METTYPE_HIERARCHY_ID}={len(met_parent)}\")\n" @@ -324,7 +324,7 @@ " target_hierarchy=METTYPE_HIERARCHY_ID,\n", " project_id=PROJECT_ID,\n", ")\n", - "result = write_models([mapping_set])\n", + "result = write_models([mapping_set], output_root=OUTPUT_ROOT)\n", "print(f\"MappingSet written: {result.rows_written} rows\")\n" ] }, @@ -347,18 +347,18 @@ "text": [ "(1, 13)\n", "shape: (1, 13)\n", - "┌───────────┬───────────┬───────────┬───────────┬───┬───────────┬───────────┬───────────┬──────────┐\n", - "│ id ┆ name ┆ descripti ┆ method_na ┆ … ┆ source_hi ┆ target_hi ┆ json_obje ┆ project_ │\n", - "│ --- ┆ --- ┆ on ┆ me ┆ ┆ erarchy ┆ erarchy ┆ ct ┆ id │\n", - "│ str ┆ str ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ --- ┆ --- │\n", - "│ ┆ ┆ str ┆ str ┆ ┆ str ┆ str ┆ str ┆ str │\n", - "╞═══════════╪═══════════╪═══════════╪═══════════╪═══╪═══════════╪═══════════╪═══════════╪══════════╡\n", - "│ visp_exc_ ┆ VISp WNM ┆ Routed ┆ Routed ┆ … ┆ null ┆ visp_met_ ┆ null ┆ visp_wnm │\n", - "│ wnm_metty ┆ excitator ┆ random ┆ random ┆ ┆ ┆ types_tax ┆ ┆ │\n", - "│ pe_mappin ┆ y ┆ forest ┆ forest ┆ ┆ ┆ onomy ┆ ┆ │\n", - "│ g ┆ MET-type ┆ mapping ┆ mapping ┆ ┆ ┆ ┆ ┆ │\n", - "│ ┆ a… ┆ o… ┆ ┆ ┆ ┆ ┆ ┆ │\n", - "└───────────┴───────────┴───────────┴───────────┴───┴───────────┴───────────┴───────────┴──────────┘\n" + "\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n", + "\u2502 id \u2506 name \u2506 descripti \u2506 method_na \u2506 \u2026 \u2506 source_hi \u2506 target_hi \u2506 json_obje \u2506 project_ \u2502\n", + "\u2502 --- \u2506 --- \u2506 on \u2506 me \u2506 \u2506 erarchy \u2506 erarchy \u2506 ct \u2506 id \u2502\n", + "\u2502 str \u2506 str \u2506 --- \u2506 --- \u2506 \u2506 --- \u2506 --- \u2506 --- \u2506 --- \u2502\n", + "\u2502 \u2506 \u2506 str \u2506 str \u2506 \u2506 str \u2506 str \u2506 str \u2506 str \u2502\n", + "\u255e\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2561\n", + "\u2502 visp_exc_ \u2506 VISp WNM \u2506 Routed \u2506 Routed \u2506 \u2026 \u2506 null \u2506 visp_met_ \u2506 null \u2506 visp_wnm \u2502\n", + "\u2502 wnm_metty \u2506 excitator \u2506 random \u2506 random \u2506 \u2506 \u2506 types_tax \u2506 \u2506 \u2502\n", + "\u2502 pe_mappin \u2506 y \u2506 forest \u2506 forest \u2506 \u2506 \u2506 onomy \u2506 \u2506 \u2502\n", + "\u2502 g \u2506 MET-type \u2506 mapping \u2506 mapping \u2506 \u2506 \u2506 \u2506 \u2506 \u2502\n", + "\u2502 \u2506 a\u2026 \u2506 o\u2026 \u2506 \u2506 \u2506 \u2506 \u2506 \u2506 \u2502\n", + "\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n" ] } ], @@ -432,7 +432,7 @@ " project_id=PROJECT_ID,\n", " ))\n", "print(f\"CellToClusterMapping rows built: {len(mappings)}\")\n", - "result = write_models(mappings)\n", + "result = write_models(mappings, output_root=OUTPUT_ROOT)\n", "print(f\"CellToClusterMapping written: {result.rows_written} rows\")\n" ] }, @@ -455,22 +455,22 @@ "text": [ "(1023, 8)\n", "shape: (3, 8)\n", - "┌─────────────┬─────────────┬─────────────┬─────────────┬───────┬─────────────┬───────┬────────────┐\n", - "│ id ┆ mapping_set ┆ source_cell ┆ target_clus ┆ score ┆ probability ┆ notes ┆ project_id │\n", - "│ --- ┆ --- ┆ --- ┆ ter ┆ --- ┆ --- ┆ --- ┆ --- │\n", - "│ str ┆ str ┆ str ┆ --- ┆ f64 ┆ f64 ┆ str ┆ str │\n", - "│ ┆ ┆ ┆ str ┆ ┆ ┆ ┆ │\n", - "╞═════════════╪═════════════╪═════════════╪═════════════╪═══════╪═════════════╪═══════╪════════════╡\n", - "│ 182709_6984 ┆ visp_exc_wn ┆ 182709_6984 ┆ L5 ET-2 ┆ null ┆ 0.988 ┆ null ┆ visp_wnm │\n", - "│ -X2452-Y124 ┆ m_mettype_m ┆ -X2452-Y124 ┆ ┆ ┆ ┆ ┆ │\n", - "│ 23_reg-L… ┆ apping ┆ 23_reg ┆ ┆ ┆ ┆ ┆ │\n", - "│ 182709_6984 ┆ visp_exc_wn ┆ 182709_6984 ┆ Glutamaterg ┆ null ┆ null ┆ null ┆ visp_wnm │\n", - "│ -X2452-Y124 ┆ m_mettype_m ┆ -X2452-Y124 ┆ ic ┆ ┆ ┆ ┆ │\n", - "│ 23_reg-G… ┆ apping ┆ 23_reg ┆ ┆ ┆ ┆ ┆ │\n", - "│ 182709_6984 ┆ visp_exc_wn ┆ 182709_6984 ┆ cell ┆ null ┆ null ┆ null ┆ visp_wnm │\n", - "│ -X2452-Y124 ┆ m_mettype_m ┆ -X2452-Y124 ┆ ┆ ┆ ┆ ┆ │\n", - "│ 23_reg-c… ┆ apping ┆ 23_reg ┆ ┆ ┆ ┆ ┆ │\n", - "└─────────────┴─────────────┴─────────────┴─────────────┴───────┴─────────────┴───────┴────────────┘\n" + "\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n", + "\u2502 id \u2506 mapping_set \u2506 source_cell \u2506 target_clus \u2506 score \u2506 probability \u2506 notes \u2506 project_id \u2502\n", + "\u2502 --- \u2506 --- \u2506 --- \u2506 ter \u2506 --- \u2506 --- \u2506 --- \u2506 --- \u2502\n", + "\u2502 str \u2506 str \u2506 str \u2506 --- \u2506 f64 \u2506 f64 \u2506 str \u2506 str \u2502\n", + "\u2502 \u2506 \u2506 \u2506 str \u2506 \u2506 \u2506 \u2506 \u2502\n", + "\u255e\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2561\n", + "\u2502 182709_6984 \u2506 visp_exc_wn \u2506 182709_6984 \u2506 L5 ET-2 \u2506 null \u2506 0.988 \u2506 null \u2506 visp_wnm \u2502\n", + "\u2502 -X2452-Y124 \u2506 m_mettype_m \u2506 -X2452-Y124 \u2506 \u2506 \u2506 \u2506 \u2506 \u2502\n", + "\u2502 23_reg-L\u2026 \u2506 apping \u2506 23_reg \u2506 \u2506 \u2506 \u2506 \u2506 \u2502\n", + "\u2502 182709_6984 \u2506 visp_exc_wn \u2506 182709_6984 \u2506 Glutamaterg \u2506 null \u2506 null \u2506 null \u2506 visp_wnm \u2502\n", + "\u2502 -X2452-Y124 \u2506 m_mettype_m \u2506 -X2452-Y124 \u2506 ic \u2506 \u2506 \u2506 \u2506 \u2502\n", + "\u2502 23_reg-G\u2026 \u2506 apping \u2506 23_reg \u2506 \u2506 \u2506 \u2506 \u2506 \u2502\n", + "\u2502 182709_6984 \u2506 visp_exc_wn \u2506 182709_6984 \u2506 cell \u2506 null \u2506 null \u2506 null \u2506 visp_wnm \u2502\n", + "\u2502 -X2452-Y124 \u2506 m_mettype_m \u2506 -X2452-Y124 \u2506 \u2506 \u2506 \u2506 \u2506 \u2502\n", + "\u2502 23_reg-c\u2026 \u2506 apping \u2506 23_reg \u2506 \u2506 \u2506 \u2506 \u2506 \u2502\n", + "\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n" ] } ], @@ -511,7 +511,7 @@ "| Output path | Class | Rows |\n", "|---|---|---|\n", "| `mappingset/` (`id={MAPPING_SET_ID}`) | `MappingSet` (Routed random forest mapping) | 1 |\n", - "| `celltoclustermapping/` (`mapping_set={MAPPING_SET_ID}`) | `CellToClusterMapping` | one per (cell × MET-type ancestor); leaf rows carry `probability`, ancestors are null |\n", + "| `celltoclustermapping/` (`mapping_set={MAPPING_SET_ID}`) | `CellToClusterMapping` | one per (cell \u00d7 MET-type ancestor); leaf rows carry `probability`, ancestors are null |\n", "\n", "**Not written:** no `ClusterMembership` rows. WNM cells did not define the VISp MET-types taxonomy, so their assignments are mappings, not memberships.\n" ] diff --git a/code/etl_wnm_exc_04_projection_matrix.ipynb b/code/etl_wnm_exc_04_projection_matrix.ipynb index a3dab81..73a7d6a 100644 --- a/code/etl_wnm_exc_04_projection_matrix.ipynb +++ b/code/etl_wnm_exc_04_projection_matrix.ipynb @@ -5,9 +5,9 @@ "id": "7dab0b27", "metadata": {}, "source": [ - "# ETL — WNM Excitatory: Projection Matrix\n", + "# ETL \u2014 WNM Excitatory: Projection Matrix\n", "\n", - "Writes two `ProjectionMeasurementMatrix` rows (ipsi + contra) for `project_id=\"visp_wnm\"`, `dataset_id=\"visp_exc_wnm\"`, plus the backing wide-form Delta tables. Source: `ProjectionMatrix_tip_and_branch_roll_up.csv` (345 cells × 152 ipsi + 68 contra regions). Prerequisite: `etl_wnm_exc_01`. Registers 4 cells absent from `_01` via `append_new_dataitems`." + "Writes two `ProjectionMeasurementMatrix` rows (ipsi + contra) for `project_id=\"visp_wnm\"`, `dataset_id=\"visp_exc_wnm\"`, plus the backing wide-form Delta tables. Source: `ProjectionMatrix_tip_and_branch_roll_up.csv` (345 cells \u00d7 152 ipsi + 68 contra regions). Prerequisite: `etl_wnm_exc_01`. Registers 4 cells absent from `_01` via `append_new_dataitems`." ] }, { @@ -15,7 +15,7 @@ "id": "b1804bce", "metadata": {}, "source": [ - "**Caveat:** `measurement_type=MICRONS_OF_AXON` is a best guess. The filename `tip_and_branch_roll_up` suggests counts, but values are floats with magnitudes ~10⁴ — consistent with µm of axon length per region. To confirm with the data owner." + "**Caveat:** `measurement_type=MICRONS_OF_AXON` is a best guess. The filename `tip_and_branch_roll_up` suggests counts, but values are floats with magnitudes ~10\u2074 \u2014 consistent with \u00b5m of axon length per region. To confirm with the data owner." ] }, { @@ -25,7 +25,7 @@ "source": [ "**Known schema mismatches (stopgaps):**\n", "\n", - "1. `ProjectionMeasurementMatrix` lacks `ProjectScoped` → metadata predicate is `id IN (...)` only. Fix: add `mixins: [ProjectScoped]` in `schemas/projection_schema.yaml` and regenerate.\n", + "1. `ProjectionMeasurementMatrix` lacks `ProjectScoped` \u2192 metadata predicate is `id IN (...)` only. Fix: add `mixins: [ProjectScoped]` in `schemas/projection_schema.yaml` and regenerate.\n", "2. `region_index` stores raw acronym strings instead of `BrainRegion.id`s; `brainregion/` is not yet populated. Re-run after that bootstrap.\n", "3. `values` is typed `ZarrArray` but stored here as a `file://` delta-path string (mirrors `CellFeatureMatrix.parquet_path`). Fix: add a `parquet_path` slot or commit to zarr." ] @@ -136,7 +136,7 @@ " .filter((pl.col(\"project_id\") == PROJECT_ID) & (pl.col(\"dataset_id\") == DATASET_ID))\n", ")\n", "assert prereq_assoc.shape[0] > 0, (\n", - " f\"etl_wnm_exc_01_dataset_dataitem.ipynb must be run first — \"\n", + " f\"etl_wnm_exc_01_dataset_dataitem.ipynb must be run first \u2014 \"\n", " f\"no DataItemDataSetAssociation rows for project_id='{PROJECT_ID}', dataset_id='{DATASET_ID}'\"\n", ")\n", "print(f\"Prereq OK: {prereq_assoc.shape[0]} DataItem associations registered for {DATASET_ID}.\")" @@ -261,7 +261,7 @@ ], "source": [ "# First column (unnamed) is the swc filename. Strip the .swc suffix to get the cell id\n", - "# (matches etl_wnm_exc_01 convention). Cell ids are kept as strings — never cast.\n", + "# (matches etl_wnm_exc_01 convention). Cell ids are kept as strings \u2014 never cast.\n", "df = pd.read_csv(INPUT_CSV, index_col=0)\n", "df.index = df.index.astype(str).str.removesuffix(\".swc\")\n", "df.index.name = \"id\"\n", @@ -374,14 +374,14 @@ } ], "source": [ - "# Append only-new DataItem rows for cells absent from _01. append_new_dataitems is idempotent —\n", + "# Append only-new DataItem rows for cells absent from _01. append_new_dataitems is idempotent \u2014\n", "# re-running this cell appends 0 and does not disturb other projects' rows in dataitem/.\n", "if new_ids:\n", " new_items = [\n", " DataItem(id=cid, name=cid, project_id=PROJECT_ID, modality=Modality.MORPHOLOGY.value)\n", " for cid in new_ids\n", " ]\n", - " n_appended = write_models(new_items).rows_written\n", + " n_appended = write_models(new_items, output_root=OUTPUT_ROOT).rows_written\n", " print(f\"Appended {n_appended} new DataItem rows\")\n", "else:\n", " print(\"All cells already in DataItem; nothing to append.\")" @@ -406,15 +406,15 @@ "text": [ "(345, 4)\n", "shape: (3, 4)\n", - "┌───────────────────────────────┬───────────────────────────────┬───────────────────┬────────────┐\n", - "│ id ┆ name ┆ neuroglancer_link ┆ project_id │\n", - "│ --- ┆ --- ┆ --- ┆ --- │\n", - "│ str ┆ str ┆ str ┆ str │\n", - "╞═══════════════════════════════╪═══════════════════════════════╪═══════════════════╪════════════╡\n", - "│ 17109_6801-X7432-Y4405_reg ┆ 17109_6801-X7432-Y4405_reg ┆ null ┆ visp_wnm │\n", - "│ 211541_6961-X18505-Y15909_reg ┆ 211541_6961-X18505-Y15909_reg ┆ null ┆ visp_wnm │\n", - "│ 220309_5824-X3486-Y10261_reg ┆ 220309_5824-X3486-Y10261_reg ┆ null ┆ visp_wnm │\n", - "└───────────────────────────────┴───────────────────────────────┴───────────────────┴────────────┘\n", + "\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n", + "\u2502 id \u2506 name \u2506 neuroglancer_link \u2506 project_id \u2502\n", + "\u2502 --- \u2506 --- \u2506 --- \u2506 --- \u2502\n", + "\u2502 str \u2506 str \u2506 str \u2506 str \u2502\n", + "\u255e\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2561\n", + "\u2502 17109_6801-X7432-Y4405_reg \u2506 17109_6801-X7432-Y4405_reg \u2506 null \u2506 visp_wnm \u2502\n", + "\u2502 211541_6961-X18505-Y15909_reg \u2506 211541_6961-X18505-Y15909_reg \u2506 null \u2506 visp_wnm \u2502\n", + "\u2502 220309_5824-X3486-Y10261_reg \u2506 220309_5824-X3486-Y10261_reg \u2506 null \u2506 visp_wnm \u2502\n", + "\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n", "All 345 cells present in DataItem.\n" ] } @@ -473,7 +473,7 @@ " )\n", " for cid in cell_ids\n", "]\n", - "result = write_models(associations)\n", + "result = write_models(associations, output_root=OUTPUT_ROOT)\n", "print(f\"DataItemDataSetAssociation written: {result.rows_written} rows\")" ] }, @@ -496,15 +496,15 @@ "text": [ "(345, 3)\n", "shape: (3, 3)\n", - "┌───────────────────────────────┬──────────────┬────────────┐\n", - "│ dataitem_id ┆ dataset_id ┆ project_id │\n", - "│ --- ┆ --- ┆ --- │\n", - "│ str ┆ str ┆ str │\n", - "╞═══════════════════════════════╪══════════════╪════════════╡\n", - "│ 18864_6734-X4899-Y27447_reg ┆ visp_exc_wnm ┆ visp_wnm │\n", - "│ 191812_7938-X6892-Y25312_reg ┆ visp_exc_wnm ┆ visp_wnm │\n", - "│ 211550_7718-X19461-Y16950_reg ┆ visp_exc_wnm ┆ visp_wnm │\n", - "└───────────────────────────────┴──────────────┴────────────┘\n" + "\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n", + "\u2502 dataitem_id \u2506 dataset_id \u2506 project_id \u2502\n", + "\u2502 --- \u2506 --- \u2506 --- \u2502\n", + "\u2502 str \u2506 str \u2506 str \u2502\n", + "\u255e\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2561\n", + "\u2502 18864_6734-X4899-Y27447_reg \u2506 visp_exc_wnm \u2506 visp_wnm \u2502\n", + "\u2502 191812_7938-X6892-Y25312_reg \u2506 visp_exc_wnm \u2506 visp_wnm \u2502\n", + "\u2502 211550_7718-X19461-Y16950_reg \u2506 visp_exc_wnm \u2506 visp_wnm \u2502\n", + "\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n" ] } ], @@ -597,17 +597,17 @@ "text": [ "(345, 155)\n", "shape: (3, 6)\n", - "┌─────────────────────────────┬────────────┬──────────────┬────────────┬──────────────┬────────────┐\n", - "│ id ┆ project_id ┆ dataset_id ┆ VISam ┆ VISp ┆ VISpm │\n", - "│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │\n", - "│ str ┆ str ┆ str ┆ f64 ┆ f64 ┆ f64 │\n", - "╞═════════════════════════════╪════════════╪══════════════╪════════════╪══════════════╪════════════╡\n", - "│ 18864_6734-X4899-Y27447_reg ┆ visp_wnm ┆ visp_exc_wnm ┆ 8287.70664 ┆ 34450.175934 ┆ 483.223644 │\n", - "│ 191812_7938-X6892-Y25312_re ┆ visp_wnm ┆ visp_exc_wnm ┆ 0.0 ┆ 794.102517 ┆ 0.0 │\n", - "│ g ┆ ┆ ┆ ┆ ┆ │\n", - "│ 211550_7718-X19461-Y16950_r ┆ visp_wnm ┆ visp_exc_wnm ┆ 0.0 ┆ 6473.751624 ┆ 0.0 │\n", - "│ eg ┆ ┆ ┆ ┆ ┆ │\n", - "└─────────────────────────────┴────────────┴──────────────┴────────────┴──────────────┴────────────┘\n" + "\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n", + "\u2502 id \u2506 project_id \u2506 dataset_id \u2506 VISam \u2506 VISp \u2506 VISpm \u2502\n", + "\u2502 --- \u2506 --- \u2506 --- \u2506 --- \u2506 --- \u2506 --- \u2502\n", + "\u2502 str \u2506 str \u2506 str \u2506 f64 \u2506 f64 \u2506 f64 \u2502\n", + "\u255e\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2561\n", + "\u2502 18864_6734-X4899-Y27447_reg \u2506 visp_wnm \u2506 visp_exc_wnm \u2506 8287.70664 \u2506 34450.175934 \u2506 483.223644 \u2502\n", + "\u2502 191812_7938-X6892-Y25312_re \u2506 visp_wnm \u2506 visp_exc_wnm \u2506 0.0 \u2506 794.102517 \u2506 0.0 \u2502\n", + "\u2502 g \u2506 \u2506 \u2506 \u2506 \u2506 \u2502\n", + "\u2502 211550_7718-X19461-Y16950_r \u2506 visp_wnm \u2506 visp_exc_wnm \u2506 0.0 \u2506 6473.751624 \u2506 0.0 \u2502\n", + "\u2502 eg \u2506 \u2506 \u2506 \u2506 \u2506 \u2502\n", + "\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n" ] } ], @@ -697,15 +697,15 @@ "text": [ "(345, 71)\n", "shape: (3, 6)\n", - "┌───────────────────────────────┬────────────┬──────────────┬─────────────┬──────┬─────┐\n", - "│ id ┆ project_id ┆ dataset_id ┆ VISpor ┆ VISp ┆ CP │\n", - "│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │\n", - "│ str ┆ str ┆ str ┆ f64 ┆ f64 ┆ f64 │\n", - "╞═══════════════════════════════╪════════════╪══════════════╪═════════════╪══════╪═════╡\n", - "│ 18864_6734-X4899-Y27447_reg ┆ visp_wnm ┆ visp_exc_wnm ┆ 0.0 ┆ 0.0 ┆ 0.0 │\n", - "│ 191812_7938-X6892-Y25312_reg ┆ visp_wnm ┆ visp_exc_wnm ┆ 1045.437572 ┆ 0.0 ┆ 0.0 │\n", - "│ 211550_7718-X19461-Y16950_reg ┆ visp_wnm ┆ visp_exc_wnm ┆ 0.0 ┆ 0.0 ┆ 0.0 │\n", - "└───────────────────────────────┴────────────┴──────────────┴─────────────┴──────┴─────┘\n" + "\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2510\n", + "\u2502 id \u2506 project_id \u2506 dataset_id \u2506 VISpor \u2506 VISp \u2506 CP \u2502\n", + "\u2502 --- \u2506 --- \u2506 --- \u2506 --- \u2506 --- \u2506 --- \u2502\n", + "\u2502 str \u2506 str \u2506 str \u2506 f64 \u2506 f64 \u2506 f64 \u2502\n", + "\u255e\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2561\n", + "\u2502 18864_6734-X4899-Y27447_reg \u2506 visp_wnm \u2506 visp_exc_wnm \u2506 0.0 \u2506 0.0 \u2506 0.0 \u2502\n", + "\u2502 191812_7938-X6892-Y25312_reg \u2506 visp_wnm \u2506 visp_exc_wnm \u2506 1045.437572 \u2506 0.0 \u2506 0.0 \u2502\n", + "\u2502 211550_7718-X19461-Y16950_reg \u2506 visp_wnm \u2506 visp_exc_wnm \u2506 0.0 \u2506 0.0 \u2506 0.0 \u2502\n", + "\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2518\n" ] } ], @@ -791,7 +791,7 @@ "\n", "ipsi_matrix = ProjectionMeasurementMatrix(\n", " id=FSI_IPSI,\n", - " description=\"WNM excitatory ipsilateral projection matrix: per-cell axon length (µm, inferred) by ipsilateral CCF region.\",\n", + " description=\"WNM excitatory ipsilateral projection matrix: per-cell axon length (\u00b5m, inferred) by ipsilateral CCF region.\",\n", " measurement_type=ProjectionMeasurementType.MICRONS_OF_AXON,\n", " modality=Modality.MORPHOLOGY,\n", " laterality=Laterality.IPSILATERAL,\n", @@ -803,7 +803,7 @@ ")\n", "contra_matrix = ProjectionMeasurementMatrix(\n", " id=FSI_CONTRA,\n", - " description=\"WNM excitatory contralateral projection matrix: per-cell axon length (µm, inferred) by contralateral CCF region.\",\n", + " description=\"WNM excitatory contralateral projection matrix: per-cell axon length (\u00b5m, inferred) by contralateral CCF region.\",\n", " measurement_type=ProjectionMeasurementType.MICRONS_OF_AXON,\n", " modality=Modality.MORPHOLOGY,\n", " laterality=Laterality.CONTRALATERAL,\n", @@ -814,8 +814,8 @@ " unit=Unit.MICRONS_LENGTH,\n", ")\n", "\n", - "write_projection_matrix(ipsi_matrix, df[ipsi_cols].to_numpy())\n", - "write_projection_matrix(contra_matrix, df[contra_cols].to_numpy())\n", + "write_projection_matrix(ipsi_matrix, df[ipsi_cols].to_numpy(), output_root=OUTPUT_ROOT)\n", + "write_projection_matrix(contra_matrix, df[contra_cols].to_numpy(), output_root=OUTPUT_ROOT)\n", "print(\"ProjectionMeasurementMatrix written: 2 rows\")\n" ] }, @@ -838,16 +838,16 @@ "text": [ "(2, 10)\n", "shape: (2, 5)\n", - "┌─────────────────────┬───────────────┬──────────────────┬────────────────┬────────────────────────┐\n", - "│ id ┆ laterality ┆ measurement_type ┆ unit ┆ values │\n", - "│ --- ┆ --- ┆ --- ┆ --- ┆ --- │\n", - "│ str ┆ str ┆ str ┆ str ┆ str │\n", - "╞═════════════════════╪═══════════════╪══════════════════╪════════════════╪════════════════════════╡\n", - "│ wnm_exc_proj_contra ┆ CONTRALATERAL ┆ MICRONS_OF_AXON ┆ MICRONS_LENGTH ┆ file:///scratch/em_pat │\n", - "│ ┆ ┆ ┆ ┆ chseq_wn… │\n", - "│ wnm_exc_proj_ipsi ┆ IPSILATERAL ┆ MICRONS_OF_AXON ┆ MICRONS_LENGTH ┆ file:///scratch/em_pat │\n", - "│ ┆ ┆ ┆ ┆ chseq_wn… │\n", - "└─────────────────────┴───────────────┴──────────────────┴────────────────┴────────────────────────┘\n", + "\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n", + "\u2502 id \u2506 laterality \u2506 measurement_type \u2506 unit \u2506 values \u2502\n", + "\u2502 --- \u2506 --- \u2506 --- \u2506 --- \u2506 --- \u2502\n", + "\u2502 str \u2506 str \u2506 str \u2506 str \u2506 str \u2502\n", + "\u255e\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2561\n", + "\u2502 wnm_exc_proj_contra \u2506 CONTRALATERAL \u2506 MICRONS_OF_AXON \u2506 MICRONS_LENGTH \u2506 file:///scratch/em_pat \u2502\n", + "\u2502 \u2506 \u2506 \u2506 \u2506 chseq_wn\u2026 \u2502\n", + "\u2502 wnm_exc_proj_ipsi \u2506 IPSILATERAL \u2506 MICRONS_OF_AXON \u2506 MICRONS_LENGTH \u2506 file:///scratch/em_pat \u2502\n", + "\u2502 \u2506 \u2506 \u2506 \u2506 chseq_wn\u2026 \u2502\n", + "\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n", "Verified both matrix rows.\n" ] } @@ -884,8 +884,8 @@ "|---|---|---|\n", "| `dataitem/` | `DataItem` | +N new cells (4 expected; via `append_new_dataitems`) |\n", "| `dataitem_dataset_association/` | `DataItemDataSetAssociation` | 345 (overwrite scoped to `project_id` + `dataset_id`) |\n", - "| `projectionmeasurementmatrix/wnm_exc_proj_ipsi/` | wide parquet | 345 cells × 152 ipsilateral region columns |\n", - "| `projectionmeasurementmatrix/wnm_exc_proj_contra/` | wide parquet | 345 cells × 68 contralateral region columns |\n", + "| `projectionmeasurementmatrix/wnm_exc_proj_ipsi/` | wide parquet | 345 cells \u00d7 152 ipsilateral region columns |\n", + "| `projectionmeasurementmatrix/wnm_exc_proj_contra/` | wide parquet | 345 cells \u00d7 68 contralateral region columns |\n", "| `projectionmeasurementmatrix/` | `ProjectionMeasurementMatrix` | 2 (one per laterality) |\n", "\n", "`measurement_type=MICRONS_OF_AXON` is recorded based on inference from value magnitudes; awaiting confirmation from the data owner. Region indices are stored as raw acronym strings until `brainregion/` is bootstrapped (see schema-mismatch note above).\n" diff --git a/src/connects_common_connectivity/io/writers.py b/src/connects_common_connectivity/io/writers.py index e32072e..3d9847c 100644 --- a/src/connects_common_connectivity/io/writers.py +++ b/src/connects_common_connectivity/io/writers.py @@ -21,7 +21,7 @@ from deltalake import write_deltalake from pydantic import BaseModel -from ..config import Settings, get_settings, table_path +from ..config import Settings, get_settings from .arrow_utils import attach_linkml_metadata, build_arrow_schema, models_to_table from .write_spec import REGISTRY, WriteSpec, get_spec from .write_utils import append_new_dataitems, populate_region_coverage @@ -224,7 +224,37 @@ def _dispatch_append_new_by_id( # --------------------------------------------------------------------------- -def write_models(models: Any, *, settings: Settings | None = None) -> WriteResult: +def _resolve_output_root( + settings: Settings | None, + output_root: str | Path | None, +) -> Path: + """Resolve the effective on-disk root for a single write call. + + Precedence (highest first): + + 1. Explicit ``output_root=`` (str or :class:`Path`). Used verbatim; + passing both ``settings=`` and ``output_root=`` is an error so callers + never have to remember a precedence rule. + 2. Explicit ``settings=`` → ``settings.output_root``. + 3. :func:`get_settings` → the discovered ``ccc_config.yaml``. + """ + if output_root is not None and settings is not None: + raise TypeError( + "Pass either settings= or output_root=, not both. " + "output_root= is the per-call override; settings= carries the " + "full Settings object." + ) + if output_root is not None: + return Path(output_root) + return Path((settings or get_settings()).output_root) + + +def write_models( + models: Any, + *, + settings: Settings | None = None, + output_root: str | Path | None = None, +) -> WriteResult: """Write a batch of generated pydantic models to the shared Delta lake. The class is inferred from ``models`` and dispatched through its @@ -239,8 +269,16 @@ def write_models(models: Any, *, settings: Settings | None = None) -> WriteResul same class. The class must be one of :data:`WRITABLE_CLASSES`. settings: Optional explicit settings. Falls back to :func:`get_settings` when - omitted; an explicit ``settings=`` always wins (matches the - precedence documented in :mod:`connects_common_connectivity.config`). + omitted; an explicit ``settings=`` always wins over the discovered + config (matches the precedence documented in + :mod:`connects_common_connectivity.config`). + output_root: + Optional per-call override of the on-disk root under which the + canonical ``spec.subdir`` is written. Use this when a single + notebook/dataset should write to a different location than the + shared ``ccc_config.yaml`` ``output_root`` (e.g. an isolated test + dataset). Mutually exclusive with ``settings=`` — passing both + raises ``TypeError``. Returns ------- @@ -265,12 +303,12 @@ def write_models(models: Any, *, settings: Settings | None = None) -> WriteResul items = list(_validation_hook(items, spec)) - settings = settings or get_settings() + root = _resolve_output_root(settings, output_root) schema = build_arrow_schema(cls) table = models_to_table(items, schema=schema) table = attach_linkml_metadata(table, linkml_class=cls.__name__) - path = table_path(settings, spec.subdir) + path = root / spec.subdir if spec.write_mode == "overwrite_scoped": return _dispatch_overwrite_scoped(table, spec, path) @@ -283,7 +321,11 @@ def write_models(models: Any, *, settings: Settings | None = None) -> WriteResul def write_projection_matrix( - pmm: Any, matrix: Any, *, settings: Settings | None = None + pmm: Any, + matrix: Any, + *, + settings: Settings | None = None, + output_root: str | Path | None = None, ) -> WriteResult: """Enrich ``pmm`` with derived ``region_coverage`` and write it. @@ -292,9 +334,12 @@ def write_projection_matrix( alongside the model so coverage can be derived from it. The input ``pmm`` is not mutated — :func:`populate_region_coverage` returns a new instance. + + ``settings`` and ``output_root`` have the same semantics — and the same + mutual-exclusion rule — as in :func:`write_models`. """ enriched = populate_region_coverage(pmm, matrix) - return write_models(enriched, settings=settings) + return write_models(enriched, settings=settings, output_root=output_root) def write_cellcellconnectivitylong( diff --git a/tests/test_writers.py b/tests/test_writers.py index c9af853..8ef8945 100644 --- a/tests/test_writers.py +++ b/tests/test_writers.py @@ -326,3 +326,79 @@ class NotInRegistry: with pytest.raises(TypeError): write_models(NotInRegistry(), settings=settings) + + +# --------------------------------------------------------------------------- +# Per-call output_root override +# --------------------------------------------------------------------------- + + +def test_write_models_output_root_override_writes_to_given_path(tmp_path): + """Passing output_root= writes under that root, bypassing get_settings().""" + alt_root = tmp_path / "alt_dataset" + ds = DataSet(id="d_alt", name="alt", project_id="p_alt") + + result = write_models(ds, output_root=alt_root) + + assert result.path == alt_root / "dataset" + rows = pl.read_delta(str(alt_root / "dataset")).filter( + pl.col("id") == "d_alt" + ) + assert rows.shape[0] == 1 + + +def test_write_models_output_root_accepts_string(tmp_path): + """str and Path are both accepted for output_root.""" + alt_root = tmp_path / "string_root" + ds = DataSet(id="d_str", name="s", project_id="p_str") + + result = write_models(ds, output_root=str(alt_root)) + + assert result.path == alt_root / "dataset" + + +def test_write_models_rejects_both_settings_and_output_root(settings, tmp_path): + """Passing both settings= and output_root= raises (no precedence to memorize).""" + ds = DataSet(id="d_x", name="x", project_id="p_x") + with pytest.raises(TypeError, match="either settings= or output_root="): + write_models(ds, settings=settings, output_root=tmp_path / "other") + + +def test_write_projection_matrix_output_root_override(tmp_path): + """write_projection_matrix forwards output_root through write_models.""" + alt_root = tmp_path / "pmm_alt" + pmm = ProjectionMeasurementMatrix( + id="pmm_alt", + measurement_type=ProjectionMeasurementType.MICRONS_OF_AXON, + modality=Modality.MORPHOLOGY, + laterality=Laterality.IPSILATERAL, + unit=Unit.MICRONS_LENGTH, + data_item_index=["c1", "c2"], + region_index=["r1", "r2"], + values="file:///tmp/pmm_alt.delta", + ) + matrix = np.array([[1.0, 0.0], [0.0, 2.0]]) + + result = write_projection_matrix(pmm, matrix, output_root=alt_root) + + assert result.path == alt_root / "projectionmeasurementmatrix" + + +def test_write_projection_matrix_rejects_both_settings_and_output_root( + settings, tmp_path +): + pmm = ProjectionMeasurementMatrix( + id="pmm_x", + measurement_type=ProjectionMeasurementType.MICRONS_OF_AXON, + modality=Modality.MORPHOLOGY, + laterality=Laterality.IPSILATERAL, + unit=Unit.MICRONS_LENGTH, + data_item_index=["c1"], + region_index=["r1"], + values="file:///tmp/pmm_x.delta", + ) + matrix = np.array([[1.0]]) + with pytest.raises(TypeError, match="either settings= or output_root="): + write_projection_matrix( + pmm, matrix, settings=settings, output_root=tmp_path / "other" + ) From 76c904ff46aeb4edb3f01fc238cd90f25797d420 Mon Sep 17 00:00:00 2001 From: reneyagmur Date: Wed, 17 Jun 2026 00:27:59 +0000 Subject: [PATCH 15/25] added calcium imaging to correlative connectivity --- CHANGELOG.md | 9 +++++++++ schemas/base_schema.yaml | 2 ++ src/connects_common_connectivity/models.py | 4 ++++ 3 files changed, 15 insertions(+) diff --git a/CHANGELOG.md b/CHANGELOG.md index 88c2eb2..5baa08b 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -9,6 +9,15 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 ### Added +- Added `CALCIUM_IMAGING` value to the `Modality` enum for calcium imaging + based functional correlations. +- Added an `output_root=` keyword to `write_models()` and + `write_projection_matrix()` for per-call overrides of the on-disk root. + Accepts a `str` or `Path` and writes to `//`, + bypassing `ccc_config.yaml` for that call. Mutually exclusive with + `settings=` (passing both raises `TypeError`). Lets a single notebook + redirect its writes (e.g. an isolated test dataset) without mutating + process-global config or environment variables. - Added `WriteSpec` registry entries for `AlgorithmRun` and `HierarchyCategory` (both project-agnostic, scope=`["id"]`, `overwrite_scoped`). These classes are now writable through diff --git a/schemas/base_schema.yaml b/schemas/base_schema.yaml index a72221c..91fe769 100644 --- a/schemas/base_schema.yaml +++ b/schemas/base_schema.yaml @@ -26,6 +26,8 @@ enums: description: X-ray microscopy based connectivity mapping. EXPANSION_MICROSCOPY: description: Expansion microscopy based connectivity mapping. + CALCIUM_IMAGING: + description: Calcium imaging based functional correlations. OTHER: description: Other modality. ProjectionMeasurementType: diff --git a/src/connects_common_connectivity/models.py b/src/connects_common_connectivity/models.py index 1e6fb50..3233b1b 100644 --- a/src/connects_common_connectivity/models.py +++ b/src/connects_common_connectivity/models.py @@ -112,6 +112,10 @@ class Modality(str, Enum): """ Expansion microscopy based connectivity mapping. """ + CALCIUM_IMAGING = "CALCIUM_IMAGING" + """ + Calcium imaging based functional correlations. + """ OTHER = "OTHER" """ Other modality. From 20b77f8c187eca7c86b99251277a249340b8fb9a Mon Sep 17 00:00:00 2001 From: reneyagmur Date: Wed, 17 Jun 2026 01:13:26 +0000 Subject: [PATCH 16/25] v1dd skeleton and initial push --- code/etl_v1dd_00_explore.ipynb | 60 +- code/etl_v1dd_01_v1196.ipynb | 1060 ++++++++++++++++++++++++++++++ etl_v1dd_01_v1196_temp_prompt.md | 119 ++++ 3 files changed, 1182 insertions(+), 57 deletions(-) create mode 100644 code/etl_v1dd_01_v1196.ipynb create mode 100644 etl_v1dd_01_v1196_temp_prompt.md diff --git a/code/etl_v1dd_00_explore.ipynb b/code/etl_v1dd_00_explore.ipynb index 48665ea..6df6557 100644 --- a/code/etl_v1dd_00_explore.ipynb +++ b/code/etl_v1dd_00_explore.ipynb @@ -492,53 +492,6 @@ "coreg_df.head(3)" ] }, - { - "cell_type": "code", - "execution_count": 11, - "id": "af630fa8-8303-4315-bfa1-6ccfe657f956", - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "array([ 143, 40, 98, 100, 60, 105, 12, 232, 109, 409, 269,\n", - " 230, 29, 443, 170, 145, 226, 402, 99, 144, 120, 38,\n", - " 206, 117, 21, 30, 4, 341, 22, 25, 361, 14, 19,\n", - " 0, 240, 166, 444, 159, 189, 346, 75, 360, 548, 212,\n", - " 207, 215, 89, 187, 31, 45, 6, 271, 129, 139, 158,\n", - " 237, 245, 112, 5, 177, 367, 197, 193, 318, 150, 163,\n", - " 49, 93, 368, 254, 203, 247, 33, 15, 62, 69, 90,\n", - " 119, 154, 169, 195, 192, 184, 140, 116, 222, 77, 228,\n", - " 191, 121, 94, 141, 10, 36, 52, 32, 3, 67, 108,\n", - " 70, 73, 74, 78, 92, 97, 113, 58, 125, 134, 152,\n", - " 61, 72, 17, 84, 46, 26, 39, 41, 44, 671, 48,\n", - " 43, 227, 457, 127, 122, 229, 176, 107, 87, 231, 380,\n", - " 148, 255, 379, 258, 552, 295, 251, 261, 623, 481, 34,\n", - " 316, 432, 257, 223, 211, 137, 173, 440, 294, 395, 185,\n", - " 37, 162, 183, 253, 27, 867, 16, 80, 500, 221, 155,\n", - " 160, 194, 135, 149, 161, 164, 168, 200, 115, 201, 263,\n", - " 370, 132, 114, 13, 42, 47, 59, 65, 83, 101, 9,\n", - " 128, 103, 104, 81, 133, 204, 55, 35, 50, 64, 66,\n", - " 76, 202, 82, 88, 95, 7, 18, 198, 213, 250, 289,\n", - " 23, 282, 287, 56, 20, 24, 28, 171, 147, 11, 8,\n", - " 172, 85, 872, 68, 509, 256, 96, 106, 118, 389, 153,\n", - " 259, 281, 243, 401, 157, 265, 317, 1, 339, 421, 462,\n", - " 479, 499, 511, 2, 460, 165, 180, 196, 314, 71, 130,\n", - " 57, 63, 86, 91, 411, 420, 306, 293, 218, 217, 182,\n", - " 267, 233, 581, 284, 142, 280, 278, 635, 388, 657, 333,\n", - " 556, 234, 365, 167, 220, 151, 291, 416, 584, 79, 326,\n", - " 475, 355, 319, 1113, 236, 383, 393])" - ] - }, - "execution_count": 11, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "coreg_df.roi.unique()" - ] - }, { "cell_type": "markdown", "id": "831e6318", @@ -562,16 +515,9 @@ }, { "cell_type": "code", - "execution_count": 7, + "execution_count": 12, "id": "6a252201", - "metadata": { - "execution": { - "iopub.execute_input": "2026-06-16T18:43:21.648866Z", - "iopub.status.busy": "2026-06-16T18:43:21.648688Z", - "iopub.status.idle": "2026-06-16T18:43:21.660835Z", - "shell.execute_reply": "2026-06-16T18:43:21.660207Z" - } - }, + "metadata": {}, "outputs": [ { "name": "stdout", @@ -656,7 +602,7 @@ "2 1 3 0 2 1.442091" ] }, - "execution_count": 7, + "execution_count": 12, "metadata": {}, "output_type": "execute_result" } diff --git a/code/etl_v1dd_01_v1196.ipynb b/code/etl_v1dd_01_v1196.ipynb new file mode 100644 index 0000000..4b629f2 --- /dev/null +++ b/code/etl_v1dd_01_v1196.ipynb @@ -0,0 +1,1060 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "title", + "metadata": {}, + "source": [ + "# ETL — V1DD release 1196 (single-notebook)\n", + "\n", + "Writes the full V1DD 1196 release into the common-connectivity schemas under one notebook, project `v1dd`.\n", + "\n", + "**Two DataSets inside project `v1dd`:**\n", + "- `v1dd_1196_em` — every EM soma in `soma_and_cell_type_1196.feather` (DataItem id = soma `id`).\n", + "- `v1dd_1196_func` — every functional ROI in `snr_by_cell.feather` (DataItem id = `f\"{volume}-{column}-{plane}-{roi}\"`).\n", + "\n", + "**Additional cohort DataSets (subsets of `v1dd_1196_em`):**\n", + "- `v1dd_1196_proofread_axons` — `proofread_axon_list_1196.npy`.\n", + "- `v1dd_1196_proofread_dendrites` — `proofread_dendrite_list_1196.npy`.\n", + "\n", + "**Additional cohort DataSet (subset of `v1dd_1196_func`):**\n", + "- `v1dd_1196_func_coregistered` — functional ROIs that appear in `coregistration_1196.feather`.\n", + "\n", + "Sections marked **TODO** are skeletons only; we will fill them together. Each section ends with an **open questions** list." + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "id": "imports", + "metadata": { + "execution": { + "iopub.execute_input": "2026-06-17T01:07:55.435587Z", + "iopub.status.busy": "2026-06-17T01:07:55.435338Z", + "iopub.status.idle": "2026-06-17T01:07:56.615920Z", + "shell.execute_reply": "2026-06-17T01:07:56.615055Z" + } + }, + "outputs": [], + "source": [ + "from pathlib import Path\n", + "\n", + "import numpy as np\n", + "import pandas as pd\n", + "import polars as pl\n", + "import pyarrow as pa\n", + "\n", + "from connects_common_connectivity.models import (\n", + " AlgorithmRun,\n", + " CellCellConnectivityLong,\n", + " CellFeatureDefinition,\n", + " CellFeatureMatrix,\n", + " CellFeatureSet,\n", + " CellToCellMapping,\n", + " Cluster,\n", + " ClusterHierarchy,\n", + " ClusterMembership,\n", + " DataItem,\n", + " DataItemDataSetAssociation,\n", + " DataSet,\n", + " MappingSet,\n", + " Modality,\n", + " SpatialLocation,\n", + ")\n", + "from connects_common_connectivity.io import write_models" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "id": "constants", + "metadata": { + "execution": { + "iopub.execute_input": "2026-06-17T01:07:56.617795Z", + "iopub.status.busy": "2026-06-17T01:07:56.617515Z", + "iopub.status.idle": "2026-06-17T01:07:56.621880Z", + "shell.execute_reply": "2026-06-17T01:07:56.621183Z" + } + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "DATA_ROOT : /data/v1dd_1196\n", + "OUTPUT_ROOT : ../scratch/v1dd_1196_v1/\n", + "PROJECT_ID : v1dd\n", + "RELEASE : 1196\n", + "DATASET_EM : v1dd_1196_em\n", + "DATASET_FUNC : v1dd_1196_func\n", + "DATASET_PROOFREAD_AXON : v1dd_1196_proofread_axons\n", + "DATASET_PROOFREAD_DEND : v1dd_1196_proofread_dendrites\n", + "DATASET_FUNC_COREG : v1dd_1196_func_coregistered\n", + "HIERARCHY_ID_V1DD : v1dd_cell_types\n", + "HIERARCHY_ID_MINNIE : minnie65_csm_cell_types\n", + "FS_EM_SOMA_GEOM : v1dd_em_soma_geometry\n", + "FS_FUNC_QC : v1dd_func_qc\n", + "FS_FUNC_POSITION : v1dd_func_imaging_position\n" + ] + } + ], + "source": [ + "DATA_ROOT = Path(\"/data/v1dd_1196\")\n", + "OUTPUT_ROOT = \"../scratch/v1dd_1196_v1/\"\n", + "PROJECT_ID = \"v1dd\"\n", + "RELEASE = \"1196\"\n", + "\n", + "DATASET_EM = \"v1dd_1196_em\"\n", + "DATASET_FUNC = \"v1dd_1196_func\"\n", + "DATASET_PROOFREAD_AXON = \"v1dd_1196_proofread_axons\"\n", + "DATASET_PROOFREAD_DEND = \"v1dd_1196_proofread_dendrites\"\n", + "DATASET_FUNC_COREG = \"v1dd_1196_func_coregistered\"\n", + "\n", + "HIERARCHY_ID_V1DD = \"v1dd_cell_types\"\n", + "HIERARCHY_ID_MINNIE = \"minnie65_csm_cell_types\" # for comparison only\n", + "\n", + "FS_EM_SOMA_GEOM = \"v1dd_em_soma_geometry\"\n", + "FS_FUNC_QC = \"v1dd_func_qc\"\n", + "FS_FUNC_POSITION = \"v1dd_func_imaging_position\"\n", + "\n", + "for k, v in list(locals().items()):\n", + " if k.isupper():\n", + " print(f\"{k:24s}: {v}\")" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "id": "prereq", + "metadata": { + "execution": { + "iopub.execute_input": "2026-06-17T01:07:56.623448Z", + "iopub.status.busy": "2026-06-17T01:07:56.623169Z", + "iopub.status.idle": "2026-06-17T01:07:56.788552Z", + "shell.execute_reply": "2026-06-17T01:07:56.787849Z" + } + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "All input files present.\n" + ] + } + ], + "source": [ + "# Sanity check: every expected input file is on disk.\n", + "expected = [\n", + " \"data_description.json\",\n", + " \"subject.json\",\n", + " \"soma_and_cell_type_1196.feather\",\n", + " \"proofread_axon_list_1196.npy\",\n", + " \"proofread_dendrite_list_1196.npy\",\n", + " \"snr_by_cell.feather\",\n", + " \"coregistration_1196.feather\",\n", + " \"syn_df_all_to_proofread_to_all_1196.feather\",\n", + " \"syn_label_df_all_to_proofread_to_all_1196.feather\",\n", + " \"cell_cell_correlations_by_stimulus.feather\",\n", + " \"cell_cell_correlations_by_stimulus_coregistered.feather\",\n", + "]\n", + "missing = [f for f in expected if not (DATA_ROOT / f).exists()]\n", + "assert not missing, f\"Missing input files: {missing}\"\n", + "print(\"All input files present.\")" + ] + }, + { + "cell_type": "markdown", + "id": "master-id-decision", + "metadata": {}, + "source": [ + "## Master decision — `DataItem.id = pt_root_id`\n", + "\n", + "Every downstream file in this release is keyed by `pt_root_id`; only `soma_and_cell_type_1196.feather` carries the per-detection soma `id`. We therefore use `str(pt_root_id)` as the EM `DataItem.id` for the whole notebook and treat soma-centroid data as features attached to the cell.\n", + "\n", + "### Where each id appears\n", + "\n", + "| File | soma `id` | `pt_root_id` |\n", + "|---|:---:|:---:|\n", + "| `soma_and_cell_type_1196.feather` | ✅ | ✅ |\n", + "| `proofread_axon_list_1196.npy` | — | ✅ |\n", + "| `proofread_dendrite_list_1196.npy` | — | ✅ |\n", + "| `coregistration_1196.feather` | — | ✅ |\n", + "| `syn_df_all_to_proofread_to_all_1196.feather` | — | ✅ (`pre_pt_root_id`, `post_pt_root_id`) |\n", + "| `cell_cell_correlations_by_stimulus_coregistered.feather` | — | ✅ |\n", + "| `snr_by_cell.feather`, `cell_cell_correlations_by_stimulus.feather` | — | — (functional ROI tuples) |\n", + "\n", + "### Key counts from `soma_and_cell_type` (207,455 rows)\n", + "\n", + "| quantity | value |\n", + "|---|---:|\n", + "| unique soma `id` | 207,455 |\n", + "| unique `pt_root_id` | 163,064 |\n", + "| rows with `pt_root_id == 0` (orphan detections) | 3,835 |\n", + "| rows with non-zero `pt_root_id` | 203,620 |\n", + "| unique non-zero `pt_root_id` | 163,063 |\n", + "| `pt_root_id`s with > 1 soma row | 19,615 (~12 %) |\n", + "| max soma rows for a single `pt_root_id` | 184 |\n", + "\n", + "### Policy\n", + "\n", + "- **EM `DataItem.id = str(pt_root_id)`** (one row per segment), `name = str(pt_root_id)`.\n", + "- **Drop `pt_root_id == 0` rows** — they cannot be referenced from any other file and so cannot be cohort-associated or linked.\n", + "- **Collapse multi-soma `pt_root_id`s** to one DataItem by picking the soma row with the largest `volume` (largest nucleus detection is the most plausible primary soma); the other rows are dropped from the cell-features matrix. Number of collapsed cells: 19,615; rows discarded: 207,455 − 163,063 − 3,835 = 40,557.\n", + "- **Downstream joins are direct lookups** on `pt_root_id` everywhere. No `pt_root_id → soma_id` resolution step is needed in §2, §5, §8, §9, §10.\n", + "\n", + "This matches the way the rest of the V1DD release is keyed and aligns with Minnie's nucleus-per-segment convention (the 12 % multi-detection rate is the only thing that differs — Minnie's `nucleus_detection_lookup_v1` already does the collapse at the source)." + ] + }, + { + "cell_type": "markdown", + "id": "s1-md", + "metadata": {}, + "source": [ + "## 1. `DataSet` rows\n", + "\n", + "Five DataSet rows under `project_id=\"v1dd\"`. Provenance comes from `data_description.json`; `publication` points at the V1DD physiology repository.\n", + "\n", + "| id | modality | parent |\n", + "|---|---|---|\n", + "| `v1dd_1196_em` | `ELECTRON_MICROSCOPY` | — |\n", + "| `v1dd_1196_proofread_axons` | `ELECTRON_MICROSCOPY` | subset of `v1dd_1196_em` |\n", + "| `v1dd_1196_proofread_dendrites` | `ELECTRON_MICROSCOPY` | subset of `v1dd_1196_em` |\n", + "| `v1dd_1196_func` | `CALCIUM_IMAGING` | — |\n", + "| `v1dd_1196_func_coregistered` | `CALCIUM_IMAGING` | subset of `v1dd_1196_func` |" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "id": "s1-code", + "metadata": { + "execution": { + "iopub.execute_input": "2026-06-17T01:07:56.790316Z", + "iopub.status.busy": "2026-06-17T01:07:56.790121Z", + "iopub.status.idle": "2026-06-17T01:07:57.314099Z", + "shell.execute_reply": "2026-06-17T01:07:57.296497Z" + } + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "DataSet rows written: 5\n" + ] + } + ], + "source": [ + "V1DD_PUBLICATION = \"https://github.com/AllenInstitute/v1dd_physiology\"\n", + "\n", + "datasets = [\n", + " DataSet(\n", + " id=DATASET_EM,\n", + " name=\"V1DD release 1196 — EM somas\",\n", + " publication=V1DD_PUBLICATION,\n", + " modality=Modality.ELECTRON_MICROSCOPY.value,\n", + " project_id=PROJECT_ID,\n", + " ),\n", + " DataSet(\n", + " id=DATASET_PROOFREAD_AXON,\n", + " name=\"V1DD release 1196 — proofread axons cohort\",\n", + " publication=V1DD_PUBLICATION,\n", + " modality=Modality.ELECTRON_MICROSCOPY.value,\n", + " project_id=PROJECT_ID,\n", + " ),\n", + " DataSet(\n", + " id=DATASET_PROOFREAD_DEND,\n", + " name=\"V1DD release 1196 — proofread dendrites cohort\",\n", + " publication=V1DD_PUBLICATION,\n", + " modality=Modality.ELECTRON_MICROSCOPY.value,\n", + " project_id=PROJECT_ID,\n", + " ),\n", + " DataSet(\n", + " id=DATASET_FUNC,\n", + " name=\"V1DD release 1196 — functional 2P ROIs\",\n", + " publication=V1DD_PUBLICATION,\n", + " modality=Modality.CALCIUM_IMAGING.value,\n", + " project_id=PROJECT_ID,\n", + " ),\n", + " DataSet(\n", + " id=DATASET_FUNC_COREG,\n", + " name=\"V1DD release 1196 — coregistered functional ROIs cohort\",\n", + " publication=V1DD_PUBLICATION,\n", + " modality=Modality.CALCIUM_IMAGING.value,\n", + " project_id=PROJECT_ID,\n", + " ),\n", + "]\n", + "\n", + "result = write_models(datasets, output_root=OUTPUT_ROOT)\n", + "print(f\"DataSet rows written: {result.rows_written}\")" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "id": "s1-verify", + "metadata": { + "execution": { + "iopub.execute_input": "2026-06-17T01:07:57.316012Z", + "iopub.status.busy": "2026-06-17T01:07:57.315796Z", + "iopub.status.idle": "2026-06-17T01:07:57.353718Z", + "shell.execute_reply": "2026-06-17T01:07:57.352843Z" + } + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "shape: (5, 5)\n", + " id name modality publication\n", + " v1dd_1196_func_coregistered V1DD release 1196 — coregistered functional ROIs cohort CALCIUM_IMAGING https://github.com/AllenInstitute/v1dd_physiology\n", + " v1dd_1196_func V1DD release 1196 — functional 2P ROIs CALCIUM_IMAGING https://github.com/AllenInstitute/v1dd_physiology\n", + "v1dd_1196_proofread_dendrites V1DD release 1196 — proofread dendrites cohort ELECTRON_MICROSCOPY https://github.com/AllenInstitute/v1dd_physiology\n", + " v1dd_1196_proofread_axons V1DD release 1196 — proofread axons cohort ELECTRON_MICROSCOPY https://github.com/AllenInstitute/v1dd_physiology\n", + " v1dd_1196_em V1DD release 1196 — EM somas ELECTRON_MICROSCOPY https://github.com/AllenInstitute/v1dd_physiology\n", + "\n", + "OK — 5 DataSet rows for project v1dd.\n" + ] + } + ], + "source": [ + "ds_verify = (\n", + " pl.read_delta(OUTPUT_ROOT + \"dataset/\")\n", + " .filter(pl.col(\"project_id\") == PROJECT_ID)\n", + ")\n", + "print(\"shape:\", ds_verify.shape)\n", + "print(ds_verify.select([\"id\", \"name\", \"modality\", \"publication\"]).to_pandas().to_string(index=False))\n", + "\n", + "expected_ids = {DATASET_EM, DATASET_PROOFREAD_AXON, DATASET_PROOFREAD_DEND,\n", + " DATASET_FUNC, DATASET_FUNC_COREG}\n", + "got_ids = set(ds_verify[\"id\"].to_list())\n", + "assert expected_ids <= got_ids, f\"missing DataSet ids: {expected_ids - got_ids}\"\n", + "assert ds_verify[\"id\"].n_unique() == ds_verify.shape[0], \"duplicate DataSet ids\"\n", + "modalities = dict(zip(ds_verify[\"id\"].to_list(), ds_verify[\"modality\"].to_list()))\n", + "for em_id in (DATASET_EM, DATASET_PROOFREAD_AXON, DATASET_PROOFREAD_DEND):\n", + " assert modalities[em_id] == Modality.ELECTRON_MICROSCOPY.value, em_id\n", + "for fn_id in (DATASET_FUNC, DATASET_FUNC_COREG):\n", + " assert modalities[fn_id] == Modality.CALCIUM_IMAGING.value, fn_id\n", + "print(\"\\nOK — 5 DataSet rows for project v1dd.\")" + ] + }, + { + "cell_type": "markdown", + "id": "s2-md", + "metadata": {}, + "source": [ + "## 2. EM `DataItem`s and `DataItemDataSetAssociation`s\n", + "\n", + "Per the master decision: one EM `DataItem` per unique non-zero `pt_root_id` in `soma_and_cell_type_1196.feather`. `id = name = str(pt_root_id)`. Where multiple soma rows share a `pt_root_id`, keep the one with the largest `volume`.\n", + "\n", + "Associations:\n", + "- Every kept EM cell → `v1dd_1196_em`.\n", + "- Cells whose `pt_root_id` ∈ `proofread_axon_list_1196.npy` → also `v1dd_1196_proofread_axons` (46/1210 proofread roots are absent from the soma catalog and will be skipped — see exploration cell).\n", + "- Cells whose `pt_root_id` ∈ `proofread_dendrite_list_1196.npy` → also `v1dd_1196_proofread_dendrites` (63986/63986 present).\n", + "\n", + "**Open questions:**\n", + "1. Proofread axon roots missing from the soma catalog (46/1210) — skip silently or log + skip? (Leaning: log + skip; they're real proofread cells without a soma centroid in this release.)\n", + "2. `neuroglancer_link` slot — V1DD has a public neuroglancer state; if a URL template is available we should populate it. Leaving null until a template is confirmed." + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "id": "s2-load", + "metadata": { + "execution": { + "iopub.execute_input": "2026-06-17T01:07:57.355615Z", + "iopub.status.busy": "2026-06-17T01:07:57.355409Z", + "iopub.status.idle": "2026-06-17T01:07:57.686941Z", + "shell.execute_reply": "2026-06-17T01:07:57.686050Z" + } + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "soma_df shape : (207455, 11)\n", + "pt_root_id unique : 163064\n", + "id unique : 207455\n", + "axon ids unique : 1210\n", + "dend ids unique : 63986\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "axon ∩ soma roots : 1164\n", + "dend ∩ soma roots : 63986\n" + ] + } + ], + "source": [ + "soma_df = pd.read_feather(DATA_ROOT / \"soma_and_cell_type_1196.feather\")\n", + "axon_ids = np.load(DATA_ROOT / \"proofread_axon_list_1196.npy\", allow_pickle=True)\n", + "dend_ids = np.load(DATA_ROOT / \"proofread_dendrite_list_1196.npy\", allow_pickle=True)\n", + "\n", + "print(\"soma_df shape :\", soma_df.shape)\n", + "print(\"pt_root_id unique :\", soma_df['pt_root_id'].nunique())\n", + "print(\"id unique :\", soma_df['id'].nunique())\n", + "print(\"axon ids unique :\", len(set(axon_ids.tolist())))\n", + "print(\"dend ids unique :\", len(set(dend_ids.tolist())))\n", + "print(\"axon ∩ soma roots :\", len(set(axon_ids.tolist()) & set(soma_df['pt_root_id'].tolist())))\n", + "print(\"dend ∩ soma roots :\", len(set(dend_ids.tolist()) & set(soma_df['pt_root_id'].tolist())))" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "id": "s2-code", + "metadata": { + "execution": { + "iopub.execute_input": "2026-06-17T01:07:57.688584Z", + "iopub.status.busy": "2026-06-17T01:07:57.688389Z", + "iopub.status.idle": "2026-06-17T01:07:57.691184Z", + "shell.execute_reply": "2026-06-17T01:07:57.690436Z" + } + }, + "outputs": [], + "source": [ + "# TODO: build EM DataItem rows + base associations to DATASET_EM.\n", + "# TODO: join axon/dend lists on pt_root_id to derive cohort association rows.\n", + "# TODO: write_models(...) + verify counts.\n", + "pass" + ] + }, + { + "cell_type": "markdown", + "id": "s3-md", + "metadata": {}, + "source": [ + "## 3. EM soma `CellFeatureMatrix` (`v1dd_em_soma_geometry`)\n", + "\n", + "Numeric features per EM DataItem:\n", + "- `pt_position_x`, `pt_position_y`, `pt_position_z` — voxel coordinates in the EM volume.\n", + "- `volume` — soma volume (units to confirm; likely µm³).\n", + "\n", + "Three `CellFeatureDefinition` rows (dtype ` soma DataItem id.\n", + "# TODO: write mappingset/ and celltocellmapping/ with correct predicates.\n", + "pass" + ] + }, + { + "cell_type": "markdown", + "id": "s9-md", + "metadata": {}, + "source": [ + "## 9. Synapses — `CellCellConnectivityLong`\n", + "\n", + "Source: `syn_df_all_to_proofread_to_all_1196.feather` (8.2M rows) + `syn_label_df_all_to_proofread_to_all_1196.feather` (6.7M tag rows, indexed by synapse `id`).\n", + "\n", + "Aggregate per (`pre_pt_root_id`, `post_pt_root_id`) pair into:\n", + "- `synapse_count` — number of synapses (count, dimensionless).\n", + "- `synapse_size_sum` — total `size` (voxel-count; units to confirm).\n", + "- *(optional)* `spine_synapse_count` — count of synapses tagged `spine`.\n", + "\n", + "Write to its own subdirectory per §5g: `cellcellconnectivitylong_proofread_to_proofread/` (folder name from the source feather). Pre/post cell ids are EM `DataItem` ids — i.e. `str(pt_root_id)` directly, no join needed.\n", + "\n", + "**Open question — defer decision:** *Should raw per-synapse rows get their own schema?* Today `CellCellConnectivityLong` collapses to one row per cell pair, which loses per-synapse position, size, and label information. Two paths:\n", + "- Keep aggregated only; ship raw rows as a parquet sidecar outside the common schema.\n", + "- Propose a new `Synapse` class (slots: id, pre_cell, post_cell, ctr_position, size, tag) — would require a schema PR.\n", + "\n", + "Leaving this **open**; the skeleton implements the aggregated form only.\n", + "\n", + "**Other open questions:**\n", + "1. What `measurement_type` enum value covers `synapse_count` and `synapse_size_sum`? Need to read `SynapticMeasurementType` enum values.\n", + "2. Unit for `synapse_size_sum` — `size` is in voxels (need confirmation); convert to nm³ or leave as voxel counts?\n", + "3. Most synapse endpoints (≈4.2M roots, of which only ~59k are in the soma catalog) have no matching EM `DataItem` — `CellCellConnectivityLong` requires both endpoints to be registered DataItems, so the un-cataloged endpoints must be dropped or we must register additional \"synapse-partner\" DataItems for them. Leaning: drop the un-cataloged side and keep only edges where both endpoints are in `v1dd_1196_em`.\n", + "4. The label feather is indexed by synapse `id` but is shorter than the main synapse table (6.7M vs 8.2M) — unlabelled synapses should be treated as `tag=null`, not implicitly `non-spine`.\n", + "5. Connectome discriminator — per §5g, the folder name scopes the example; confirm `cellcellconnectivitylong_proofread_to_proofread/` is the right convention." + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "id": "s9-code", + "metadata": { + "execution": { + "iopub.execute_input": "2026-06-17T01:07:57.982113Z", + "iopub.status.busy": "2026-06-17T01:07:57.981950Z", + "iopub.status.idle": "2026-06-17T01:07:57.984248Z", + "shell.execute_reply": "2026-06-17T01:07:57.983638Z" + } + }, + "outputs": [], + "source": [ + "# TODO: load syn_df + syn_label_df; join labels on synapse id.\n", + "# TODO: groupby (pre, post) -> synapse_count, synapse_size_sum, spine_count.\n", + "# TODO: join pre/post pt_root_id -> EM DataItem id.\n", + "# TODO: build CellCellConnectivityLong rows (one per pair per measurement_type).\n", + "# TODO: write to cellcellconnectivitylong_proofread_to_proofread/.\n", + "pass" + ] + }, + { + "cell_type": "markdown", + "id": "s10-md", + "metadata": {}, + "source": [ + "## 10. Functional cell-cell correlations — `CellCellConnectivityLong`\n", + "\n", + "Two source tables, seven stimulus conditions each:\n", + "- `cell_cell_correlations_by_stimulus_coregistered.feather` — keyed by `pre_pt_root_id` / `post_pt_root_id` (which **are** the EM DataItem ids). Pairs **do** repeat (multiple ROIs per EM cell, ~4 %). Two options: (a) write rows directly as (EM, EM) pairs and let consumers see the duplicates, (b) average correlations within each (pre, post) pair, (c) explode back into functional DataItem ids via coreg and write at (func, func) level.\n", + "- `cell_cell_correlations_by_stimulus.feather` — keyed by `(volume, column, plane, roi)` × 2. Tuples are unique. Maps cleanly to functional DataItem ids.\n", + "\n", + "Skeleton plan: one folder per (table, stimulus), e.g. `cellcellconnectivitylong_func_corr_drifting_gratings_full/`, `cellcellconnectivitylong_func_corr_coreg_drifting_gratings_full/`. 7 stimuli × 2 tables = 14 folders.\n", + "\n", + "Verified earlier: in the coregistered table, 148728 rows reduce to 142410 unique (pre_root, post_root) pairs — i.e. ~4% of rows share a pair with another row. 12 self-pairs exist. Pre-set == post-set (551 cells, fully symmetric).\n", + "\n", + "**Open questions:**\n", + "1. **Which key for the coregistered table?** With EM ids = pt_root_id, options (a)/(b)/(c) above are all on the table. (a) is the most direct; (b) loses ROI-level information; (c) requires picking which of the multiple coreg ROIs gets the correlation when collapsing pre side and same on post.\n", + "2. **Symmetry** — Pearson correlation is symmetric (corr(a,b) == corr(b,a)). The table appears to include both directions (8.8M rows ≈ N*(N-1) not N*(N-1)/2). Should we deduplicate, or keep as-is for ease of querying? `CellCellConnectivityLong` doesn't enforce direction.\n", + "3. **Self-pairs** — drop the 12 self-pair rows in the coregistered table?\n", + "4. **Measurement type / unit** — need a `SynapticMeasurementType` enum value for \"Pearson correlation\". If none exists, propose `pearson_correlation`?\n", + "5. **Scale** — 7 stimuli × 8.8M = 62M rows for the all-ROI table. Write all of it, or threshold (|r| > 0.1) and keep a sparse view? Storage cost vs query utility tradeoff.\n", + "6. Per §5g, do we need a per-stimulus folder, or can we use one folder with `measurement_type` as the discriminator? (Schema has only one `measurement_type` enum per row, so per-stimulus folders are the natural fit unless the enum has a `pearson_correlation_` variant — unlikely.)" + ] + }, + { + "cell_type": "code", + "execution_count": 18, + "id": "s10-explore", + "metadata": { + "execution": { + "iopub.execute_input": "2026-06-17T01:07:57.985687Z", + "iopub.status.busy": "2026-06-17T01:07:57.985523Z", + "iopub.status.idle": "2026-06-17T01:08:11.820420Z", + "shell.execute_reply": "2026-06-17T01:08:11.819635Z" + } + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "== coregistered table ==\n", + "rows : 148728\n", + "unique pre roots : 551\n", + "unique post roots : 551\n", + "unique (pre,post) : 142410\n", + "self pairs : 12\n", + "pre set == post set : True\n", + "\n", + "== all-ROI table ==\n", + "rows : 8846260\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "unique tuples : 8846260\n", + "self pairs (same vol,col,pln,roi): 0\n" + ] + } + ], + "source": [ + "corr_df = pd.read_feather(DATA_ROOT / \"cell_cell_correlations_by_stimulus.feather\")\n", + "corr_co_df = pd.read_feather(DATA_ROOT / \"cell_cell_correlations_by_stimulus_coregistered.feather\")\n", + "\n", + "print(\"== coregistered table ==\")\n", + "print(\"rows :\", len(corr_co_df))\n", + "print(\"unique pre roots :\", corr_co_df['pre_pt_root_id'].nunique())\n", + "print(\"unique post roots :\", corr_co_df['post_pt_root_id'].nunique())\n", + "print(\"unique (pre,post) :\", corr_co_df.drop_duplicates(['pre_pt_root_id','post_pt_root_id']).shape[0])\n", + "print(\"self pairs :\", (corr_co_df['pre_pt_root_id']==corr_co_df['post_pt_root_id']).sum())\n", + "print(\"pre set == post set :\", set(corr_co_df['pre_pt_root_id'].unique()) == set(corr_co_df['post_pt_root_id'].unique()))\n", + "print()\n", + "key = ['pre_roi','post_roi','pre_plane','post_plane','column','volume']\n", + "print(\"== all-ROI table ==\")\n", + "print(\"rows :\", len(corr_df))\n", + "print(\"unique tuples :\", corr_df.drop_duplicates(key).shape[0])\n", + "print(\"self pairs (same vol,col,pln,roi):\",\n", + " ((corr_df['pre_roi']==corr_df['post_roi']) & (corr_df['pre_plane']==corr_df['post_plane'])).sum())" + ] + }, + { + "cell_type": "code", + "execution_count": 19, + "id": "s10-code", + "metadata": { + "execution": { + "iopub.execute_input": "2026-06-17T01:08:11.822465Z", + "iopub.status.busy": "2026-06-17T01:08:11.822184Z", + "iopub.status.idle": "2026-06-17T01:08:11.824801Z", + "shell.execute_reply": "2026-06-17T01:08:11.824101Z" + } + }, + "outputs": [], + "source": [ + "# TODO: pivot each table -> long form (one row per pair per stimulus).\n", + "# TODO: resolve open questions above (key choice, dedup, threshold).\n", + "# TODO: write 14 folders cellcellconnectivitylong_func_corr_{coreg_,}/.\n", + "pass" + ] + }, + { + "cell_type": "markdown", + "id": "summary-md", + "metadata": {}, + "source": [ + "## Summary (skeleton)\n", + "\n", + "| Output path | Class | Rows | Status |\n", + "|---|---|---|---|\n", + "| `dataset/` | `DataSet` × 5 | 5 | skeleton |\n", + "| `dataitem/` | `DataItem` | ~207k EM + ~4.5k func | skeleton |\n", + "| `dataitem_dataset_association/` | `DataItemDataSetAssociation` | ~213k base + cohort rows | skeleton |\n", + "| `cellfeaturedefinition/`, `cellfeatureset/`, `cellfeaturematrix/`, `cellfeatures/v1dd_em_soma_geometry/` | EM soma geometry | 4 defs, 1 set, 1 matrix, 207k cell rows | skeleton |\n", + "| `singlecellreconstruction/` | `SingleCellReconstruction` | ≤207k (drops NaN trform) | skeleton |\n", + "| `cluster/`, `clusterhierarchy/`, `algorithmrun/`, `clustermembership/` | V1DD taxonomy | 15, 1, 1, parent-propagated | skeleton |\n", + "| `cellfeatures/v1dd_func_qc/`, `cellfeatures/v1dd_func_imaging_position/` | Functional features | 1 + 4 defs, 2 sets, 2 matrices | skeleton |\n", + "| `mappingset/`, `celltocellmapping/` | EM↔func coregistration | 1 set, 571 rows | skeleton |\n", + "| `cellcellconnectivitylong_proofread_to_proofread/` | Synapse aggregation | ~N pairs × M measurement types | skeleton |\n", + "| `cellcellconnectivitylong_func_corr_/` × 7 | All-ROI correlations | ~8.8M per stim | skeleton |\n", + "| `cellcellconnectivitylong_func_corr_coreg_/` × 7 | Coreg correlations | ~149k per stim | skeleton |\n", + "\n", + "**Cross-section open questions (need answers before we wire the writes):**\n", + "- pt_root_id → soma_id join policy (multi-match, missing) — §2, §8, §9.\n", + "- Modality enum value for calcium imaging — §1.\n", + "- Reference space + units for `pt_position_trform_*` — §3, §4.\n", + "- Per-synapse schema vs aggregation-only — §9.\n", + "- Correlation key (EM cell vs functional ROI) and symmetry handling — §10." + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.12.4" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/etl_v1dd_01_v1196_temp_prompt.md b/etl_v1dd_01_v1196_temp_prompt.md new file mode 100644 index 0000000..ff66bc3 --- /dev/null +++ b/etl_v1dd_01_v1196_temp_prompt.md @@ -0,0 +1,119 @@ +# Handoff prompt — continue building `etl_v1dd_01_v1196.ipynb` + +You are picking up an in-progress ETL notebook that ingests the V1DD release 1196 dataset into the Common-Connectivity (CCC) Delta-lake schemas. One previous agent built the skeleton + section 1 (DataSets). The user wants the remaining sections filled in **one at a time, together** — finish a section, show the result, wait for the user to review before moving on. + +--- + +## Read first (in this order) + +### Authoritative conventions +- `/root/capsule/etl_example_prompt.md` — full ETL conventions guide. **Read end-to-end before writing any code.** Pay special attention to: + - §2 hard rules (never edit `src/` or `models.py`; never cast ids; use enum `.value`; every write has a verification cell). + - §4 canonical notebook structure. + - §5a–§5j write patterns per table family. + - §10 common mistakes table. +- `/root/capsule/CHANGELOG.md` — only relevant if you end up changing schemas (don't unless the user asks). + +### The notebook in progress (the one you'll be editing) +- `/root/capsule/code/etl_v1dd_01_v1196.ipynb` + - Cells 0–6: title, imports, constants, prereq check, **master id decision**, §1 DataSets (DONE, written + verified). + - Cells 7+: §2…§10 are skeletons with markdown plans + `# TODO` code stubs + per-section open questions. + - **`OUTPUT_ROOT = "../scratch/v1dd_1196_v1/"`** — relative to `code/`. The §1 outputs are already there under `dataset/`. + - Re-execute the whole notebook with `cd /root/capsule/code && uv run jupyter nbconvert --to notebook --execute --inplace etl_v1dd_01_v1196.ipynb` after every change. + +### Exploration / scratch reference +- `/root/capsule/code/etl_v1dd_00_explore.ipynb` — initial exploration of every input file with schema-fit notes. Useful for sanity-checking shapes/columns. + +### Example notebooks to mirror (same modality as V1DD = MICrONS Minnie) +- `/root/capsule/code/etl_minnie_01_dataset_dataitem.ipynb` — DataSet + DataItem + association pattern. +- `/root/capsule/code/etl_minnie_02_cell_features.ipynb` — CellFeatureSet/Definition/Matrix + wide parquet. +- `/root/capsule/code/etl_minnie_03_cluster_and_cluster_membership.ipynb` — Cluster taxonomy + parent-propagated memberships. +- `/root/capsule/code/etl_minnie_04_cell_cell.ipynb` — CellCellConnectivityLong with per-example folder convention. +- `/root/capsule/code/etl_wnm_exc_04_projection_matrix.ipynb` — for the `SingleCellReconstruction` + `SpatialLocation` pattern if needed. + +### Schemas (source of truth — do not modify without explicit user request) +- `/root/capsule/schemas/base_schema.yaml` +- `/root/capsule/schemas/core_schema.yaml` — `DataSet`, `DataItem`, `DataItemDataSetAssociation`, `SpatialLocation`, `Modality`. +- `/root/capsule/schemas/cell_features_schema.yaml` +- `/root/capsule/schemas/clustering_schema.yaml` +- `/root/capsule/schemas/mappings_schema.yaml` — `MappingSet`, `CellToCellMapping`, `CellToClusterMapping`. +- `/root/capsule/schemas/cell_cell_schema.yaml` — `CellCellConnectivityLong`, `SynapticMeasurementType` enum. +- `/root/capsule/schemas/single_cell_schema.yaml` — `SingleCellReconstruction`. +- The user has already added `Modality.CALCIUM_IMAGING` and regenerated `src/connects_common_connectivity/models.py`. Trust this. + +### Package utilities (read-only) +- `/root/capsule/src/connects_common_connectivity/models.py` — auto-generated pydantic models; read to confirm field names and enum values. +- `/root/capsule/src/connects_common_connectivity/io/writers.py` — `write_models(models, *, output_root=...)` dispatches by class. Use `output_root=OUTPUT_ROOT` because we are NOT writing to the shared `ccc_config.yaml` location. +- `/root/capsule/src/connects_common_connectivity/io/write_spec.py` — per-class WriteSpec (predicates, partition keys); consult before any write that isn't already in the writer registry. +- `/root/capsule/src/connects_common_connectivity/write_utils.py` — `append_new_dataitems`, `walk_ancestors`. +- `/root/capsule/src/connects_common_connectivity/arrow_utils.py` — `build_arrow_schema`, `models_to_table`, `attach_linkml_metadata`, `build_cell_feature_matrix_schema` (kwarg-only — see §10 of the prompt guide). + +--- + +## Raw V1DD data — `/data/v1dd_1196/` + +| File | Shape | Notes | +|---|---|---| +| `data_description.json`, `subject.json`, `metadata.nd.json` | aind-data-schema records | provenance; `name`, `project_name`, modalities, S3 location | +| `soma_and_cell_type_1196.feather` | (207 455, 11) | soma centroids + `cell_type_coarse` ∈ {E,I} + `cell_type` (12 leaves) | +| `proofread_axon_list_1196.npy` | (1 210,) int64 | `pt_root_id`s with proofread axons; 1164/1210 are in soma catalog | +| `proofread_dendrite_list_1196.npy` | (63 986,) int64 | `pt_root_id`s with proofread dendrites; all in soma catalog | +| `snr_by_cell.feather` | (4 458, 5) | functional ROI `(volume, column, plane, roi)` + `snr` | +| `coregistration_1196.feather` | (571, 5) | EM↔functional mapping; pre/post not unique on either side | +| `syn_df_all_to_proofread_to_all_1196.feather` | (8 204 497, 13) | per-synapse rows; `pre_pt_root_id`/`post_pt_root_id` + positions + `size` | +| `syn_label_df_all_to_proofread_to_all_1196.feather` | (6 706 286, 1) | per-synapse `tag` (`spine`, …), indexed by synapse `id` | +| `cell_cell_correlations_by_stimulus.feather` | (8 846 260, 13) | all-ROI functional Pearson corr per stimulus, ROI-tuple-keyed, tuples unique | +| `cell_cell_correlations_by_stimulus_coregistered.feather` | (148 728, 9) | same but EM-rootid-keyed; 142 410 unique pairs (≈4 % repeat), 12 self-pairs | + +--- + +## Master decisions already made (do not relitigate) + +- **One notebook for all of V1DD 1196**, no `_02`/`_03` follow-ups. +- **`OUTPUT_ROOT = "../scratch/v1dd_1196_v1/"`.** +- **`PROJECT_ID = "v1dd"`.** +- **Five DataSets** (`v1dd_1196_em`, `v1dd_1196_proofread_axons`, `v1dd_1196_proofread_dendrites`, `v1dd_1196_func`, `v1dd_1196_func_coregistered`). Already written, do not rewrite. +- **EM `DataItem.id = str(pt_root_id)`** — single source of truth for EM cells. See the `master-id-decision` markdown cell for the table + numbers + collapse policy (drop `pt_root_id==0`, keep largest-`volume` soma when multiple rows share a root). +- **Functional `DataItem.id = f"{volume}-{column}-{plane}-{roi}"`** (planned in §6 skeleton). +- **`publication = "https://github.com/AllenInstitute/v1dd_physiology"`** for every DataSet. + +--- + +## Sections still to fill (in order) + +| § | Title | Status | +|---|---|---| +| 1 | DataSet rows | ✅ DONE | +| 2 | EM DataItems + cohort associations | TODO — next up | +| 3 | EM soma `CellFeatureMatrix` (`v1dd_em_soma_geometry`) | TODO | +| 4 | `SingleCellReconstruction` + `SpatialLocation` (CCF) | TODO | +| 5 | V1DD cell-type taxonomy (Cluster + ClusterMembership) | TODO; includes a v1dd↔minnie taxonomy comparison table already verified | +| 6 | Functional DataItems + coregistered cohort | TODO | +| 7 | Functional feature sets (`v1dd_func_qc`, `v1dd_func_imaging_position`) | TODO | +| 8 | `CellToCellMapping` for EM↔functional coregistration | TODO | +| 9 | Synapse aggregation → `CellCellConnectivityLong` | TODO (per-synapse schema question is intentionally OPEN) | +| 10 | Functional correlations → `CellCellConnectivityLong` × 7 stimuli × 2 tables | TODO | + +Each section has open questions in its markdown cell. **Ask the user before answering them yourself** — they want to review each section before you wire the writes. + +--- + +## Working agreement (per user instruction) + +1. **Build one section at a time.** Do not jump ahead. After each section: re-execute the full notebook, show the verification cell output to the user, then stop and wait. +2. **Update the markdown of each section as decisions are resolved** — remove answered open questions, keep unresolved ones, keep the section concise. +3. **Don't touch `src/` or `schemas/`** unless the user explicitly asks for a schema change. +4. **Don't relitigate the master id decision.** If a section's open question is rendered moot by it, just delete the question. +5. **Verification cell after every write** — read back with `pl.read_delta`, print shape + head(3), assert at least one invariant (row count, unique ids, expected categorical value). +6. **Use `write_models(..., output_root=OUTPUT_ROOT)`** for everything that has a WriteSpec. For things without one (wide-form feature parquets, cell-cell folders), fall back to `deltalake.write_deltalake` with the patterns in §5b–§5g of the prompt guide. +7. **Run the notebook headless to validate**: `cd /root/capsule/code && uv run jupyter nbconvert --to notebook --execute --inplace etl_v1dd_01_v1196.ipynb`. + +--- + +## Next action when the user returns + +Start with **§2 (EM DataItems + cohort associations)**. The skeleton is in place and the master id decision is documented. The two remaining open questions in that section are: +1. Proofread axon roots missing from the soma catalog (46/1210) — skip silently or log + skip? +2. `neuroglancer_link` — populate, or leave null? + +Ask the user, then implement, write, verify, and stop. From 0122e2d642559a32ff4153c5f3b78fb3a840da3a Mon Sep 17 00:00:00 2001 From: reneyagmur Date: Tue, 23 Jun 2026 09:36:45 +0000 Subject: [PATCH 17/25] notebook migration clean up --- CHANGELOG.md | 6 +- code/etl_minnie_01_dataset_dataitem.ipynb | 179 +++++------------- ...l_visp_inh_patchseq_02_cell_features.ipynb | 57 ++---- ...eq_03_cluster_membership_and_mapping.ipynb | 54 ++---- code/etl_wnm_exc_02_cell_features.ipynb | 77 +++----- planning/TODO.md | 21 +- 6 files changed, 127 insertions(+), 267 deletions(-) diff --git a/CHANGELOG.md b/CHANGELOG.md index 5baa08b..491e56d 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -45,8 +45,10 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 hardcoded `OUTPUT_ROOT = "../scratch/..."` strings are replaced with `output_root()` from `connects_common_connectivity.config`, and hand-rolled `write_deltalake(..., mode=..., predicate=..., partition_by=...)` - calls for registry-backed models are replaced with `write_models(...)` - (and `write_projection_matrix(...)` for projection matrices). + calls for every registry-backed model are replaced with `write_models(...)` + (and `write_projection_matrix(...)` for projection-matrix metadata rows). + Wide cell-feature / projection-matrix parquets and `CellCellConnectivityLong` + writes remain on raw `write_deltalake` pending registry support. - Moved `arrow_utils` and `write_utils` under `connects_common_connectivity.io.*`. diff --git a/code/etl_minnie_01_dataset_dataitem.ipynb b/code/etl_minnie_01_dataset_dataitem.ipynb index ed3d463..b150252 100644 --- a/code/etl_minnie_01_dataset_dataitem.ipynb +++ b/code/etl_minnie_01_dataset_dataitem.ipynb @@ -4,7 +4,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "# ETL \u2014 Minnie65: DataSet & DataItem\n", + "# ETL — Minnie65: DataSet & DataItem\n", "\n", "Writes one `DataSet` record (`dataset_id = \"minnie65_v1300_nuclei\"`, `project_id = \"minnie65\"`) and one `DataItem` per nucleus from the CAVE `nucleus_detection_lookup_v1` view at materialization version 1300, plus the corresponding `DataItemDataSetAssociation` links. Cohort DataSets (e.g. `minnie65_v1300_csm_cluster`) and cell features are written by later notebooks." ] @@ -50,7 +50,7 @@ "name": "stdout", "output_type": "stream", "text": [ - "OUTPUT_ROOT : ../scratch/em_patchseq_wnm_v2/\n", + "OUTPUT_ROOT : ../scratch/em_patchseq_wnm_v3/\n", "PROJECT_ID : minnie65\n", "DATASET_ID : minnie65_v1300_nuclei\n", "CAVE_DATASTACK : minnie65_phase3_v1\n", @@ -88,92 +88,21 @@ "metadata": {}, "outputs": [ { - "name": "stdout", - "output_type": "stream", - "text": [ - "Shape: (133969, 7)\n" + "ename": "HTTPError", + "evalue": "503 Server Error: Service Temporarily Unavailable for url: https://minnie.microns-daf.com/materialize/version content:b'\\r\\n503 Service Temporarily Unavailable\\r\\n\\r\\n

503 Service Temporarily Unavailable

\\r\\n
nginx
\\r\\n\\r\\n\\r\\n'", + "output_type": "error", + "traceback": [ + "\u001b[31m---------------------------------------------------------------------------\u001b[39m", + "\u001b[31mHTTPError\u001b[39m Traceback (most recent call last)", + "\u001b[36mCell\u001b[39m\u001b[36m \u001b[39m\u001b[32mIn[3]\u001b[39m\u001b[32m, line 2\u001b[39m\n\u001b[32m 1\u001b[39m client = caveclient.CAVEclient(CAVE_DATASTACK, auth_token=os.environ[\u001b[33m\"CUSTOM_KEY\"\u001b[39m])\n\u001b[32m----> \u001b[39m\u001b[32m2\u001b[39m client.materialize.version = CAVE_VERSION\n\u001b[32m 3\u001b[39m \n\u001b[32m 4\u001b[39m nuc_df = client.materialize.query_view(CAVE_VIEW)\n\u001b[32m 5\u001b[39m nuc_df = nuc_df.query(\u001b[33m\"pt_root_id != 0\"\u001b[39m)\n", + "\u001b[36mFile \u001b[39m\u001b[32m/opt/conda/lib/python3.12/site-packages/caveclient/frameworkclient.py:633\u001b[39m, in \u001b[36mCAVEclientFull.materialize\u001b[39m\u001b[34m(self)\u001b[39m\n\u001b[32m 628\u001b[39m \u001b[38;5;250m\u001b[39m\u001b[33;03m\"\"\"\u001b[39;00m\n\u001b[32m 629\u001b[39m \u001b[33;03mA client for the materialization service. See [client.materialize](../api/materialize.md)\u001b[39;00m\n\u001b[32m 630\u001b[39m \u001b[33;03mfor more information.\u001b[39;00m\n\u001b[32m 631\u001b[39m \u001b[33;03m\"\"\"\u001b[39;00m\n\u001b[32m 632\u001b[39m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;28mself\u001b[39m._materialize \u001b[38;5;129;01mis\u001b[39;00m \u001b[38;5;28;01mNone\u001b[39;00m:\n\u001b[32m--> \u001b[39m\u001b[32m633\u001b[39m \u001b[38;5;28mself\u001b[39m._materialize = \u001b[30;43mMaterializationClient\u001b[39;49m\u001b[30;43m(\u001b[39;49m\n\u001b[32m 634\u001b[39m \u001b[30;43m \u001b[39;49m\u001b[30;43mserver_address\u001b[39;49m\u001b[30;43m=\u001b[39;49m\u001b[30;43mself\u001b[39;49m\u001b[30;43m.\u001b[39;49m\u001b[30;43mlocal_server\u001b[39;49m\u001b[30;43m,\u001b[39;49m\n\u001b[32m 635\u001b[39m \u001b[30;43m \u001b[39;49m\u001b[30;43mauth_client\u001b[39;49m\u001b[30;43m=\u001b[39;49m\u001b[30;43mself\u001b[39;49m\u001b[30;43m.\u001b[39;49m\u001b[30;43mauth\u001b[39;49m\u001b[30;43m,\u001b[39;49m\n\u001b[32m 636\u001b[39m \u001b[30;43m \u001b[39;49m\u001b[30;43mdatastack_name\u001b[39;49m\u001b[30;43m=\u001b[39;49m\u001b[30;43mself\u001b[39;49m\u001b[30;43m.\u001b[39;49m\u001b[30;43m_datastack_name\u001b[39;49m\u001b[30;43m,\u001b[39;49m\n\u001b[32m 637\u001b[39m \u001b[30;43m \u001b[39;49m\u001b[30;43msynapse_table\u001b[39;49m\u001b[30;43m=\u001b[39;49m\u001b[30;43mself\u001b[39;49m\u001b[30;43m.\u001b[39;49m\u001b[30;43minfo\u001b[39;49m\u001b[30;43m.\u001b[39;49m\u001b[30;43mget_datastack_info\u001b[39;49m\u001b[30;43m(\u001b[39;49m\u001b[30;43m)\u001b[39;49m\u001b[30;43m.\u001b[39;49m\u001b[30;43mget\u001b[39;49m\u001b[30;43m(\u001b[39;49m\u001b[30;43m\"\u001b[39;49m\u001b[30;43msynapse_table\u001b[39;49m\u001b[30;43m\"\u001b[39;49m\u001b[30;43m,\u001b[39;49m\u001b[30;43m \u001b[39;49m\u001b[30;43;01mNone\u001b[39;49;00m\u001b[30;43m)\u001b[39;49m\u001b[30;43m,\u001b[39;49m\n\u001b[32m 638\u001b[39m \u001b[30;43m \u001b[39;49m\u001b[30;43mmax_retries\u001b[39;49m\u001b[30;43m=\u001b[39;49m\u001b[30;43mself\u001b[39;49m\u001b[30;43m.\u001b[39;49m\u001b[30;43m_max_retries\u001b[39;49m\u001b[30;43m,\u001b[39;49m\n\u001b[32m 639\u001b[39m \u001b[30;43m \u001b[39;49m\u001b[30;43mpool_maxsize\u001b[39;49m\u001b[30;43m=\u001b[39;49m\u001b[30;43mself\u001b[39;49m\u001b[30;43m.\u001b[39;49m\u001b[30;43m_pool_maxsize\u001b[39;49m\u001b[30;43m,\u001b[39;49m\n\u001b[32m 640\u001b[39m \u001b[30;43m \u001b[39;49m\u001b[30;43mpool_block\u001b[39;49m\u001b[30;43m=\u001b[39;49m\u001b[30;43mself\u001b[39;49m\u001b[30;43m.\u001b[39;49m\u001b[30;43m_pool_block\u001b[39;49m\u001b[30;43m,\u001b[39;49m\n\u001b[32m 641\u001b[39m \u001b[30;43m \u001b[39;49m\u001b[30;43mover_client\u001b[39;49m\u001b[30;43m=\u001b[39;49m\u001b[30;43mself\u001b[39;49m\u001b[30;43m,\u001b[39;49m\n\u001b[32m 642\u001b[39m \u001b[30;43m \u001b[39;49m\u001b[30;43mdesired_resolution\u001b[39;49m\u001b[30;43m=\u001b[39;49m\u001b[30;43mself\u001b[39;49m\u001b[30;43m.\u001b[39;49m\u001b[30;43mdesired_resolution\u001b[39;49m\u001b[30;43m,\u001b[39;49m\n\u001b[32m 643\u001b[39m \u001b[30;43m \u001b[39;49m\u001b[30;43m)\u001b[39;49m\n\u001b[32m 644\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28mself\u001b[39m._materialize\n", + "\u001b[36mFile \u001b[39m\u001b[32m/opt/conda/lib/python3.12/site-packages/caveclient/materializationengine.py:221\u001b[39m, in \u001b[36mMaterializationClient.__init__\u001b[39m\u001b[34m(self, server_address, datastack_name, auth_client, cg_client, synapse_table, api_version, version, verify, max_retries, pool_maxsize, pool_block, desired_resolution, over_client)\u001b[39m\n\u001b[32m 209\u001b[39m auth_header = auth_client.request_header\n\u001b[32m 210\u001b[39m endpoints, api_version = _api_endpoints(\n\u001b[32m 211\u001b[39m api_version,\n\u001b[32m 212\u001b[39m SERVER_KEY,\n\u001b[32m (...)\u001b[39m\u001b[32m 218\u001b[39m verify=verify,\n\u001b[32m 219\u001b[39m )\n\u001b[32m--> \u001b[39m\u001b[32m221\u001b[39m \u001b[30;43msuper\u001b[39;49m\u001b[30;43m(\u001b[39;49m\u001b[30;43mMaterializationClient\u001b[39;49m\u001b[30;43m,\u001b[39;49m\u001b[30;43m \u001b[39;49m\u001b[30;43mself\u001b[39;49m\u001b[30;43m)\u001b[39;49m\u001b[30;43m.\u001b[39;49m\u001b[30;43m__init__\u001b[39;49m\u001b[30;43m(\u001b[39;49m\n\u001b[32m 222\u001b[39m \u001b[30;43m \u001b[39;49m\u001b[30;43mserver_address\u001b[39;49m\u001b[30;43m,\u001b[39;49m\n\u001b[32m 223\u001b[39m \u001b[30;43m \u001b[39;49m\u001b[30;43mauth_header\u001b[39;49m\u001b[30;43m,\u001b[39;49m\n\u001b[32m 224\u001b[39m \u001b[30;43m \u001b[39;49m\u001b[30;43mapi_version\u001b[39;49m\u001b[30;43m,\u001b[39;49m\n\u001b[32m 225\u001b[39m \u001b[30;43m \u001b[39;49m\u001b[30;43mendpoints\u001b[39;49m\u001b[30;43m,\u001b[39;49m\n\u001b[32m 226\u001b[39m \u001b[30;43m \u001b[39;49m\u001b[30;43mSERVER_KEY\u001b[39;49m\u001b[30;43m,\u001b[39;49m\n\u001b[32m 227\u001b[39m \u001b[30;43m \u001b[39;49m\u001b[30;43mverify\u001b[39;49m\u001b[30;43m=\u001b[39;49m\u001b[30;43mverify\u001b[39;49m\u001b[30;43m,\u001b[39;49m\n\u001b[32m 228\u001b[39m \u001b[30;43m \u001b[39;49m\u001b[30;43mmax_retries\u001b[39;49m\u001b[30;43m=\u001b[39;49m\u001b[30;43mmax_retries\u001b[39;49m\u001b[30;43m,\u001b[39;49m\n\u001b[32m 229\u001b[39m \u001b[30;43m \u001b[39;49m\u001b[30;43mpool_maxsize\u001b[39;49m\u001b[30;43m=\u001b[39;49m\u001b[30;43mpool_maxsize\u001b[39;49m\u001b[30;43m,\u001b[39;49m\n\u001b[32m 230\u001b[39m \u001b[30;43m \u001b[39;49m\u001b[30;43mpool_block\u001b[39;49m\u001b[30;43m=\u001b[39;49m\u001b[30;43mpool_block\u001b[39;49m\u001b[30;43m,\u001b[39;49m\n\u001b[32m 231\u001b[39m \u001b[30;43m \u001b[39;49m\u001b[30;43mover_client\u001b[39;49m\u001b[30;43m=\u001b[39;49m\u001b[30;43mover_client\u001b[39;49m\u001b[30;43m,\u001b[39;49m\n\u001b[32m 232\u001b[39m \u001b[30;43m\u001b[39;49m\u001b[30;43m)\u001b[39;49m\n\u001b[32m 233\u001b[39m \u001b[38;5;28mself\u001b[39m._datastack_name = datastack_name\n\u001b[32m 234\u001b[39m \u001b[38;5;28mself\u001b[39m._version = version\n", + "\u001b[36mFile \u001b[39m\u001b[32m/opt/conda/lib/python3.12/site-packages/caveclient/base.py:217\u001b[39m, in \u001b[36mClientBase.__init__\u001b[39m\u001b[34m(self, server_address, auth_header, api_version, endpoints, server_name, verify, max_retries, pool_maxsize, pool_block, over_client)\u001b[39m\n\u001b[32m 215\u001b[39m \u001b[38;5;28mself\u001b[39m._endpoints = endpoints\n\u001b[32m 216\u001b[39m \u001b[38;5;28mself\u001b[39m._fc = over_client\n\u001b[32m--> \u001b[39m\u001b[32m217\u001b[39m \u001b[38;5;28mself\u001b[39m._server_version = \u001b[30;43mself\u001b[39;49m\u001b[30;43m.\u001b[39;49m\u001b[30;43m_get_version\u001b[39;49m\u001b[30;43m(\u001b[39;49m\u001b[30;43m)\u001b[39;49m\n", + "\u001b[36mFile \u001b[39m\u001b[32m/opt/conda/lib/python3.12/site-packages/caveclient/base.py:246\u001b[39m, in \u001b[36mClientBase._get_version\u001b[39m\u001b[34m(self)\u001b[39m\n\u001b[32m 244\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28;01mNone\u001b[39;00m\n\u001b[32m 245\u001b[39m \u001b[38;5;28;01melse\u001b[39;00m:\n\u001b[32m--> \u001b[39m\u001b[32m246\u001b[39m version_str = \u001b[30;43mhandle_response\u001b[39;49m\u001b[30;43m(\u001b[39;49m\u001b[30;43mresponse\u001b[39;49m\u001b[30;43m,\u001b[39;49m\u001b[30;43m \u001b[39;49m\u001b[30;43mas_json\u001b[39;49m\u001b[30;43m=\u001b[39;49m\u001b[30;43;01mTrue\u001b[39;49;00m\u001b[30;43m)\u001b[39;49m\n\u001b[32m 247\u001b[39m version = Version(version_str)\n\u001b[32m 248\u001b[39m \u001b[38;5;28;01mreturn\u001b[39;00m version\n", + "\u001b[36mFile \u001b[39m\u001b[32m/opt/conda/lib/python3.12/site-packages/caveclient/base.py:94\u001b[39m, in \u001b[36mhandle_response\u001b[39m\u001b[34m(response, as_json, log_warning)\u001b[39m\n\u001b[32m 92\u001b[39m \u001b[38;5;250m\u001b[39m\u001b[33;03m\"\"\"Deal with potential errors in endpoint response and return json for default case\"\"\"\u001b[39;00m\n\u001b[32m 93\u001b[39m \u001b[38;5;66;03m# NOTE: consider adding \"None on 404\" as an option?\u001b[39;00m\n\u001b[32m---> \u001b[39m\u001b[32m94\u001b[39m \u001b[30;43m_raise_for_status\u001b[39;49m\u001b[30;43m(\u001b[39;49m\u001b[30;43mresponse\u001b[39;49m\u001b[30;43m,\u001b[39;49m\u001b[30;43m \u001b[39;49m\u001b[30;43mlog_warning\u001b[39;49m\u001b[30;43m=\u001b[39;49m\u001b[30;43mlog_warning\u001b[39;49m\u001b[30;43m)\u001b[39;49m\n\u001b[32m 95\u001b[39m _check_authorization_redirect(response)\n\u001b[32m 96\u001b[39m \u001b[38;5;28;01mif\u001b[39;00m as_json:\n", + "\u001b[36mFile \u001b[39m\u001b[32m/opt/conda/lib/python3.12/site-packages/caveclient/base.py:84\u001b[39m, in \u001b[36m_raise_for_status\u001b[39m\u001b[34m(r, log_warning)\u001b[39m\n\u001b[32m 76\u001b[39m http_error_msg = \u001b[33m\"\u001b[39m\u001b[38;5;132;01m%s\u001b[39;00m\u001b[33m Server Error: \u001b[39m\u001b[38;5;132;01m%s\u001b[39;00m\u001b[33m for url: \u001b[39m\u001b[38;5;132;01m%s\u001b[39;00m\u001b[33m content:\u001b[39m\u001b[38;5;132;01m%s\u001b[39;00m\u001b[33m\"\u001b[39m % (\n\u001b[32m 77\u001b[39m r.status_code,\n\u001b[32m 78\u001b[39m reason,\n\u001b[32m 79\u001b[39m r.url,\n\u001b[32m 80\u001b[39m r.content,\n\u001b[32m 81\u001b[39m )\n\u001b[32m 83\u001b[39m \u001b[38;5;28;01mif\u001b[39;00m http_error_msg:\n\u001b[32m---> \u001b[39m\u001b[32m84\u001b[39m \u001b[38;5;28;01mraise\u001b[39;00m requests.HTTPError(http_error_msg, response=r)\n\u001b[32m 85\u001b[39m \u001b[38;5;28;01mif\u001b[39;00m log_warning:\n\u001b[32m 86\u001b[39m warning = r.headers.get(\u001b[33m\"\u001b[39m\u001b[33mWarning\u001b[39m\u001b[33m\"\u001b[39m)\n", + "\u001b[31mHTTPError\u001b[39m: 503 Server Error: Service Temporarily Unavailable for url: https://minnie.microns-daf.com/materialize/version content:b'\\r\\n503 Service Temporarily Unavailable\\r\\n\\r\\n

503 Service Temporarily Unavailable

\\r\\n
nginx
\\r\\n\\r\\n\\r\\n'" ] - }, - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
idvolumept_root_idorig_root_idpt_supervoxel_idpt_positionpt_position_lookup
1373879229.04504386469113609013560786469113609013560796218056992431305[228816, 239776, 19593][228816, 239776, 19593]
320185893.75383686469113537389367886469113537389367884955554103121097[146848, 213600, 26267][146848, 213600, 26267]
4600774135.1897918646911356823787440111493022281121981[339120, 276112, 19442][339520, 276480, 19506]
\n", - "
" - ], - "text/plain": [ - " id volume pt_root_id orig_root_id \\\n", - "1 373879 229.045043 864691136090135607 864691136090135607 \n", - "3 201858 93.753836 864691135373893678 864691135373893678 \n", - "4 600774 135.189791 864691135682378744 0 \n", - "\n", - " pt_supervoxel_id pt_position pt_position_lookup \n", - "1 96218056992431305 [228816, 239776, 19593] [228816, 239776, 19593] \n", - "3 84955554103121097 [146848, 213600, 26267] [146848, 213600, 26267] \n", - "4 111493022281121981 [339120, 276112, 19442] [339520, 276480, 19506] " - ] - }, - "execution_count": 3, - "metadata": {}, - "output_type": "execute_result" } ], "source": [ @@ -196,17 +125,9 @@ }, { "cell_type": "code", - "execution_count": 4, + "execution_count": null, "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "DataSet written: 1 rows\n" - ] - } - ], + "outputs": [], "source": [ "dataset = DataSet(\n", " id=DATASET_ID,\n", @@ -230,14 +151,14 @@ "text": [ "(1, 5)\n", "shape: (1, 5)\n", - "\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n", - "\u2502 id \u2506 name \u2506 publication \u2506 modality \u2506 project_id \u2502\n", - "\u2502 --- \u2506 --- \u2506 --- \u2506 --- \u2506 --- \u2502\n", - "\u2502 str \u2506 str \u2506 str \u2506 str \u2506 str \u2502\n", - "\u255e\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2561\n", - "\u2502 minnie65_v1300_nucle \u2506 Minnie65 v1300 \u2506 doi.org/10.1038/s415 \u2506 ELECTRON_MICROSCOPY \u2506 minnie65 \u2502\n", - "\u2502 i \u2506 nucleus catalog \u2506 86-025-087\u2026 \u2506 \u2506 \u2502\n", - "\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n" + "┌──────────────────────┬─────────────────┬──────────────────────┬─────────────────────┬────────────┐\n", + "│ id ┆ name ┆ publication ┆ modality ┆ project_id │\n", + "│ --- ┆ --- ┆ --- ┆ --- ┆ --- │\n", + "│ str ┆ str ┆ str ┆ str ┆ str │\n", + "╞══════════════════════╪═════════════════╪══════════════════════╪═════════════════════╪════════════╡\n", + "│ minnie65_v1300_nucle ┆ Minnie65 v1300 ┆ doi.org/10.1038/s415 ┆ ELECTRON_MICROSCOPY ┆ minnie65 │\n", + "│ i ┆ nucleus catalog ┆ 86-025-087… ┆ ┆ │\n", + "└──────────────────────┴─────────────────┴──────────────────────┴─────────────────────┴────────────┘\n" ] } ], @@ -294,17 +215,17 @@ "text": [ "(133969, 4)\n", "shape: (5, 4)\n", - "\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n", - "\u2502 id \u2506 name \u2506 neuroglancer_link \u2506 project_id \u2502\n", - "\u2502 --- \u2506 --- \u2506 --- \u2506 --- \u2502\n", - "\u2502 str \u2506 str \u2506 str \u2506 str \u2502\n", - "\u255e\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2561\n", - "\u2502 373879 \u2506 864691136090135607 \u2506 null \u2506 minnie65 \u2502\n", - "\u2502 201858 \u2506 864691135373893678 \u2506 null \u2506 minnie65 \u2502\n", - "\u2502 600774 \u2506 864691135682378744 \u2506 null \u2506 minnie65 \u2502\n", - "\u2502 408486 \u2506 864691135194387242 \u2506 null \u2506 minnie65 \u2502\n", - "\u2502 598774 \u2506 864691135741608653 \u2506 null \u2506 minnie65 \u2502\n", - "\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n" + "┌────────┬────────────────────┬───────────────────┬────────────┐\n", + "│ id ┆ name ┆ neuroglancer_link ┆ project_id │\n", + "│ --- ┆ --- ┆ --- ┆ --- │\n", + "│ str ┆ str ┆ str ┆ str │\n", + "╞════════╪════════════════════╪═══════════════════╪════════════╡\n", + "│ 373879 ┆ 864691136090135607 ┆ null ┆ minnie65 │\n", + "│ 201858 ┆ 864691135373893678 ┆ null ┆ minnie65 │\n", + "│ 600774 ┆ 864691135682378744 ┆ null ┆ minnie65 │\n", + "│ 408486 ┆ 864691135194387242 ┆ null ┆ minnie65 │\n", + "│ 598774 ┆ 864691135741608653 ┆ null ┆ minnie65 │\n", + "└────────┴────────────────────┴───────────────────┴────────────┘\n" ] } ], @@ -366,17 +287,17 @@ "text": [ "(133969, 3)\n", "shape: (5, 3)\n", - "\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n", - "\u2502 dataitem_id \u2506 dataset_id \u2506 project_id \u2502\n", - "\u2502 --- \u2506 --- \u2506 --- \u2502\n", - "\u2502 str \u2506 str \u2506 str \u2502\n", - "\u255e\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2561\n", - "\u2502 373879 \u2506 minnie65_v1300_nuclei \u2506 minnie65 \u2502\n", - "\u2502 201858 \u2506 minnie65_v1300_nuclei \u2506 minnie65 \u2502\n", - "\u2502 600774 \u2506 minnie65_v1300_nuclei \u2506 minnie65 \u2502\n", - "\u2502 408486 \u2506 minnie65_v1300_nuclei \u2506 minnie65 \u2502\n", - "\u2502 598774 \u2506 minnie65_v1300_nuclei \u2506 minnie65 \u2502\n", - "\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n" + "┌─────────────┬───────────────────────┬────────────┐\n", + "│ dataitem_id ┆ dataset_id ┆ project_id │\n", + "│ --- ┆ --- ┆ --- │\n", + "│ str ┆ str ┆ str │\n", + "╞═════════════╪═══════════════════════╪════════════╡\n", + "│ 373879 ┆ minnie65_v1300_nuclei ┆ minnie65 │\n", + "│ 201858 ┆ minnie65_v1300_nuclei ┆ minnie65 │\n", + "│ 600774 ┆ minnie65_v1300_nuclei ┆ minnie65 │\n", + "│ 408486 ┆ minnie65_v1300_nuclei ┆ minnie65 │\n", + "│ 598774 ┆ minnie65_v1300_nuclei ┆ minnie65 │\n", + "└─────────────┴───────────────────────┴────────────┘\n" ] } ], @@ -408,8 +329,8 @@ "| `dataitem_dataset_association/` | `DataItemDataSetAssociation` | `len(nuc_df)` |\n", "\n", "**Intentionally not written here:**\n", - "- Cohort DataSets (e.g. `minnie65_v1300_csm_cluster`) \u2014 each cohort is an additional `DataSet` row plus `DataItemDataSetAssociation` rows pointing at the same `DataItem` ids; written by `_02`/`_03` notebooks.\n", - "- Cell features (`pt_position`, cell type labels, etc.) \u2014 written in `_02` as `CellFeature` records." + "- Cohort DataSets (e.g. `minnie65_v1300_csm_cluster`) — each cohort is an additional `DataSet` row plus `DataItemDataSetAssociation` rows pointing at the same `DataItem` ids; written by `_02`/`_03` notebooks.\n", + "- Cell features (`pt_position`, cell type labels, etc.) — written in `_02` as `CellFeature` records." ] }, { diff --git a/code/etl_visp_inh_patchseq_02_cell_features.ipynb b/code/etl_visp_inh_patchseq_02_cell_features.ipynb index 8c4d444..f86be35 100644 --- a/code/etl_visp_inh_patchseq_02_cell_features.ipynb +++ b/code/etl_visp_inh_patchseq_02_cell_features.ipynb @@ -32,10 +32,7 @@ "from deltalake import write_deltalake\n", "\n", "from connects_common_connectivity.io.arrow_utils import (\n", - " attach_linkml_metadata,\n", - " build_arrow_schema,\n", " build_cell_feature_matrix_schema,\n", - " models_to_table,\n", ")\n", "from connects_common_connectivity.models import (\n", " CellFeatureDefinition,\n", @@ -170,7 +167,7 @@ }, { "cell_type": "code", - "execution_count": 5, + "execution_count": null, "metadata": { "execution": { "iopub.execute_input": "2026-06-12T23:38:17.447595Z", @@ -179,41 +176,27 @@ "shell.execute_reply": "2026-06-12T23:38:17.621135Z" } }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "DataItems appended: 120\n", - "Associations appended: 120\n" - ] - } - ], + "outputs": [], "source": [ "if new_ids:\n", - " n_di = write_models([DataItem(id=cid, name=cid, project_id=PROJECT_ID) for cid in new_ids], output_root=OUTPUT_ROOT).rows_written\n", + " n_di = write_models(\n", + " [DataItem(id=cid, name=cid, project_id=PROJECT_ID) for cid in new_ids],\n", + " output_root=OUTPUT_ROOT,\n", + " ).rows_written\n", " print(f\"DataItems appended: {n_di}\")\n", - "\n", - " schema_assoc = build_arrow_schema(DataItemDataSetAssociation)\n", - " new_assoc_table = attach_linkml_metadata(\n", - " models_to_table(\n", - " [\n", - " DataItemDataSetAssociation(dataitem_id=cid, dataset_id=DATASET_ID, project_id=PROJECT_ID)\n", - " for cid in new_ids\n", - " ],\n", - " schema=schema_assoc,\n", - " ),\n", - " linkml_class=\"DataItemDataSetAssociation\",\n", - " )\n", - " # mode=\"append\" is safe here: new_ids only contains cells not yet in DataItem.\n", - " # Re-runs skip this block (new_ids is empty), so no duplicate associations accumulate.\n", - " write_deltalake(\n", - " OUTPUT_ROOT + \"dataitem_dataset_association/\", new_assoc_table,\n", - " mode=\"append\", partition_by=[\"project_id\"],\n", - " )\n", - " print(f\"Associations appended: {len(new_ids)}\")\n", "else:\n", - " print(\"No new cells to register \u2014 all already present.\")" + " print(\"No new cells to register \\u2014 all already present.\")\n", + "\n", + "# Re-assert the full (project_id, dataset_id) association scope. The\n", + "# DataItemDataSetAssociation WriteSpec is overwrite_scoped on those two\n", + "# columns, so passing the full intended set is idempotent and self-heals\n", + "# any partial prior run.\n", + "n_assoc = write_models(\n", + " [DataItemDataSetAssociation(dataitem_id=cid, dataset_id=DATASET_ID, project_id=PROJECT_ID)\n", + " for cid in all_wide_ids],\n", + " output_root=OUTPUT_ROOT,\n", + ").rows_written\n", + "print(f\"Associations written for ({PROJECT_ID}, {DATASET_ID}): {n_assoc}\")\n" ] }, { @@ -963,9 +946,9 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.13.13" + "version": "3.12.4" } }, "nbformat": 4, "nbformat_minor": 4 -} +} \ No newline at end of file diff --git a/code/etl_visp_inh_patchseq_03_cluster_membership_and_mapping.ipynb b/code/etl_visp_inh_patchseq_03_cluster_membership_and_mapping.ipynb index 47da515..c56dd0f 100644 --- a/code/etl_visp_inh_patchseq_03_cluster_membership_and_mapping.ipynb +++ b/code/etl_visp_inh_patchseq_03_cluster_membership_and_mapping.ipynb @@ -68,14 +68,7 @@ "source": [ "import pandas as pd\n", "import polars as pl\n", - "import pyarrow as pa\n", - "from deltalake import write_deltalake\n", "\n", - "from connects_common_connectivity.io.arrow_utils import (\n", - " attach_linkml_metadata,\n", - " build_arrow_schema,\n", - " models_to_table,\n", - ")\n", "from connects_common_connectivity.models import (\n", " CellToClusterMapping,\n", " ClusterMembership,\n", @@ -336,7 +329,7 @@ }, { "cell_type": "code", - "execution_count": 6, + "execution_count": null, "id": "2c8d4901", "metadata": { "execution": { @@ -346,37 +339,18 @@ "shell.execute_reply": "2026-06-12T23:38:34.392730Z" } }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "No new associations needed \u2014 all MET cells already linked to inh dataset.\n", - "Total visp_inh_patchseq associations now: 2879\n" - ] - } - ], + "outputs": [], "source": [ - "if ids_needing_assoc:\n", - " schema_assoc = build_arrow_schema(DataItemDataSetAssociation)\n", - " new_assoc_table = attach_linkml_metadata(\n", - " models_to_table(\n", - " [DataItemDataSetAssociation(dataitem_id=cid, dataset_id=DATASET_ID, project_id=PROJECT_ID)\n", - " for cid in ids_needing_assoc],\n", - " schema=schema_assoc,\n", - " ),\n", - " linkml_class=\"DataItemDataSetAssociation\",\n", - " )\n", - " # mode=\"append\" is idempotent here: ids_needing_assoc only contains ids without an\n", - " # existing (project, dataset) association. On re-run, the set is empty and we skip.\n", - " write_deltalake(\n", - " OUTPUT_ROOT + \"dataitem_dataset_association/\", new_assoc_table,\n", - " mode=\"append\",\n", - " partition_by=[\"project_id\"],\n", - " )\n", - " print(f\"Associations appended: {len(ids_needing_assoc)}\")\n", - "else:\n", - " print(\"No new associations needed \u2014 all MET cells already linked to inh dataset.\")\n", + "# Re-assert the full (project_id, dataset_id) association scope for every\n", + "# MET cell. DataItemDataSetAssociation is overwrite_scoped on\n", + "# (project_id, dataset_id), so passing the full set is idempotent and\n", + "# self-heals any cells that were missed by a prior partial run.\n", + "n_assoc = write_models(\n", + " [DataItemDataSetAssociation(dataitem_id=cid, dataset_id=DATASET_ID, project_id=PROJECT_ID)\n", + " for cid in sorted(met_csv_ids)],\n", + " output_root=OUTPUT_ROOT,\n", + ").rows_written\n", + "print(f\"Associations written for ({PROJECT_ID}, {DATASET_ID}): {n_assoc}\")\n", "\n", "# Verify post-condition: every MET cell now has the inh dataset association.\n", "post_assoc = (\n", @@ -973,9 +947,9 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.13.13" + "version": "3.12.4" } }, "nbformat": 4, "nbformat_minor": 5 -} +} \ No newline at end of file diff --git a/code/etl_wnm_exc_02_cell_features.ipynb b/code/etl_wnm_exc_02_cell_features.ipynb index 59d7ec8..bd3955b 100644 --- a/code/etl_wnm_exc_02_cell_features.ipynb +++ b/code/etl_wnm_exc_02_cell_features.ipynb @@ -29,14 +29,10 @@ "import pandas as pd\n", "import polars as pl\n", "import pyarrow as pa\n", - "from deltalake import DeltaTable, write_deltalake\n", - "from deltalake.exceptions import TableNotFoundError\n", + "from deltalake import write_deltalake\n", "\n", "from connects_common_connectivity.io.arrow_utils import (\n", - " attach_linkml_metadata,\n", - " build_arrow_schema,\n", " build_cell_feature_matrix_schema,\n", - " models_to_table,\n", ")\n", "from connects_common_connectivity.models import (\n", " CellFeatureDefinition,\n", @@ -479,7 +475,7 @@ }, { "cell_type": "code", - "execution_count": 8, + "execution_count": null, "metadata": { "execution": { "iopub.execute_input": "2026-06-12T21:50:50.460934Z", @@ -488,66 +484,39 @@ "shell.execute_reply": "2026-06-12T21:50:50.703619Z" } }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Appended 4 new DataItem rows\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Appended 4 new DataItemDataSetAssociation rows\n" - ] - } - ], + "outputs": [], "source": [ "# Register new cells (DataItem + DataItemDataSetAssociation) for those in Set1 not yet in _01.\n", "if new_ids_set1:\n", " new_items = [DataItem(id=i, name=i, project_id=PROJECT_ID) for i in new_ids_set1]\n", " n_appended = write_models(new_items, output_root=OUTPUT_ROOT).rows_written\n", " print(f\"Appended {n_appended} new DataItem rows\")\n", - "\n", - " new_assoc = [\n", - " DataItemDataSetAssociation(dataitem_id=i, dataset_id=DATASET_ID, project_id=PROJECT_ID)\n", - " for i in new_ids_set1\n", - " ]\n", - " schema_assoc = build_arrow_schema(DataItemDataSetAssociation)\n", - " table_assoc = attach_linkml_metadata(\n", - " models_to_table(new_assoc, schema=schema_assoc),\n", - " linkml_class=\"DataItemDataSetAssociation\",\n", - " )\n", - " # Append \u2014 association table uses append_new_dataitems pattern\n", - " # (no overwrite predicate since we only add new rows here)\n", - " existing_assoc = pl.read_delta(OUTPUT_ROOT + \"dataitem_dataset_association/\").filter(\n", - " (pl.col(\"project_id\") == PROJECT_ID) & (pl.col(\"dataset_id\") == DATASET_ID)\n", - " )\n", - " existing_assoc_ids = set(existing_assoc[\"dataitem_id\"].to_list())\n", - " truly_new_assoc = [a for a in new_assoc if a.dataitem_id not in existing_assoc_ids]\n", - " if truly_new_assoc:\n", - " table_new_assoc = attach_linkml_metadata(\n", - " models_to_table(truly_new_assoc, schema=schema_assoc),\n", - " linkml_class=\"DataItemDataSetAssociation\",\n", - " )\n", - " write_deltalake(\n", - " OUTPUT_ROOT + \"dataitem_dataset_association/\", table_new_assoc,\n", - " mode=\"append\", partition_by=[\"project_id\"],\n", - " )\n", - " print(f\"Appended {len(truly_new_assoc)} new DataItemDataSetAssociation rows\")\n", - " else:\n", - " print(\"No new association rows to append.\")\n", "else:\n", - " print(\"All Set1 cells already registered \u2014 no new DataItem or association writes.\")\n", + " print(\"All Set1 cells already registered \\u2014 no new DataItem writes.\")\n", + "\n", + "# Re-assert the full (project_id, dataset_id) association scope as the union\n", + "# of any existing assoc rows and the Set1 ids. DataItemDataSetAssociation is\n", + "# overwrite_scoped on (project_id, dataset_id), so passing the full intended\n", + "# set is idempotent and self-heals partial prior runs.\n", + "existing_assoc = (\n", + " pl.read_delta(OUTPUT_ROOT + \"dataitem_dataset_association/\")\n", + " .filter((pl.col(\"project_id\") == PROJECT_ID) & (pl.col(\"dataset_id\") == DATASET_ID))\n", + ")\n", + "existing_assoc_ids = set(existing_assoc[\"dataitem_id\"].to_list())\n", + "all_assoc_ids = sorted(existing_assoc_ids | set(set1_ids))\n", + "n_assoc = write_models(\n", + " [DataItemDataSetAssociation(dataitem_id=i, dataset_id=DATASET_ID, project_id=PROJECT_ID)\n", + " for i in all_assoc_ids],\n", + " output_root=OUTPUT_ROOT,\n", + ").rows_written\n", + "print(f\"Associations written for ({PROJECT_ID}, {DATASET_ID}): {n_assoc}\")\n", "\n", "# Refresh registered ids so Set2 coverage check reflects newly added cells.\n", "wnm_registered_ids = set(\n", " pl.read_delta(OUTPUT_ROOT + \"dataitem_dataset_association/\")\n", " .filter((pl.col(\"project_id\") == PROJECT_ID) & (pl.col(\"dataset_id\") == DATASET_ID))\n", " [\"dataitem_id\"].to_list()\n", - ")" + ")\n" ] }, { @@ -1954,4 +1923,4 @@ }, "nbformat": 4, "nbformat_minor": 4 -} +} \ No newline at end of file diff --git a/planning/TODO.md b/planning/TODO.md index 90050af..3fa1236 100644 --- a/planning/TODO.md +++ b/planning/TODO.md @@ -54,11 +54,22 @@ Flat, ordered list. One row per prompt; sub-tasks live in the prompts. Design li predicate / partition columns are `Optional` in the generated schema). Tests: `tests/test_write_validation.py`. Cross-field rules deferred (still empty list on every spec). -- [ ] **W6 — Notebook migration** (`prompts/06_notebook_migration.md`) — Migrate every - ETL notebook to typed writers; delete hardcoded `OUTPUT_ROOT` and per-cell - `mode`/`predicate`/`partition_by` (`ccc_config.yaml` already exists from W1). Run the - patchseq regression (exc then inh, both DataSet rows must coexist). Remove the W3 - re-export shims and confirm nothing imports the old paths. Blocked by W3 (W5 preferred). +- [x] **W6 — Notebook migration** (`prompts/06_notebook_migration.md`) — Every ETL + notebook now routes registry-backed writes through `write_models()` / + `write_projection_matrix()`; hardcoded `OUTPUT_ROOT = "../scratch/..."` strings + replaced with `output_root()` from `config`. Patchseq regression covered (exc and + inh `DataSet` rows coexist via `scope=["project_id", "id"]`). W3 re-export shims + removed; nothing imports the old `connects_common_connectivity.arrow_utils` / + `write_utils` paths. **Carve-outs (deferred, tracked elsewhere):** + (a) Wide cell-feature and wide projection-matrix parquets in + `etl_minnie_02`, `etl_visp_exc_patchseq_02`, `etl_visp_inh_patchseq_02`, + `etl_wnm_exc_02`, and `etl_wnm_exc_04` (ipsi/contra) still call + `write_deltalake` directly — `wide_parquet` mode not yet in the registry + (W3 deviation; revisit when the wide-matrix contract is clarified). + (b) `CellCellConnectivityLong` writes in `etl_minnie_04` (cells 19, 25) — + class not in the registry; `writers.py` has a `write_cellcellconnectivitylong` + stub documenting the migration plan. + (c) `etl_v1dd_01_v1196` cell 12 wide-parquet stub (still a `# TODO` placeholder). - [ ] **W7 — Write-side test suite** (`prompts/07_tests.md`) — Drift, patchseq regression, idempotency, append-new-by-id, predicate construction, per-class example smoke, no-shim regression, public-API surface. Owns only the gaps not specified by W2/W3/W4/W5. From 29c98522924db738b63979887196c2bb84cedfeb Mon Sep 17 00:00:00 2001 From: reneyagmur Date: Tue, 23 Jun 2026 10:15:08 +0000 Subject: [PATCH 18/25] tighten tests before pr, plans --- planning/tests_review/README.md | 20 +++ planning/tests_review/findings.md | 60 +++++++ planning/tests_review/plan.md | 267 ++++++++++++++++++++++++++++++ 3 files changed, 347 insertions(+) create mode 100644 planning/tests_review/README.md create mode 100644 planning/tests_review/findings.md create mode 100644 planning/tests_review/plan.md diff --git a/planning/tests_review/README.md b/planning/tests_review/README.md new file mode 100644 index 0000000..fc64ff4 --- /dev/null +++ b/planning/tests_review/README.md @@ -0,0 +1,20 @@ +# Tests Review — Findings & Implementation Plan + +Review of `tests/` (12 files, ~1,540 LOC) on branch `ingestion-v2`. + +## Documents + +- [`findings.md`](./findings.md) — Numbered review report: high / medium / low priority issues, plus what's working well. +- [`plan.md`](./plan.md) — 5-PR implementation plan for the next steps, with code snippets and sequencing. + +## TL;DR + +Suite is solid (good docstrings, parametrization, regression tests named after the bug). Main gaps: + +1. No `conftest.py` → duplicated helpers, cache-pollution risk. +2. `pytest.raises(Exception)` used in several places → too broad. +3. Regression assertions lack failure messages. +4. `WRITABLE_CLASSES` ↔ `_make_instance` drift is silent. +5. Missing coverage for `cli.py`, `parquet_loader.py`, and `dry_run` semantics. + +Five small PRs proposed, sequenced so PRs 1–4 are pure test refactors and PR 5 is the only one likely to surface production bugs. diff --git a/planning/tests_review/findings.md b/planning/tests_review/findings.md new file mode 100644 index 0000000..0b358f9 --- /dev/null +++ b/planning/tests_review/findings.md @@ -0,0 +1,60 @@ +# Tests Review — Findings + +Review of `tests/` on branch `ingestion-v2` (12 files, ~1,540 LOC). + +## 🔴 High priority + +1. **No `conftest.py`.** Shared helpers are duplicated across files: `_models()` is redefined in 4 schema files; `_make_table`, `_read`, the `settings` / `tmp_path` fixtures appear ad-hoc. Promote `settings`, `_read`, `_models`, `_make_instance` to `tests/conftest.py` as fixtures. Will shrink the suite and stop drift. + +2. **`pytest.raises(Exception)` is too broad** in `test_basic.py` (lines 22, 37, 69) and `test_config.py` (108, 115). It will pass on completely unrelated failures (ImportError, TypeError from a refactor). Use `ValidationError` / `RuntimeError` with `match=` like the other schema tests already do. + +3. **Cross-test cache pollution risk.** `get_settings` is `lru_cache`d but only `test_config.py` clears it (via autouse fixture). If any other test imports `get_settings` first, later config tests can flake. Move the `_reset_cache_and_env` autouse fixture into `conftest.py` so it runs for every test. + +4. **`test_no_source_references_shim_paths` walks `REPO_ROOT.rglob("*")`** including `data/`, `results/`, `scratch/`, `metadata/`, `.venv` siblings, etc. It's slow and brittle. Either restrict to `{src, tests, code, scripts, planning}` or add those large dirs to `EXCLUDED_DIRS`. Also worth caching the file list. + +5. **`test_round_trip_each_writable_class` silently skips coverage drift.** If someone adds a class to `WRITABLE_CLASSES` and forgets to extend `_make_instance`, the test raises `AssertionError("no fixture for …")` — which *looks* like a test failure but doesn't tell you the spec is missing. Convert the `raise AssertionError` to `pytest.fail("…add a fixture in _make_instance")`, or better, register per-class fixtures in a dict and assert `set(fixtures) == set(WRITABLE_CLASSES)` as its own test. + +## 🟡 Medium priority + +6. **Assertion failure messages are mostly bare.** Examples: + - `test_patchseq_regression_two_datasets_same_project` → `assert ids == [...]` with no `, f"…"` message. When this fails in CI, you'll get `AssertionError: assert ['x'] == ['visp_exc_patchseq','visp_inh_patchseq']` and nothing about which write was lost. Add messages like `f"second write wiped first; remaining ids={ids}"`. + - `test_first_write_appends_all`, `test_idempotent_partial_rerun`: same — a custom message naming the scenario would speed debugging by months over time. + +7. **Lots of inline `import` statements** (`test_basic.py` imports `pytest` and `ccc` inside every test, `test_write_validation.test_write_models_calls_validation_before_io` imports `Settings` and `write_models` inside the function). Lift to module top for consistency with the rest of the suite. + +8. **`test_enum_validation` and `test_projection_measurement_matrix_laterality` use overly permissive assertions:** `assert str(ds.modality) in {Modality.TRACER.value, Modality.TRACER.name, str(Modality.TRACER)}`. That comment says "depending on dynamic generation" — pin it. If the schema can return three things, the schema isn't deterministic and *that* is the bug; if it's deterministic, assert exactly one. + +9. **No negative test for `validate_for_write` with a `list` containing a bad row.** `test_validate_for_write_accepts_a_list` covers the happy path; add a counterpart that passes `[good, bad]` and asserts the error names *which row* failed. + +10. **`test_write_models_rejects_unregistered_class`** uses `pytest.raises(TypeError)` without `match=`. Add `match="WRITABLE_CLASSES"` or similar so a misleading TypeError from elsewhere doesn't false-positive. + +11. **`test_describe_includes_resolved_values`** asserts substring `"root"` which trivially matches the path. Strengthen: assert `str(settings.output_root)` is in the output verbatim. + +12. **Idempotency assertion in `test_overwrite_scoped_is_idempotent`** checks only row count, not row equality. If the writer silently overwrites with wrong content, the test passes. Read back and assert the row matches `ds`. + +## 🟢 Low priority / polish + +13. **Naming consistency.** Some files use `def _models()` factory, others import directly from `connects_common_connectivity.models`. Pick one — preferably the direct import, since `generate_pydantic_models()` is re-invoked on every test and is presumably expensive. + +14. **`test_basic.py` is a grab bag** (imports, model generation, enum, required field, multivalued, bounds). Split into `test_import.py` + fold the rest into the topical schema files that already exist. + +15. **No markers / no test plan.** Consider `pytest.mark.slow` for the full per-class round-trip and the repo-walk shim test. Speeds local TDD. + +16. **`test_write_relocation.py` test name is misleading** — it's about shim removal, not relocation. Rename to `test_no_shim_imports.py`. + +17. **Missing coverage:** + - No tests for `cli.py` (the `ccc` entry point). + - No tests for `parquet_loader.py`. + - No tests for `dry_run=True` actually being honored by `write_models` (config has the flag; writer behavior under it is untested). + - No concurrent-write / locking behavior for delta tables, even a basic sanity test. + - `_build_predicate_escapes_single_quotes` covers `'` — also test backslash, empty string, and unicode. + +18. **`test_io_reexports_settings_helpers`** asserts identity (`is`) which is fine, but the same pattern in `test_public_api` uses `hasattr`. Pick one approach for re-export tests. + +## ✅ What's working well + +- **Excellent module-level docstrings** stating *why* the test exists (`test_writers.py`, `test_write_validation.py`, `test_write_relocation.py`, `test_public_api.py`). Keep doing this. +- **Headline regression test** (`test_patchseq_regression_two_datasets_same_project`) is exactly right — named for the bug, documents the prior failure mode in its docstring. +- **Parametrization over the registry** in `test_write_spec.py` is the right shape — it auto-grows with new entries. +- **`extra="forbid"` enforcement test** (`test_cluster_rejects_project_id`) prevents silent schema breakage. Good. +- Strong **regex `match=` usage** in schema tests catches the right error *and* the right field. diff --git a/planning/tests_review/plan.md b/planning/tests_review/plan.md new file mode 100644 index 0000000..d3baf7d --- /dev/null +++ b/planning/tests_review/plan.md @@ -0,0 +1,267 @@ +# Tests Review — Implementation Plan + +Five PRs, each independently mergeable, each small enough to review in <10 min. + +--- + +## PR 1 — `conftest.py` foundation (enables everything else) + +**Goal:** kill duplication and enforce stable test isolation (fresh cwd/env + cleared `get_settings` cache) in one shot. + +```python +# tests/conftest.py +from __future__ import annotations +import pytest +from pathlib import Path +import polars as pl + +import connects_common_connectivity as ccc +from connects_common_connectivity.config import Settings, get_settings +from connects_common_connectivity import models as _models_mod + + +@pytest.fixture(autouse=True) +def _isolate_settings(monkeypatch, tmp_path): + """Every test gets a clean cwd, no CCC_OUTPUT_ROOT, and a cleared cache.""" + monkeypatch.delenv("CCC_OUTPUT_ROOT", raising=False) + monkeypatch.chdir(tmp_path) + get_settings.cache_clear() + yield + get_settings.cache_clear() + + +@pytest.fixture(scope="session") +def models() -> dict: + """Generate pydantic models once per session (expensive).""" + return ccc.generate_pydantic_models() + + +@pytest.fixture +def settings(tmp_path) -> Settings: + return Settings(output_root=tmp_path) + + +@pytest.fixture +def read_delta(): + def _read(path) -> pl.DataFrame: + return pl.read_delta(str(path)) + return _read +``` + +Then: +- delete the duplicated `_models()` from 4 schema files; switch tests to `def test_x(models):` +- delete the duplicated `settings` / `_read` from `test_writers.py` +- delete `_reset_cache_and_env` from `test_config.py` (now autouse globally) + +**Decision:** keep autouse `chdir(tmp_path)` globally. In this package it is a feature, not a risk: config discovery is cwd-based and cached, so per-test cwd isolation prevents cross-test bleed. `test_write_relocation.py` is safe because `REPO_ROOT` is anchored from `__file__`, not cwd. + +--- + +## PR 2 — Tighten exception assertions + +**Pattern:** prefer the narrowest exception + a `match=` that names the *field or condition*, not the generic word. + +`pytest.raises` signature reminder: +```python +with pytest.raises(ExpectedException, match=r"regex against str(exc)"): + ... +``` + +**Concrete replacements:** + +| File:line | Before | After | +|---|---|---| +| `test_basic.py:22` | `pytest.raises(Exception)` | `pytest.raises(ValidationError, match=r"project_id.*[Ff]ield required")` | +| `test_basic.py:37` | `pytest.raises(Exception)` | `pytest.raises(ValidationError, match=r"modality.*Input should be")` | +| `test_basic.py:69` | `pytest.raises(Exception)` | `pytest.raises(ValidationError, match=r"probability.*less than or equal to 1")` | +| `test_config.py:108` | `pytest.raises(Exception)` | `pytest.raises(ValidationError, match=r"output_root.*[Ff]ield required")` | +| `test_config.py:115` | `pytest.raises(Exception)` | `pytest.raises(ValidationError, match=r"[Ee]xtra inputs are not permitted")` *(verify what Settings raises first)* | +| `test_writers.py:328` | `pytest.raises(TypeError)` | `pytest.raises(TypeError, match=r"pydantic model or iterable")` | + +Also add a **new** test for true registry rejection (different code path): + +```python +from pydantic import BaseModel + +class UnregisteredModel(BaseModel): + id: str + +with pytest.raises(KeyError, match=r"UnregisteredModel"): + write_models(UnregisteredModel(id="u1"), settings=settings) +``` + +Note: this test needs `from pydantic import BaseModel` at the top of `test_writers.py`. Match on the class name rather than the exact error string — it's a more durable contract than the message text. + +**Rule of thumb to leave in the PR description:** +> Never `pytest.raises(Exception)`. Always pick the narrowest class the production code raises, and always include `match=` naming the field or condition. If you don't know which exception the code raises, that's the first thing to find out — that's the contract. + +For the dynamically-generated pydantic models in `test_basic.py`, import `from pydantic import ValidationError` at module top — it's the same class instance the dynamic models will raise. + +--- + +## PR 3 — Failure messages on regression-critical asserts + +Only add custom messages where the failure mode is non-obvious. Don't litter every assert. + +**Targets:** + +```python +# test_writers.py — patchseq regression +ids = sorted(rows["id"].to_list()) +assert ids == ["visp_exc_patchseq", "visp_inh_patchseq"], ( + f"patchseq regression: second write wiped first. " + f"Expected both datasets, got {ids}" +) +``` + +```python +# test_writers.py — idempotency, also strengthen content equality +rows = _read(settings.output_root / "dataset") +assert rows.shape[0] == 1, f"idempotent rewrite produced {rows.shape[0]} rows" +assert rows["id"].to_list() == ["d1"], "row identity changed across rewrites" +assert rows["name"].to_list() == ["example"], "row content drifted across rewrites" +``` + +```python +# test_write_utils.py — partial rerun +assert n == 1, f"expected only 'c' to be new; appended {n} rows" +``` + +```python +# test_write_validation.py — IO-never-happened check +assert not (tmp_path / "cluster").exists(), ( + "validation failure should short-circuit before any IO; " + "cluster/ directory was created anyway" +) +``` + +Skip messages on simple positive assertions like `assert cfd.range_max is None` — pytest's introspection already shows the value. + +--- + +## PR 4 — Coverage drift guards & list-failure tests + +### 4a. `WRITABLE_CLASSES` ↔ fixture drift + +Replace the if/elif tower in `_make_instance` with a registry dict + drift test: + +```python +# tests/_fixtures.py (or in conftest) +INSTANCE_FACTORIES = { + DataSet: lambda: DataSet(id="ds1", name="ds", project_id="p1"), + DataItem: lambda: DataItem(id="di1", name="di1", project_id="p1"), + # ... +} + +def make_instance(cls): + try: + return INSTANCE_FACTORIES[cls]() + except KeyError: + pytest.fail( + f"No fixture for {cls.__name__}. Add an entry to " + f"INSTANCE_FACTORIES in tests/_fixtures.py." + ) +``` + +```python +def test_every_writable_class_has_a_fixture(): + missing = set(WRITABLE_CLASSES) - set(INSTANCE_FACTORIES) + assert not missing, ( + f"WRITABLE_CLASSES added entries without fixtures: " + f"{sorted(c.__name__ for c in missing)}" + ) + stale = set(INSTANCE_FACTORIES) - set(WRITABLE_CLASSES) + assert not stale, ( + f"INSTANCE_FACTORIES has stale entries not in WRITABLE_CLASSES: " + f"{sorted(c.__name__ for c in stale)}" + ) +``` + +This makes the drift visible as a dedicated test failure instead of a parametrized round-trip error. + +### 4b. Negative-path coverage for `validate_for_write` with a list + +```python +def test_validate_for_write_list_reports_failing_row(): + spec = REGISTRY["Cluster"] + items = [ + Cluster(id="c1", hierarchy_id="h1"), + Cluster(id="c2"), # missing hierarchy_id + ] + with pytest.raises(ValueError, match=r"hierarchy_id") as ei: + validate_for_write(items, spec) + # row identity should appear in the error to make debugging tractable + assert "c2" in str(ei.value), ( + f"error should name failing row; got: {ei.value}" + ) +``` + +(If the production code doesn't currently name the row, that's a real finding to file — the test documents the desired contract.) + +--- + +## PR 5 — Plug the real coverage gaps + +This is the only PR that touches behavior beyond test infra. Split per module to keep diffs small. + +### 5a. `dry_run` semantics +```python +def test_dry_run_does_not_write(tmp_path): + settings = Settings(output_root=tmp_path, dry_run=True) + ds = DataSet(id="d1", name="d", project_id="p1") + result = write_models(ds, settings=settings) + assert result.rows_written == 0, "dry_run must report 0 rows written" + assert not (tmp_path / "dataset").exists(), "dry_run must not create tables" +``` +If this fails, you've found a bug — `dry_run` exists in `Settings` but nothing checks it's honored. + +### 5b. `cli.py` +This CLI is `argparse`, not Click. Use `subprocess.run([sys.executable, "-m", "connects_common_connectivity.cli", ...])`. +Cover: top-level `--help`, `info` (assert package version text appears), one happy-path command (`bundle`), one error path (bad subcommand/args → nonzero exit). + +Skip `cmd_validate` and `etl-brain-regions` — both are marked `# pragma: no cover` in `cli.py` as runtime smoke commands; respect the existing exclusion. + +### 5c. `parquet_loader.py` +Test the public contract of `load_parquet_to_models(...)`: write a tiny parquet, load into a concrete class (e.g. `DataItem`), assert instance count + key field values + report counts/mapping. Add one negative test where required data is missing and assert the failure is surfaced in `report["errors"]`. + +### 5d. Extra escapes in `_build_predicate` (1-line additions to existing test) +```python +@pytest.mark.parametrize("value,expected_literal", [ + ("O'Hara", "'O''Hara'"), + ("", "''"), + ("a\\b", "'a\\b'"), # backslash is not special in SQL string literals + ("café", "'café'"), +]) +def test_build_predicate_escapes(value, expected_literal): + assert _build_predicate(["name"], [value]) == f"name = {expected_literal}" +``` + +### 5e. Repo-walk hardening in `test_write_relocation.py` +```python +SEARCH_ROOTS = ["src", "tests", "code", "scripts", "planning"] + +def _iter_source_files(): + for root in SEARCH_ROOTS: + base = REPO_ROOT / root + if not base.exists(): + continue + for path in base.rglob("*"): + if path.is_file() and path.suffix in {".py", ".ipynb"}: + if not any(p in EXCLUDED_DIRS for p in path.parts): + yield path +``` +Drops `data/`, `results/`, `scratch/`, `metadata/`, `environment/` from the walk. + +--- + +## Sequencing & rollout + +| PR | Effort | Risk | Blocks | +|---|---|---|---| +| 1. conftest | 1h | low | 2, 3 | +| 2. exceptions | 30m | low | — | +| 3. messages | 30m | none | — | +| 4. drift guards | 1h | low | — | +| 5. coverage gaps | 2–4h | medium (may surface real bugs) | — | + +PRs 1–4 are pure test refactors → merge fast. PR 5 is where you'll likely find a `dry_run` bug; budget time for a fix-PR alongside it. From aa7aaf4a81f79436dee85de6f5f79bc2306b10b1 Mon Sep 17 00:00:00 2001 From: reneyagmur Date: Tue, 23 Jun 2026 11:16:27 +0000 Subject: [PATCH 19/25] tests tightening done --- CHANGELOG.md | 3 + PR_message.md | 10 + planning/tests_review/README.md | 4 +- planning/tests_review/plan.md | 32 ++- .../io/write_validation.py | 6 +- .../io/writers.py | 22 +- tests/conftest.py | 38 ++++ tests/test_basic.py | 14 +- tests/test_cell_features_schema.py | 47 ++-- tests/test_cli.py | 43 ++++ tests/test_clustering_schema.py | 75 +++---- tests/test_config.py | 15 +- tests/test_mappings_schema.py | 43 ++-- tests/test_parquet_loader.py | 59 +++++ tests/test_projection_schema.py | 12 +- tests/test_write_relocation.py | 14 +- tests/test_write_utils.py | 2 +- tests/test_write_validation.py | 16 +- tests/test_writers.py | 211 ++++++++++-------- 19 files changed, 424 insertions(+), 242 deletions(-) create mode 100644 PR_message.md create mode 100644 tests/conftest.py create mode 100644 tests/test_cli.py create mode 100644 tests/test_parquet_loader.py diff --git a/CHANGELOG.md b/CHANGELOG.md index 491e56d..8a835a1 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -64,4 +64,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 ### Fixed +- Fixed `write_models()` to honor `Settings.dry_run=True`: writes are now skipped, + `rows_written` is reported as `0`, and no Delta table directories are created. + ### Security diff --git a/PR_message.md b/PR_message.md new file mode 100644 index 0000000..60b61cb --- /dev/null +++ b/PR_message.md @@ -0,0 +1,10 @@ +# PR Message + +Implemented the full `planning/tests_review/plan.md` sequence (WP1 to WP5) end-to-end, with package-by-package verification gates in order. + +- Added shared test foundations in `tests/conftest.py` (settings/cache/cwd isolation + shared fixtures) and removed duplicated helpers across tests. +- Tightened exception assertions to specific exception classes with meaningful `match=` checks. +- Added high-signal regression assertion messages where failures are otherwise hard to diagnose. +- Added fixture/registry drift guards for writable model coverage and improved list-validation failure reporting to include row context. +- Closed remaining coverage gaps by adding tests for CLI behavior, parquet loader contract, predicate escaping edge cases, relocation scan roots, and dry-run semantics. +- Fixed writer behavior so `write_models(..., settings=Settings(..., dry_run=True))` does not write any data and returns `rows_written=0`. diff --git a/planning/tests_review/README.md b/planning/tests_review/README.md index fc64ff4..20542f9 100644 --- a/planning/tests_review/README.md +++ b/planning/tests_review/README.md @@ -5,7 +5,7 @@ Review of `tests/` (12 files, ~1,540 LOC) on branch `ingestion-v2`. ## Documents - [`findings.md`](./findings.md) — Numbered review report: high / medium / low priority issues, plus what's working well. -- [`plan.md`](./plan.md) — 5-PR implementation plan for the next steps, with code snippets and sequencing. +- [`plan.md`](./plan.md) — Sequential implementation plan (5 work packages) for an agent to execute end-to-end in one go, with code snippets and per-package guardrails. ## TL;DR @@ -17,4 +17,4 @@ Suite is solid (good docstrings, parametrization, regression tests named after t 4. `WRITABLE_CLASSES` ↔ `_make_instance` drift is silent. 5. Missing coverage for `cli.py`, `parquet_loader.py`, and `dry_run` semantics. -Five small PRs proposed, sequenced so PRs 1–4 are pure test refactors and PR 5 is the only one likely to surface production bugs. +Five sequential work packages proposed for end-to-end agent execution. WPs 1–4 are pure test refactors; WP 5 is the only one likely to surface production bugs (fix in-place). diff --git a/planning/tests_review/plan.md b/planning/tests_review/plan.md index d3baf7d..8594f53 100644 --- a/planning/tests_review/plan.md +++ b/planning/tests_review/plan.md @@ -1,10 +1,10 @@ # Tests Review — Implementation Plan -Five PRs, each independently mergeable, each small enough to review in <10 min. +Five sequential work packages, implemented in order on the same execution track (no PR slicing). --- -## PR 1 — `conftest.py` foundation (enables everything else) +## Work Package 1 — `conftest.py` foundation (enables everything else) **Goal:** kill duplication and enforce stable test isolation (fresh cwd/env + cleared `get_settings` cache) in one shot. @@ -57,7 +57,7 @@ Then: --- -## PR 2 — Tighten exception assertions +## Work Package 2 — Tighten exception assertions **Pattern:** prefer the narrowest exception + a `match=` that names the *field or condition*, not the generic word. @@ -92,14 +92,14 @@ with pytest.raises(KeyError, match=r"UnregisteredModel"): Note: this test needs `from pydantic import BaseModel` at the top of `test_writers.py`. Match on the class name rather than the exact error string — it's a more durable contract than the message text. -**Rule of thumb to leave in the PR description:** +**Rule of thumb to leave in the package notes:** > Never `pytest.raises(Exception)`. Always pick the narrowest class the production code raises, and always include `match=` naming the field or condition. If you don't know which exception the code raises, that's the first thing to find out — that's the contract. For the dynamically-generated pydantic models in `test_basic.py`, import `from pydantic import ValidationError` at module top — it's the same class instance the dynamic models will raise. --- -## PR 3 — Failure messages on regression-critical asserts +## Work Package 3 — Failure messages on regression-critical asserts Only add custom messages where the failure mode is non-obvious. Don't litter every assert. @@ -139,7 +139,7 @@ Skip messages on simple positive assertions like `assert cfd.range_max is None` --- -## PR 4 — Coverage drift guards & list-failure tests +## Work Package 4 — Coverage drift guards & list-failure tests ### 4a. `WRITABLE_CLASSES` ↔ fixture drift @@ -200,9 +200,9 @@ def test_validate_for_write_list_reports_failing_row(): --- -## PR 5 — Plug the real coverage gaps +## Work Package 5 — Plug the real coverage gaps -This is the only PR that touches behavior beyond test infra. Split per module to keep diffs small. +This is the only work package that may touch behavior beyond test infra. Split per module to keep diffs small. ### 5a. `dry_run` semantics ```python @@ -254,9 +254,21 @@ Drops `data/`, `results/`, `scratch/`, `metadata/`, `environment/` from the walk --- +## Sequential execution guardrails (hard stops between packages) + +Do **not** start the next work package until the current package meets its guardrail. + +1. **WP1 → WP2:** `conftest.py` is in place, duplicated helper fixtures are removed from target files, and settings/cache isolation behavior remains intact. +2. **WP2 → WP3:** broad `pytest.raises(Exception)` uses targeted in this plan are replaced with narrow exception types and meaningful `match=` checks. +3. **WP3 → WP4:** custom assertion messages were added only to regression-critical/non-obvious assertions (no blanket message churn). +4. **WP4 → WP5:** fixture drift guard(s) are in place and list-failure validation coverage is added; registry/fixture mismatch now fails with explicit guidance. +5. **WP5 completion:** coverage-gap tests are in place; if `dry_run` exposes a real bug, fix behavior in the same package before declaring completion. + +--- + ## Sequencing & rollout -| PR | Effort | Risk | Blocks | +| Work package | Effort | Risk | Blocks | |---|---|---|---| | 1. conftest | 1h | low | 2, 3 | | 2. exceptions | 30m | low | — | @@ -264,4 +276,4 @@ Drops `data/`, `results/`, `scratch/`, `metadata/`, `environment/` from the walk | 4. drift guards | 1h | low | — | | 5. coverage gaps | 2–4h | medium (may surface real bugs) | — | -PRs 1–4 are pure test refactors → merge fast. PR 5 is where you'll likely find a `dry_run` bug; budget time for a fix-PR alongside it. +Work packages 1–4 are pure test refactors. Work package 5 is where you'll likely find a `dry_run` bug; budget time for an immediate behavior fix in the same execution sequence. diff --git a/src/connects_common_connectivity/io/write_validation.py b/src/connects_common_connectivity/io/write_validation.py index 3c1c199..1963d44 100644 --- a/src/connects_common_connectivity/io/write_validation.py +++ b/src/connects_common_connectivity/io/write_validation.py @@ -128,7 +128,7 @@ def validate_for_write(models: Any, spec: WriteSpec) -> Any: return items if was_iter else items[0] revalidated: list[BaseModel] = [] - for m in items: + for idx, m in enumerate(items): try: revalidated.append(strict.model_validate(m.model_dump())) except ValidationError as err: @@ -141,9 +141,11 @@ def validate_for_write(models: Any, spec: WriteSpec) -> Any: } ) slot_text = ", ".join(missing) if missing else "(see below)" + row_id = getattr(m, "id", None) + row_hint = f"row {idx}" if row_id is None else f"row {idx} (id={row_id})" raise ValueError( f"{cls.__name__}: missing required_for_write slot(s): " - f"{slot_text}. {err}" + f"{slot_text} at {row_hint}. {err}" ) from err return revalidated if was_iter else revalidated[0] diff --git a/src/connects_common_connectivity/io/writers.py b/src/connects_common_connectivity/io/writers.py index 3d9847c..1440b7c 100644 --- a/src/connects_common_connectivity/io/writers.py +++ b/src/connects_common_connectivity/io/writers.py @@ -227,7 +227,7 @@ def _dispatch_append_new_by_id( def _resolve_output_root( settings: Settings | None, output_root: str | Path | None, -) -> Path: +) -> tuple[Path, Settings | None]: """Resolve the effective on-disk root for a single write call. Precedence (highest first): @@ -245,8 +245,9 @@ def _resolve_output_root( "full Settings object." ) if output_root is not None: - return Path(output_root) - return Path((settings or get_settings()).output_root) + return Path(output_root), None + resolved = settings or get_settings() + return Path(resolved.output_root), resolved def write_models( @@ -303,13 +304,22 @@ def write_models( items = list(_validation_hook(items, spec)) - root = _resolve_output_root(settings, output_root) + root, resolved_settings = _resolve_output_root(settings, output_root) + path = root / spec.subdir + + if resolved_settings is not None and resolved_settings.dry_run: + return WriteResult( + class_name=spec.model_cls.__name__, + path=path, + mode=spec.write_mode, + predicates=(), + rows_written=0, + ) + schema = build_arrow_schema(cls) table = models_to_table(items, schema=schema) table = attach_linkml_metadata(table, linkml_class=cls.__name__) - path = root / spec.subdir - if spec.write_mode == "overwrite_scoped": return _dispatch_overwrite_scoped(table, spec, path) if spec.write_mode == "append_new_by_id": diff --git a/tests/conftest.py b/tests/conftest.py new file mode 100644 index 0000000..12b3378 --- /dev/null +++ b/tests/conftest.py @@ -0,0 +1,38 @@ +from __future__ import annotations + +from pathlib import Path + +import polars as pl +import pytest + +import connects_common_connectivity as ccc +from connects_common_connectivity.config import Settings, get_settings + + +@pytest.fixture(autouse=True) +def _isolate_settings(monkeypatch: pytest.MonkeyPatch, tmp_path: Path): + """Each test gets isolated cwd/env and a fresh get_settings cache.""" + monkeypatch.delenv("CCC_OUTPUT_ROOT", raising=False) + monkeypatch.chdir(tmp_path) + get_settings.cache_clear() + yield + get_settings.cache_clear() + + +@pytest.fixture(scope="session") +def models() -> dict: + """Generate pydantic models once per session (expensive).""" + return ccc.generate_pydantic_models() + + +@pytest.fixture +def settings(tmp_path: Path) -> Settings: + return Settings(output_root=tmp_path) + + +@pytest.fixture +def read_delta(): + def _read(path: str | Path) -> pl.DataFrame: + return pl.read_delta(str(path)) + + return _read diff --git a/tests/test_basic.py b/tests/test_basic.py index bd1e600..c4220be 100644 --- a/tests/test_basic.py +++ b/tests/test_basic.py @@ -1,3 +1,7 @@ +import pytest +from pydantic import ValidationError + + def test_import(): import connects_common_connectivity as ccc assert ccc.__version__ @@ -14,17 +18,15 @@ def test_model_generation(): def test_required_field_enforcement(): - import pytest import connects_common_connectivity as ccc models = ccc.generate_pydantic_models() DataItem = models["DataItem"] # project_id is required; omitting should raise a validation error - with pytest.raises(Exception): + with pytest.raises(ValidationError, match=r"(?s)project_id.*[Ff]ield required"): DataItem(id="D1", name="Item 1") def test_enum_validation(): - import pytest import connects_common_connectivity as ccc models = ccc.generate_pydantic_models() Modality = models["Modality"] # Enum @@ -34,7 +36,7 @@ def test_enum_validation(): # Depending on dynamic generation, modality may be stored as enum value or raw string assert str(ds.modality) in {Modality.TRACER.value, Modality.TRACER.name, str(Modality.TRACER)} # Invalid modality should raise error now that slot has enum range - with pytest.raises(Exception): + with pytest.raises(ValidationError, match=r"(?s)modality.*Input should be"): DataSet(id="DS2", name="Dataset 2", modality="NOT_A_VALID_MODALITY", project_id="P1") @@ -51,7 +53,6 @@ def test_multivalued_slot_list_type(): def test_probability_bounds_and_pattern(): - import pytest import connects_common_connectivity as ccc models = ccc.generate_pydantic_models() MappingSet = models["MappingSet"] @@ -66,9 +67,8 @@ def test_probability_bounds_and_pattern(): mapping = CellToCellMapping(id="M1", mapping_set=ms.id, source_cell=cell1.id, target_cell=cell2.id, probability=0.5, project_id="P1") assert 0 <= mapping.probability <= 1 # Invalid probability > 1 - with pytest.raises(Exception): + with pytest.raises(ValidationError, match=r"(?s)probability.*less than or equal to 1"): CellToCellMapping(id="M2", mapping_set=ms.id, source_cell=cell1.id, target_cell=cell2.id, probability=1.5, project_id="P1") - diff --git a/tests/test_cell_features_schema.py b/tests/test_cell_features_schema.py index 51bae1d..28135b6 100644 --- a/tests/test_cell_features_schema.py +++ b/tests/test_cell_features_schema.py @@ -1,26 +1,19 @@ import pytest from pydantic import ValidationError -import connects_common_connectivity as ccc - - -def _models(): - return ccc.generate_pydantic_models() - - # --------------------------------------------------------------------------- # CellFeatureDefinition # --------------------------------------------------------------------------- -def test_cell_feature_definition_project_id_required(): - CellFeatureDefinition = _models()["CellFeatureDefinition"] +def test_cell_feature_definition_project_id_required(models): + CellFeatureDefinition = models["CellFeatureDefinition"] with pytest.raises(ValidationError, match=r"(?s)project_id.*Field required"): CellFeatureDefinition(id="nucleus_volume_um", description="Nucleus volume") -def test_cell_feature_definition_valid(): - CellFeatureDefinition = _models()["CellFeatureDefinition"] +def test_cell_feature_definition_valid(models): + CellFeatureDefinition = models["CellFeatureDefinition"] cfd = CellFeatureDefinition( id="nucleus_volume_um", description="Nucleus volume in cubic microns", @@ -34,37 +27,37 @@ def test_cell_feature_definition_valid(): assert cfd.range_max is None # optional -def test_cell_feature_definition_range_min_max_optional(): - CellFeatureDefinition = _models()["CellFeatureDefinition"] +def test_cell_feature_definition_range_min_max_optional(models): + CellFeatureDefinition = models["CellFeatureDefinition"] # Both range fields absent — should not raise cfd = CellFeatureDefinition(id="some_feature", project_id="minnie65") assert cfd.range_min is None assert cfd.range_max is None -def test_cell_feature_definition_data_type_pattern_valid(): - CellFeatureDefinition = _models()["CellFeatureDefinition"] +def test_cell_feature_definition_data_type_pattern_valid(models): + CellFeatureDefinition = models["CellFeatureDefinition"] for dt in ["f8", "=i4"]: cfd = CellFeatureDefinition(id="feat", data_type=dt, project_id="p1") assert cfd.data_type == dt -def test_cell_feature_definition_data_type_pattern_invalid(): - CellFeatureDefinition = _models()["CellFeatureDefinition"] +def test_cell_feature_definition_data_type_pattern_invalid(models): + CellFeatureDefinition = models["CellFeatureDefinition"] for bad in ["float32", "f4", " subprocess.CompletedProcess[str]: + return subprocess.run( + [sys.executable, "-m", "connects_common_connectivity.cli", *args], + cwd=str(cwd) if cwd else None, + capture_output=True, + text=True, + ) + + +def test_cli_help(): + result = _run_cli("--help") + assert result.returncode == 0 + assert "usage:" in result.stdout.lower() + + +def test_cli_info_shows_version(): + result = _run_cli("info") + assert result.returncode == 0 + assert "Package version:" in result.stdout + + +def test_cli_bundle_happy_path(tmp_path): + out = tmp_path / "connectivity_bundle.tar.gz" + result = _run_cli("bundle", "--output", str(out), cwd=tmp_path) + assert result.returncode == 0 + assert out.exists() + with tarfile.open(out, "r:gz") as tf: + names = tf.getnames() + assert any(name.startswith("schemas/") for name in names) + + +def test_cli_bad_subcommand_exits_nonzero(): + result = _run_cli("not-a-command") + assert result.returncode != 0 + assert "invalid choice" in result.stderr.lower() diff --git a/tests/test_clustering_schema.py b/tests/test_clustering_schema.py index ec30630..0b56ff5 100644 --- a/tests/test_clustering_schema.py +++ b/tests/test_clustering_schema.py @@ -1,32 +1,25 @@ import pytest from pydantic import ValidationError -import connects_common_connectivity as ccc - - -def _models(): - return ccc.generate_pydantic_models() - - # --------------------------------------------------------------------------- # Cluster — no longer ProjectScoped (taxonomies are global reference artifacts) # --------------------------------------------------------------------------- -def test_cluster_has_no_project_id_field(): - Cluster = _models()["Cluster"] +def test_cluster_has_no_project_id_field(models): + Cluster = models["Cluster"] assert "project_id" not in Cluster.model_fields -def test_cluster_constructs_without_project_id(): - Cluster = _models()["Cluster"] +def test_cluster_constructs_without_project_id(models): + Cluster = models["Cluster"] cluster = Cluster(id="c1") assert cluster.id == "c1" -def test_cluster_rejects_project_id(): +def test_cluster_rejects_project_id(models): # Pydantic config is extra='forbid', so passing project_id raises rather than silently dropping. - Cluster = _models()["Cluster"] + Cluster = models["Cluster"] with pytest.raises(ValidationError, match=r"(?s)project_id.*Extra inputs are not permitted"): Cluster(id="c1", project_id="visp_patchseq") @@ -36,20 +29,20 @@ def test_cluster_rejects_project_id(): # --------------------------------------------------------------------------- -def test_cluster_membership_project_id_required(): - ClusterMembership = _models()["ClusterMembership"] +def test_cluster_membership_project_id_required(models): + ClusterMembership = models["ClusterMembership"] with pytest.raises(ValidationError, match=r"(?s)project_id.*Field required"): ClusterMembership(item="cell_1", cluster="c1") -def test_cluster_membership_hierarchy_id_optional(): - ClusterMembership = _models()["ClusterMembership"] +def test_cluster_membership_hierarchy_id_optional(models): + ClusterMembership = models["ClusterMembership"] cm = ClusterMembership(item="cell_1", cluster="c1", project_id="visp_patchseq") assert cm.hierarchy_id is None -def test_cluster_membership_hierarchy_id_round_trip(): - ClusterMembership = _models()["ClusterMembership"] +def test_cluster_membership_hierarchy_id_round_trip(models): + ClusterMembership = models["ClusterMembership"] cm = ClusterMembership( item="cell_1", cluster="c1", @@ -59,8 +52,8 @@ def test_cluster_membership_hierarchy_id_round_trip(): assert cm.hierarchy_id == "visp_met_types_v1" -def test_cluster_membership_hierarchy_id_must_be_string(): - ClusterMembership = _models()["ClusterMembership"] +def test_cluster_membership_hierarchy_id_must_be_string(models): + ClusterMembership = models["ClusterMembership"] with pytest.raises(ValidationError, match=r"(?s)hierarchy_id.*Input should be a valid string"): ClusterMembership( item="cell_1", @@ -75,27 +68,27 @@ def test_cluster_membership_hierarchy_id_must_be_string(): # --------------------------------------------------------------------------- -def test_cluster_hierarchy_id_optional(): - Cluster = _models()["Cluster"] +def test_cluster_hierarchy_id_optional(models): + Cluster = models["Cluster"] cluster = Cluster(id="c1") assert cluster.hierarchy_id is None -def test_cluster_hierarchy_id_round_trip(): - Cluster = _models()["Cluster"] +def test_cluster_hierarchy_id_round_trip(models): + Cluster = models["Cluster"] cluster = Cluster(id="c1", hierarchy_id="visp_met_types_v1") assert cluster.hierarchy_id == "visp_met_types_v1" -def test_cluster_hierarchy_id_must_be_string(): - Cluster = _models()["Cluster"] +def test_cluster_hierarchy_id_must_be_string(models): + Cluster = models["Cluster"] with pytest.raises(ValidationError, match=r"(?s)hierarchy_id.*Input should be a valid string"): Cluster(id="c1", hierarchy_id=123) -def test_cluster_still_has_no_project_id_after_hierarchy_id_added(): +def test_cluster_still_has_no_project_id_after_hierarchy_id_added(models): # Regression guard: hierarchy_id was added without re-introducing ProjectScoped on Cluster. - Cluster = _models()["Cluster"] + Cluster = models["Cluster"] assert "project_id" not in Cluster.model_fields with pytest.raises(ValidationError, match=r"(?s)project_id.*Extra inputs are not permitted"): Cluster(id="c1", project_id="visp_patchseq") @@ -106,39 +99,39 @@ def test_cluster_still_has_no_project_id_after_hierarchy_id_added(): # --------------------------------------------------------------------------- -def test_cluster_hierarchy_constructs_with_id_run_root_clusters(): - ClusterHierarchy = _models()["ClusterHierarchy"] +def test_cluster_hierarchy_constructs_with_id_run_root_clusters(models): + ClusterHierarchy = models["ClusterHierarchy"] h = ClusterHierarchy(id="h1", run="run1", root="root", clusters=["root", "c1"]) assert h.id == "h1" assert h.root == "root" assert h.clusters == ["root", "c1"] -def test_cluster_hierarchy_requires_id(): - ClusterHierarchy = _models()["ClusterHierarchy"] +def test_cluster_hierarchy_requires_id(models): + ClusterHierarchy = models["ClusterHierarchy"] with pytest.raises(ValidationError, match=r"(?s)id.*Field required"): ClusterHierarchy(run="run1", root="root", clusters=["root"]) -def test_algorithm_run_requires_algorithm_name(): - AlgorithmRun = _models()["AlgorithmRun"] +def test_algorithm_run_requires_algorithm_name(models): + AlgorithmRun = models["AlgorithmRun"] with pytest.raises(ValidationError, match=r"(?s)algorithm_name.*Field required"): AlgorithmRun(id="run1") -def test_algorithm_run_constructs_without_input_dataset(): - AlgorithmRun = _models()["AlgorithmRun"] +def test_algorithm_run_constructs_without_input_dataset(models): + AlgorithmRun = models["AlgorithmRun"] run = AlgorithmRun(id="run1", algorithm_name="hierarchical") assert run.input_dataset is None -def test_hierarchy_category_requires_id(): - HierarchyCategory = _models()["HierarchyCategory"] +def test_hierarchy_category_requires_id(models): + HierarchyCategory = models["HierarchyCategory"] with pytest.raises(ValidationError, match=r"(?s)id.*Field required"): HierarchyCategory(description="leaf") -def test_hierarchy_category_level_optional(): - HierarchyCategory = _models()["HierarchyCategory"] +def test_hierarchy_category_level_optional(models): + HierarchyCategory = models["HierarchyCategory"] cat = HierarchyCategory(id="cluster") assert cat.level is None diff --git a/tests/test_config.py b/tests/test_config.py index 0ebd243..e65970e 100644 --- a/tests/test_config.py +++ b/tests/test_config.py @@ -6,6 +6,7 @@ from pathlib import Path import pytest +from pydantic import ValidationError from connects_common_connectivity.config import ( CONFIG_FILENAME, Settings, @@ -16,16 +17,6 @@ ) -@pytest.fixture(autouse=True) -def _reset_cache_and_env(monkeypatch, tmp_path): - """Each test runs in an isolated tmp cwd with a cleared cache and no env override.""" - monkeypatch.delenv("CCC_OUTPUT_ROOT", raising=False) - monkeypatch.chdir(tmp_path) - get_settings.cache_clear() - yield - get_settings.cache_clear() - - def _write_config(dir_: Path, **values) -> Path: import yaml @@ -105,14 +96,14 @@ def test_table_path_joins_and_returns_path(tmp_path): def test_output_root_is_required(tmp_path): _write_config(tmp_path, dry_run=False) # missing output_root get_settings.cache_clear() - with pytest.raises(Exception): + with pytest.raises(ValidationError, match=r"(?s)output_root.*[Ff]ield required"): get_settings() def test_unknown_keys_rejected(tmp_path): _write_config(tmp_path, output_root=str(tmp_path), nonsense_key=1) get_settings.cache_clear() - with pytest.raises(Exception): + with pytest.raises(ValidationError, match=r"(?s)[Ee]xtra inputs are not permitted"): get_settings() diff --git a/tests/test_mappings_schema.py b/tests/test_mappings_schema.py index 1e70041..8e026b7 100644 --- a/tests/test_mappings_schema.py +++ b/tests/test_mappings_schema.py @@ -1,21 +1,14 @@ import pytest from pydantic import ValidationError -import connects_common_connectivity as ccc - - -def _models(): - return ccc.generate_pydantic_models() - - # --------------------------------------------------------------------------- # MappingSet — source/target endpoints can be DataSet or ClusterHierarchy # --------------------------------------------------------------------------- -def test_mapping_set_dataset_to_dataset(): +def test_mapping_set_dataset_to_dataset(models): # Cell-to-cell shape: source_dataset + target_dataset (back-compat). - MappingSet = _models()["MappingSet"] + MappingSet = models["MappingSet"] ms = MappingSet( id="ms_cell_cell", name="ms_cell_cell", @@ -30,9 +23,9 @@ def test_mapping_set_dataset_to_dataset(): assert ms.target_hierarchy is None -def test_mapping_set_dataset_to_hierarchy(): +def test_mapping_set_dataset_to_hierarchy(models): # Cell-to-cluster shape: source_dataset + target_hierarchy. - MappingSet = _models()["MappingSet"] + MappingSet = models["MappingSet"] ms = MappingSet( id="ms_cell_cluster", name="ms_cell_cluster", @@ -47,9 +40,9 @@ def test_mapping_set_dataset_to_hierarchy(): assert ms.source_hierarchy is None -def test_mapping_set_hierarchy_to_hierarchy(): +def test_mapping_set_hierarchy_to_hierarchy(models): # Cluster-to-cluster shape: source_hierarchy + target_hierarchy. - MappingSet = _models()["MappingSet"] + MappingSet = models["MappingSet"] ms = MappingSet( id="ms_cluster_cluster", name="ms_cluster_cluster", @@ -64,10 +57,10 @@ def test_mapping_set_hierarchy_to_hierarchy(): assert ms.target_hierarchy == "visp_met_types_v1" -def test_mapping_set_endpoints_optional(): +def test_mapping_set_endpoints_optional(models): # All four endpoint slots are optional at the schema level (LinkML can't enforce # "exactly one of"); convention is enforced per-mapping kind. - MappingSet = _models()["MappingSet"] + MappingSet = models["MappingSet"] ms = MappingSet( id="ms_minimal", name="ms_minimal", method_name="m", project_id="p1", @@ -78,20 +71,20 @@ def test_mapping_set_endpoints_optional(): assert ms.target_hierarchy is None -def test_mapping_set_method_name_still_required(): - MappingSet = _models()["MappingSet"] +def test_mapping_set_method_name_still_required(models): + MappingSet = models["MappingSet"] with pytest.raises(ValidationError, match=r"(?s)method_name.*Field required"): MappingSet(id="ms1", project_id="p1") -def test_mapping_set_project_id_still_required(): - MappingSet = _models()["MappingSet"] +def test_mapping_set_project_id_still_required(models): + MappingSet = models["MappingSet"] with pytest.raises(ValidationError, match=r"(?s)project_id.*Field required"): MappingSet(id="ms1", method_name="m") -def test_mapping_set_hierarchy_fields_must_be_strings(): - MappingSet = _models()["MappingSet"] +def test_mapping_set_hierarchy_fields_must_be_strings(models): + MappingSet = models["MappingSet"] with pytest.raises(ValidationError, match=r"(?s)target_hierarchy.*Input should be a valid string"): MappingSet( id="ms1", method_name="m", project_id="p1", @@ -104,8 +97,8 @@ def test_mapping_set_hierarchy_fields_must_be_strings(): # --------------------------------------------------------------------------- -def test_cell_to_cluster_mapping_round_trip(): - CellToClusterMapping = _models()["CellToClusterMapping"] +def test_cell_to_cluster_mapping_round_trip(models): + CellToClusterMapping = models["CellToClusterMapping"] m = CellToClusterMapping( id="map_001", mapping_set="ms_cell_cluster", @@ -122,8 +115,8 @@ def test_cell_to_cluster_mapping_round_trip(): assert m.probability == 0.91 -def test_cell_to_cluster_mapping_requires_target_cluster(): - CellToClusterMapping = _models()["CellToClusterMapping"] +def test_cell_to_cluster_mapping_requires_target_cluster(models): + CellToClusterMapping = models["CellToClusterMapping"] with pytest.raises(ValidationError, match=r"(?s)target_cluster.*Field required"): CellToClusterMapping( id="map_001", diff --git a/tests/test_parquet_loader.py b/tests/test_parquet_loader.py new file mode 100644 index 0000000..2522c90 --- /dev/null +++ b/tests/test_parquet_loader.py @@ -0,0 +1,59 @@ +from __future__ import annotations + +import pyarrow as pa +import pyarrow.parquet as pq + +from connects_common_connectivity.parquet_loader import load_parquet_to_models + + +def _write_parquet(path, columns: dict[str, list[str]]) -> None: + table = pa.table({name: pa.array(values) for name, values in columns.items()}) + pq.write_table(table, path) + + +def test_load_parquet_to_models_happy_path_dataitem(tmp_path): + parquet_path = tmp_path / "dataitems.parquet" + _write_parquet( + parquet_path, + { + "id": ["d1", "d2"], + "name": ["item-1", "item-2"], + "project_id": ["p1", "p1"], + }, + ) + + instances, report = load_parquet_to_models( + "connectivity_schema.yaml", + "DataItem", + str(parquet_path), + ) + + assert len(instances) == 2 + assert [item.id for item in instances] == ["d1", "d2"] + assert [item.project_id for item in instances] == ["p1", "p1"] + assert report["mapping"]["id"] == "id" + assert report["mapping"]["project_id"] == "project_id" + assert report["counts"]["rows"] == 2 + assert report["counts"]["instances"] == 2 + assert report["counts"]["errors"] == 0 + + +def test_load_parquet_to_models_reports_missing_required_slot(tmp_path): + parquet_path = tmp_path / "missing_project_id.parquet" + _write_parquet( + parquet_path, + { + "id": ["d1"], + "name": ["item-1"], + }, + ) + + instances, report = load_parquet_to_models( + "connectivity_schema.yaml", + "DataItem", + str(parquet_path), + ) + + assert instances == [] + assert report["counts"]["errors"] == 1 + assert any("project_id" in err["message"] for err in report["errors"]) diff --git a/tests/test_projection_schema.py b/tests/test_projection_schema.py index 899d3ba..53c0fa3 100644 --- a/tests/test_projection_schema.py +++ b/tests/test_projection_schema.py @@ -1,11 +1,7 @@ import pytest from pydantic import ValidationError -import connects_common_connectivity as ccc - - -def test_laterality_enum(): - models = ccc.generate_pydantic_models() +def test_laterality_enum(models): Laterality = models["Laterality"] assert Laterality.IPSILATERAL.name == "IPSILATERAL" assert Laterality.CONTRALATERAL.name == "CONTRALATERAL" @@ -13,8 +9,7 @@ def test_laterality_enum(): assert Laterality.UNKNOWN.name == "UNKNOWN" -def test_projection_measurement_matrix_laterality(): - models = ccc.generate_pydantic_models() +def test_projection_measurement_matrix_laterality(models): PMM = models["ProjectionMeasurementMatrix"] Laterality = models["Laterality"] Modality = models["Modality"] @@ -32,8 +27,7 @@ def test_projection_measurement_matrix_laterality(): modality=Modality.MORPHOLOGY, laterality="NOT_VALID") -def test_region_coverage_on_pmm(): - models = ccc.generate_pydantic_models() +def test_region_coverage_on_pmm(models): PMM = models["ProjectionMeasurementMatrix"] Laterality = models["Laterality"] Modality = models["Modality"] diff --git a/tests/test_write_relocation.py b/tests/test_write_relocation.py index 8ea6a94..42d7779 100644 --- a/tests/test_write_relocation.py +++ b/tests/test_write_relocation.py @@ -18,6 +18,7 @@ import pytest REPO_ROOT = Path(__file__).resolve().parents[1] +SEARCH_ROOTS = ["src", "tests", "code", "scripts", "planning"] EXCLUDED_DIRS = {".venv", ".git", ".pytest_cache", ".ruff_cache", ".ipynb_checkpoints", ".Trash-0", "node_modules"} @@ -41,13 +42,14 @@ def test_shim_modules_not_importable(): def _iter_source_files(): - for path in REPO_ROOT.rglob("*"): - if not path.is_file(): + for root in SEARCH_ROOTS: + base = REPO_ROOT / root + if not base.exists(): continue - if any(part in EXCLUDED_DIRS for part in path.parts): - continue - if path.suffix in {".py", ".ipynb"}: - yield path + for path in base.rglob("*"): + if path.is_file() and path.suffix in {".py", ".ipynb"}: + if not any(part in EXCLUDED_DIRS for part in path.parts): + yield path def test_no_source_references_shim_paths(): diff --git a/tests/test_write_utils.py b/tests/test_write_utils.py index 5424874..29625e4 100644 --- a/tests/test_write_utils.py +++ b/tests/test_write_utils.py @@ -51,7 +51,7 @@ def test_idempotent_partial_rerun(tmp_path): path = str(tmp_path / "dataitem") append_new_dataitems(path, _make_table(["a", "b"]), project_id="proj_a") n = append_new_dataitems(path, _make_table(["a", "b", "c"]), project_id="proj_a") - assert n == 1 # only "c" is new + assert n == 1, f"expected only 'c' to be new; appended {n} rows" # --------------------------------------------------------------------------- diff --git a/tests/test_write_validation.py b/tests/test_write_validation.py index 9083625..74a6727 100644 --- a/tests/test_write_validation.py +++ b/tests/test_write_validation.py @@ -99,6 +99,17 @@ def test_validate_for_write_accepts_a_list(): assert [m.id for m in result] == ["c1", "c2"] +def test_validate_for_write_list_reports_failing_row(): + spec = REGISTRY["Cluster"] + items = [ + Cluster(id="c1", hierarchy_id="h1"), + Cluster(id="c2"), # missing hierarchy_id + ] + with pytest.raises(ValueError, match="hierarchy_id") as ei: + validate_for_write(items, spec) + assert "c2" in str(ei.value), f"error should name failing row; got: {ei.value}" + + def test_validate_for_write_passthrough_when_required_is_empty(): spec = REGISTRY["DataSet"] ds = DataSet(id="d1", name="d", project_id="p1") @@ -128,4 +139,7 @@ def test_write_models_calls_validation_before_io(tmp_path): with pytest.raises(ValueError, match="hierarchy_id"): write_models(bad, settings=settings) # No table directory created — IO never happened. - assert not (tmp_path / "cluster").exists() + assert not (tmp_path / "cluster").exists(), ( + "validation failure should short-circuit before any IO; " + "cluster/ directory was created anyway" + ) diff --git a/tests/test_writers.py b/tests/test_writers.py index 8ef8945..2f94a30 100644 --- a/tests/test_writers.py +++ b/tests/test_writers.py @@ -16,6 +16,7 @@ import polars as pl import pyarrow as pa import pytest +from pydantic import BaseModel from connects_common_connectivity.config import Settings from connects_common_connectivity.io.write_spec import REGISTRY @@ -48,20 +49,6 @@ Unit, ) -# --------------------------------------------------------------------------- -# Fixtures -# --------------------------------------------------------------------------- - - -@pytest.fixture -def settings(tmp_path) -> Settings: - return Settings(output_root=tmp_path) - - -def _read(path) -> pl.DataFrame: - return pl.read_delta(str(path)) - - # --------------------------------------------------------------------------- # Predicate construction # --------------------------------------------------------------------------- @@ -78,11 +65,17 @@ def test_build_predicate_format(): ) -def test_build_predicate_escapes_single_quotes(): - assert ( - _build_predicate(["name"], ["O'Hara"]) - == "name = 'O''Hara'" - ) +@pytest.mark.parametrize( + "value,expected_literal", + [ + ("O'Hara", "'O''Hara'"), + ("", "''"), + ("a\\b", "'a\\b'"), + ("café", "'café'"), + ], +) +def test_build_predicate_escapes(value, expected_literal): + assert _build_predicate(["name"], [value]) == f"name = {expected_literal}" # --------------------------------------------------------------------------- @@ -111,7 +104,7 @@ def test_group_by_scope_preserves_first_appearance_order(): # --------------------------------------------------------------------------- -def test_patchseq_regression_two_datasets_same_project(settings): +def test_patchseq_regression_two_datasets_same_project(settings, read_delta): """Two DataSet rows with the same ``project_id`` but different ``id`` must coexist. Before W2/W3 the notebooks predicated on ``project_id`` only, so a @@ -123,20 +116,35 @@ def test_patchseq_regression_two_datasets_same_project(settings): write_models(ds_a, settings=settings) write_models(ds_b, settings=settings) - rows = _read(settings.output_root / "dataset") + rows = read_delta(settings.output_root / "dataset") ids = sorted(rows["id"].to_list()) - assert ids == ["visp_exc_patchseq", "visp_inh_patchseq"] + assert ids == ["visp_exc_patchseq", "visp_inh_patchseq"], ( + f"patchseq regression: second write wiped first. " + f"Expected both datasets, got {ids}" + ) -def test_overwrite_scoped_is_idempotent(settings): +def test_overwrite_scoped_is_idempotent(settings, read_delta): ds = DataSet(id="d1", name="example", project_id="p1") write_models(ds, settings=settings) write_models(ds, settings=settings) - rows = _read(settings.output_root / "dataset") - assert rows.shape[0] == 1 + rows = read_delta(settings.output_root / "dataset") + assert rows.shape[0] == 1, f"idempotent rewrite produced {rows.shape[0]} rows" + assert rows["id"].to_list() == ["d1"], "row identity changed across rewrites" + assert rows["name"].to_list() == ["example"], "row content drifted across rewrites" + +def test_dry_run_does_not_write(tmp_path): + settings = Settings(output_root=tmp_path, dry_run=True) + ds = DataSet(id="d1", name="d", project_id="p1") -def test_multi_scope_group_dispatch_yields_one_predicate_per_group(settings): + result = write_models(ds, settings=settings) + + assert result.rows_written == 0, "dry_run must report 0 rows written" + assert not (tmp_path / "dataset").exists(), "dry_run must not create tables" + + +def test_multi_scope_group_dispatch_yields_one_predicate_per_group(settings, read_delta): rows_in = [ DataSet(id="a", name="A", project_id="p1"), DataSet(id="b", name="B", project_id="p1"), @@ -146,7 +154,7 @@ def test_multi_scope_group_dispatch_yields_one_predicate_per_group(settings): assert len(result.predicates) == 2 assert result.rows_written == 2 # Both end up in the table. - rows = _read(settings.output_root / "dataset") + rows = read_delta(settings.output_root / "dataset") assert sorted(rows["id"].to_list()) == ["a", "b"] @@ -155,7 +163,7 @@ def test_multi_scope_group_dispatch_yields_one_predicate_per_group(settings): # --------------------------------------------------------------------------- -def test_append_new_by_id_only_appends_unseen(settings): +def test_append_new_by_id_only_appends_unseen(settings, read_delta): items_first = [ DataItem(id="cell_1", name="cell_1", project_id="p1"), DataItem(id="cell_2", name="cell_2", project_id="p1"), @@ -172,7 +180,7 @@ def test_append_new_by_id_only_appends_unseen(settings): r2 = write_models(items_second, settings=settings) assert r2.rows_written == 1 - rows = _read(settings.output_root / "dataitem") + rows = read_delta(settings.output_root / "dataitem") assert sorted(rows["id"].to_list()) == ["cell_1", "cell_2", "cell_3"] @@ -190,79 +198,88 @@ def test_append_new_by_id_rejects_mixed_project_ids(settings): # --------------------------------------------------------------------------- +INSTANCE_FACTORIES = { + DataSet: lambda: DataSet(id="ds1", name="ds", project_id="p1"), + DataItem: lambda: DataItem(id="di1", name="di1", project_id="p1"), + DataItemDataSetAssociation: lambda: DataItemDataSetAssociation( + dataitem_id="di1", dataset_id="ds1", project_id="p1" + ), + Cluster: lambda: Cluster(id="c1", hierarchy_id="h1", level=0), + ClusterHierarchy: lambda: ClusterHierarchy(id="h1", root="c1", clusters=["c1"]), + ClusterMembership: lambda: ClusterMembership( + item="cell_1", cluster="c1", hierarchy_id="h1", project_id="p1" + ), + MappingSet: lambda: MappingSet(id="m1", project_id="p1", name="m", method_name="dummy"), + CellToClusterMapping: lambda: CellToClusterMapping( + id="ctc1", + project_id="p1", + mapping_set="m1", + source_cell="cell_1", + target_cluster="c1", + ), + CellFeatureSet: lambda: CellFeatureSet(id="fs1", project_id="p1"), + CellFeatureDefinition: lambda: CellFeatureDefinition( + id="feat_a", + project_id="p1", + feature_set_id="fs1", + data_type="= 1 @@ -271,7 +288,7 @@ def test_round_trip_each_writable_class(cls, settings): # --------------------------------------------------------------------------- -def test_write_projection_matrix_enriches_and_does_not_mutate_input(settings): +def test_write_projection_matrix_enriches_and_does_not_mutate_input(settings, read_delta): pmm = ProjectionMeasurementMatrix( id="pmm_test", measurement_type=ProjectionMeasurementType.MICRONS_OF_AXON, @@ -294,7 +311,7 @@ def test_write_projection_matrix_enriches_and_does_not_mutate_input(settings): assert result.class_name == "ProjectionMeasurementMatrix" assert pmm.region_coverage in (None, []) # input not mutated - rows = _read(settings.output_root / "projectionmeasurementmatrix") + rows = read_delta(settings.output_root / "projectionmeasurementmatrix") coverage = rows.filter(pl.col("id") == "pmm_test")["region_coverage"].to_list()[0] assert list(coverage) == ["VISp", "MOB"] @@ -324,10 +341,18 @@ def test_write_models_rejects_unregistered_class(settings): class NotInRegistry: pass - with pytest.raises(TypeError): + with pytest.raises(TypeError, match="pydantic model or iterable"): write_models(NotInRegistry(), settings=settings) +def test_write_models_rejects_unregistered_pydantic_model(settings): + class UnregisteredModel(BaseModel): + id: str + + with pytest.raises(KeyError, match="UnregisteredModel"): + write_models(UnregisteredModel(id="u1"), settings=settings) + + # --------------------------------------------------------------------------- # Per-call output_root override # --------------------------------------------------------------------------- From 468cf69be767227821ed0694e9219f52638fcd13 Mon Sep 17 00:00:00 2001 From: reneyagmur Date: Tue, 23 Jun 2026 12:52:45 +0000 Subject: [PATCH 20/25] start to clean up documentation and pr message --- PR_message.md | 20 ++++++++++++- planning/ARCHITECTURE.md | 16 ++++++---- planning/TODO.md | 15 ++++++++-- planning/prompts/03_writers.md | 2 +- planning/prompts/_deferred/09_analysis.md | 36 ++++++++++++++++++----- 5 files changed, 70 insertions(+), 19 deletions(-) diff --git a/PR_message.md b/PR_message.md index 60b61cb..48c2e81 100644 --- a/PR_message.md +++ b/PR_message.md @@ -7,4 +7,22 @@ Implemented the full `planning/tests_review/plan.md` sequence (WP1 to WP5) end-t - Added high-signal regression assertion messages where failures are otherwise hard to diagnose. - Added fixture/registry drift guards for writable model coverage and improved list-validation failure reporting to include row context. - Closed remaining coverage gaps by adding tests for CLI behavior, parquet loader contract, predicate escaping edge cases, relocation scan roots, and dry-run semantics. -- Fixed writer behavior so `write_models(..., settings=Settings(..., dry_run=True))` does not write any data and returns `rows_written=0`. +- Migrated all `DataItemDataSetAssociation` hand-rolled `write_deltalake` calls in `etl_visp_inh_patchseq_02`, `etl_visp_inh_patchseq_03`, and `etl_wnm_exc_02` to `write_models(...)`. Every registry-backed class is now exclusively written through `write_models` / `write_projection_matrix` in the ETL notebooks; per-notebook imports trimmed accordingly. +- Flipped `planning/TODO.md` W6 (notebook migration) and W7 (write-side test suite) to done, with explicit deferred carve-outs recorded on W6 (wide cell-feature/projection-matrix parquets, `CellCellConnectivityLong`, the v1dd-01 stub) and a coverage inventory recorded on W7. `uv run pytest -q` → 160 passed. +- Tightened the `CHANGELOG.md` notebook-migration entry to "every registry-backed model" and disclosed the remaining `write_deltalake` carve-outs so the changelog no longer overclaims. +- Cleaned up `src/connects_common_connectivity/io/io_plans.md`: shipped `populate_region_coverage` left to live by its docstring; deferred `compare_region_coverage` spec moved into `planning/prompts/_deferred/09_analysis.md`; the source-tree file removed and back-references in `planning/ARCHITECTURE.md` and `planning/prompts/03_writers.md` updated. + +## IO module rollout (W1–W7, per `planning/TODO.md`) + +End-to-end build of the curated `connects_common_connectivity.io` write path, tracked in `planning/TODO.md` and prompted from `planning/prompts/`. + +- **W1 — Config.** Added `connects_common_connectivity.config` with a pydantic `Settings` loaded from a walk-up–discovered `ccc_config.yaml`, a cached `get_settings()`, `table_path()`, and an `output_root()` helper. Relative values anchor at the config file's directory via `os.path.abspath` (avoids Code Ocean's `scratch -> /scratch` symlink). Precedence: explicit arg > `CCC_OUTPUT_ROOT` env > `ccc_config.yaml` > error. Repo-root `ccc_config.yaml` seeded. +- **W2 — Write-spec registry.** Added `io/write_spec.py` with the `WriteSpec` model, a `REGISTRY` of all writable classes, `get_spec()` lookup, and a drift test. `DataSet` scope widened to `(project_id, id)` so patchseq exc/inh `DataSet` rows coexist. +- **W3 — Writers + relocation + registry expansion.** Moved `arrow_utils.py` / `write_utils.py` under `io/`. Built `io/writers.py`: `write_models()` single dispatch over the registry (no per-class wrappers), `WriteResult` frozen dataclass, `WRITABLE_CLASSES` tuple, and the one non-`write_models` writer `write_projection_matrix()` (justified by its non-uniform signature). Added `populate_region_coverage()` in `io/write_utils.py`. Registry grew to 14 entries (added `Cluster`, `ClusterHierarchy`, `ClusterMembership`, `MappingSet`, `CellToClusterMapping`, `CellFeatureSet`, `CellFeatureDefinition`, `CellFeatureMatrix`, `ProjectionMeasurementMatrix`, `AlgorithmRun`, `HierarchyCategory`). +- **W4 — Public API.** Curated `io/__init__.py` re-exports pinned by `__all__`: `get_settings`, `Settings`, `table_path`, `write_models`, `write_projection_matrix`, `WriteResult`, `WRITABLE_CLASSES`. +- **W5 — Write validation.** Added `io/write_validation.py`: `strict_model_for(cls)` flips `WriteSpec.required_for_write` slots to non-optional and strips `Optional` from those annotations (cached per class, no mutation of generated `models.py`); `validate_for_write()` re-validates instances and raises `ValueError` naming the missing slots before any IO. Wired into `write_models`. Populated `required_for_write` for `Cluster`, `ClusterMembership`, `CellFeatureDefinition`. +- **W6 — Notebook migration.** Every ETL notebook routes registry-backed writes through `write_models()` / `write_projection_matrix()`; hardcoded `OUTPUT_ROOT = "../scratch/..."` strings replaced with `output_root()`. Patchseq regression covered. W3 re-export shims removed; nothing imports `connects_common_connectivity.arrow_utils` / `write_utils` anymore. Deferred carve-outs: wide cell-feature / projection-matrix parquets, `CellCellConnectivityLong` (no registry entry; `write_cellcellconnectivitylong` stub documents the plan), and the `etl_v1dd_01` cell-12 stub. +- **W7 — Write-side test suite.** Per-class smoke parametrized over `WRITABLE_CLASSES`; no-shim regression (`test_shim_modules_deleted`, `_not_importable`, `_no_source_references_shim_paths`); registry drift; patchseq, idempotency, append-new-by-id, predicate construction, and `output_root=` override coverage in `tests/test_writers.py`; strict-validation failures; public-API surface. `uv run pytest -q` → 160 passed. +- **Other.** Added per-call `output_root=` keyword on `write_models()` and `write_projection_matrix()` (mutually exclusive with `settings=`) so a single notebook can redirect its writes without mutating process-global config. Added `Modality.CALCIUM_IMAGING`. Removed deprecated `connects_common_connectivity.arrow_utils` / `write_utils` re-export shims. + +Deferred and unchanged in this PR: readers (L1), read-side analysis + opt-in `check_refs` (L2), and the carve-outs called out under W6. diff --git a/planning/ARCHITECTURE.md b/planning/ARCHITECTURE.md index 1df8221..beb2651 100644 --- a/planning/ARCHITECTURE.md +++ b/planning/ARCHITECTURE.md @@ -28,12 +28,15 @@ This document assumes those and adds the design on top. - `parquet_loader.py` — `load_parquet_to_models(...)` (Parquet → models with a report). - `cli.py` — LinkML `SchemaView`-based full validation (the `ccc` command). Kept as the occasional heavyweight conformance check, **not** on the hot write path. -- `io/io_plans.md` — two pre-existing ideas that are **different concerns** and must land in - different modules (see below): +- `io/io_plans.md` — historical: two pre-existing design notes (now superseded). Both + are referenced below so the design history stays linkable. The source-tree file has + been deleted; what remained relevant moved into `planning/`: - `populate_region_coverage(pmm, matrix)` — derives `region_coverage` from the dense - values **before** a matrix is written → a **write-side transform**. + values **before** a matrix is written → a **write-side transform**. **Shipped** in + `io/write_utils.py`; the file's docstring is now the source of truth. - `compare_region_coverage(pmms)` — summarizes overlap across already-written matrices → - **read/analysis**. + **read/analysis**. **Deferred** to the read-side work; full spec moved to + `planning/prompts/_deferred/09_analysis.md`. ## Target `io/` structure (clean is the goal) @@ -255,7 +258,7 @@ A single dispatch core, no per-class wrappers: caller does the enrichment and then calls `write_models`. - `io/write_utils.py` (moved from root): `append_new_dataitems` is the `append_new_by_id` backend; `walk_ancestors` is used by membership/mapping writers; `populate_region_coverage` - (ported from `io_plans.md`) is the pre-write projection helper. `write_projection_matrix` + (ported from the now-deleted `io_plans.md`) is the pre-write projection helper. `write_projection_matrix` calls `populate_region_coverage` (or accepts an already-enriched matrix). Keep it a pure function (no IO, no mutation of input). Generalize `append_new_dataitems` only if needed (e.g. parametrize the partition column) without breaking callers. Rationale: this is write @@ -278,7 +281,8 @@ notebook migration. Once the write path is solid and notebooks are migrated, rev always drop to raw `polars.read_delta`; readers are conveniences, not a wall. When this starts, `parquet_loader.py` is **moved** to `io/parquet_loader.py` (pure move, not folded) and used as the typed-read backend. -- **Read-side analysis**: `compare_region_coverage(pmms)` from `io_plans.md` (shared vs +- **Read-side analysis**: `compare_region_coverage(pmms)` (spec in + `planning/prompts/_deferred/09_analysis.md`) (shared vs exclusive region coverage across matrices) — reads finished data and summarizes. - **Opt-in referential check** (`write_models(..., check_refs=True)`): needs a reader, so it rides with the read-side work. diff --git a/planning/TODO.md b/planning/TODO.md index 3fa1236..28a624f 100644 --- a/planning/TODO.md +++ b/planning/TODO.md @@ -70,9 +70,18 @@ Flat, ordered list. One row per prompt; sub-tasks live in the prompts. Design li class not in the registry; `writers.py` has a `write_cellcellconnectivitylong` stub documenting the migration plan. (c) `etl_v1dd_01_v1196` cell 12 wide-parquet stub (still a `# TODO` placeholder). -- [ ] **W7 — Write-side test suite** (`prompts/07_tests.md`) — Drift, patchseq regression, - idempotency, append-new-by-id, predicate construction, per-class example smoke, no-shim - regression, public-API surface. Owns only the gaps not specified by W2/W3/W4/W5. +- [x] **W7 — Write-side test suite** (`prompts/07_tests.md`) — Coverage verified against + the prompt's gap list: (1) per-class smoke for every `WRITABLE_CLASSES` entry via + `tests/test_writers.py::test_round_trip_each_writable_class` (parametrized, auto-covers + the 14 registered classes including post-hoc `AlgorithmRun` / `HierarchyCategory`); + (2) no-shim regression in `tests/test_write_relocation.py` + (`test_shim_modules_deleted`, `test_shim_modules_not_importable`, + `test_no_source_references_shim_paths`); (3) registry drift in + `tests/test_write_spec.py`; (4) patchseq regression / idempotency / append-new-by-id / + predicate construction / per-call `output_root=` override in `tests/test_writers.py`; + (5) strict-validation failures in `tests/test_write_validation.py`; (6) public-API + surface in `tests/test_public_api.py`. Full suite green: `uv run pytest -q` → 160 + passed. - [ ] **W8 — README / usage docs** — Update README for the write API. No prompt; small task. Ask before large edits. diff --git a/planning/prompts/03_writers.md b/planning/prompts/03_writers.md index 6901004..7426912 100644 --- a/planning/prompts/03_writers.md +++ b/planning/prompts/03_writers.md @@ -143,7 +143,7 @@ path turns out to need invariants that don't fit `WriteSpec` cleanly, stop and r before adding a separate function. ## Projection pre-write helper + `write_projection_matrix` -Port `populate_region_coverage(pmm, matrix)` from `io/io_plans.md` into +Port `populate_region_coverage(pmm, matrix)` from the (now-deleted) `io/io_plans.md` into `io/write_utils.py` (write plumbing — same shelf as `append_new_dataitems`, NOT a separate `transforms` module). Pure function: derive `region_coverage` from the dense values array, return a NEW `ProjectionMeasurementMatrix` instance (no mutation, no IO). diff --git a/planning/prompts/_deferred/09_analysis.md b/planning/prompts/_deferred/09_analysis.md index 30304d5..88c266c 100644 --- a/planning/prompts/_deferred/09_analysis.md +++ b/planning/prompts/_deferred/09_analysis.md @@ -13,14 +13,34 @@ function = premature module; relocate to `io/analysis.py` only when a second ana function arrives, a pure move with no public-API change). It reads finished data and summarizes; it never writes or mutates inputs. -Port `compare_region_coverage(pmms)` from `io/io_plans.md`: -- Input: list of `ProjectionMeasurementMatrix` instances with `region_index` and - `region_coverage` populated. -- Compute `shared_regions` (intersection of `region_index`), `shared_coverage` - (intersection of `region_coverage`), and, for every non-empty subset of the inputs, the - count of regions exclusively covered by that combination. -- Print the summary table shown in `io_plans.md` and return a dict with keys - `shared_regions`, `shared_coverage`, `exclusive_counts`. +Spec for `compare_region_coverage(pmms) → dict` (moved here from the old +`src/connects_common_connectivity/io/io_plans.md`; source-tree file deleted): + +- **Input:** `pmms` — list of `ProjectionMeasurementMatrix` instances, each with + `region_index` and `region_coverage` populated. (`region_coverage` is produced by + `populate_region_coverage`, already shipped in `io/write_utils.py`.) +- **Computes:** + - `shared_regions`: intersection of all `region_index` across inputs (what regions can + we compare at all?). + - `shared_coverage`: intersection of all `region_coverage` across inputs (where do all + datasets have signal?). + - For every non-empty subset of the input PMMs (powerset, size 1 through N): count of + regions that are in that subset's `region_coverage` intersection but **not** in any + other PMM's `region_coverage` (exclusive to that combination). +- **Prints:** A summary table showing, for each subset combination, how many regions are + exclusively covered by that combination. Example for 3 datasets A, B, C: + ``` + Only in A: 12 + Only in B: 5 + Only in C: 8 + Only in A ∩ B: 3 + Only in A ∩ C: 2 + Only in B ∩ C: 1 + In all (A ∩ B ∩ C): 45 + ``` +- **Returns:** dict with keys `shared_regions`, `shared_coverage`, and + `exclusive_counts` (mapping subset labels to region counts). +- **Properties:** Pure function, no side effects. Does not modify inputs. ## B. Opt-in referential check — `check_refs` This is the home for the referential rule deliberately kept off the hot path in From d2bec9bdad6d0e042f979da602dd840f3625ded8 Mon Sep 17 00:00:00 2001 From: reneyagmur Date: Tue, 23 Jun 2026 12:58:43 +0000 Subject: [PATCH 21/25] etl example prompt and readme updated --- code/etl_examples_readme.ipynb | 6 +- etl_example_prompt.md | 109 ++++++++++++++------------------- 2 files changed, 49 insertions(+), 66 deletions(-) diff --git a/code/etl_examples_readme.ipynb b/code/etl_examples_readme.ipynb index d4fd6b5..2224376 100644 --- a/code/etl_examples_readme.ipynb +++ b/code/etl_examples_readme.ipynb @@ -8,7 +8,7 @@ "\n", "A quick-reference guide to what was registered and why. Use this notebook to orient yourself before diving into a specific ETL notebook.\n", "\n", - "> **All notebooks live in `code/`.** Outputs land in `../scratch/em_patchseq_wnm_v1/`." + "> **All notebooks live in `code/`.** Outputs land in `../scratch/em_patchseq_wnm_v1/`. Registry-backed model tables are written with `write_models(...)` (projection rows use `write_projection_matrix(...)`)." ] }, { @@ -123,7 +123,7 @@ "| `etl_tasic_01_cluster.ipynb` | `tasic_2018_visp_taxonomy` | Tasic 2018 VISp scRNA-seq taxonomy (class → subclass → cluster) |\n", "| `etl_visp_met_types_01_cluster.ipynb` | `visp_met_types_taxonomy` | VISp MET-types (class → cluster), 45 leaves |\n", "\n", - "Both write `algorithmrun/`, `clusterhierarchy/`, `cluster/`, and `hierarchycategory/` rows. No `project_id`; rows are scoped by `hierarchy_id` so multiple taxonomies can coexist in the same Delta tables.\n" + "Both write `algorithmrun/`, `clusterhierarchy/`, `cluster/`, and `hierarchycategory/` rows. No `project_id`; `Cluster` rows are scoped by `hierarchy_id`, while the others are id-scoped in the write registry.\n" ] }, { @@ -289,7 +289,7 @@ "\n", "Source: `ProjectionMatrix_tip_and_branch_roll_up.csv`. Cell ids are the SWC filename with `.swc` stripped (matches `_01`).\n", "\n", - "Adds **+4 new cells** found in the projection CSV but not yet in `dataitem/` — the same late-addition pattern as the `_02` notebooks. Registered via `append_new_dataitems`.\n" + "Adds **+4 new cells** found in the projection CSV but not yet in `dataitem/` — the same late-addition pattern as the `_02` notebooks. Registered via `write_models(DataItem(...))` (append-new-by-id mode).\n" ] }, { diff --git a/etl_example_prompt.md b/etl_example_prompt.md index 1f56b22..664908c 100644 --- a/etl_example_prompt.md +++ b/etl_example_prompt.md @@ -25,10 +25,11 @@ schemas/single_cell_schema.yaml ### Package utilities (read-only reference) ``` src/connects_common_connectivity/models.py # Pydantic models — read to understand fields -src/connects_common_connectivity/arrow_utils.py # build_arrow_schema, models_to_table, - # attach_linkml_metadata, - # build_cell_feature_matrix_schema -src/connects_common_connectivity/write_utils.py # append_new_dataitems, walk_ancestors +src/connects_common_connectivity/io/arrow_utils.py # build_arrow_schema, models_to_table, + # attach_linkml_metadata, + # build_cell_feature_matrix_schema +src/connects_common_connectivity/io/write_utils.py # walk_ancestors +src/connects_common_connectivity/io/writers.py # write_models, write_projection_matrix ``` ### Example notebooks (read for patterns) @@ -110,45 +111,30 @@ Every notebook follows this cell order. Do not skip sections. ## 5. Write pattern reference -### 5a. `dataitem/` — use `append_new_dataitems` (never overwrite, never plain append) +### 5a. Registry-backed tables — use `write_models` (do not hand-build predicates) ```python -from connects_common_connectivity.write_utils import append_new_dataitems +from connects_common_connectivity.io import write_models -n = append_new_dataitems(OUTPUT_ROOT + "dataitem/", arrow_table, project_id=PROJECT_ID) -print(f"Appended {n} new DataItem rows") +result = write_models(rows, output_root=OUTPUT_ROOT) +print(result.mode, result.rows_written, result.predicates) ``` -- Reads existing `(project_id, id)` pairs, appends only rows whose `id` is new. -- Re-running appends nothing. Two notebooks sharing the same `project_id` do not clobber each other. -- **Never** use `write_deltalake(..., mode="overwrite", predicate="project_id=...")` for `dataitem/`. A predicate-scoped overwrite wipes the entire partition, deleting the other dataset's cells. +- `write_models` infers class from `rows`, then applies the registered `subdir`, `partition_by`, scope predicate, and write mode from `io/write_spec.py`. +- For registry tables, notebook code should **not** call `write_deltalake` directly and should not build `predicate=` strings by hand. +- Re-running is idempotent when you pass the full intended scope slice (the standard pattern in current notebooks). -### 5b. All registry tables — `mode="overwrite"` with a two-level predicate +### 5b. `DataItem` writes are id-deduped append via `write_models` ```python -write_deltalake( - OUTPUT_ROOT + "/", arrow_table, - mode="overwrite", - predicate=f"project_id = '{PROJECT_ID}' AND = '{VALUE}'", - partition_by=["project_id"], -) +new_dataitems = [DataItem(id=cid, name=cid, project_id=PROJECT_ID) for cid in new_ids] +written = write_models(new_dataitems, output_root=OUTPUT_ROOT).rows_written +print(f"DataItems appended: {written}") ``` -A **two-level predicate** is required. One level (`project_id`) is not enough when two notebooks share a `project_id` but write different rows to the same table. The second level pins the predicate to exactly the rows this notebook owns. - -| Table | Second predicate field | Example value | -|---|---|---| -| `dataset/` | `id` | `dataset_id = 'visp_inh_patchseq'` | -| `dataitem_dataset_association/` | `dataset_id` | `dataset_id = 'visp_inh_patchseq'` | -| `cellfeaturedefinition/` | `feature_set_id` | `feature_set_id = 'inh_visp_morph_features'` | -| `cellfeatureset/` | `id` | `id = 'inh_visp_morph_features'` | -| `cellfeaturematrix/` | `feature_set_id` | `feature_set_id = 'inh_visp_morph_features'` | -| `clustermembership/` | `hierarchy_id` | `hierarchy_id = 'visp_met_types_taxonomy'` | -| `celltoclustermapping/` | `mapping_set` | `mapping_set = 'visp_exc_wnm_mettype_mapping'` | -| `projectionmeasurementmatrix/` | `id` | `id = 'wnm_exc_proj_ipsi'` | -| `cellcellconnectivitylong//` | (folder scopes the example) | — | - -`cellfeaturedefinition/` should also use `partition_by=["project_id", "feature_set_id"]` for query performance. +- `DataItem` dispatches to `append_new_by_id` (id dedupe within one `project_id` per call). +- Re-running with the same ids appends nothing. +- **Do not** use scoped overwrite for `dataitem/`. ### 5c. Wide-form feature parquet — `mode="overwrite"` with predicate on `project_id` @@ -171,37 +157,34 @@ If a feature CSV contains cell ids not present in the `_01` DataItems, register 1. Read `dataitem_dataset_association/` filtered to `project_id AND dataset_id` → collect existing ids. 2. Identify new ids (`set(csv_ids) - set(existing_ids)`). -3. Call `append_new_dataitems` for new `DataItem` rows. -4. Plain `mode="append"` for new `DataItemDataSetAssociation` rows — but only after deduplicating against existing association rows: - ```python - existing_assoc_ids = set(pl.read_delta(...).filter(...)[\"dataitem_id\"]) - truly_new = [a for a in new_assoc if a.dataitem_id not in existing_assoc_ids] - if truly_new: - write_deltalake(..., mode="append", ...) - ``` +3. Call `write_models([...DataItem(...)...], output_root=OUTPUT_ROOT)` for any new cells. +4. Re-assert the full `(project_id, dataset_id)` association scope with `write_models([...DataItemDataSetAssociation(...)...])` (pass the full intended set, not append-only deltas). ### 5e. Cluster taxonomy tables (global) -`cluster/`, `clusterhierarchy/`, `algorithmrun/` have **no `project_id`**. Multiple taxonomies coexist in the same Delta table; scope by `hierarchy_id` (or `id` for the hierarchy/run rows themselves). +`cluster/`, `clusterhierarchy/`, `algorithmrun/`, and `hierarchycategory/` have **no `project_id`**. Write through `write_models`; registry scopes are: + +- `Cluster`: `hierarchy_id` +- `ClusterHierarchy`: `id` +- `AlgorithmRun`: `id` +- `HierarchyCategory`: `id` ```python -write_deltalake( - OUTPUT_ROOT + "cluster/", arrow_table, - mode="overwrite", - predicate=f"hierarchy_id = '{HIERARCHY_ID}'", - partition_by=["hierarchy_id"], -) +write_models(cluster_rows, output_root=OUTPUT_ROOT) +write_models([hierarchy_row], output_root=OUTPUT_ROOT) +write_models([run_row], output_root=OUTPUT_ROOT) +write_models(category_rows, output_root=OUTPUT_ROOT) ``` -Use `predicate=f"id = '{HIERARCHY_ID}'"` for the single `clusterhierarchy/` row and `predicate=f"id = '{RUN_ID}'"` for the single `algorithmrun/` row. See `etl_tasic_01_cluster.ipynb` and `etl_visp_met_types_01_cluster.ipynb`. +See `etl_tasic_01_cluster.ipynb` and `etl_visp_met_types_01_cluster.ipynb`. ### 5f. Membership and mapping (project-scoped, per-hierarchy) -- `clustermembership/` — predicate `project_id AND hierarchy_id`, `partition_by=["project_id", "hierarchy_id"]`. -- `celltoclustermapping/` — predicate `project_id AND mapping_set`, `partition_by=["project_id", "mapping_set"]`. -- `mappingset/` — predicate by `id` (one row per named mapping). +- `ClusterMembership` is scoped by `project_id AND hierarchy_id`. +- `CellToClusterMapping` is scoped by `project_id AND mapping_set`. +- `MappingSet` is scoped by `project_id AND id`. -When two notebooks merge into the same `(project_id, hierarchy_id)` slice (e.g. exc + inh patch-seq both writing memberships into `(visp_patchseq, visp_met_types_taxonomy)`), each must read the existing slice back, union with the new rows, then overwrite. Re-running either notebook is then idempotent. +When two notebooks merge into the same scoped slice (for example, both patch-seq `_03` notebooks writing memberships for the same `(project_id, hierarchy_id)`), each notebook should write the full intended slice via `write_models(...)`. Re-running either notebook remains idempotent. ### 5g. Cell-cell connectivity (`cellcellconnectivitylong/`) @@ -216,7 +199,7 @@ Predicate `project_id` only; the folder scopes the example. See `etl_minnie_04_c ### 5h. Projection matrix (`projectionmeasurementmatrix/` + wide-form parquet) -One Delta row per matrix; underlying wide table in `projection_/`. Predicate `project_id AND id` for the registry row; predicate `project_id` for the wide-form folder (the folder already scopes to one matrix). See `etl_wnm_exc_04_projection_matrix.ipynb`. +Use `write_projection_matrix(pmm_row, dense_matrix, output_root=OUTPUT_ROOT)` for `ProjectionMeasurementMatrix` rows; it computes `region_coverage` from the dense matrix and delegates to the registry-backed writer. Keep direct `write_deltalake` only for the underlying wide-form `projectionmeasurementmatrix//` parquet folders. See `etl_wnm_exc_04_projection_matrix.ipynb`. ### 5i. Membership vs mapping @@ -229,10 +212,10 @@ If the cells were not in the cohort that defined the taxonomy, write `CellToClus ### 5j. Parent propagation (`walk_ancestors`) -Every membership and mapping is parent-propagated: one row per (cell × ancestor) all the way up to the root. Use `walk_ancestors` from `write_utils.py`: +Every membership and mapping is parent-propagated: one row per (cell × ancestor) all the way up to the root. Use `walk_ancestors` from `io.write_utils`: ```python -from connects_common_connectivity.write_utils import walk_ancestors +from connects_common_connectivity.io.write_utils import walk_ancestors for ancestor_id, is_leaf in walk_ancestors(leaf_id, parent_by_child): ... # build one row, set probability/membership_score on the leaf only @@ -245,7 +228,7 @@ for ancestor_id, is_leaf in walk_ancestors(leaf_id, parent_by_child): ## 6. Building arrow tables ```python -from connects_common_connectivity.arrow_utils import ( +from connects_common_connectivity.io.arrow_utils import ( build_arrow_schema, models_to_table, attach_linkml_metadata, @@ -317,10 +300,10 @@ When two projects (different `project_id`) share a feature set (same `feature_se | Mistake | What goes wrong | Correct approach | |---|---|---| -| `write_deltalake(dataitem/, mode="overwrite", predicate="project_id=...")` | Wipes the entire partition, deleting the other dataset's cells | Use `append_new_dataitems` | -| Single-level predicate `project_id` on shared tables | Second notebook wipes first notebook's rows | Always use two-level predicate | -| `mode="append"` on registry tables (dataset, cellfeatureset, etc.) | Accumulates duplicate rows on every re-run | Use `mode="overwrite"` with predicate | -| `mode="append"` on association table without dedup check | Accumulates duplicate association rows | Check existing ids before appending | +| Calling `write_deltalake` directly for a registry-backed model table | Notebook-level predicate/partition drift from `io/write_spec.py` | Use `write_models(...)` | +| Hand-building `predicate=` / `partition_by=` for model writes | Scope bugs (row loss or accidental clobber) | Let `write_models` apply the registered scope | +| Writing `DataItem` with overwrite or plain append | Clobbers or duplicates within a project partition | Use `write_models(DataItem(...))` (append_new_by_id) | +| Appending only delta associations in `_02`/`_03` notebooks | Partial reruns can leave missing links | Re-write the full `(project_id, dataset_id)` scoped association slice with `write_models` | | Raw string for enum slot (`modality="MORPHOLOGY"`) | Pydantic validation error | Use `Modality.MORPHOLOGY.value` | | Casting or reformatting id values | Ids won't match across tables | Use ids as-is from the source file | | Editing `models.py` directly | Changes lost on next schema regen | Edit the schema YAML, then regenerate | @@ -328,12 +311,12 @@ When two projects (different `project_id`) share a feature set (same `feature_se | Verifying with `project_id` filter only on a shared table | Asserts pass but row count is wrong (includes other dataset) | Always filter by both `project_id` and `dataset_id` (or `feature_set_id`) | | Positional `models_to_table(rows, ModelClass)` or `attach_linkml_metadata(table, "Cluster")` | Silent schema-construction error, opaque message | Use `schema=` and `linkml_class=` kwargs | | Setting `AlgorithmRun.produced_hierarchies = [hierarchy]` | Pydantic expects an inlined dict, not a list — validation error | Omit it; `ClusterHierarchy.run` carries the inverse link | -| `mode="overwrite"` on `clustermembership/` with predicate on `project_id` only | Wipes other hierarchies' rows for the same project | Use two-level predicate: `project_id AND hierarchy_id` | +| Manual overwrite on `clustermembership/` scoped only by `project_id` | Wipes other hierarchies' rows for the same project | Use `write_models` (`ClusterMembership` scope is `project_id AND hierarchy_id`) | | Writing `ClusterMembership` for cells not in the cohort that defined the taxonomy | Misrepresents provenance — they were classified, not members | Use `CellToClusterMapping` + a `MappingSet` row instead | --- ## 11. Known limitations -- **`HierarchyCategory` has no safe global write pattern today.** The table has no `project_id` and no `hierarchy_id` discriminator, and category ids (`class`, `subclass`, `cluster`) are intentionally shared across taxonomies. Predicate-scoped overwrite would clobber sibling taxonomies' rows; plain append collides on `id`. Current `_03` notebooks (`etl_minnie_03`, `etl_visp_met_types_01_cluster`) skip this write and flag a TODO. A global-dedup append helper is the planned fix. +- **`HierarchyCategory` rows are id-scoped global vocabulary rows.** Because ids like `class`, `subclass`, and `cluster` are shared across taxonomies, only write canonical shared definitions (same ids/meaning) via `write_models`. Do not invent taxonomy-specific category ids without a schema-level discriminator. - **`CellCellConnectivityLong` has no `connectome_id` discriminator.** Two example connectomes for the same project must live in separate folders (see §5g). Schema addition would let them share a folder. From 306d263b7931440222678a536f628430f5bafd47 Mon Sep 17 00:00:00 2001 From: reneyagmur Date: Tue, 23 Jun 2026 13:13:52 +0000 Subject: [PATCH 22/25] edited readme --- README.md | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/README.md b/README.md index 7517cd0..7820126 100644 --- a/README.md +++ b/README.md @@ -16,6 +16,7 @@ The pilot of the Common Connectivity Pilot is focused on developing a framework - Packaged with `pyproject.toml` and intended to be managed via `uv` - BrainRegion ETL example from Parquet (S3/local) via `examples/etl_brain_regions.py` or CLI `ccc etl-brain-regions` - Generic Parquet→LinkML loader utility (`parquet_loader.py`) for any class in the schema +- Curated IO layer (`connects_common_connectivity.io`) for writing generated pydantic models to a shared Delta lake — `write_models(...)` / `write_projection_matrix(...)` dispatched via a `WriteSpec` registry, with output location resolved from `ccc_config.yaml` ## Getting Started (with uv) @@ -147,7 +148,7 @@ Pydantic models; this repository currently favors agility for early design. ## ETL Notebooks -A set of ETL Jupyter notebooks in `code/` registers real datasets into the shared Delta Lake store under `results/em_patchseq_wnm_v1/`. These serve as concrete working examples for every schema class. +A set of ETL Jupyter notebooks in `code/` registers real datasets into a shared Delta Lake store via the `connects_common_connectivity.io` layer (`write_models`, `write_projection_matrix`). The output location is resolved from `ccc_config.yaml` at the repo root (or the `CCC_OUTPUT_ROOT` environment variable), so notebooks do not hard-code a destination path. These serve as concrete working examples for every schema class. - **`code/etl_examples_readme.ipynb`** — markdown-only overview of all registered datasets and feature sets: what each dataset contains, why cell counts differ between sources, and how shared feature sets work across projects. Start here if you're new to the data. From 77c32a82ce69627c48ce40c2377d304932e1f12e Mon Sep 17 00:00:00 2001 From: reneyagmur Date: Tue, 23 Jun 2026 13:52:44 +0000 Subject: [PATCH 23/25] docstring and header cleanup, pr message drafting cleanup --- PR_message.md | 85 +++++++++++++------ src/connects_common_connectivity/config.py | 6 +- .../io/__init__.py | 2 - .../io/io_plans.md | 36 -------- .../io/write_spec.py | 17 ++-- .../io/write_validation.py | 19 ++--- .../io/writers.py | 19 +++-- 7 files changed, 85 insertions(+), 99 deletions(-) delete mode 100644 src/connects_common_connectivity/io/io_plans.md diff --git a/PR_message.md b/PR_message.md index 48c2e81..6d3a2c8 100644 --- a/PR_message.md +++ b/PR_message.md @@ -1,28 +1,57 @@ -# PR Message - -Implemented the full `planning/tests_review/plan.md` sequence (WP1 to WP5) end-to-end, with package-by-package verification gates in order. - -- Added shared test foundations in `tests/conftest.py` (settings/cache/cwd isolation + shared fixtures) and removed duplicated helpers across tests. -- Tightened exception assertions to specific exception classes with meaningful `match=` checks. -- Added high-signal regression assertion messages where failures are otherwise hard to diagnose. -- Added fixture/registry drift guards for writable model coverage and improved list-validation failure reporting to include row context. -- Closed remaining coverage gaps by adding tests for CLI behavior, parquet loader contract, predicate escaping edge cases, relocation scan roots, and dry-run semantics. -- Migrated all `DataItemDataSetAssociation` hand-rolled `write_deltalake` calls in `etl_visp_inh_patchseq_02`, `etl_visp_inh_patchseq_03`, and `etl_wnm_exc_02` to `write_models(...)`. Every registry-backed class is now exclusively written through `write_models` / `write_projection_matrix` in the ETL notebooks; per-notebook imports trimmed accordingly. -- Flipped `planning/TODO.md` W6 (notebook migration) and W7 (write-side test suite) to done, with explicit deferred carve-outs recorded on W6 (wide cell-feature/projection-matrix parquets, `CellCellConnectivityLong`, the v1dd-01 stub) and a coverage inventory recorded on W7. `uv run pytest -q` → 160 passed. -- Tightened the `CHANGELOG.md` notebook-migration entry to "every registry-backed model" and disclosed the remaining `write_deltalake` carve-outs so the changelog no longer overclaims. -- Cleaned up `src/connects_common_connectivity/io/io_plans.md`: shipped `populate_region_coverage` left to live by its docstring; deferred `compare_region_coverage` spec moved into `planning/prompts/_deferred/09_analysis.md`; the source-tree file removed and back-references in `planning/ARCHITECTURE.md` and `planning/prompts/03_writers.md` updated. - -## IO module rollout (W1–W7, per `planning/TODO.md`) - -End-to-end build of the curated `connects_common_connectivity.io` write path, tracked in `planning/TODO.md` and prompted from `planning/prompts/`. - -- **W1 — Config.** Added `connects_common_connectivity.config` with a pydantic `Settings` loaded from a walk-up–discovered `ccc_config.yaml`, a cached `get_settings()`, `table_path()`, and an `output_root()` helper. Relative values anchor at the config file's directory via `os.path.abspath` (avoids Code Ocean's `scratch -> /scratch` symlink). Precedence: explicit arg > `CCC_OUTPUT_ROOT` env > `ccc_config.yaml` > error. Repo-root `ccc_config.yaml` seeded. -- **W2 — Write-spec registry.** Added `io/write_spec.py` with the `WriteSpec` model, a `REGISTRY` of all writable classes, `get_spec()` lookup, and a drift test. `DataSet` scope widened to `(project_id, id)` so patchseq exc/inh `DataSet` rows coexist. -- **W3 — Writers + relocation + registry expansion.** Moved `arrow_utils.py` / `write_utils.py` under `io/`. Built `io/writers.py`: `write_models()` single dispatch over the registry (no per-class wrappers), `WriteResult` frozen dataclass, `WRITABLE_CLASSES` tuple, and the one non-`write_models` writer `write_projection_matrix()` (justified by its non-uniform signature). Added `populate_region_coverage()` in `io/write_utils.py`. Registry grew to 14 entries (added `Cluster`, `ClusterHierarchy`, `ClusterMembership`, `MappingSet`, `CellToClusterMapping`, `CellFeatureSet`, `CellFeatureDefinition`, `CellFeatureMatrix`, `ProjectionMeasurementMatrix`, `AlgorithmRun`, `HierarchyCategory`). -- **W4 — Public API.** Curated `io/__init__.py` re-exports pinned by `__all__`: `get_settings`, `Settings`, `table_path`, `write_models`, `write_projection_matrix`, `WriteResult`, `WRITABLE_CLASSES`. -- **W5 — Write validation.** Added `io/write_validation.py`: `strict_model_for(cls)` flips `WriteSpec.required_for_write` slots to non-optional and strips `Optional` from those annotations (cached per class, no mutation of generated `models.py`); `validate_for_write()` re-validates instances and raises `ValueError` naming the missing slots before any IO. Wired into `write_models`. Populated `required_for_write` for `Cluster`, `ClusterMembership`, `CellFeatureDefinition`. -- **W6 — Notebook migration.** Every ETL notebook routes registry-backed writes through `write_models()` / `write_projection_matrix()`; hardcoded `OUTPUT_ROOT = "../scratch/..."` strings replaced with `output_root()`. Patchseq regression covered. W3 re-export shims removed; nothing imports `connects_common_connectivity.arrow_utils` / `write_utils` anymore. Deferred carve-outs: wide cell-feature / projection-matrix parquets, `CellCellConnectivityLong` (no registry entry; `write_cellcellconnectivitylong` stub documents the plan), and the `etl_v1dd_01` cell-12 stub. -- **W7 — Write-side test suite.** Per-class smoke parametrized over `WRITABLE_CLASSES`; no-shim regression (`test_shim_modules_deleted`, `_not_importable`, `_no_source_references_shim_paths`); registry drift; patchseq, idempotency, append-new-by-id, predicate construction, and `output_root=` override coverage in `tests/test_writers.py`; strict-validation failures; public-API surface. `uv run pytest -q` → 160 passed. -- **Other.** Added per-call `output_root=` keyword on `write_models()` and `write_projection_matrix()` (mutually exclusive with `settings=`) so a single notebook can redirect its writes without mutating process-global config. Added `Modality.CALCIUM_IMAGING`. Removed deprecated `connects_common_connectivity.arrow_utils` / `write_utils` re-export shims. - -Deferred and unchanged in this PR: readers (L1), read-side analysis + opt-in `check_refs` (L2), and the carve-outs called out under W6. +# IO layer: write path + validation + +Ships the curated `connects_common_connectivity.io` write path end-to-end: package-wide configuration, a registry-driven write API, write-time validation derived from that same registry, ETL notebook migration to the new API, and the test suite to back it. + +## Design: WriteSpec as the single source of truth + +The `WriteSpec` registered per writable class is one declaration that drives both Delta dispatch (subdir, partitioning, scope columns, write mode) and write-time validation (`required_for_write` slots are flipped non-optional in auto-derived strict submodels and re-validated before any IO). Generated `models.py` is never touched. + +## Configuration + +- New `connects_common_connectivity.config`: pydantic `Settings`, cached `get_settings()`, walk-up discovery of `ccc_config.yaml`, plus `output_root()` / `table_path()` helpers. Relative values anchor at the config file's directory via `os.path.abspath` (avoids Code Ocean's `scratch -> /scratch` symlink). +- Precedence: explicit arg > `CCC_OUTPUT_ROOT` env > `ccc_config.yaml` > error. +- Repo-root `ccc_config.yaml` seeded. + +## Write registry and dispatch + +- `io/write_spec.py`: `WriteSpec`, `REGISTRY` (14 entries), `get_spec()`. +- `io/writers.py`: `write_models()` single-dispatch over the registry (no per-class wrappers), frozen `WriteResult` dataclass, `WRITABLE_CLASSES` tuple. `write_projection_matrix()` is the only non-`write_models` writer, justified by its non-uniform signature (dense matrix + model). +- `populate_region_coverage()` added in `io/write_utils.py`; derives `region_coverage` from the dense values before write. +- `DataSet` scope widened to `(project_id, id)` so patchseq exc/inh `DataSet` rows coexist (today's predicate-only-on-`project_id` behavior would overwrite one with the other). + +## Write-time validation + +- `io/write_validation.py`: `strict_model_for(cls)` flips `WriteSpec.required_for_write` slots to non-optional and strips `Optional` from those annotations (cached per class, no mutation of generated `models.py`). `validate_for_write()` re-validates instances and raises `ValueError` naming the missing slots before any IO. Wired into `write_models`. +- `required_for_write` populated for `Cluster`, `ClusterMembership`, `CellFeatureDefinition`. + +## Public API surface + +- Curated `io/__init__.py` re-exports pinned by `__all__`: `get_settings`, `Settings`, `table_path`, `write_models`, `write_projection_matrix`, `WriteResult`, `WRITABLE_CLASSES`. +- Per-call `output_root=` keyword on `write_models()` / `write_projection_matrix()` (mutually exclusive with `settings=`) so a single notebook can redirect its writes without mutating process-global config. +- `Modality.CALCIUM_IMAGING` added (for functional correlations in microns or v1dd-like datasets with EM + CI experiments). +- Removed `connects_common_connectivity.arrow_utils` / `connects_common_connectivity.write_utils` re-export shims; `arrow_utils.py` and `write_utils.py` now live exclusively under `io/`. + +## ETL notebook migration + +- Every registry-backed class is now exclusively written through `write_models` / `write_projection_matrix` in the ETL notebooks. Hand-rolled `write_deltalake` migrated. Per-notebook imports trimmed. +- Hardcoded `OUTPUT_ROOT = "../scratch/..."` strings replaced with `output_root()`. +- Patchseq exc/inh regression covered (see `DataSet` scope fix above). + +## Tests + +- Shared `tests/conftest.py` foundations (settings/cache/cwd isolation + shared fixtures); duplicated helpers removed. +- Tightened exception assertions to specific classes with meaningful `match=` checks. +- High-signal regression assertion messages where failures are otherwise hard to diagnose; list-validation failures now include row context. +- Per-class smoke parametrized over `WRITABLE_CLASSES`; registry-drift guard; no-shim regression (`test_shim_modules_deleted`, `_not_importable`, `_no_source_references_shim_paths`). +- Closed coverage gaps: CLI behavior, parquet loader contract, predicate escaping edge cases, relocation scan roots, dry-run semantics. +- Patchseq regression, idempotency, append-new-by-id, predicate construction, `output_root=` override, strict-validation failures, public-API surface. + +## Not in this PR + +- Wide cell-feature / projection-matrix parquet writes (still use `write_deltalake` directly). +- `CellCellConnectivityLong` — no registry entry yet; the `write_cellcellconnectivitylong` stub in `io/writers.py` documents the migration plan. +- The `etl_v1dd_01` new dataset ingestion prototype ongoing in parallel. + +## Verification + +`uv run pytest -q` → 160 passed. diff --git a/src/connects_common_connectivity/config.py b/src/connects_common_connectivity/config.py index f31510c..da5c9e4 100644 --- a/src/connects_common_connectivity/config.py +++ b/src/connects_common_connectivity/config.py @@ -148,9 +148,9 @@ def output_root(settings: Optional[Settings] = None, *, absolute: bool = False) ``code/`` and a script running at the repo root both point at the same place. By default this function then returns the path **relative to the current working directory**, so a notebook in ``code/`` sees - ``"../scratch/em_patchseq_wnm_v1/"`` while a process at the repo root - sees ``"scratch/em_patchseq_wnm_v1/"``. Pass ``absolute=True`` to get the - fully resolved absolute path instead. + ``"../scratch//"`` while a process at the repo root sees + ``"scratch//"``. Pass ``absolute=True`` to get the fully + resolved absolute path instead. Prefer :func:`table_path` for new code — it returns a typed :class:`Path` for a named table subdir and is cwd-independent. diff --git a/src/connects_common_connectivity/io/__init__.py b/src/connects_common_connectivity/io/__init__.py index 93e7759..1ee5ba6 100644 --- a/src/connects_common_connectivity/io/__init__.py +++ b/src/connects_common_connectivity/io/__init__.py @@ -24,8 +24,6 @@ write_projection_matrix, ) -# TODO(W8): reader exports - __all__ = [ "get_settings", "Settings", diff --git a/src/connects_common_connectivity/io/io_plans.md b/src/connects_common_connectivity/io/io_plans.md deleted file mode 100644 index a5dee65..0000000 --- a/src/connects_common_connectivity/io/io_plans.md +++ /dev/null @@ -1,36 +0,0 @@ -# IO Utility Functions — Plans - -## `populate_region_coverage(pmm, matrix) → ProjectionMeasurementMatrix` - -Automatically populates the `region_coverage` field on a `ProjectionMeasurementMatrix` from the dense values array. - -- **Input:** - - `pmm`: a `ProjectionMeasurementMatrix` instance with `region_index` already set. - - `matrix`: dense numeric array of shape `(len(data_item_index), len(region_index))` — numpy ndarray or similar. -- **Logic:** For each column index `i`, check `any(matrix[:, i] != 0)`. Collect the corresponding `pmm.region_index[i]` entries where the column has at least one non-zero value. -- **Output:** Returns a copy of `pmm` with `region_coverage` set to the non-zero-column subset of `region_index`. -- **Properties:** Pure function, no side effects. Does not modify the input `pmm`. - ---- - -## `compare_region_coverage(pmms) → dict` - -Compares region index and region coverage across multiple `ProjectionMeasurementMatrix` instances. Answers: "which regions are shared, and which are exclusive to specific dataset combinations?" - -- **Input:** - - `pmms`: list of `ProjectionMeasurementMatrix` instances, each with `region_index` and `region_coverage` populated. -- **Computes:** - - `shared_regions`: intersection of all `region_index` across inputs (what regions can we compare at all?). - - `shared_coverage`: intersection of all `region_coverage` across inputs (where do all datasets have signal?). - - For every non-empty subset of the input PMMs (powerset, size 1 through N): count of regions that are in that subset's `region_coverage` intersection but **not** in any other PMM's `region_coverage` (exclusive to that combination). -- **Prints:** A summary table showing, for each subset combination, how many regions are exclusively covered by that combination. Example for 3 datasets A, B, C: - ``` - Only in A: 12 - Only in B: 5 - Only in C: 8 - Only in A ∩ B: 3 - Only in A ∩ C: 2 - Only in B ∩ C: 1 - In all (A ∩ B ∩ C): 45 - ``` -- **Returns:** dict with keys `shared_regions`, `shared_coverage`, and `exclusive_counts` (mapping subset labels to region counts). diff --git a/src/connects_common_connectivity/io/write_spec.py b/src/connects_common_connectivity/io/write_spec.py index 34d4c09..5a325b5 100644 --- a/src/connects_common_connectivity/io/write_spec.py +++ b/src/connects_common_connectivity/io/write_spec.py @@ -2,12 +2,9 @@ A :class:`WriteSpec` describes how a generated pydantic model is persisted into the shared Delta lake: which subdirectory, which partition columns, which scope -columns, and which write mode the backend should dispatch on. - -Only the seed entries needed to unblock W3 are registered here -(``DataSet``, ``DataItem``, ``DataItemDataSetAssociation``). Additional classes -are added in W3 as their writers are prototyped — see -``planning/prompts/03_writers.md``. +columns, and which write mode the backend should dispatch on. :data:`REGISTRY` +is the source of truth for which classes are writable; add an entry here to +make a new class writable through :func:`write_models`. """ from __future__ import annotations @@ -53,9 +50,9 @@ class WriteSpec(BaseModel): model_cls=DataSet, subdir="dataset", partition_by=["project_id"], - # patchseq fix: today's notebooks predicate only on project_id, which - # is why visp_inh_patchseq overwrites visp_exc_patchseq. Scoping on - # (project_id, id) keeps each DataSet row independent. + # Scoped on (project_id, id) so DataSet rows from sibling notebooks + # sharing a project_id (e.g. patchseq exc/inh) do not overwrite each + # other. scope_columns=["project_id", "id"], write_mode="overwrite_scoped", ), @@ -140,7 +137,7 @@ class WriteSpec(BaseModel): # the wide-form numeric Parquet at ``cellfeatures/{feature_set_id}/`` # is built from raw dataframes in the notebook, not from a model # instance, so it does not flow through ``write_models`` and stays - # outside the registry. See planning/prompts/03_writers.md report. + # outside the registry. write_mode="overwrite_scoped", ), "ProjectionMeasurementMatrix": WriteSpec( diff --git a/src/connects_common_connectivity/io/write_validation.py b/src/connects_common_connectivity/io/write_validation.py index 1963d44..78f68b3 100644 --- a/src/connects_common_connectivity/io/write_validation.py +++ b/src/connects_common_connectivity/io/write_validation.py @@ -6,13 +6,11 @@ contexts, but the *write* path needs them concretely (e.g. the predicate columns, the partition columns, the id used for dedupe). -W2's :class:`WriteSpec` records this in ``required_for_write``. This -module turns that list into a real check by deriving a strict pydantic -subclass of the generated model — runtime-only, never mutating -``models.py`` — and re-validating each instance through it before any IO. - -The CLI's LinkML-conformance check is a different beast (whole-schema, -generic, no registry). The two intentionally do not share code. +The :class:`WriteSpec` for each writable class records this in +``required_for_write``. This module turns that list into a real check by +deriving a strict pydantic subclass of the generated model — +runtime-only, never mutating ``models.py`` — and re-validating each +instance through it before any IO. """ from __future__ import annotations @@ -107,10 +105,9 @@ def _coerce_iterable(models: Any) -> tuple[bool, list[BaseModel]]: def validate_for_write(models: Any, spec: WriteSpec) -> Any: """Re-validate ``models`` through the strict submodel for ``spec.model_cls``. - Same shape contract as the W3 ``_validation_hook``: a single instance - in returns a single instance out; an iterable in returns a list out. - No I/O. Pydantic-only. On failure, raises :class:`ValueError` naming - the class and the failing slot. + Single instance in returns a single instance out; an iterable in + returns a list out. No I/O. Pydantic-only. On failure, raises + :class:`ValueError` naming the class and the failing slot. """ was_iter, items = _coerce_iterable(models) if not items: diff --git a/src/connects_common_connectivity/io/writers.py b/src/connects_common_connectivity/io/writers.py index 1440b7c..4d09ed0 100644 --- a/src/connects_common_connectivity/io/writers.py +++ b/src/connects_common_connectivity/io/writers.py @@ -59,12 +59,12 @@ class WriteResult: # --------------------------------------------------------------------------- -# Validation hook (replaced by W5) +# Validation hook # --------------------------------------------------------------------------- def _validation_hook(models: Sequence[BaseModel], spec: WriteSpec) -> Sequence[BaseModel]: - """Strict re-validation against ``spec.required_for_write`` (W5). + """Strict re-validation against ``spec.required_for_write``. Identity-shaped: takes a sequence in, returns a sequence out. Pure pydantic; no I/O. @@ -357,19 +357,20 @@ def write_cellcellconnectivitylong( ) -> WriteResult: """Placeholder writer for ``CellCellConnectivityLong`` rows. - TODO: ``CellCellConnectivityLong`` is not yet in the WriteSpec REGISTRY, - and the existing ETL notebooks (``etl_minnie_04_cell_cell.ipynb``, + Not implemented. ``CellCellConnectivityLong`` has no ``WriteSpec`` entry + yet, and the existing ETL notebooks (``etl_minnie_04_cell_cell.ipynb``, ``parse_minnie_clustering.ipynb``) write to non-canonical, run-specific subdirs (e.g. ``cellcellconnectivitylong_proofread_pre_to_csm_post/``) rather than the canonical ``cellcellconnectivitylong/`` subdir that - ``write_models`` would resolve. Until we either (a) consolidate those - callers onto the canonical subdir and add a ``WriteSpec``, or (b) extend - the dispatch to accept a per-call subdir override, those notebooks keep - using ``write_deltalake`` directly. This stub exists as a reminder. + ``write_models`` would resolve. Until either (a) those callers + consolidate onto the canonical subdir and a ``WriteSpec`` is added, or + (b) dispatch is extended to accept a per-call subdir override, those + notebooks keep using ``write_deltalake`` directly. This stub exists as + a reminder of that open work. """ raise NotImplementedError( "write_cellcellconnectivitylong is not implemented yet; " - "see writers.py docstring for migration plan." + "see the docstring for the migration plan." ) From 49c821acbab411781ea45ca40a6848d22d7e9026 Mon Sep 17 00:00:00 2001 From: reneyagmur Date: Tue, 23 Jun 2026 14:09:47 +0000 Subject: [PATCH 24/25] arrange past prompts planning as docu, changelog structure --- CHANGELOG.md | 72 ++++++++++++------- planning/{ => 20260623}/ARCHITECTURE.md | 0 .../20260623/PR_message.md | 0 planning/{ => 20260623}/README.md | 0 planning/{ => 20260623}/TODO.md | 2 +- .../prompts/00_shared_context.md | 0 planning/{ => 20260623}/prompts/01_config.md | 0 .../{ => 20260623}/prompts/02_write_spec.md | 0 planning/{ => 20260623}/prompts/03_writers.md | 0 .../{ => 20260623}/prompts/04_public_api.md | 0 .../{ => 20260623}/prompts/05_validation.md | 0 .../prompts/06_notebook_migration.md | 0 planning/{ => 20260623}/prompts/07_tests.md | 0 .../prompts/_deferred/08_readers.md | 0 .../prompts/_deferred/09_analysis.md | 0 .../{ => 20260623}/tests_review/README.md | 0 .../{ => 20260623}/tests_review/findings.md | 0 planning/{ => 20260623}/tests_review/plan.md | 0 .../etl_v1dd_01_v1196_temp_prompt.md | 0 pyproject.toml | 2 +- 20 files changed, 48 insertions(+), 28 deletions(-) rename planning/{ => 20260623}/ARCHITECTURE.md (100%) rename PR_message.md => planning/20260623/PR_message.md (100%) rename planning/{ => 20260623}/README.md (100%) rename planning/{ => 20260623}/TODO.md (99%) rename planning/{ => 20260623}/prompts/00_shared_context.md (100%) rename planning/{ => 20260623}/prompts/01_config.md (100%) rename planning/{ => 20260623}/prompts/02_write_spec.md (100%) rename planning/{ => 20260623}/prompts/03_writers.md (100%) rename planning/{ => 20260623}/prompts/04_public_api.md (100%) rename planning/{ => 20260623}/prompts/05_validation.md (100%) rename planning/{ => 20260623}/prompts/06_notebook_migration.md (100%) rename planning/{ => 20260623}/prompts/07_tests.md (100%) rename planning/{ => 20260623}/prompts/_deferred/08_readers.md (100%) rename planning/{ => 20260623}/prompts/_deferred/09_analysis.md (100%) rename planning/{ => 20260623}/tests_review/README.md (100%) rename planning/{ => 20260623}/tests_review/findings.md (100%) rename planning/{ => 20260623}/tests_review/plan.md (100%) rename etl_v1dd_01_v1196_temp_prompt.md => planning/etl_v1dd_01_v1196_temp_prompt.md (100%) diff --git a/CHANGELOG.md b/CHANGELOG.md index 8a835a1..0d8c810 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -9,8 +9,42 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 ### Added -- Added `CALCIUM_IMAGING` value to the `Modality` enum for calcium imaging - based functional correlations. +### Changed + +### Deprecated + +### Removed + +### Fixed + +### Security + +## [0.2.0] - 2026-06-23 + +### Added + +- Added `connects_common_connectivity.config` with `Settings`, + `get_settings()`, `find_config_file()`, `output_root()`, and + `table_path()`. Settings are discovered from a `ccc_config.yaml` at (or + above) the cwd; `CCC_OUTPUT_ROOT` overrides `output_root`. Relative + `output_root` values are anchored at the config file's directory so a + notebook in `code/` and a script at the repo root resolve to the same + place. +- Added curated public API at `connects_common_connectivity.io`: + `write_models()` (single dispatch core for all generated pydantic + models), `write_projection_matrix()`, `WriteResult`, + `WRITABLE_CLASSES`, and re-exports of `get_settings`, `Settings`, and + `table_path`. The surface is pinned by `__all__`. +- Added write-time validation: `write_models()` now re-validates each + model through a runtime-derived strict subclass that flips + `WriteSpec.required_for_write` slots to non-optional, raising + `ValueError` before any IO if a write-required slot is missing or + `None`. Public helpers `strict_model_for()` and `validate_for_write()` + live in `connects_common_connectivity.io.write_validation`. +- Added `WriteSpec` registry entries for `AlgorithmRun` and + `HierarchyCategory` (both project-agnostic, scope=`["id"]`, + `overwrite_scoped`). These classes are now writable through + `write_models(...)` and surface in `WRITABLE_CLASSES`. - Added an `output_root=` keyword to `write_models()` and `write_projection_matrix()` for per-call overrides of the on-disk root. Accepts a `str` or `Path` and writes to `//`, @@ -18,26 +52,11 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 `settings=` (passing both raises `TypeError`). Lets a single notebook redirect its writes (e.g. an isolated test dataset) without mutating process-global config or environment variables. -- Added `WriteSpec` registry entries for `AlgorithmRun` and - `HierarchyCategory` (both project-agnostic, scope=`["id"]`, - `overwrite_scoped`). These classes are now writable through - `write_models(...)` and surface in `WRITABLE_CLASSES`. -- Added write-time validation: `write_models()` now re-validates each - model through a runtime-derived strict subclass that flips - `WriteSpec.required_for_write` slots to non-optional, raising - `ValueError` before any IO if a write-required slot is missing or - `None`. Public helpers `strict_model_for()` and `validate_for_write()` - live in `connects_common_connectivity.io.write_validation`. -- Added curated public API at `connects_common_connectivity.io`: imports - for `get_settings`, `Settings`, `table_path`, `write_models`, - `write_projection_matrix`, `WriteResult`, and `WRITABLE_CLASSES` are - now stable and pinned by `__all__`. -- Added `connects_common_connectivity.io.writers` with `write_models()` (the - single dispatch core for all generated pydantic models), - `write_projection_matrix()`, `WriteResult`, and `WRITABLE_CLASSES`. - Added `populate_region_coverage()` in `connects_common_connectivity.io.write_utils` for deriving `ProjectionMeasurementMatrix.region_coverage` from a dense matrix. +- Added `CALCIUM_IMAGING` value to the `Modality` enum for calcium + imaging based functional correlations. ### Changed @@ -49,11 +68,10 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 (and `write_projection_matrix(...)` for projection-matrix metadata rows). Wide cell-feature / projection-matrix parquets and `CellCellConnectivityLong` writes remain on raw `write_deltalake` pending registry support. -- Moved `arrow_utils` and `write_utils` under +- Moved `connects_common_connectivity.arrow_utils` and + `connects_common_connectivity.write_utils` under `connects_common_connectivity.io.*`. -### Deprecated - ### Removed - Removed the deprecated re-export shims @@ -64,7 +82,9 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 ### Fixed -- Fixed `write_models()` to honor `Settings.dry_run=True`: writes are now skipped, - `rows_written` is reported as `0`, and no Delta table directories are created. - -### Security +- Fixed `DataSet` writes to scope on `(project_id, id)` instead of + `project_id` alone, so sibling notebooks sharing a `project_id` (e.g. + patchseq exc/inh) no longer overwrite each other's `DataSet` rows. +- Fixed `write_models()` to honor `Settings.dry_run=True`: writes are now + skipped, `rows_written` is reported as `0`, and no Delta table + directories are created. diff --git a/planning/ARCHITECTURE.md b/planning/20260623/ARCHITECTURE.md similarity index 100% rename from planning/ARCHITECTURE.md rename to planning/20260623/ARCHITECTURE.md diff --git a/PR_message.md b/planning/20260623/PR_message.md similarity index 100% rename from PR_message.md rename to planning/20260623/PR_message.md diff --git a/planning/README.md b/planning/20260623/README.md similarity index 100% rename from planning/README.md rename to planning/20260623/README.md diff --git a/planning/TODO.md b/planning/20260623/TODO.md similarity index 99% rename from planning/TODO.md rename to planning/20260623/TODO.md index 28a624f..17e96e0 100644 --- a/planning/TODO.md +++ b/planning/20260623/TODO.md @@ -82,7 +82,7 @@ Flat, ordered list. One row per prompt; sub-tasks live in the prompts. Design li (5) strict-validation failures in `tests/test_write_validation.py`; (6) public-API surface in `tests/test_public_api.py`. Full suite green: `uv run pytest -q` → 160 passed. -- [ ] **W8 — README / usage docs** — Update README for the write API. No prompt; small task. +- [x] **W8 — README / usage docs** — Update README for the write API. No prompt; small task. Ask before large edits. ## Deferred (do not start; design kept for reference) diff --git a/planning/prompts/00_shared_context.md b/planning/20260623/prompts/00_shared_context.md similarity index 100% rename from planning/prompts/00_shared_context.md rename to planning/20260623/prompts/00_shared_context.md diff --git a/planning/prompts/01_config.md b/planning/20260623/prompts/01_config.md similarity index 100% rename from planning/prompts/01_config.md rename to planning/20260623/prompts/01_config.md diff --git a/planning/prompts/02_write_spec.md b/planning/20260623/prompts/02_write_spec.md similarity index 100% rename from planning/prompts/02_write_spec.md rename to planning/20260623/prompts/02_write_spec.md diff --git a/planning/prompts/03_writers.md b/planning/20260623/prompts/03_writers.md similarity index 100% rename from planning/prompts/03_writers.md rename to planning/20260623/prompts/03_writers.md diff --git a/planning/prompts/04_public_api.md b/planning/20260623/prompts/04_public_api.md similarity index 100% rename from planning/prompts/04_public_api.md rename to planning/20260623/prompts/04_public_api.md diff --git a/planning/prompts/05_validation.md b/planning/20260623/prompts/05_validation.md similarity index 100% rename from planning/prompts/05_validation.md rename to planning/20260623/prompts/05_validation.md diff --git a/planning/prompts/06_notebook_migration.md b/planning/20260623/prompts/06_notebook_migration.md similarity index 100% rename from planning/prompts/06_notebook_migration.md rename to planning/20260623/prompts/06_notebook_migration.md diff --git a/planning/prompts/07_tests.md b/planning/20260623/prompts/07_tests.md similarity index 100% rename from planning/prompts/07_tests.md rename to planning/20260623/prompts/07_tests.md diff --git a/planning/prompts/_deferred/08_readers.md b/planning/20260623/prompts/_deferred/08_readers.md similarity index 100% rename from planning/prompts/_deferred/08_readers.md rename to planning/20260623/prompts/_deferred/08_readers.md diff --git a/planning/prompts/_deferred/09_analysis.md b/planning/20260623/prompts/_deferred/09_analysis.md similarity index 100% rename from planning/prompts/_deferred/09_analysis.md rename to planning/20260623/prompts/_deferred/09_analysis.md diff --git a/planning/tests_review/README.md b/planning/20260623/tests_review/README.md similarity index 100% rename from planning/tests_review/README.md rename to planning/20260623/tests_review/README.md diff --git a/planning/tests_review/findings.md b/planning/20260623/tests_review/findings.md similarity index 100% rename from planning/tests_review/findings.md rename to planning/20260623/tests_review/findings.md diff --git a/planning/tests_review/plan.md b/planning/20260623/tests_review/plan.md similarity index 100% rename from planning/tests_review/plan.md rename to planning/20260623/tests_review/plan.md diff --git a/etl_v1dd_01_v1196_temp_prompt.md b/planning/etl_v1dd_01_v1196_temp_prompt.md similarity index 100% rename from etl_v1dd_01_v1196_temp_prompt.md rename to planning/etl_v1dd_01_v1196_temp_prompt.md diff --git a/pyproject.toml b/pyproject.toml index 223a69b..e1341a5 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -1,6 +1,6 @@ [project] name = "connects-common-connectivity" -version = "0.1.0" +version = "0.2.0" description = "Common connectivity data models and utilities (LinkML + Pydantic) for BRAIN CONNECTS pilot" authors = [ { name = "Forrest Collman" } ] license = { text = "MIT" } From 2dab4e9b0cef2b97b3cba9d67370d1229e900208 Mon Sep 17 00:00:00 2001 From: reneyagmur Date: Tue, 23 Jun 2026 15:32:57 +0000 Subject: [PATCH 25/25] read union merge noteboo migration regression for patchseq notebooks --- code/etl_tasic_01_cluster.ipynb | 203 ++---- ...isp_exc_patchseq_01_dataset_dataitem.ipynb | 153 ++--- ...l_visp_exc_patchseq_02_cell_features.ipynb | 235 +++---- ...eq_03_cluster_membership_and_mapping.ipynb | 370 ++++------- ...isp_inh_patchseq_01_dataset_dataitem.ipynb | 151 ++--- ...l_visp_inh_patchseq_02_cell_features.ipynb | 312 +++++----- ...eq_03_cluster_membership_and_mapping.ipynb | 376 ++++++----- code/etl_visp_met_types_01_cluster.ipynb | 193 ++---- code/etl_wnm_exc_01_dataset_dataitem.ipynb | 149 ++--- code/etl_wnm_exc_02_cell_features.ipynb | 585 ++++++------------ ...l_wnm_exc_03_cell_to_cluster_mapping.ipynb | 155 ++--- code/etl_wnm_exc_04_projection_matrix.ipynb | 300 +++------ planning/20260623/PR_message.md | 1 + planning/multi_writer_scope_design.md | 201 ++++++ uv.lock | 4 +- 15 files changed, 1343 insertions(+), 2045 deletions(-) create mode 100644 planning/multi_writer_scope_design.md diff --git a/code/etl_tasic_01_cluster.ipynb b/code/etl_tasic_01_cluster.ipynb index 42a39d5..66c78dc 100644 --- a/code/etl_tasic_01_cluster.ipynb +++ b/code/etl_tasic_01_cluster.ipynb @@ -5,7 +5,7 @@ "id": "7f4f8c85", "metadata": {}, "source": [ - "# ETL \u2014 Tasic 2018 VISp Taxonomy (cluster reference)\n", + "# ETL — Tasic 2018 VISp Taxonomy (cluster reference)\n", "\n", "Registers the **Tasic 2018 VISp scRNA-seq taxonomy** as a global cluster reference. Writes `algorithmrun/`, `clusterhierarchy/`, `cluster/`, `hierarchycategory/`. **Out of scope:** Tasic cells are not registered as `DataItem`s here.\n", "\n", @@ -20,23 +20,16 @@ "**Known schema caveats (documented, not fixed):**\n", "\n", "1. **Two opposite `level` conventions.** `Cluster.level` uses `0=root` (depth from root, increasing downward). `HierarchyCategory.level` uses `0=lowest` (resolution-detail order, leaf to coarse). The two values for a given node always sum to the hierarchy depth (3 here). Don't cross-compare.\n", - "2. **`HierarchyCategory` has no taxonomy discriminator.** Categories like `class`/`subclass`/`cluster`/`major_class` are intentionally shared vocabulary across taxonomies. If a future taxonomy disagrees on a shared category's `level` or `description`, the later writer silently overwrites \u2014 so this notebook narrows its overwrite to `id IN (...)` over only the rows it owns.\n", + "2. **`HierarchyCategory` has no taxonomy discriminator.** Categories like `class`/`subclass`/`cluster`/`major_class` are intentionally shared vocabulary across taxonomies. If a future taxonomy disagrees on a shared category's `level` or `description`, the later writer silently overwrites — so this notebook narrows its overwrite to `id IN (...)` over only the rows it owns.\n", "3. **`HierarchyCategory.level` is generated as `Optional[str]`** (the top-level `level` slot has no `range:` set; only `Cluster.level` overrides to integer). Stored here as the str repr of the intended integer (`\"0\"`, `\"1\"`, ...). Future fix: add `range: integer` to `HierarchyCategory.level` in `slot_usage`.\n", - "4. **Color anomaly.** `Non-Neuronal` has two `class_color` values upstream (`#808285` \u00d7577 rows, `#8D1800` \u00d7181). Resolved by silent `dict(zip(label, color))` last-wins (matches the reference notebook). Final color: `#8D1800`." + "4. **Color anomaly.** `Non-Neuronal` has two `class_color` values upstream (`#808285` ×577 rows, `#8D1800` ×181). Resolved by silent `dict(zip(label, color))` last-wins (matches the reference notebook). Final color: `#8D1800`." ] }, { "cell_type": "code", "execution_count": 1, "id": "49b62c21", - "metadata": { - "execution": { - "iopub.execute_input": "2026-06-12T23:38:20.347590Z", - "iopub.status.busy": "2026-06-12T23:38:20.347316Z", - "iopub.status.idle": "2026-06-12T23:38:21.516554Z", - "shell.execute_reply": "2026-06-12T23:38:21.515638Z" - } - }, + "metadata": {}, "outputs": [], "source": [ "from pathlib import Path\n", @@ -59,14 +52,7 @@ "cell_type": "code", "execution_count": 2, "id": "a6acd535", - "metadata": { - "execution": { - "iopub.execute_input": "2026-06-12T23:38:21.518766Z", - "iopub.status.busy": "2026-06-12T23:38:21.518462Z", - "iopub.status.idle": "2026-06-12T23:38:21.525036Z", - "shell.execute_reply": "2026-06-12T23:38:21.524314Z" - } - }, + "metadata": {}, "outputs": [ { "name": "stdout", @@ -105,26 +91,13 @@ "cell_type": "code", "execution_count": 3, "id": "333984ed", - "metadata": { - "execution": { - "iopub.execute_input": "2026-06-12T23:38:21.526995Z", - "iopub.status.busy": "2026-06-12T23:38:21.526805Z", - "iopub.status.idle": "2026-06-12T23:38:21.738989Z", - "shell.execute_reply": "2026-06-12T23:38:21.738199Z" - } - }, + "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "anno.feather shape: (14236, 152)\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ + "anno.feather shape: (14236, 152)\n", "classes=3 subclasses=23 clusters=111\n", "Non-Neuronal class_color resolved to: #8D1800\n" ] @@ -170,21 +143,14 @@ "id": "f20fdd20", "metadata": {}, "source": [ - "## `HierarchyCategory` \u2014 4 rows (`major_class`/`class`/`subclass`/`cluster`)" + "## `HierarchyCategory` — 4 rows (`major_class`/`class`/`subclass`/`cluster`)" ] }, { "cell_type": "code", "execution_count": 4, "id": "1685c251", - "metadata": { - "execution": { - "iopub.execute_input": "2026-06-12T23:38:21.740776Z", - "iopub.status.busy": "2026-06-12T23:38:21.740579Z", - "iopub.status.idle": "2026-06-12T23:38:22.047104Z", - "shell.execute_reply": "2026-06-12T23:38:22.046212Z" - } - }, + "metadata": {}, "outputs": [ { "name": "stdout", @@ -213,14 +179,7 @@ "cell_type": "code", "execution_count": 5, "id": "72271432", - "metadata": { - "execution": { - "iopub.execute_input": "2026-06-12T23:38:22.048970Z", - "iopub.status.busy": "2026-06-12T23:38:22.048769Z", - "iopub.status.idle": "2026-06-12T23:38:22.120558Z", - "shell.execute_reply": "2026-06-12T23:38:22.119673Z" - } - }, + "metadata": {}, "outputs": [ { "name": "stdout", @@ -228,16 +187,16 @@ "text": [ "(4, 3)\n", "shape: (4, 3)\n", - "\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n", - "\u2502 id \u2506 description \u2506 level \u2502\n", - "\u2502 --- \u2506 --- \u2506 --- \u2502\n", - "\u2502 str \u2506 str \u2506 str \u2502\n", - "\u255e\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2561\n", - "\u2502 class \u2506 Top-level transcriptomic class\u2026 \u2506 2 \u2502\n", - "\u2502 cluster \u2506 Leaf cluster (cell type / T-ty\u2026 \u2506 0 \u2502\n", - "\u2502 major_class \u2506 Synthetic root grouping all cl\u2026 \u2506 3 \u2502\n", - "\u2502 subclass \u2506 Subclass of cell types. \u2506 1 \u2502\n", - "\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n" + "┌─────────────┬─────────────────────────────────┬───────┐\n", + "│ id ┆ description ┆ level │\n", + "│ --- ┆ --- ┆ --- │\n", + "│ str ┆ str ┆ str │\n", + "╞═════════════╪═════════════════════════════════╪═══════╡\n", + "│ class ┆ Top-level transcriptomic class… ┆ 2 │\n", + "│ cluster ┆ Leaf cluster (cell type / T-ty… ┆ 0 │\n", + "│ major_class ┆ Synthetic root grouping all cl… ┆ 3 │\n", + "│ subclass ┆ Subclass of cell types. ┆ 1 │\n", + "└─────────────┴─────────────────────────────────┴───────┘\n" ] } ], @@ -253,21 +212,14 @@ "id": "8936a274", "metadata": {}, "source": [ - "## `AlgorithmRun` \u2014 1 row" + "## `AlgorithmRun` — 1 row" ] }, { "cell_type": "code", "execution_count": 6, "id": "025bb878", - "metadata": { - "execution": { - "iopub.execute_input": "2026-06-12T23:38:22.122454Z", - "iopub.status.busy": "2026-06-12T23:38:22.122171Z", - "iopub.status.idle": "2026-06-12T23:38:22.198474Z", - "shell.execute_reply": "2026-06-12T23:38:22.197736Z" - } - }, + "metadata": {}, "outputs": [ { "name": "stdout", @@ -298,14 +250,7 @@ "cell_type": "code", "execution_count": 7, "id": "067fb84c", - "metadata": { - "execution": { - "iopub.execute_input": "2026-06-12T23:38:22.200059Z", - "iopub.status.busy": "2026-06-12T23:38:22.199866Z", - "iopub.status.idle": "2026-06-12T23:38:22.215438Z", - "shell.execute_reply": "2026-06-12T23:38:22.214697Z" - } - }, + "metadata": {}, "outputs": [ { "name": "stdout", @@ -313,18 +258,18 @@ "text": [ "(1, 9)\n", "shape: (1, 9)\n", - "\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n", - "\u2502 id \u2506 algorithm \u2506 algorithm \u2506 json_obje \u2506 \u2026 \u2506 input_dat \u2506 produced_ \u2506 score_des \u2506 distance \u2502\n", - "\u2502 --- \u2506 _name \u2506 _version \u2506 ct \u2506 \u2506 aset \u2506 hierarchi \u2506 cription \u2506 _descrip \u2502\n", - "\u2502 str \u2506 --- \u2506 --- \u2506 --- \u2506 \u2506 --- \u2506 es \u2506 --- \u2506 tion \u2502\n", - "\u2502 \u2506 str \u2506 str \u2506 str \u2506 \u2506 str \u2506 --- \u2506 str \u2506 --- \u2502\n", - "\u2502 \u2506 \u2506 \u2506 \u2506 \u2506 \u2506 str \u2506 \u2506 str \u2502\n", - "\u255e\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2561\n", - "\u2502 tasic_201 \u2506 hierarchi \u2506 2018 \u2506 null \u2506 \u2026 \u2506 null \u2506 null \u2506 null \u2506 null \u2502\n", - "\u2502 8_visp_cl \u2506 cal \u2506 \u2506 \u2506 \u2506 \u2506 \u2506 \u2506 \u2502\n", - "\u2502 ustering \u2506 (Tasic et \u2506 \u2506 \u2506 \u2506 \u2506 \u2506 \u2506 \u2502\n", - "\u2502 \u2506 al. 201\u2026 \u2506 \u2506 \u2506 \u2506 \u2506 \u2506 \u2506 \u2502\n", - "\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n" + "┌───────────┬───────────┬───────────┬───────────┬───┬───────────┬───────────┬───────────┬──────────┐\n", + "│ id ┆ algorithm ┆ algorithm ┆ json_obje ┆ … ┆ input_dat ┆ produced_ ┆ score_des ┆ distance │\n", + "│ --- ┆ _name ┆ _version ┆ ct ┆ ┆ aset ┆ hierarchi ┆ cription ┆ _descrip │\n", + "│ str ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ es ┆ --- ┆ tion │\n", + "│ ┆ str ┆ str ┆ str ┆ ┆ str ┆ --- ┆ str ┆ --- │\n", + "│ ┆ ┆ ┆ ┆ ┆ ┆ str ┆ ┆ str │\n", + "╞═══════════╪═══════════╪═══════════╪═══════════╪═══╪═══════════╪═══════════╪═══════════╪══════════╡\n", + "│ tasic_201 ┆ hierarchi ┆ 2018 ┆ null ┆ … ┆ null ┆ null ┆ null ┆ null │\n", + "│ 8_visp_cl ┆ cal ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", + "│ ustering ┆ (Tasic et ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", + "│ ┆ al. 201… ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", + "└───────────┴───────────┴───────────┴───────────┴───┴───────────┴───────────┴───────────┴──────────┘\n" ] } ], @@ -340,33 +285,20 @@ "id": "bb7f663e", "metadata": {}, "source": [ - "## `Cluster` \u2014 138 rows (1 synthetic root + 3 classes + 23 subclasses + 111 leaf clusters)" + "## `Cluster` — 138 rows (1 synthetic root + 3 classes + 23 subclasses + 111 leaf clusters)" ] }, { "cell_type": "code", "execution_count": 8, "id": "0d1e5c5c", - "metadata": { - "execution": { - "iopub.execute_input": "2026-06-12T23:38:22.217025Z", - "iopub.status.busy": "2026-06-12T23:38:22.216842Z", - "iopub.status.idle": "2026-06-12T23:38:22.312533Z", - "shell.execute_reply": "2026-06-12T23:38:22.311621Z" - } - }, + "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "Cluster rows built: 138\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ + "Cluster rows built: 138\n", "Cluster written: 138 rows\n" ] } @@ -431,14 +363,7 @@ "cell_type": "code", "execution_count": 9, "id": "9aa57316", - "metadata": { - "execution": { - "iopub.execute_input": "2026-06-12T23:38:22.314948Z", - "iopub.status.busy": "2026-06-12T23:38:22.314685Z", - "iopub.status.idle": "2026-06-12T23:38:22.333367Z", - "shell.execute_reply": "2026-06-12T23:38:22.332356Z" - } - }, + "metadata": {}, "outputs": [ { "name": "stdout", @@ -446,16 +371,16 @@ "text": [ "(138, 9)\n", "shape: (4, 2)\n", - "\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2510\n", - "\u2502 hierarchy_category \u2506 len \u2502\n", - "\u2502 --- \u2506 --- \u2502\n", - "\u2502 str \u2506 u32 \u2502\n", - "\u255e\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2561\n", - "\u2502 class \u2506 3 \u2502\n", - "\u2502 cluster \u2506 111 \u2502\n", - "\u2502 major_class \u2506 1 \u2502\n", - "\u2502 subclass \u2506 23 \u2502\n", - "\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2518\n" + "┌────────────────────┬─────┐\n", + "│ hierarchy_category ┆ len │\n", + "│ --- ┆ --- │\n", + "│ str ┆ u32 │\n", + "╞════════════════════╪═════╡\n", + "│ class ┆ 3 │\n", + "│ cluster ┆ 111 │\n", + "│ major_class ┆ 1 │\n", + "│ subclass ┆ 23 │\n", + "└────────────────────┴─────┘\n" ] } ], @@ -473,21 +398,14 @@ "id": "48bf29de", "metadata": {}, "source": [ - "## `ClusterHierarchy` \u2014 1 row" + "## `ClusterHierarchy` — 1 row" ] }, { "cell_type": "code", "execution_count": 10, "id": "31d4b276", - "metadata": { - "execution": { - "iopub.execute_input": "2026-06-12T23:38:22.335559Z", - "iopub.status.busy": "2026-06-12T23:38:22.335284Z", - "iopub.status.idle": "2026-06-12T23:38:22.416260Z", - "shell.execute_reply": "2026-06-12T23:38:22.415532Z" - } - }, + "metadata": {}, "outputs": [ { "name": "stdout", @@ -512,14 +430,7 @@ "cell_type": "code", "execution_count": 11, "id": "5072ae81", - "metadata": { - "execution": { - "iopub.execute_input": "2026-06-12T23:38:22.417824Z", - "iopub.status.busy": "2026-06-12T23:38:22.417633Z", - "iopub.status.idle": "2026-06-12T23:38:22.432337Z", - "shell.execute_reply": "2026-06-12T23:38:22.431572Z" - } - }, + "metadata": {}, "outputs": [ { "name": "stdout", @@ -559,6 +470,14 @@ "\n", "Tasic taxonomy: 1 synthetic root + 3 classes + 23 subclasses + 111 leaf clusters. No `DataItem`s registered (out of scope). Idempotent." ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "9c123e95-ff37-4190-8381-5593ed47082d", + "metadata": {}, + "outputs": [], + "source": [] } ], "metadata": { @@ -577,7 +496,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.13.13" + "version": "3.12.4" } }, "nbformat": 4, diff --git a/code/etl_visp_exc_patchseq_01_dataset_dataitem.ipynb b/code/etl_visp_exc_patchseq_01_dataset_dataitem.ipynb index 3a6e565..849b83b 100644 --- a/code/etl_visp_exc_patchseq_01_dataset_dataitem.ipynb +++ b/code/etl_visp_exc_patchseq_01_dataset_dataitem.ipynb @@ -4,7 +4,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "# ETL \u2014 VISp Excitatory Patch-seq: DataSet & DataItem\n", + "# ETL — VISp Excitatory Patch-seq: DataSet & DataItem\n", "\n", "Writes one `DataSet` record (`dataset_id = \"visp_exc_patchseq\"`, `project_id = \"visp_patchseq\"`), one `DataItem` per cell from `inferred_met_types.csv`, and the corresponding `DataItemDataSetAssociation` links. No prerequisites; features and cluster mappings are written in `_02` and `_03`." ] @@ -12,14 +12,7 @@ { "cell_type": "code", "execution_count": 1, - "metadata": { - "execution": { - "iopub.execute_input": "2026-06-12T23:38:05.430484Z", - "iopub.status.busy": "2026-06-12T23:38:05.430258Z", - "iopub.status.idle": "2026-06-12T23:38:06.593666Z", - "shell.execute_reply": "2026-06-12T23:38:06.592934Z" - } - }, + "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", @@ -39,14 +32,7 @@ { "cell_type": "code", "execution_count": 2, - "metadata": { - "execution": { - "iopub.execute_input": "2026-06-12T23:38:06.595708Z", - "iopub.status.busy": "2026-06-12T23:38:06.595415Z", - "iopub.status.idle": "2026-06-12T23:38:06.599858Z", - "shell.execute_reply": "2026-06-12T23:38:06.599188Z" - } - }, + "metadata": {}, "outputs": [ { "name": "stdout", @@ -81,14 +67,7 @@ { "cell_type": "code", "execution_count": 3, - "metadata": { - "execution": { - "iopub.execute_input": "2026-06-12T23:38:06.636467Z", - "iopub.status.busy": "2026-06-12T23:38:06.636236Z", - "iopub.status.idle": "2026-06-12T23:38:06.741365Z", - "shell.execute_reply": "2026-06-12T23:38:06.740644Z" - } - }, + "metadata": {}, "outputs": [ { "name": "stdout", @@ -175,14 +154,7 @@ { "cell_type": "code", "execution_count": 4, - "metadata": { - "execution": { - "iopub.execute_input": "2026-06-12T23:38:06.743157Z", - "iopub.status.busy": "2026-06-12T23:38:06.742959Z", - "iopub.status.idle": "2026-06-12T23:38:06.894898Z", - "shell.execute_reply": "2026-06-12T23:38:06.894218Z" - } - }, + "metadata": {}, "outputs": [ { "name": "stdout", @@ -207,14 +179,7 @@ { "cell_type": "code", "execution_count": 5, - "metadata": { - "execution": { - "iopub.execute_input": "2026-06-12T23:38:06.896587Z", - "iopub.status.busy": "2026-06-12T23:38:06.896365Z", - "iopub.status.idle": "2026-06-12T23:38:06.916236Z", - "shell.execute_reply": "2026-06-12T23:38:06.915319Z" - } - }, + "metadata": {}, "outputs": [ { "name": "stdout", @@ -222,14 +187,14 @@ "text": [ "(1, 5)\n", "shape: (1, 5)\n", - "\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n", - "\u2502 id \u2506 name \u2506 publication \u2506 modality \u2506 project_id \u2502\n", - "\u2502 --- \u2506 --- \u2506 --- \u2506 --- \u2506 --- \u2502\n", - "\u2502 str \u2506 str \u2506 str \u2506 str \u2506 str \u2502\n", - "\u255e\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2561\n", - "\u2502 visp_exc_patchseq \u2506 VISp excitatory \u2506 doi.org/10.1101/2023.11.25.56 \u2506 MORPHOLOGY \u2506 visp_patchseq \u2502\n", - "\u2502 \u2506 Patch-seq data\u2026 \u2506 8\u2026 \u2506 \u2506 \u2502\n", - "\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n" + "┌───────────────────┬─────────────────┬───────────────────────────────┬────────────┬───────────────┐\n", + "│ id ┆ name ┆ publication ┆ modality ┆ project_id │\n", + "│ --- ┆ --- ┆ --- ┆ --- ┆ --- │\n", + "│ str ┆ str ┆ str ┆ str ┆ str │\n", + "╞═══════════════════╪═════════════════╪═══════════════════════════════╪════════════╪═══════════════╡\n", + "│ visp_exc_patchseq ┆ VISp excitatory ┆ doi.org/10.1101/2023.11.25.56 ┆ MORPHOLOGY ┆ visp_patchseq │\n", + "│ ┆ Patch-seq data… ┆ 8… ┆ ┆ │\n", + "└───────────────────┴─────────────────┴───────────────────────────────┴────────────┴───────────────┘\n" ] } ], @@ -256,20 +221,13 @@ { "cell_type": "code", "execution_count": 6, - "metadata": { - "execution": { - "iopub.execute_input": "2026-06-12T23:38:06.918092Z", - "iopub.status.busy": "2026-06-12T23:38:06.917892Z", - "iopub.status.idle": "2026-06-12T23:38:07.026981Z", - "shell.execute_reply": "2026-06-12T23:38:07.026264Z" - } - }, + "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "DataItem rows appended: 1528 (total in batch: 1528)\n" + "DataItem rows appended: 0 (total in batch: 1528)\n" ] } ], @@ -287,32 +245,25 @@ { "cell_type": "code", "execution_count": 7, - "metadata": { - "execution": { - "iopub.execute_input": "2026-06-12T23:38:07.028778Z", - "iopub.status.busy": "2026-06-12T23:38:07.028580Z", - "iopub.status.idle": "2026-06-12T23:38:07.044979Z", - "shell.execute_reply": "2026-06-12T23:38:07.044143Z" - } - }, + "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "(1528, 4)\n", + "(4407, 4)\n", "shape: (5, 4)\n", - "\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n", - "\u2502 id \u2506 name \u2506 neuroglancer_link \u2506 project_id \u2502\n", - "\u2502 --- \u2506 --- \u2506 --- \u2506 --- \u2502\n", - "\u2502 str \u2506 str \u2506 str \u2506 str \u2502\n", - "\u255e\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2561\n", - "\u2502 908902400 \u2506 908902400 \u2506 null \u2506 visp_patchseq \u2502\n", - "\u2502 965091329 \u2506 965091329 \u2506 null \u2506 visp_patchseq \u2502\n", - "\u2502 978149378 \u2506 978149378 \u2506 null \u2506 visp_patchseq \u2502\n", - "\u2502 834891776 \u2506 834891776 \u2506 null \u2506 visp_patchseq \u2502\n", - "\u2502 897003522 \u2506 897003522 \u2506 null \u2506 visp_patchseq \u2502\n", - "\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n" + "┌───────────┬───────────┬───────────────────┬───────────────┐\n", + "│ id ┆ name ┆ neuroglancer_link ┆ project_id │\n", + "│ --- ┆ --- ┆ --- ┆ --- │\n", + "│ str ┆ str ┆ str ┆ str │\n", + "╞═══════════╪═══════════╪═══════════════════╪═══════════════╡\n", + "│ 601790961 ┆ 601790961 ┆ null ┆ visp_patchseq │\n", + "│ 602535278 ┆ 602535278 ┆ null ┆ visp_patchseq │\n", + "│ 604646725 ┆ 604646725 ┆ null ┆ visp_patchseq │\n", + "│ 623326230 ┆ 623326230 ┆ null ┆ visp_patchseq │\n", + "│ 623434306 ┆ 623434306 ┆ null ┆ visp_patchseq │\n", + "└───────────┴───────────┴───────────────────┴───────────────┘\n" ] } ], @@ -339,14 +290,7 @@ { "cell_type": "code", "execution_count": 8, - "metadata": { - "execution": { - "iopub.execute_input": "2026-06-12T23:38:07.046793Z", - "iopub.status.busy": "2026-06-12T23:38:07.046597Z", - "iopub.status.idle": "2026-06-12T23:38:07.154276Z", - "shell.execute_reply": "2026-06-12T23:38:07.153632Z" - } - }, + "metadata": {}, "outputs": [ { "name": "stdout", @@ -372,14 +316,7 @@ { "cell_type": "code", "execution_count": 9, - "metadata": { - "execution": { - "iopub.execute_input": "2026-06-12T23:38:07.156197Z", - "iopub.status.busy": "2026-06-12T23:38:07.155972Z", - "iopub.status.idle": "2026-06-12T23:38:07.221610Z", - "shell.execute_reply": "2026-06-12T23:38:07.220787Z" - } - }, + "metadata": {}, "outputs": [ { "name": "stdout", @@ -387,17 +324,17 @@ "text": [ "(1528, 3)\n", "shape: (5, 3)\n", - "\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n", - "\u2502 dataitem_id \u2506 dataset_id \u2506 project_id \u2502\n", - "\u2502 --- \u2506 --- \u2506 --- \u2502\n", - "\u2502 str \u2506 str \u2506 str \u2502\n", - "\u255e\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2561\n", - "\u2502 908902400 \u2506 visp_exc_patchseq \u2506 visp_patchseq \u2502\n", - "\u2502 965091329 \u2506 visp_exc_patchseq \u2506 visp_patchseq \u2502\n", - "\u2502 978149378 \u2506 visp_exc_patchseq \u2506 visp_patchseq \u2502\n", - "\u2502 834891776 \u2506 visp_exc_patchseq \u2506 visp_patchseq \u2502\n", - "\u2502 897003522 \u2506 visp_exc_patchseq \u2506 visp_patchseq \u2502\n", - "\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n" + "┌─────────────┬───────────────────┬───────────────┐\n", + "│ dataitem_id ┆ dataset_id ┆ project_id │\n", + "│ --- ┆ --- ┆ --- │\n", + "│ str ┆ str ┆ str │\n", + "╞═════════════╪═══════════════════╪═══════════════╡\n", + "│ 908902400 ┆ visp_exc_patchseq ┆ visp_patchseq │\n", + "│ 965091329 ┆ visp_exc_patchseq ┆ visp_patchseq │\n", + "│ 978149378 ┆ visp_exc_patchseq ┆ visp_patchseq │\n", + "│ 834891776 ┆ visp_exc_patchseq ┆ visp_patchseq │\n", + "│ 897003522 ┆ visp_exc_patchseq ┆ visp_patchseq │\n", + "└─────────────┴───────────────────┴───────────────┘\n" ] } ], @@ -429,8 +366,8 @@ "| `dataitem_dataset_association/` | `DataItemDataSetAssociation` | 1 528 |\n", "\n", "**Input columns intentionally not written here:**\n", - "- `t_type` \u2014 transcriptomic type label; written in a later notebook as `CellToClusterMapping`.\n", - "- `met_type`, `inferred_met_type` \u2014 MET-type labels; written in a later notebook as `ClusterMembership`." + "- `t_type` — transcriptomic type label; written in a later notebook as `CellToClusterMapping`.\n", + "- `met_type`, `inferred_met_type` — MET-type labels; written in a later notebook as `ClusterMembership`." ] }, { @@ -457,7 +394,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.13.13" + "version": "3.12.4" } }, "nbformat": 4, diff --git a/code/etl_visp_exc_patchseq_02_cell_features.ipynb b/code/etl_visp_exc_patchseq_02_cell_features.ipynb index 75d084f..1d82194 100644 --- a/code/etl_visp_exc_patchseq_02_cell_features.ipynb +++ b/code/etl_visp_exc_patchseq_02_cell_features.ipynb @@ -4,22 +4,15 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "# ETL \u2014 VISp excitatory Patch-seq: Cell Features\n", + "# ETL — VISp excitatory Patch-seq: Cell Features\n", "\n", - "Writes 50 `CellFeatureDefinition` rows, one `CellFeatureSet` (`exc_visp_morph_features`), the wide-form morphology feature parquet (389 cells \u00d7 50 features), and one `CellFeatureMatrix` pointer. All 389 cells are already registered in `DataItem` by `etl_visp_exc_patchseq_01_dataset_dataitem.ipynb`; no new cell registration is needed. Prerequisite: `etl_visp_exc_patchseq_01_dataset_dataitem.ipynb` (`project_id=\"visp_patchseq\"`, `dataset_id=\"visp_exc_patchseq\"`)." + "Writes 50 `CellFeatureDefinition` rows, one `CellFeatureSet` (`exc_visp_morph_features`), the wide-form morphology feature parquet (389 cells × 50 features), and one `CellFeatureMatrix` pointer. All 389 cells are already registered in `DataItem` by `etl_visp_exc_patchseq_01_dataset_dataitem.ipynb`; no new cell registration is needed. Prerequisite: `etl_visp_exc_patchseq_01_dataset_dataitem.ipynb` (`project_id=\"visp_patchseq\"`, `dataset_id=\"visp_exc_patchseq\"`)." ] }, { "cell_type": "code", "execution_count": 1, - "metadata": { - "execution": { - "iopub.execute_input": "2026-06-12T23:38:12.411111Z", - "iopub.status.busy": "2026-06-12T23:38:12.410905Z", - "iopub.status.idle": "2026-06-12T23:38:13.544267Z", - "shell.execute_reply": "2026-06-12T23:38:13.543365Z" - } - }, + "metadata": {}, "outputs": [], "source": [ "import os\n", @@ -45,14 +38,7 @@ { "cell_type": "code", "execution_count": 2, - "metadata": { - "execution": { - "iopub.execute_input": "2026-06-12T23:38:13.546112Z", - "iopub.status.busy": "2026-06-12T23:38:13.545813Z", - "iopub.status.idle": "2026-06-12T23:38:13.550775Z", - "shell.execute_reply": "2026-06-12T23:38:13.549995Z" - } - }, + "metadata": {}, "outputs": [ { "name": "stdout", @@ -89,14 +75,7 @@ { "cell_type": "code", "execution_count": 3, - "metadata": { - "execution": { - "iopub.execute_input": "2026-06-12T23:38:13.552342Z", - "iopub.status.busy": "2026-06-12T23:38:13.552163Z", - "iopub.status.idle": "2026-06-12T23:38:13.647681Z", - "shell.execute_reply": "2026-06-12T23:38:13.646932Z" - } - }, + "metadata": {}, "outputs": [ { "name": "stdout", @@ -116,7 +95,7 @@ " )\n", ")\n", "assert assoc.shape[0] > 0, (\n", - " f\"etl_visp_exc_patchseq_01 must be run first \u2014 \"\n", + " f\"etl_visp_exc_patchseq_01 must be run first — \"\n", " f\"no DataItemDataSetAssociation rows for dataset_id='{DATASET_ID}'\"\n", ")\n", "registered_ids = set(assoc[\"dataitem_id\"].to_list())\n", @@ -140,14 +119,7 @@ { "cell_type": "code", "execution_count": 4, - "metadata": { - "execution": { - "iopub.execute_input": "2026-06-12T23:38:13.649320Z", - "iopub.status.busy": "2026-06-12T23:38:13.649129Z", - "iopub.status.idle": "2026-06-12T23:38:13.663385Z", - "shell.execute_reply": "2026-06-12T23:38:13.662621Z" - } - }, + "metadata": {}, "outputs": [ { "name": "stdout", @@ -255,14 +227,7 @@ { "cell_type": "code", "execution_count": 5, - "metadata": { - "execution": { - "iopub.execute_input": "2026-06-12T23:38:13.665180Z", - "iopub.status.busy": "2026-06-12T23:38:13.664982Z", - "iopub.status.idle": "2026-06-12T23:38:13.795060Z", - "shell.execute_reply": "2026-06-12T23:38:13.794343Z" - } - }, + "metadata": {}, "outputs": [ { "name": "stdout", @@ -295,14 +260,7 @@ { "cell_type": "code", "execution_count": 6, - "metadata": { - "execution": { - "iopub.execute_input": "2026-06-12T23:38:13.797069Z", - "iopub.status.busy": "2026-06-12T23:38:13.796862Z", - "iopub.status.idle": "2026-06-12T23:38:13.812508Z", - "shell.execute_reply": "2026-06-12T23:38:13.811753Z" - } - }, + "metadata": {}, "outputs": [ { "name": "stdout", @@ -310,23 +268,23 @@ "text": [ "(50, 8)\n", "shape: (3, 8)\n", - "\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n", - "\u2502 id \u2506 descriptio \u2506 unit \u2506 data_type \u2506 range_min \u2506 range_max \u2506 project_i \u2506 feature_s \u2502\n", - "\u2502 --- \u2506 n \u2506 --- \u2506 --- \u2506 --- \u2506 --- \u2506 d \u2506 et_id \u2502\n", - "\u2502 str \u2506 --- \u2506 str \u2506 str \u2506 f64 \u2506 f64 \u2506 --- \u2506 --- \u2502\n", - "\u2502 \u2506 str \u2506 \u2506 \u2506 \u2506 \u2506 str \u2506 str \u2502\n", - "\u255e\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2561\n", - "\u2502 apical_den \u2506 Difference \u2506 MICRONS_LE \u2506 \n", " \n", "
\n", - "

3 rows \u00d7 53 columns

\n", + "

3 rows × 53 columns

\n", "" ], "text/plain": [ @@ -628,7 +565,7 @@ "wide_df = pd.read_csv(WIDE_CSV)\n", "print(\"Wide CSV shape:\", wide_df.shape)\n", "\n", - "# Rename id column; convert int64 \u2192 str to match DataItem ids (values unchanged).\n", + "# Rename id column; convert int64 → str to match DataItem ids (values unchanged).\n", "wide_df = wide_df.rename(columns={\"specimen_id\": \"id\"})\n", "wide_df[\"id\"] = wide_df[\"id\"].astype(str)\n", "wide_df[\"project_id\"] = PROJECT_ID\n", @@ -645,14 +582,7 @@ { "cell_type": "code", "execution_count": 10, - "metadata": { - "execution": { - "iopub.execute_input": "2026-06-12T23:38:13.956537Z", - "iopub.status.busy": "2026-06-12T23:38:13.956342Z", - "iopub.status.idle": "2026-06-12T23:38:14.136390Z", - "shell.execute_reply": "2026-06-12T23:38:14.135513Z" - } - }, + "metadata": {}, "outputs": [ { "name": "stdout", @@ -678,14 +608,7 @@ { "cell_type": "code", "execution_count": 11, - "metadata": { - "execution": { - "iopub.execute_input": "2026-06-12T23:38:14.138273Z", - "iopub.status.busy": "2026-06-12T23:38:14.138059Z", - "iopub.status.idle": "2026-06-12T23:38:14.158556Z", - "shell.execute_reply": "2026-06-12T23:38:14.157669Z" - } - }, + "metadata": {}, "outputs": [ { "name": "stdout", @@ -693,24 +616,24 @@ "text": [ "(389, 53)\n", "shape: (3, 53)\n", - "\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n", - "\u2502 id \u2506 apical_de \u2506 apical_de \u2506 apical_de \u2506 \u2026 \u2506 basal_den \u2506 soma_alig \u2506 project_i \u2506 feature_ \u2502\n", - "\u2502 --- \u2506 ndrite_bi \u2506 ndrite_bi \u2506 ndrite_de \u2506 \u2506 drite_tot \u2506 ned_dist_ \u2506 d \u2506 set_id \u2502\n", - "\u2502 str \u2506 as_x \u2506 as_y \u2506 pth_pc_0 \u2506 \u2506 al_surfac \u2506 from_pia \u2506 --- \u2506 --- \u2502\n", - "\u2502 \u2506 --- \u2506 --- \u2506 --- \u2506 \u2506 e_a\u2026 \u2506 --- \u2506 str \u2506 str \u2502\n", - "\u2502 \u2506 f32 \u2506 f32 \u2506 f32 \u2506 \u2506 --- \u2506 f32 \u2506 \u2506 \u2502\n", - "\u2502 \u2506 \u2506 \u2506 \u2506 \u2506 f32 \u2506 \u2506 \u2506 \u2502\n", - "\u255e\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2561\n", - "\u2502 601628311 \u2506 147.07672 \u2506 388.73733 \u2506 -0.134846 \u2506 \u2026 \u2506 4251.7275 \u2506 543.53991 \u2506 visp_patc \u2506 exc_visp \u2502\n", - "\u2502 \u2506 1 \u2506 5 \u2506 \u2506 \u2506 39 \u2506 7 \u2506 hseq \u2506 _morph_f \u2502\n", - "\u2502 \u2506 \u2506 \u2506 \u2506 \u2506 \u2506 \u2506 \u2506 eatures \u2502\n", - "\u2502 603229579 \u2506 117.91847 \u2506 513.29565 \u2506 181.61207 \u2506 \u2026 \u2506 3349.0791 \u2506 541.66149 \u2506 visp_patc \u2506 exc_visp \u2502\n", - "\u2502 \u2506 2 \u2506 4 \u2506 6 \u2506 \u2506 02 \u2506 9 \u2506 hseq \u2506 _morph_f \u2502\n", - "\u2502 \u2506 \u2506 \u2506 \u2506 \u2506 \u2506 \u2506 \u2506 eatures \u2502\n", - "\u2502 603337985 \u2506 74.871315 \u2506 382.30099 \u2506 -77.73823 \u2506 \u2026 \u2506 2933.8225 \u2506 458.13421 \u2506 visp_patc \u2506 exc_visp \u2502\n", - "\u2502 \u2506 \u2506 5 \u2506 5 \u2506 \u2506 1 \u2506 6 \u2506 hseq \u2506 _morph_f \u2502\n", - "\u2502 \u2506 \u2506 \u2506 \u2506 \u2506 \u2506 \u2506 \u2506 eatures \u2502\n", - "\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n" + "┌───────────┬───────────┬───────────┬───────────┬───┬───────────┬───────────┬───────────┬──────────┐\n", + "│ id ┆ apical_de ┆ apical_de ┆ apical_de ┆ … ┆ basal_den ┆ soma_alig ┆ project_i ┆ feature_ │\n", + "│ --- ┆ ndrite_bi ┆ ndrite_bi ┆ ndrite_de ┆ ┆ drite_tot ┆ ned_dist_ ┆ d ┆ set_id │\n", + "│ str ┆ as_x ┆ as_y ┆ pth_pc_0 ┆ ┆ al_surfac ┆ from_pia ┆ --- ┆ --- │\n", + "│ ┆ --- ┆ --- ┆ --- ┆ ┆ e_a… ┆ --- ┆ str ┆ str │\n", + "│ ┆ f32 ┆ f32 ┆ f32 ┆ ┆ --- ┆ f32 ┆ ┆ │\n", + "│ ┆ ┆ ┆ ┆ ┆ f32 ┆ ┆ ┆ │\n", + "╞═══════════╪═══════════╪═══════════╪═══════════╪═══╪═══════════╪═══════════╪═══════════╪══════════╡\n", + "│ 601628311 ┆ 147.07672 ┆ 388.73733 ┆ -0.134846 ┆ … ┆ 4251.7275 ┆ 543.53991 ┆ visp_patc ┆ exc_visp │\n", + "│ ┆ 1 ┆ 5 ┆ ┆ ┆ 39 ┆ 7 ┆ hseq ┆ _morph_f │\n", + "│ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ eatures │\n", + "│ 603229579 ┆ 117.91847 ┆ 513.29565 ┆ 181.61207 ┆ … ┆ 3349.0791 ┆ 541.66149 ┆ visp_patc ┆ exc_visp │\n", + "│ ┆ 2 ┆ 4 ┆ 6 ┆ ┆ 02 ┆ 9 ┆ hseq ┆ _morph_f │\n", + "│ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ eatures │\n", + "│ 603337985 ┆ 74.871315 ┆ 382.30099 ┆ -77.73823 ┆ … ┆ 2933.8225 ┆ 458.13421 ┆ visp_patc ┆ exc_visp │\n", + "│ ┆ ┆ 5 ┆ 5 ┆ ┆ 1 ┆ 6 ┆ hseq ┆ _morph_f │\n", + "│ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ eatures │\n", + "└───────────┴───────────┴───────────┴───────────┴───┴───────────┴───────────┴───────────┴──────────┘\n" ] } ], @@ -735,14 +658,7 @@ { "cell_type": "code", "execution_count": 12, - "metadata": { - "execution": { - "iopub.execute_input": "2026-06-12T23:38:14.211041Z", - "iopub.status.busy": "2026-06-12T23:38:14.210580Z", - "iopub.status.idle": "2026-06-12T23:38:14.300305Z", - "shell.execute_reply": "2026-06-12T23:38:14.299589Z" - } - }, + "metadata": {}, "outputs": [ { "name": "stdout", @@ -768,14 +684,7 @@ { "cell_type": "code", "execution_count": 13, - "metadata": { - "execution": { - "iopub.execute_input": "2026-06-12T23:38:14.301975Z", - "iopub.status.busy": "2026-06-12T23:38:14.301778Z", - "iopub.status.idle": "2026-06-12T23:38:14.319065Z", - "shell.execute_reply": "2026-06-12T23:38:14.318153Z" - } - }, + "metadata": {}, "outputs": [ { "name": "stdout", @@ -783,14 +692,14 @@ "text": [ "(1, 5)\n", "shape: (1, 5)\n", - "\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n", - "\u2502 id \u2506 feature_set_id \u2506 parquet_path \u2506 cell_index_column \u2506 project_id \u2502\n", - "\u2502 --- \u2506 --- \u2506 --- \u2506 --- \u2506 --- \u2502\n", - "\u2502 str \u2506 str \u2506 str \u2506 str \u2506 str \u2502\n", - "\u255e\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2561\n", - "\u2502 visp_patchseq_exc_ \u2506 exc_visp_morph_fea \u2506 file:///scratch/em \u2506 id \u2506 visp_patchseq \u2502\n", - "\u2502 visp_morph_f\u2026 \u2506 tures \u2506 _patchseq_wn\u2026 \u2506 \u2506 \u2502\n", - "\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n" + "┌────────────────────┬────────────────────┬────────────────────┬───────────────────┬───────────────┐\n", + "│ id ┆ feature_set_id ┆ parquet_path ┆ cell_index_column ┆ project_id │\n", + "│ --- ┆ --- ┆ --- ┆ --- ┆ --- │\n", + "│ str ┆ str ┆ str ┆ str ┆ str │\n", + "╞════════════════════╪════════════════════╪════════════════════╪═══════════════════╪═══════════════╡\n", + "│ visp_patchseq_exc_ ┆ exc_visp_morph_fea ┆ file:///scratch/em ┆ id ┆ visp_patchseq │\n", + "│ visp_morph_f… ┆ tures ┆ _patchseq_wn… ┆ ┆ │\n", + "└────────────────────┴────────────────────┴────────────────────┴───────────────────┴───────────────┘\n" ] } ], @@ -814,7 +723,7 @@ "|---|---|---|\n", "| `cellfeaturedefinition/` | `CellFeatureDefinition` | 50 |\n", "| `cellfeatureset/` | `CellFeatureSet` | 1 (`exc_visp_morph_features`) |\n", - "| `cellfeatures/exc_visp_morph_features/` | wide parquet | 389 cells \u00d7 50 features |\n", + "| `cellfeatures/exc_visp_morph_features/` | wide parquet | 389 cells × 50 features |\n", "| `cellfeaturematrix/` | `CellFeatureMatrix` | 1 |\n", "\n", "All writes use `mode=\"overwrite\"` with a two-level predicate (`project_id AND feature_set_id`) so re-running is idempotent. The inh Patch-seq notebook (same `project_id`, `feature_set_id='inh_visp_morph_features'`) and any future WNM notebook (same `feature_set_id`, `project_id='visp_wnm'`) cannot clobber these rows." @@ -844,7 +753,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.13.13" + "version": "3.12.4" } }, "nbformat": 4, diff --git a/code/etl_visp_exc_patchseq_03_cluster_membership_and_mapping.ipynb b/code/etl_visp_exc_patchseq_03_cluster_membership_and_mapping.ipynb index 652009b..742bbb6 100644 --- a/code/etl_visp_exc_patchseq_03_cluster_membership_and_mapping.ipynb +++ b/code/etl_visp_exc_patchseq_03_cluster_membership_and_mapping.ipynb @@ -5,13 +5,13 @@ "id": "da0a1046", "metadata": {}, "source": [ - "# ETL \u2014 VISp Excitatory Patch-seq: Cluster Membership & Cell-to-Cluster Mapping\n", + "# ETL — VISp Excitatory Patch-seq: Cluster Membership & Cell-to-Cluster Mapping\n", "\n", "Registers three taxonomy assignments per VISp excitatory Patch-seq cell (`project_id=\"visp_patchseq\"`, `dataset_id=\"visp_exc_patchseq\"`):\n", "\n", - "1. **T-type** against the Tasic 2018 taxonomy (`hierarchy_id=\"tasic_2018_visp_taxonomy\"`) as `CellToClusterMapping` \u2014 these cells are *mapped* into Tasic via the same Patch-seq tree-mapping pipeline used in Gouwens et al. 2020; they were not part of the Tasic dataset. Source column: `t_type` (1528 cells, with the legacy `ET \u2192 PT` rename applied).\n", - "2. **Ground-truth MET-type** against the VISp MET-types taxonomy (`hierarchy_id=\"visp_met_types_taxonomy\"`) as `ClusterMembership` \u2014 Patch-seq cells *define* the MET-types space, so this is direct membership, not a mapping. Source column: `met_type` (Gouwens 2020 mMET-type assignments, 384 cells).\n", - "3. **Inferred MET-type** against the same VISp MET-types taxonomy as `CellToClusterMapping` \u2014 this is an algorithmically predicted label, semantically a *mapping* rather than direct membership. Source column: `inferred_met_type`, registered only for the 1053 cells that lack a ground-truth `met_type` (so this set is disjoint from the membership rows above; the inferred column matches ground truth perfectly on the overlap, asserted in-notebook). The producing algorithm is not documented in the source data \u2014 `MappingSet.method_name` is a generic placeholder and should be updated when the method is confirmed.\n", + "1. **T-type** against the Tasic 2018 taxonomy (`hierarchy_id=\"tasic_2018_visp_taxonomy\"`) as `CellToClusterMapping` — these cells are *mapped* into Tasic via the same Patch-seq tree-mapping pipeline used in Gouwens et al. 2020; they were not part of the Tasic dataset. Source column: `t_type` (1528 cells, with the legacy `ET → PT` rename applied).\n", + "2. **Ground-truth MET-type** against the VISp MET-types taxonomy (`hierarchy_id=\"visp_met_types_taxonomy\"`) as `ClusterMembership` — Patch-seq cells *define* the MET-types space, so this is direct membership, not a mapping. Source column: `met_type` (Gouwens 2020 mMET-type assignments, 384 cells).\n", + "3. **Inferred MET-type** against the same VISp MET-types taxonomy as `CellToClusterMapping` — this is an algorithmically predicted label, semantically a *mapping* rather than direct membership. Source column: `inferred_met_type`, registered only for the 1053 cells that lack a ground-truth `met_type` (so this set is disjoint from the membership rows above; the inferred column matches ground truth perfectly on the overlap, asserted in-notebook). The producing algorithm is not documented in the source data — `MappingSet.method_name` is a generic placeholder and should be updated when the method is confirmed.\n", "\n", "Per-cell rows are emitted at the leaf level **and at every ancestor level** so that level-agnostic queries against `clustermembership/` / `celltoclustermapping/` work without a hierarchy join. `probability` (when available) is recorded on the leaf row only and left null on ancestors, matching the reference notebook convention.\n", "\n", @@ -22,14 +22,7 @@ "cell_type": "code", "execution_count": 1, "id": "fbe20dd9", - "metadata": { - "execution": { - "iopub.execute_input": "2026-06-12T23:38:28.615615Z", - "iopub.status.busy": "2026-06-12T23:38:28.615425Z", - "iopub.status.idle": "2026-06-12T23:38:29.877420Z", - "shell.execute_reply": "2026-06-12T23:38:29.876545Z" - } - }, + "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", @@ -49,14 +42,7 @@ "cell_type": "code", "execution_count": 2, "id": "f48baf63", - "metadata": { - "execution": { - "iopub.execute_input": "2026-06-12T23:38:29.879289Z", - "iopub.status.busy": "2026-06-12T23:38:29.878999Z", - "iopub.status.idle": "2026-06-12T23:38:29.884373Z", - "shell.execute_reply": "2026-06-12T23:38:29.883673Z" - } - }, + "metadata": {}, "outputs": [ { "name": "stdout", @@ -110,14 +96,7 @@ "cell_type": "code", "execution_count": 3, "id": "9ffb70f0", - "metadata": { - "execution": { - "iopub.execute_input": "2026-06-12T23:38:29.885943Z", - "iopub.status.busy": "2026-06-12T23:38:29.885760Z", - "iopub.status.idle": "2026-06-12T23:38:30.053119Z", - "shell.execute_reply": "2026-06-12T23:38:30.052352Z" - } - }, + "metadata": {}, "outputs": [ { "name": "stdout", @@ -135,7 +114,7 @@ " .filter((pl.col(\"project_id\") == PROJECT_ID) & (pl.col(\"dataset_id\") == DATASET_ID))\n", ")\n", "assert assoc.shape[0] > 0, (\n", - " f\"etl_visp_exc_patchseq_01 must run first \u2014 no association rows for dataset_id='{DATASET_ID}'\"\n", + " f\"etl_visp_exc_patchseq_01 must run first — no association rows for dataset_id='{DATASET_ID}'\"\n", ")\n", "registered_ids = set(assoc[\"dataitem_id\"].to_list())\n", "print(f\"Registered DataItems for {DATASET_ID}: {len(registered_ids)}\")\n", @@ -144,8 +123,8 @@ "cluster_df = pl.read_delta(OUTPUT_ROOT + \"cluster/\")\n", "ttype_clu = cluster_df.filter(pl.col(\"hierarchy_id\") == TTYPE_HIERARCHY_ID)\n", "met_clu = cluster_df.filter(pl.col(\"hierarchy_id\") == METTYPE_HIERARCHY_ID)\n", - "assert ttype_clu.shape[0] > 0, f\"etl_tasic_01_cluster must run first \u2014 no clusters for {TTYPE_HIERARCHY_ID}\"\n", - "assert met_clu.shape[0] > 0, f\"etl_visp_met_types_01_cluster must run first \u2014 no clusters for {METTYPE_HIERARCHY_ID}\"\n", + "assert ttype_clu.shape[0] > 0, f\"etl_tasic_01_cluster must run first — no clusters for {TTYPE_HIERARCHY_ID}\"\n", + "assert met_clu.shape[0] > 0, f\"etl_visp_met_types_01_cluster must run first — no clusters for {METTYPE_HIERARCHY_ID}\"\n", "\n", "ttype_parent = dict(zip(ttype_clu[\"id\"].to_list(), ttype_clu[\"parent\"].to_list()))\n", "met_parent = dict(zip(met_clu[\"id\"].to_list(), met_clu[\"parent\"].to_list()))\n", @@ -166,14 +145,7 @@ "cell_type": "code", "execution_count": 4, "id": "363edbba", - "metadata": { - "execution": { - "iopub.execute_input": "2026-06-12T23:38:30.054935Z", - "iopub.status.busy": "2026-06-12T23:38:30.054735Z", - "iopub.status.idle": "2026-06-12T23:38:30.070275Z", - "shell.execute_reply": "2026-06-12T23:38:30.069530Z" - } - }, + "metadata": {}, "outputs": [ { "name": "stdout", @@ -258,14 +230,7 @@ "cell_type": "code", "execution_count": 5, "id": "d6fe8d5a", - "metadata": { - "execution": { - "iopub.execute_input": "2026-06-12T23:38:30.071906Z", - "iopub.status.busy": "2026-06-12T23:38:30.071719Z", - "iopub.status.idle": "2026-06-12T23:38:30.075499Z", - "shell.execute_reply": "2026-06-12T23:38:30.074785Z" - } - }, + "metadata": {}, "outputs": [ { "name": "stdout", @@ -291,23 +256,16 @@ "id": "2340b987", "metadata": {}, "source": [ - "## T-type \u2192 `CellToClusterMapping` against Tasic 2018\n", + "## T-type → `CellToClusterMapping` against Tasic 2018\n", "\n", - "Apply the legacy `ET \u2192 PT` rename so that t-type labels match Tasic cluster ids (Tasic predates the ET nomenclature). Validate every translated label exists as a Tasic cluster id; raise on unknowns. Emit one `CellToClusterMapping` per (cell, ancestor) pair against `target_hierarchy=tasic_2018_visp_taxonomy`." + "Apply the legacy `ET → PT` rename so that t-type labels match Tasic cluster ids (Tasic predates the ET nomenclature). Validate every translated label exists as a Tasic cluster id; raise on unknowns. Emit one `CellToClusterMapping` per (cell, ancestor) pair against `target_hierarchy=tasic_2018_visp_taxonomy`." ] }, { "cell_type": "code", "execution_count": 6, "id": "1bf18692", - "metadata": { - "execution": { - "iopub.execute_input": "2026-06-12T23:38:30.077129Z", - "iopub.status.busy": "2026-06-12T23:38:30.076860Z", - "iopub.status.idle": "2026-06-12T23:38:30.082438Z", - "shell.execute_reply": "2026-06-12T23:38:30.081696Z" - } - }, + "metadata": {}, "outputs": [ { "name": "stdout", @@ -331,14 +289,7 @@ "cell_type": "code", "execution_count": 7, "id": "66a5586b", - "metadata": { - "execution": { - "iopub.execute_input": "2026-06-12T23:38:30.083979Z", - "iopub.status.busy": "2026-06-12T23:38:30.083802Z", - "iopub.status.idle": "2026-06-12T23:38:30.180815Z", - "shell.execute_reply": "2026-06-12T23:38:30.180029Z" - } - }, + "metadata": {}, "outputs": [ { "name": "stdout", @@ -349,7 +300,7 @@ } ], "source": [ - "# MappingSet \u2014 one row describing the t-type assignment method.\n", + "# MappingSet — one row describing the t-type assignment method.\n", "ttype_mapping_set = MappingSet(\n", " id=MAPPING_SET_ID,\n", " name=\"VISp excitatory Patch-seq T-type assignments\",\n", @@ -372,14 +323,7 @@ "cell_type": "code", "execution_count": 8, "id": "2614cb03", - "metadata": { - "execution": { - "iopub.execute_input": "2026-06-12T23:38:30.182451Z", - "iopub.status.busy": "2026-06-12T23:38:30.182259Z", - "iopub.status.idle": "2026-06-12T23:38:30.198813Z", - "shell.execute_reply": "2026-06-12T23:38:30.198076Z" - } - }, + "metadata": {}, "outputs": [ { "name": "stdout", @@ -387,17 +331,17 @@ "text": [ "(1, 13)\n", "shape: (1, 13)\n", - "\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n", - "\u2502 id \u2506 name \u2506 descripti \u2506 method_na \u2506 \u2026 \u2506 source_hi \u2506 target_hi \u2506 json_obje \u2506 project_ \u2502\n", - "\u2502 --- \u2506 --- \u2506 on \u2506 me \u2506 \u2506 erarchy \u2506 erarchy \u2506 ct \u2506 id \u2502\n", - "\u2502 str \u2506 str \u2506 --- \u2506 --- \u2506 \u2506 --- \u2506 --- \u2506 --- \u2506 --- \u2502\n", - "\u2502 \u2506 \u2506 str \u2506 str \u2506 \u2506 str \u2506 str \u2506 str \u2506 str \u2502\n", - "\u255e\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2561\n", - "\u2502 visp_exc_ \u2506 VISp exci \u2506 Tree-mapp \u2506 Patch-seq \u2506 \u2026 \u2506 null \u2506 tasic_201 \u2506 null \u2506 visp_pat \u2502\n", - "\u2502 patchseq_ \u2506 tatory \u2506 ing of \u2506 tree-mapp \u2506 \u2506 \u2506 8_visp_ta \u2506 \u2506 chseq \u2502\n", - "\u2502 ttype_map \u2506 Patch-seq \u2506 VISp exci \u2506 ing \u2506 \u2506 \u2506 xonomy \u2506 \u2506 \u2502\n", - "\u2502 pin\u2026 \u2506 T-ty\u2026 \u2506 tator\u2026 \u2506 (Gouwen\u2026 \u2506 \u2506 \u2506 \u2506 \u2506 \u2502\n", - "\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n" + "┌───────────┬───────────┬───────────┬───────────┬───┬───────────┬───────────┬───────────┬──────────┐\n", + "│ id ┆ name ┆ descripti ┆ method_na ┆ … ┆ source_hi ┆ target_hi ┆ json_obje ┆ project_ │\n", + "│ --- ┆ --- ┆ on ┆ me ┆ ┆ erarchy ┆ erarchy ┆ ct ┆ id │\n", + "│ str ┆ str ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ --- ┆ --- │\n", + "│ ┆ ┆ str ┆ str ┆ ┆ str ┆ str ┆ str ┆ str │\n", + "╞═══════════╪═══════════╪═══════════╪═══════════╪═══╪═══════════╪═══════════╪═══════════╪══════════╡\n", + "│ visp_exc_ ┆ VISp exci ┆ Tree-mapp ┆ Patch-seq ┆ … ┆ null ┆ tasic_201 ┆ null ┆ visp_pat │\n", + "│ patchseq_ ┆ tatory ┆ ing of ┆ tree-mapp ┆ ┆ ┆ 8_visp_ta ┆ ┆ chseq │\n", + "│ ttype_map ┆ Patch-seq ┆ VISp exci ┆ ing ┆ ┆ ┆ xonomy ┆ ┆ │\n", + "│ pin… ┆ T-ty… ┆ tator… ┆ (Gouwen… ┆ ┆ ┆ ┆ ┆ │\n", + "└───────────┴───────────┴───────────┴───────────┴───┴───────────┴───────────┴───────────┴──────────┘\n" ] } ], @@ -419,26 +363,13 @@ "cell_type": "code", "execution_count": 9, "id": "7a0ca37b", - "metadata": { - "execution": { - "iopub.execute_input": "2026-06-12T23:38:30.200547Z", - "iopub.status.busy": "2026-06-12T23:38:30.200330Z", - "iopub.status.idle": "2026-06-12T23:38:30.463242Z", - "shell.execute_reply": "2026-06-12T23:38:30.462426Z" - } - }, + "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "CellToClusterMapping rows built: 6112\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ + "CellToClusterMapping rows built: 6112\n", "CellToClusterMapping written: 6112 rows\n" ] } @@ -448,7 +379,7 @@ "ttype_mappings: list[CellToClusterMapping] = []\n", "for cell_id, leaf in zip(df.index, translated):\n", " if not isinstance(leaf, str):\n", - " continue # no t_type \u2014 skip (current data has none, but be defensive)\n", + " continue # no t_type — skip (current data has none, but be defensive)\n", " for cid, is_leaf in walk_ancestors(leaf, ttype_parent):\n", " ttype_mappings.append(CellToClusterMapping(\n", " id=f\"{cell_id}-{cid}-{PROJECT_ID}-{TTYPE_HIERARCHY_ID}\",\n", @@ -467,14 +398,7 @@ "cell_type": "code", "execution_count": 10, "id": "1e3146fb", - "metadata": { - "execution": { - "iopub.execute_input": "2026-06-12T23:38:30.464836Z", - "iopub.status.busy": "2026-06-12T23:38:30.464648Z", - "iopub.status.idle": "2026-06-12T23:38:30.482919Z", - "shell.execute_reply": "2026-06-12T23:38:30.482230Z" - } - }, + "metadata": {}, "outputs": [ { "name": "stdout", @@ -482,22 +406,22 @@ "text": [ "(6112, 8)\n", "shape: (3, 8)\n", - "\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n", - "\u2502 id \u2506 mapping_set \u2506 source_cell \u2506 target_clus \u2506 score \u2506 probability \u2506 notes \u2506 project_id \u2502\n", - "\u2502 --- \u2506 --- \u2506 --- \u2506 ter \u2506 --- \u2506 --- \u2506 --- \u2506 --- \u2502\n", - "\u2502 str \u2506 str \u2506 str \u2506 --- \u2506 f64 \u2506 f64 \u2506 str \u2506 str \u2502\n", - "\u2502 \u2506 \u2506 \u2506 str \u2506 \u2506 \u2506 \u2506 \u2502\n", - "\u255e\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2561\n", - "\u2502 908902400-L \u2506 visp_exc_pa \u2506 908902400 \u2506 L6 CT VISp \u2506 null \u2506 null \u2506 null \u2506 visp_patch \u2502\n", - "\u2502 6 CT VISp \u2506 tchseq_ttyp \u2506 \u2506 Ctxn3 Sla \u2506 \u2506 \u2506 \u2506 seq \u2502\n", - "\u2502 Ctxn3 Sla\u2026 \u2506 e_mappin\u2026 \u2506 \u2506 \u2506 \u2506 \u2506 \u2506 \u2502\n", - "\u2502 908902400-L \u2506 visp_exc_pa \u2506 908902400 \u2506 L6 CT \u2506 null \u2506 null \u2506 null \u2506 visp_patch \u2502\n", - "\u2502 6 CT-visp_p \u2506 tchseq_ttyp \u2506 \u2506 \u2506 \u2506 \u2506 \u2506 seq \u2502\n", - "\u2502 atchseq-\u2026 \u2506 e_mappin\u2026 \u2506 \u2506 \u2506 \u2506 \u2506 \u2506 \u2502\n", - "\u2502 908902400-G \u2506 visp_exc_pa \u2506 908902400 \u2506 Glutamaterg \u2506 null \u2506 null \u2506 null \u2506 visp_patch \u2502\n", - "\u2502 lutamatergi \u2506 tchseq_ttyp \u2506 \u2506 ic \u2506 \u2506 \u2506 \u2506 seq \u2502\n", - "\u2502 c-visp_p\u2026 \u2506 e_mappin\u2026 \u2506 \u2506 \u2506 \u2506 \u2506 \u2506 \u2502\n", - "\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n" + "┌─────────────┬─────────────┬─────────────┬─────────────┬───────┬─────────────┬───────┬────────────┐\n", + "│ id ┆ mapping_set ┆ source_cell ┆ target_clus ┆ score ┆ probability ┆ notes ┆ project_id │\n", + "│ --- ┆ --- ┆ --- ┆ ter ┆ --- ┆ --- ┆ --- ┆ --- │\n", + "│ str ┆ str ┆ str ┆ --- ┆ f64 ┆ f64 ┆ str ┆ str │\n", + "│ ┆ ┆ ┆ str ┆ ┆ ┆ ┆ │\n", + "╞═════════════╪═════════════╪═════════════╪═════════════╪═══════╪═════════════╪═══════╪════════════╡\n", + "│ 908902400-L ┆ visp_exc_pa ┆ 908902400 ┆ L6 CT VISp ┆ null ┆ null ┆ null ┆ visp_patch │\n", + "│ 6 CT VISp ┆ tchseq_ttyp ┆ ┆ Ctxn3 Sla ┆ ┆ ┆ ┆ seq │\n", + "│ Ctxn3 Sla… ┆ e_mappin… ┆ ┆ ┆ ┆ ┆ ┆ │\n", + "│ 908902400-L ┆ visp_exc_pa ┆ 908902400 ┆ L6 CT ┆ null ┆ null ┆ null ┆ visp_patch │\n", + "│ 6 CT-visp_p ┆ tchseq_ttyp ┆ ┆ ┆ ┆ ┆ ┆ seq │\n", + "│ atchseq-… ┆ e_mappin… ┆ ┆ ┆ ┆ ┆ ┆ │\n", + "│ 908902400-G ┆ visp_exc_pa ┆ 908902400 ┆ Glutamaterg ┆ null ┆ null ┆ null ┆ visp_patch │\n", + "│ lutamatergi ┆ tchseq_ttyp ┆ ┆ ic ┆ ┆ ┆ ┆ seq │\n", + "│ c-visp_p… ┆ e_mappin… ┆ ┆ ┆ ┆ ┆ ┆ │\n", + "└─────────────┴─────────────┴─────────────┴─────────────┴───────┴─────────────┴───────┴────────────┘\n" ] } ], @@ -525,7 +449,7 @@ "id": "6f6c11c5", "metadata": {}, "source": [ - "## MET-type \u2192 `ClusterMembership` against VISp MET-types\n", + "## MET-type → `ClusterMembership` against VISp MET-types\n", "\n", "Subset to cells with non-null `met_type` (Gouwens 2020 mMET-type ground-truth assignments). Validate every label is a known MET cluster id; raise on unknowns. Emit one `ClusterMembership` per (cell, ancestor) pair with `hierarchy_id=\"visp_met_types_taxonomy\"`. Membership (not mapping), because Patch-seq cells *define* this taxonomy.\n", "\n", @@ -536,14 +460,7 @@ "cell_type": "code", "execution_count": 11, "id": "435ca181", - "metadata": { - "execution": { - "iopub.execute_input": "2026-06-12T23:38:30.484538Z", - "iopub.status.busy": "2026-06-12T23:38:30.484356Z", - "iopub.status.idle": "2026-06-12T23:38:30.489999Z", - "shell.execute_reply": "2026-06-12T23:38:30.489342Z" - } - }, + "metadata": {}, "outputs": [ { "name": "stdout", @@ -567,20 +484,14 @@ "cell_type": "code", "execution_count": 12, "id": "65ce039a", - "metadata": { - "execution": { - "iopub.execute_input": "2026-06-12T23:38:30.491602Z", - "iopub.status.busy": "2026-06-12T23:38:30.491418Z", - "iopub.status.idle": "2026-06-12T23:38:30.616015Z", - "shell.execute_reply": "2026-06-12T23:38:30.615272Z" - } - }, + "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "New ClusterMembership rows built: 1152\n", + "Existing rows under predicate: 1152; kept (other notebooks): 0; new: 1152\n", "ClusterMembership written: 1152 rows\n" ] } @@ -600,9 +511,24 @@ "print(f\"New ClusterMembership rows built: {len(memberships)}\")\n", "\n", "our_cell_ids = set(met_df.index.tolist())\n", - "import polars as _pl\n", - "other_cm = _pl.DataFrame({\"item\": []})\n", - "all_memberships = memberships\n", + "\n", + "# Merge-then-overwrite: ClusterMembership is overwrite_scoped on\n", + "# (project_id, hierarchy_id), so a plain overwrite here would clobber rows\n", + "# written under the same predicate by sibling notebooks (e.g.\n", + "# etl_visp_inh_patchseq_03's 495 GABAergic-MET cells). Read existing rows,\n", + "# keep the ones this notebook does not own (item NOT IN our_cell_ids), and\n", + "# union them with the new rows before re-writing the full scope.\n", + "try:\n", + " existing_cm = (\n", + " pl.read_delta(OUTPUT_ROOT + \"clustermembership/\")\n", + " .filter((pl.col(\"project_id\") == PROJECT_ID) & (pl.col(\"hierarchy_id\") == METTYPE_HIERARCHY_ID))\n", + " )\n", + "except Exception:\n", + " existing_cm = pl.DataFrame(schema={\"item\": pl.Utf8})\n", + "other_cm = existing_cm.filter(~pl.col(\"item\").is_in(list(our_cell_ids))) if existing_cm.shape[0] else existing_cm\n", + "other_memberships = [ClusterMembership(**row) for row in other_cm.to_dicts()]\n", + "all_memberships = other_memberships + memberships\n", + "print(f\"Existing rows under predicate: {existing_cm.shape[0]}; kept (other notebooks): {other_cm.shape[0]}; new: {len(memberships)}\")\n", "result = write_models(all_memberships, output_root=OUTPUT_ROOT)\n", "print(f\"ClusterMembership written: {result.rows_written} rows\")\n" ] @@ -611,14 +537,7 @@ "cell_type": "code", "execution_count": 13, "id": "6c93328b", - "metadata": { - "execution": { - "iopub.execute_input": "2026-06-12T23:38:30.617717Z", - "iopub.status.busy": "2026-06-12T23:38:30.617489Z", - "iopub.status.idle": "2026-06-12T23:38:30.636171Z", - "shell.execute_reply": "2026-06-12T23:38:30.635261Z" - } - }, + "metadata": {}, "outputs": [ { "name": "stdout", @@ -626,19 +545,19 @@ "text": [ "(1152, 7)\n", "shape: (3, 7)\n", - "\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n", - "\u2502 item \u2506 cluster \u2506 membership_s \u2506 probability \u2506 distance \u2506 project_id \u2506 hierarchy_id \u2502\n", - "\u2502 --- \u2506 --- \u2506 core \u2506 --- \u2506 --- \u2506 --- \u2506 --- \u2502\n", - "\u2502 str \u2506 str \u2506 --- \u2506 f64 \u2506 f64 \u2506 str \u2506 str \u2502\n", - "\u2502 \u2506 \u2506 f64 \u2506 \u2506 \u2506 \u2506 \u2502\n", - "\u255e\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2561\n", - "\u2502 1039273993 \u2506 L6b \u2506 null \u2506 null \u2506 null \u2506 visp_patchse \u2506 visp_met_typ \u2502\n", - "\u2502 \u2506 \u2506 \u2506 \u2506 \u2506 q \u2506 es_taxonomy \u2502\n", - "\u2502 1039273993 \u2506 Glutamatergic \u2506 null \u2506 null \u2506 null \u2506 visp_patchse \u2506 visp_met_typ \u2502\n", - "\u2502 \u2506 \u2506 \u2506 \u2506 \u2506 q \u2506 es_taxonomy \u2502\n", - "\u2502 1039273993 \u2506 cell \u2506 null \u2506 null \u2506 null \u2506 visp_patchse \u2506 visp_met_typ \u2502\n", - "\u2502 \u2506 \u2506 \u2506 \u2506 \u2506 q \u2506 es_taxonomy \u2502\n", - "\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n", + "┌────────────┬───────────────┬──────────────┬─────────────┬──────────┬──────────────┬──────────────┐\n", + "│ item ┆ cluster ┆ membership_s ┆ probability ┆ distance ┆ project_id ┆ hierarchy_id │\n", + "│ --- ┆ --- ┆ core ┆ --- ┆ --- ┆ --- ┆ --- │\n", + "│ str ┆ str ┆ --- ┆ f64 ┆ f64 ┆ str ┆ str │\n", + "│ ┆ ┆ f64 ┆ ┆ ┆ ┆ │\n", + "╞════════════╪═══════════════╪══════════════╪═════════════╪══════════╪══════════════╪══════════════╡\n", + "│ 1039273993 ┆ L6b ┆ null ┆ null ┆ null ┆ visp_patchse ┆ visp_met_typ │\n", + "│ ┆ ┆ ┆ ┆ ┆ q ┆ es_taxonomy │\n", + "│ 1039273993 ┆ Glutamatergic ┆ null ┆ null ┆ null ┆ visp_patchse ┆ visp_met_typ │\n", + "│ ┆ ┆ ┆ ┆ ┆ q ┆ es_taxonomy │\n", + "│ 1039273993 ┆ cell ┆ null ┆ null ┆ null ┆ visp_patchse ┆ visp_met_typ │\n", + "│ ┆ ┆ ┆ ┆ ┆ q ┆ es_taxonomy │\n", + "└────────────┴───────────────┴──────────────┴─────────────┴──────────┴──────────────┴──────────────┘\n", "Our cells present: 384 / 384\n", "Other-notebook rows preserved: 0\n" ] @@ -679,9 +598,9 @@ "id": "ce675b24", "metadata": {}, "source": [ - "## Inferred MET-type \u2192 `CellToClusterMapping`\n", + "## Inferred MET-type → `CellToClusterMapping`\n", "\n", - "`inferred_met_type` is an algorithmically predicted MET-type label, available for 1437 of 1528 cells. It is *inferred*, not direct measurement, so it belongs as `CellToClusterMapping` against `target_hierarchy=visp_met_types_taxonomy` \u2014 distinct from the ground-truth `met_type` membership written above.\n", + "`inferred_met_type` is an algorithmically predicted MET-type label, available for 1437 of 1528 cells. It is *inferred*, not direct measurement, so it belongs as `CellToClusterMapping` against `target_hierarchy=visp_met_types_taxonomy` — distinct from the ground-truth `met_type` membership written above.\n", "\n", "**Subset rule:** register only the 1053 cells whose `met_type` is null. The 384 cells with ground-truth `met_type` are already in `ClusterMembership` (and the inferred column agrees with them perfectly on the overlap, asserted below).\n" ] @@ -690,14 +609,7 @@ "cell_type": "code", "execution_count": 14, "id": "a9e13592", - "metadata": { - "execution": { - "iopub.execute_input": "2026-06-12T23:38:30.638133Z", - "iopub.status.busy": "2026-06-12T23:38:30.637866Z", - "iopub.status.idle": "2026-06-12T23:38:30.648350Z", - "shell.execute_reply": "2026-06-12T23:38:30.647571Z" - } - }, + "metadata": {}, "outputs": [ { "name": "stdout", @@ -731,14 +643,7 @@ "cell_type": "code", "execution_count": 15, "id": "74b1a5ca", - "metadata": { - "execution": { - "iopub.execute_input": "2026-06-12T23:38:30.650127Z", - "iopub.status.busy": "2026-06-12T23:38:30.649866Z", - "iopub.status.idle": "2026-06-12T23:38:30.769158Z", - "shell.execute_reply": "2026-06-12T23:38:30.768407Z" - } - }, + "metadata": {}, "outputs": [ { "name": "stdout", @@ -774,14 +679,7 @@ "cell_type": "code", "execution_count": 16, "id": "cd687756", - "metadata": { - "execution": { - "iopub.execute_input": "2026-06-12T23:38:30.770777Z", - "iopub.status.busy": "2026-06-12T23:38:30.770585Z", - "iopub.status.idle": "2026-06-12T23:38:30.793984Z", - "shell.execute_reply": "2026-06-12T23:38:30.793350Z" - } - }, + "metadata": {}, "outputs": [ { "name": "stdout", @@ -789,17 +687,17 @@ "text": [ "(1, 13)\n", "shape: (1, 13)\n", - "\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n", - "\u2502 id \u2506 name \u2506 descripti \u2506 method_na \u2506 \u2026 \u2506 source_hi \u2506 target_hi \u2506 json_obje \u2506 project_ \u2502\n", - "\u2502 --- \u2506 --- \u2506 on \u2506 me \u2506 \u2506 erarchy \u2506 erarchy \u2506 ct \u2506 id \u2502\n", - "\u2502 str \u2506 str \u2506 --- \u2506 --- \u2506 \u2506 --- \u2506 --- \u2506 --- \u2506 --- \u2502\n", - "\u2502 \u2506 \u2506 str \u2506 str \u2506 \u2506 str \u2506 str \u2506 str \u2506 str \u2502\n", - "\u255e\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2561\n", - "\u2502 visp_exc_ \u2506 VISp exci \u2506 Algorithm \u2506 inferred \u2506 \u2026 \u2506 null \u2506 visp_met_ \u2506 null \u2506 visp_pat \u2502\n", - "\u2502 patchseq_ \u2506 tatory \u2506 ically \u2506 MET-type \u2506 \u2506 \u2506 types_tax \u2506 \u2506 chseq \u2502\n", - "\u2502 inferred_ \u2506 Patch-seq \u2506 predicted \u2506 assignmen \u2506 \u2506 \u2506 onomy \u2506 \u2506 \u2502\n", - "\u2502 met\u2026 \u2506 infe\u2026 \u2506 MET-\u2026 \u2506 t (\u2026 \u2506 \u2506 \u2506 \u2506 \u2506 \u2502\n", - "\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n" + "┌───────────┬───────────┬───────────┬───────────┬───┬───────────┬───────────┬───────────┬──────────┐\n", + "│ id ┆ name ┆ descripti ┆ method_na ┆ … ┆ source_hi ┆ target_hi ┆ json_obje ┆ project_ │\n", + "│ --- ┆ --- ┆ on ┆ me ┆ ┆ erarchy ┆ erarchy ┆ ct ┆ id │\n", + "│ str ┆ str ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ --- ┆ --- │\n", + "│ ┆ ┆ str ┆ str ┆ ┆ str ┆ str ┆ str ┆ str │\n", + "╞═══════════╪═══════════╪═══════════╪═══════════╪═══╪═══════════╪═══════════╪═══════════╪══════════╡\n", + "│ visp_exc_ ┆ VISp exci ┆ Algorithm ┆ inferred ┆ … ┆ null ┆ visp_met_ ┆ null ┆ visp_pat │\n", + "│ patchseq_ ┆ tatory ┆ ically ┆ MET-type ┆ ┆ ┆ types_tax ┆ ┆ chseq │\n", + "│ inferred_ ┆ Patch-seq ┆ predicted ┆ assignmen ┆ ┆ ┆ onomy ┆ ┆ │\n", + "│ met… ┆ infe… ┆ MET-… ┆ t (… ┆ ┆ ┆ ┆ ┆ │\n", + "└───────────┴───────────┴───────────┴───────────┴───┴───────────┴───────────┴───────────┴──────────┘\n" ] } ], @@ -827,26 +725,13 @@ "cell_type": "code", "execution_count": 17, "id": "dcdef943", - "metadata": { - "execution": { - "iopub.execute_input": "2026-06-12T23:38:30.795789Z", - "iopub.status.busy": "2026-06-12T23:38:30.795605Z", - "iopub.status.idle": "2026-06-12T23:38:31.026263Z", - "shell.execute_reply": "2026-06-12T23:38:31.025252Z" - } - }, + "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "CellToClusterMapping (inferred) rows built: 3159\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ + "CellToClusterMapping (inferred) rows built: 3159\n", "CellToClusterMapping written: 3159 rows\n" ] } @@ -872,14 +757,7 @@ "cell_type": "code", "execution_count": 18, "id": "2406a7b7", - "metadata": { - "execution": { - "iopub.execute_input": "2026-06-12T23:38:31.028216Z", - "iopub.status.busy": "2026-06-12T23:38:31.027948Z", - "iopub.status.idle": "2026-06-12T23:38:31.062012Z", - "shell.execute_reply": "2026-06-12T23:38:31.061251Z" - } - }, + "metadata": {}, "outputs": [ { "name": "stdout", @@ -887,22 +765,22 @@ "text": [ "(3159, 8)\n", "shape: (3, 8)\n", - "\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n", - "\u2502 id \u2506 mapping_set \u2506 source_cell \u2506 target_clus \u2506 score \u2506 probability \u2506 notes \u2506 project_id \u2502\n", - "\u2502 --- \u2506 --- \u2506 --- \u2506 ter \u2506 --- \u2506 --- \u2506 --- \u2506 --- \u2502\n", - "\u2502 str \u2506 str \u2506 str \u2506 --- \u2506 f64 \u2506 f64 \u2506 str \u2506 str \u2502\n", - "\u2502 \u2506 \u2506 \u2506 str \u2506 \u2506 \u2506 \u2506 \u2502\n", - "\u255e\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2561\n", - "\u2502 908902400-L \u2506 visp_exc_pa \u2506 908902400 \u2506 L6 CT-1 \u2506 null \u2506 null \u2506 null \u2506 visp_patch \u2502\n", - "\u2502 6 CT-1-visp \u2506 tchseq_infe \u2506 \u2506 \u2506 \u2506 \u2506 \u2506 seq \u2502\n", - "\u2502 _patchse\u2026 \u2506 rred_met\u2026 \u2506 \u2506 \u2506 \u2506 \u2506 \u2506 \u2502\n", - "\u2502 908902400-G \u2506 visp_exc_pa \u2506 908902400 \u2506 Glutamaterg \u2506 null \u2506 null \u2506 null \u2506 visp_patch \u2502\n", - "\u2502 lutamatergi \u2506 tchseq_infe \u2506 \u2506 ic \u2506 \u2506 \u2506 \u2506 seq \u2502\n", - "\u2502 c-visp_p\u2026 \u2506 rred_met\u2026 \u2506 \u2506 \u2506 \u2506 \u2506 \u2506 \u2502\n", - "\u2502 908902400-c \u2506 visp_exc_pa \u2506 908902400 \u2506 cell \u2506 null \u2506 null \u2506 null \u2506 visp_patch \u2502\n", - "\u2502 ell-visp_pa \u2506 tchseq_infe \u2506 \u2506 \u2506 \u2506 \u2506 \u2506 seq \u2502\n", - "\u2502 tchseq-v\u2026 \u2506 rred_met\u2026 \u2506 \u2506 \u2506 \u2506 \u2506 \u2506 \u2502\n", - "\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n" + "┌─────────────┬─────────────┬─────────────┬─────────────┬───────┬─────────────┬───────┬────────────┐\n", + "│ id ┆ mapping_set ┆ source_cell ┆ target_clus ┆ score ┆ probability ┆ notes ┆ project_id │\n", + "│ --- ┆ --- ┆ --- ┆ ter ┆ --- ┆ --- ┆ --- ┆ --- │\n", + "│ str ┆ str ┆ str ┆ --- ┆ f64 ┆ f64 ┆ str ┆ str │\n", + "│ ┆ ┆ ┆ str ┆ ┆ ┆ ┆ │\n", + "╞═════════════╪═════════════╪═════════════╪═════════════╪═══════╪═════════════╪═══════╪════════════╡\n", + "│ 908902400-L ┆ visp_exc_pa ┆ 908902400 ┆ L6 CT-1 ┆ null ┆ null ┆ null ┆ visp_patch │\n", + "│ 6 CT-1-visp ┆ tchseq_infe ┆ ┆ ┆ ┆ ┆ ┆ seq │\n", + "│ _patchse… ┆ rred_met… ┆ ┆ ┆ ┆ ┆ ┆ │\n", + "│ 908902400-G ┆ visp_exc_pa ┆ 908902400 ┆ Glutamaterg ┆ null ┆ null ┆ null ┆ visp_patch │\n", + "│ lutamatergi ┆ tchseq_infe ┆ ┆ ic ┆ ┆ ┆ ┆ seq │\n", + "│ c-visp_p… ┆ rred_met… ┆ ┆ ┆ ┆ ┆ ┆ │\n", + "│ 908902400-c ┆ visp_exc_pa ┆ 908902400 ┆ cell ┆ null ┆ null ┆ null ┆ visp_patch │\n", + "│ ell-visp_pa ┆ tchseq_infe ┆ ┆ ┆ ┆ ┆ ┆ seq │\n", + "│ tchseq-v… ┆ rred_met… ┆ ┆ ┆ ┆ ┆ ┆ │\n", + "└─────────────┴─────────────┴─────────────┴─────────────┴───────┴─────────────┴───────┴────────────┘\n" ] } ], @@ -955,9 +833,9 @@ "|---|---|---|\n", "| `mappingset/` (`id={MAPPING_SET_ID}`) | `MappingSet` (T-type tree mapping) | 1 |\n", "| `mappingset/` (`id={MAPPING_SET_INFERRED_ID}`) | `MappingSet` (inferred MET-type, method unspecified) | 1 |\n", - "| `celltoclustermapping/` (`mapping_set={MAPPING_SET_ID}`) | `CellToClusterMapping` | one per (cell \u00d7 t-type ancestor), all 1528 cells |\n", - "| `celltoclustermapping/` (`mapping_set={MAPPING_SET_INFERRED_ID}`) | `CellToClusterMapping` (inferred) | one per (cell \u00d7 MET-type ancestor), 1053 cells without ground-truth `met_type` |\n", - "| `clustermembership/` (`hierarchy_id={METTYPE_HIERARCHY_ID}`) | `ClusterMembership` | one per (cell \u00d7 MET-type ancestor), 384 cells with ground-truth `met_type` |\n", + "| `celltoclustermapping/` (`mapping_set={MAPPING_SET_ID}`) | `CellToClusterMapping` | one per (cell × t-type ancestor), all 1528 cells |\n", + "| `celltoclustermapping/` (`mapping_set={MAPPING_SET_INFERRED_ID}`) | `CellToClusterMapping` (inferred) | one per (cell × MET-type ancestor), 1053 cells without ground-truth `met_type` |\n", + "| `clustermembership/` (`hierarchy_id={METTYPE_HIERARCHY_ID}`) | `ClusterMembership` | one per (cell × MET-type ancestor), 384 cells with ground-truth `met_type` |\n", "\n", "All three columns of `inferred_met_types.csv` are now registered. The 91 cells with neither `met_type` nor `inferred_met_type` are unrepresented in cluster tables (no label to assign).\n" ] @@ -979,7 +857,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.13.13" + "version": "3.12.4" } }, "nbformat": 4, diff --git a/code/etl_visp_inh_patchseq_01_dataset_dataitem.ipynb b/code/etl_visp_inh_patchseq_01_dataset_dataitem.ipynb index f5589c7..0e8db3f 100644 --- a/code/etl_visp_inh_patchseq_01_dataset_dataitem.ipynb +++ b/code/etl_visp_inh_patchseq_01_dataset_dataitem.ipynb @@ -4,7 +4,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "# ETL \u2014 VISp Inhibitory Patch-seq: DataSet & DataItem\n", + "# ETL — VISp Inhibitory Patch-seq: DataSet & DataItem\n", "\n", "Writes one `DataSet` record (`dataset_id = \"visp_inh_patchseq\"`, `project_id = \"visp_patchseq\"`), one `DataItem` per cell from `patchseq_tx_cell_ttype_labels.csv`, and the corresponding `DataItemDataSetAssociation` links. No prerequisites; features and cluster mappings are written in `_02` and `_03`." ] @@ -12,14 +12,7 @@ { "cell_type": "code", "execution_count": 1, - "metadata": { - "execution": { - "iopub.execute_input": "2026-06-12T23:38:09.025703Z", - "iopub.status.busy": "2026-06-12T23:38:09.025505Z", - "iopub.status.idle": "2026-06-12T23:38:10.181580Z", - "shell.execute_reply": "2026-06-12T23:38:10.180791Z" - } - }, + "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", @@ -39,14 +32,7 @@ { "cell_type": "code", "execution_count": 2, - "metadata": { - "execution": { - "iopub.execute_input": "2026-06-12T23:38:10.183658Z", - "iopub.status.busy": "2026-06-12T23:38:10.183359Z", - "iopub.status.idle": "2026-06-12T23:38:10.187923Z", - "shell.execute_reply": "2026-06-12T23:38:10.187228Z" - } - }, + "metadata": {}, "outputs": [ { "name": "stdout", @@ -81,14 +67,7 @@ { "cell_type": "code", "execution_count": 3, - "metadata": { - "execution": { - "iopub.execute_input": "2026-06-12T23:38:10.189613Z", - "iopub.status.busy": "2026-06-12T23:38:10.189331Z", - "iopub.status.idle": "2026-06-12T23:38:10.319847Z", - "shell.execute_reply": "2026-06-12T23:38:10.319194Z" - } - }, + "metadata": {}, "outputs": [ { "name": "stdout", @@ -171,14 +150,7 @@ { "cell_type": "code", "execution_count": 4, - "metadata": { - "execution": { - "iopub.execute_input": "2026-06-12T23:38:10.321614Z", - "iopub.status.busy": "2026-06-12T23:38:10.321417Z", - "iopub.status.idle": "2026-06-12T23:38:10.402594Z", - "shell.execute_reply": "2026-06-12T23:38:10.401862Z" - } - }, + "metadata": {}, "outputs": [ { "name": "stdout", @@ -203,14 +175,7 @@ { "cell_type": "code", "execution_count": 5, - "metadata": { - "execution": { - "iopub.execute_input": "2026-06-12T23:38:10.404617Z", - "iopub.status.busy": "2026-06-12T23:38:10.404267Z", - "iopub.status.idle": "2026-06-12T23:38:10.423561Z", - "shell.execute_reply": "2026-06-12T23:38:10.422734Z" - } - }, + "metadata": {}, "outputs": [ { "name": "stdout", @@ -218,14 +183,14 @@ "text": [ "(1, 5)\n", "shape: (1, 5)\n", - "\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n", - "\u2502 id \u2506 name \u2506 publication \u2506 modality \u2506 project_id \u2502\n", - "\u2502 --- \u2506 --- \u2506 --- \u2506 --- \u2506 --- \u2502\n", - "\u2502 str \u2506 str \u2506 str \u2506 str \u2506 str \u2502\n", - "\u255e\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2561\n", - "\u2502 visp_inh_patchseq \u2506 VISp inhibitory \u2506 doi.org/10.1016/j.cell.2020.0 \u2506 MORPHOLOGY \u2506 visp_patchseq \u2502\n", - "\u2502 \u2506 Patch-seq data\u2026 \u2506 9\u2026 \u2506 \u2506 \u2502\n", - "\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n" + "┌───────────────────┬─────────────────┬───────────────────────────────┬────────────┬───────────────┐\n", + "│ id ┆ name ┆ publication ┆ modality ┆ project_id │\n", + "│ --- ┆ --- ┆ --- ┆ --- ┆ --- │\n", + "│ str ┆ str ┆ str ┆ str ┆ str │\n", + "╞═══════════════════╪═════════════════╪═══════════════════════════════╪════════════╪═══════════════╡\n", + "│ visp_inh_patchseq ┆ VISp inhibitory ┆ doi.org/10.1016/j.cell.2020.0 ┆ MORPHOLOGY ┆ visp_patchseq │\n", + "│ ┆ Patch-seq data… ┆ 9… ┆ ┆ │\n", + "└───────────────────┴─────────────────┴───────────────────────────────┴────────────┴───────────────┘\n" ] } ], @@ -251,20 +216,13 @@ { "cell_type": "code", "execution_count": 6, - "metadata": { - "execution": { - "iopub.execute_input": "2026-06-12T23:38:10.425342Z", - "iopub.status.busy": "2026-06-12T23:38:10.425136Z", - "iopub.status.idle": "2026-06-12T23:38:10.536997Z", - "shell.execute_reply": "2026-06-12T23:38:10.536321Z" - } - }, + "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "DataItem rows appended: 2759 (total in batch: 2759)\n" + "DataItem rows appended: 0 (total in batch: 2759)\n" ] } ], @@ -282,32 +240,25 @@ { "cell_type": "code", "execution_count": 7, - "metadata": { - "execution": { - "iopub.execute_input": "2026-06-12T23:38:10.538766Z", - "iopub.status.busy": "2026-06-12T23:38:10.538571Z", - "iopub.status.idle": "2026-06-12T23:38:10.552390Z", - "shell.execute_reply": "2026-06-12T23:38:10.551671Z" - } - }, + "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "(4287, 4)\n", + "(4407, 4)\n", "shape: (5, 4)\n", - "\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n", - "\u2502 id \u2506 name \u2506 neuroglancer_link \u2506 project_id \u2502\n", - "\u2502 --- \u2506 --- \u2506 --- \u2506 --- \u2502\n", - "\u2502 str \u2506 str \u2506 str \u2506 str \u2502\n", - "\u255e\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2561\n", - "\u2502 888001481 \u2506 888001481 \u2506 null \u2506 visp_patchseq \u2502\n", - "\u2502 736493069 \u2506 736493069 \u2506 null \u2506 visp_patchseq \u2502\n", - "\u2502 830445950 \u2506 830445950 \u2506 null \u2506 visp_patchseq \u2502\n", - "\u2502 644941196 \u2506 644941196 \u2506 null \u2506 visp_patchseq \u2502\n", - "\u2502 658075752 \u2506 658075752 \u2506 null \u2506 visp_patchseq \u2502\n", - "\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n" + "┌───────────┬───────────┬───────────────────┬───────────────┐\n", + "│ id ┆ name ┆ neuroglancer_link ┆ project_id │\n", + "│ --- ┆ --- ┆ --- ┆ --- │\n", + "│ str ┆ str ┆ str ┆ str │\n", + "╞═══════════╪═══════════╪═══════════════════╪═══════════════╡\n", + "│ 601790961 ┆ 601790961 ┆ null ┆ visp_patchseq │\n", + "│ 602535278 ┆ 602535278 ┆ null ┆ visp_patchseq │\n", + "│ 604646725 ┆ 604646725 ┆ null ┆ visp_patchseq │\n", + "│ 623326230 ┆ 623326230 ┆ null ┆ visp_patchseq │\n", + "│ 623434306 ┆ 623434306 ┆ null ┆ visp_patchseq │\n", + "└───────────┴───────────┴───────────────────┴───────────────┘\n" ] } ], @@ -334,14 +285,7 @@ { "cell_type": "code", "execution_count": 8, - "metadata": { - "execution": { - "iopub.execute_input": "2026-06-12T23:38:10.554222Z", - "iopub.status.busy": "2026-06-12T23:38:10.554018Z", - "iopub.status.idle": "2026-06-12T23:38:10.664488Z", - "shell.execute_reply": "2026-06-12T23:38:10.663495Z" - } - }, + "metadata": {}, "outputs": [ { "name": "stdout", @@ -367,14 +311,7 @@ { "cell_type": "code", "execution_count": 9, - "metadata": { - "execution": { - "iopub.execute_input": "2026-06-12T23:38:10.666246Z", - "iopub.status.busy": "2026-06-12T23:38:10.666043Z", - "iopub.status.idle": "2026-06-12T23:38:10.680836Z", - "shell.execute_reply": "2026-06-12T23:38:10.680038Z" - } - }, + "metadata": {}, "outputs": [ { "name": "stdout", @@ -382,17 +319,17 @@ "text": [ "(2759, 3)\n", "shape: (5, 3)\n", - "\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n", - "\u2502 dataitem_id \u2506 dataset_id \u2506 project_id \u2502\n", - "\u2502 --- \u2506 --- \u2506 --- \u2502\n", - "\u2502 str \u2506 str \u2506 str \u2502\n", - "\u255e\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2561\n", - "\u2502 888001481 \u2506 visp_inh_patchseq \u2506 visp_patchseq \u2502\n", - "\u2502 736493069 \u2506 visp_inh_patchseq \u2506 visp_patchseq \u2502\n", - "\u2502 830445950 \u2506 visp_inh_patchseq \u2506 visp_patchseq \u2502\n", - "\u2502 644941196 \u2506 visp_inh_patchseq \u2506 visp_patchseq \u2502\n", - "\u2502 658075752 \u2506 visp_inh_patchseq \u2506 visp_patchseq \u2502\n", - "\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n" + "┌─────────────┬───────────────────┬───────────────┐\n", + "│ dataitem_id ┆ dataset_id ┆ project_id │\n", + "│ --- ┆ --- ┆ --- │\n", + "│ str ┆ str ┆ str │\n", + "╞═════════════╪═══════════════════╪═══════════════╡\n", + "│ 888001481 ┆ visp_inh_patchseq ┆ visp_patchseq │\n", + "│ 736493069 ┆ visp_inh_patchseq ┆ visp_patchseq │\n", + "│ 830445950 ┆ visp_inh_patchseq ┆ visp_patchseq │\n", + "│ 644941196 ┆ visp_inh_patchseq ┆ visp_patchseq │\n", + "│ 658075752 ┆ visp_inh_patchseq ┆ visp_patchseq │\n", + "└─────────────┴───────────────────┴───────────────┘\n" ] } ], @@ -425,7 +362,7 @@ "| `dataitem_dataset_association/` | `DataItemDataSetAssociation` | 2 759 |\n", "\n", "**Input columns intentionally not written here:**\n", - "- `ttype` \u2014 T-type label; written in a later notebook as `CellToClusterMapping`." + "- `ttype` — T-type label; written in a later notebook as `CellToClusterMapping`." ] }, { @@ -452,7 +389,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.13.13" + "version": "3.12.4" } }, "nbformat": 4, diff --git a/code/etl_visp_inh_patchseq_02_cell_features.ipynb b/code/etl_visp_inh_patchseq_02_cell_features.ipynb index f86be35..645e5c7 100644 --- a/code/etl_visp_inh_patchseq_02_cell_features.ipynb +++ b/code/etl_visp_inh_patchseq_02_cell_features.ipynb @@ -4,7 +4,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "# ETL \u2014 VISp inhibitory Patch-seq: Cell Features\n", + "# ETL — VISp inhibitory Patch-seq: Cell Features\n", "\n", "Writes 46 `CellFeatureDefinition` rows, one `CellFeatureSet` (`inh_visp_morph_features`), the wide-form morphology feature parquet, and one `CellFeatureMatrix` pointer. Also registers any cell ids present in the wide-form CSV but absent from the `DataItem` table (i.e., cells not in the original `_01` source CSV). Prerequisite: `etl_visp_inh_patchseq_01_dataset_dataitem.ipynb` (`project_id=\"visp_inh_patchseq\"`, `dataset_id=\"visp_inh_patchseq\"`)." ] @@ -14,10 +14,10 @@ "execution_count": 1, "metadata": { "execution": { - "iopub.execute_input": "2026-06-12T23:38:16.132505Z", - "iopub.status.busy": "2026-06-12T23:38:16.132321Z", - "iopub.status.idle": "2026-06-12T23:38:17.267648Z", - "shell.execute_reply": "2026-06-12T23:38:17.266852Z" + "iopub.execute_input": "2026-06-23T15:20:50.096869Z", + "iopub.status.busy": "2026-06-23T15:20:50.096673Z", + "iopub.status.idle": "2026-06-23T15:20:51.049345Z", + "shell.execute_reply": "2026-06-23T15:20:51.048486Z" } }, "outputs": [], @@ -51,10 +51,10 @@ "execution_count": 2, "metadata": { "execution": { - "iopub.execute_input": "2026-06-12T23:38:17.269443Z", - "iopub.status.busy": "2026-06-12T23:38:17.269166Z", - "iopub.status.idle": "2026-06-12T23:38:17.273934Z", - "shell.execute_reply": "2026-06-12T23:38:17.273181Z" + "iopub.execute_input": "2026-06-23T15:20:51.051494Z", + "iopub.status.busy": "2026-06-23T15:20:51.051177Z", + "iopub.status.idle": "2026-06-23T15:20:51.056074Z", + "shell.execute_reply": "2026-06-23T15:20:51.055398Z" } }, "outputs": [ @@ -95,10 +95,10 @@ "execution_count": 3, "metadata": { "execution": { - "iopub.execute_input": "2026-06-12T23:38:17.275520Z", - "iopub.status.busy": "2026-06-12T23:38:17.275334Z", - "iopub.status.idle": "2026-06-12T23:38:17.429140Z", - "shell.execute_reply": "2026-06-12T23:38:17.428335Z" + "iopub.execute_input": "2026-06-23T15:20:51.093387Z", + "iopub.status.busy": "2026-06-23T15:20:51.093158Z", + "iopub.status.idle": "2026-06-23T15:20:51.248230Z", + "shell.execute_reply": "2026-06-23T15:20:51.247573Z" } }, "outputs": [ @@ -106,7 +106,7 @@ "name": "stdout", "output_type": "stream", "text": [ - "Prerequisite OK: 4287 DataItem rows for project_id='visp_patchseq'\n" + "Prerequisite OK: 4407 DataItem rows for project_id='visp_patchseq'\n" ] } ], @@ -116,7 +116,7 @@ " .filter(pl.col(\"project_id\") == PROJECT_ID)\n", ")\n", "assert existing_dataitems.shape[0] > 0, (\n", - " f\"etl_visp_inh_patchseq_01 must be run first \u2014 no DataItem rows for project_id='{PROJECT_ID}'\"\n", + " f\"etl_visp_inh_patchseq_01 must be run first — no DataItem rows for project_id='{PROJECT_ID}'\"\n", ")\n", "print(f\"Prerequisite OK: {existing_dataitems.shape[0]} DataItem rows for project_id='{PROJECT_ID}'\")" ] @@ -127,8 +127,11 @@ "source": [ "## Register new cells from the wide CSV\n", "\n", - "Check which cell ids in the wide CSV are not yet in the `DataItem` table, register any new ones\n", - "via `append_new_dataitems`, and add `DataItemDataSetAssociation` rows for those new cells." + "Check which cell ids in the wide CSV are not yet in the `DataItem` table and register any\n", + "new ones via `append_new_dataitems`. Then re-assert the full\n", + "`(project_id, dataset_id)` association scope as the **union** of the existing scope and the\n", + "wide-CSV ids — `DataItemDataSetAssociation` is `overwrite_scoped`, so passing only the\n", + "wide-CSV ids would clobber rows written by `_01` for the same scope." ] }, { @@ -136,10 +139,10 @@ "execution_count": 4, "metadata": { "execution": { - "iopub.execute_input": "2026-06-12T23:38:17.430996Z", - "iopub.status.busy": "2026-06-12T23:38:17.430794Z", - "iopub.status.idle": "2026-06-12T23:38:17.445865Z", - "shell.execute_reply": "2026-06-12T23:38:17.445180Z" + "iopub.execute_input": "2026-06-23T15:20:51.250117Z", + "iopub.status.busy": "2026-06-23T15:20:51.249923Z", + "iopub.status.idle": "2026-06-23T15:20:51.265020Z", + "shell.execute_reply": "2026-06-23T15:20:51.264236Z" } }, "outputs": [ @@ -148,8 +151,8 @@ "output_type": "stream", "text": [ "Cells in wide CSV : 520\n", - "Already in DataItem : 400\n", - "New to register : 120\n" + "Already in DataItem : 520\n", + "New to register : 0\n" ] } ], @@ -167,16 +170,31 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 5, "metadata": { "execution": { - "iopub.execute_input": "2026-06-12T23:38:17.447595Z", - "iopub.status.busy": "2026-06-12T23:38:17.447309Z", - "iopub.status.idle": "2026-06-12T23:38:17.622049Z", - "shell.execute_reply": "2026-06-12T23:38:17.621135Z" + "iopub.execute_input": "2026-06-23T15:20:51.266754Z", + "iopub.status.busy": "2026-06-23T15:20:51.266462Z", + "iopub.status.idle": "2026-06-23T15:20:51.503978Z", + "shell.execute_reply": "2026-06-23T15:20:51.503186Z" } }, - "outputs": [], + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "No new cells to register — all already present.\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Associations written for (visp_patchseq, visp_inh_patchseq): 2879\n" + ] + } + ], "source": [ "if new_ids:\n", " n_di = write_models(\n", @@ -189,11 +207,23 @@ "\n", "# Re-assert the full (project_id, dataset_id) association scope. The\n", "# DataItemDataSetAssociation WriteSpec is overwrite_scoped on those two\n", - "# columns, so passing the full intended set is idempotent and self-heals\n", + "# columns, so we must pass every id that should remain in scope — not\n", + "# just the wide-CSV ids — otherwise rows registered by earlier notebooks\n", + "# (e.g. `_01`'s ttype-CSV cells) would be clobbered. Union the existing\n", + "# scope with the wide-CSV ids; the write is idempotent and self-heals\n", "# any partial prior run.\n", + "try:\n", + " existing_assoc_ids = set(\n", + " pl.read_delta(OUTPUT_ROOT + \"dataitem_dataset_association/\")\n", + " .filter((pl.col(\"project_id\") == PROJECT_ID) & (pl.col(\"dataset_id\") == DATASET_ID))\n", + " [\"dataitem_id\"].to_list()\n", + " )\n", + "except Exception:\n", + " existing_assoc_ids = set()\n", + "full_assoc_ids = sorted(existing_assoc_ids | set(all_wide_ids))\n", "n_assoc = write_models(\n", " [DataItemDataSetAssociation(dataitem_id=cid, dataset_id=DATASET_ID, project_id=PROJECT_ID)\n", - " for cid in all_wide_ids],\n", + " for cid in full_assoc_ids],\n", " output_root=OUTPUT_ROOT,\n", ").rows_written\n", "print(f\"Associations written for ({PROJECT_ID}, {DATASET_ID}): {n_assoc}\")\n" @@ -204,10 +234,10 @@ "execution_count": 6, "metadata": { "execution": { - "iopub.execute_input": "2026-06-12T23:38:17.623763Z", - "iopub.status.busy": "2026-06-12T23:38:17.623560Z", - "iopub.status.idle": "2026-06-12T23:38:17.890733Z", - "shell.execute_reply": "2026-06-12T23:38:17.889940Z" + "iopub.execute_input": "2026-06-23T15:20:51.505735Z", + "iopub.status.busy": "2026-06-23T15:20:51.505441Z", + "iopub.status.idle": "2026-06-23T15:20:51.724583Z", + "shell.execute_reply": "2026-06-23T15:20:51.723848Z" } }, "outputs": [ @@ -215,13 +245,7 @@ "name": "stdout", "output_type": "stream", "text": [ - "Total DataItems for project: 4407\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ + "Total DataItems for project: 4407\n", "Associations for visp_inh_patchseq: 2879\n" ] } @@ -255,10 +279,10 @@ "execution_count": 7, "metadata": { "execution": { - "iopub.execute_input": "2026-06-12T23:38:17.892420Z", - "iopub.status.busy": "2026-06-12T23:38:17.892228Z", - "iopub.status.idle": "2026-06-12T23:38:17.922873Z", - "shell.execute_reply": "2026-06-12T23:38:17.922117Z" + "iopub.execute_input": "2026-06-23T15:20:51.795339Z", + "iopub.status.busy": "2026-06-23T15:20:51.795074Z", + "iopub.status.idle": "2026-06-23T15:20:51.810825Z", + "shell.execute_reply": "2026-06-23T15:20:51.810089Z" } }, "outputs": [ @@ -365,10 +389,10 @@ "execution_count": 8, "metadata": { "execution": { - "iopub.execute_input": "2026-06-12T23:38:17.924911Z", - "iopub.status.busy": "2026-06-12T23:38:17.924550Z", - "iopub.status.idle": "2026-06-12T23:38:18.019448Z", - "shell.execute_reply": "2026-06-12T23:38:18.018467Z" + "iopub.execute_input": "2026-06-23T15:20:51.812587Z", + "iopub.status.busy": "2026-06-23T15:20:51.812386Z", + "iopub.status.idle": "2026-06-23T15:20:51.911130Z", + "shell.execute_reply": "2026-06-23T15:20:51.910362Z" } }, "outputs": [ @@ -405,10 +429,10 @@ "execution_count": 9, "metadata": { "execution": { - "iopub.execute_input": "2026-06-12T23:38:18.021312Z", - "iopub.status.busy": "2026-06-12T23:38:18.021047Z", - "iopub.status.idle": "2026-06-12T23:38:18.037296Z", - "shell.execute_reply": "2026-06-12T23:38:18.036348Z" + "iopub.execute_input": "2026-06-23T15:20:51.912804Z", + "iopub.status.busy": "2026-06-23T15:20:51.912596Z", + "iopub.status.idle": "2026-06-23T15:20:51.955067Z", + "shell.execute_reply": "2026-06-23T15:20:51.954279Z" } }, "outputs": [ @@ -418,25 +442,25 @@ "text": [ "(46, 8)\n", "shape: (3, 8)\n", - "\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n", - "\u2502 id \u2506 descriptio \u2506 unit \u2506 data_type \u2506 range_min \u2506 range_max \u2506 project_i \u2506 feature_s \u2502\n", - "\u2502 --- \u2506 n \u2506 --- \u2506 --- \u2506 --- \u2506 --- \u2506 d \u2506 et_id \u2502\n", - "\u2502 str \u2506 --- \u2506 str \u2506 str \u2506 f64 \u2506 f64 \u2506 --- \u2506 --- \u2502\n", - "\u2502 \u2506 str \u2506 \u2506 \u2506 \u2506 \u2506 str \u2506 str \u2502\n", - "\u255e\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2561\n", - "\u2502 axon_bias_ \u2506 Difference \u2506 MICRONS_LE \u2506 \n", " \n", "\n", - "

3 rows \u00d7 49 columns

\n", + "

3 rows × 49 columns

\n", "" ], "text/plain": [ @@ -728,7 +752,7 @@ "wide_df = pd.read_csv(WIDE_CSV)\n", "print(\"Wide CSV shape:\", wide_df.shape)\n", "\n", - "# Rename id column; convert int64 \u2192 str to match DataItem ids (values unchanged).\n", + "# Rename id column; convert int64 → str to match DataItem ids (values unchanged).\n", "wide_df = wide_df.rename(columns={\"specimen_id\": \"id\"})\n", "wide_df[\"id\"] = wide_df[\"id\"].astype(str)\n", "wide_df[\"project_id\"] = PROJECT_ID\n", @@ -747,10 +771,10 @@ "execution_count": 13, "metadata": { "execution": { - "iopub.execute_input": "2026-06-12T23:38:18.211506Z", - "iopub.status.busy": "2026-06-12T23:38:18.211242Z", - "iopub.status.idle": "2026-06-12T23:38:18.363428Z", - "shell.execute_reply": "2026-06-12T23:38:18.362674Z" + "iopub.execute_input": "2026-06-23T15:20:52.132768Z", + "iopub.status.busy": "2026-06-23T15:20:52.132581Z", + "iopub.status.idle": "2026-06-23T15:20:52.239611Z", + "shell.execute_reply": "2026-06-23T15:20:52.238847Z" } }, "outputs": [ @@ -780,10 +804,10 @@ "execution_count": 14, "metadata": { "execution": { - "iopub.execute_input": "2026-06-12T23:38:18.364994Z", - "iopub.status.busy": "2026-06-12T23:38:18.364810Z", - "iopub.status.idle": "2026-06-12T23:38:18.381362Z", - "shell.execute_reply": "2026-06-12T23:38:18.380741Z" + "iopub.execute_input": "2026-06-23T15:20:52.241324Z", + "iopub.status.busy": "2026-06-23T15:20:52.241119Z", + "iopub.status.idle": "2026-06-23T15:20:52.260862Z", + "shell.execute_reply": "2026-06-23T15:20:52.260110Z" } }, "outputs": [ @@ -793,24 +817,24 @@ "text": [ "(520, 49)\n", "shape: (3, 49)\n", - "\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n", - "\u2502 id \u2506 axon_bias \u2506 axon_bias \u2506 axon_dept \u2506 \u2026 \u2506 basal_den \u2506 soma_alig \u2506 project_i \u2506 feature_ \u2502\n", - "\u2502 --- \u2506 _x \u2506 _y \u2506 h_pc_0 \u2506 \u2506 drite_tot \u2506 ned_dist_ \u2506 d \u2506 set_id \u2502\n", - "\u2502 str \u2506 --- \u2506 --- \u2506 --- \u2506 \u2506 al_surfac \u2506 from_pia \u2506 --- \u2506 --- \u2502\n", - "\u2502 \u2506 f32 \u2506 f32 \u2506 f32 \u2506 \u2506 e_a\u2026 \u2506 --- \u2506 str \u2506 str \u2502\n", - "\u2502 \u2506 \u2506 \u2506 \u2506 \u2506 --- \u2506 f32 \u2506 \u2506 \u2502\n", - "\u2502 \u2506 \u2506 \u2506 \u2506 \u2506 f32 \u2506 \u2506 \u2506 \u2502\n", - "\u255e\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2561\n", - "\u2502 601506507 \u2506 180.83319 \u2506 -249.8307 \u2506 -255.2250 \u2506 \u2026 \u2506 7207.4599 \u2506 357.15982 \u2506 visp_patc \u2506 inh_visp \u2502\n", - "\u2502 \u2506 1 \u2506 5 \u2506 98 \u2506 \u2506 61 \u2506 1 \u2506 hseq \u2506 _morph_f \u2502\n", - "\u2502 \u2506 \u2506 \u2506 \u2506 \u2506 \u2506 \u2506 \u2506 eatures \u2502\n", - "\u2502 601790961 \u2506 25.481123 \u2506 434.25106 \u2506 -216.8098 \u2506 \u2026 \u2506 11691.149 \u2506 663.10302 \u2506 visp_patc \u2506 inh_visp \u2502\n", - "\u2502 \u2506 \u2506 8 \u2506 91 \u2506 \u2506 414 \u2506 7 \u2506 hseq \u2506 _morph_f \u2502\n", - "\u2502 \u2506 \u2506 \u2506 \u2506 \u2506 \u2506 \u2506 \u2506 eatures \u2502\n", - "\u2502 601803754 \u2506 42.650597 \u2506 104.69784 \u2506 1157.1519 \u2506 \u2026 \u2506 11384.542 \u2506 170.36506 \u2506 visp_patc \u2506 inh_visp \u2502\n", - "\u2502 \u2506 \u2506 5 \u2506 78 \u2506 \u2506 969 \u2506 7 \u2506 hseq \u2506 _morph_f \u2502\n", - "\u2502 \u2506 \u2506 \u2506 \u2506 \u2506 \u2506 \u2506 \u2506 eatures \u2502\n", - "\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n" + "┌───────────┬───────────┬───────────┬───────────┬───┬───────────┬───────────┬───────────┬──────────┐\n", + "│ id ┆ axon_bias ┆ axon_bias ┆ axon_dept ┆ … ┆ basal_den ┆ soma_alig ┆ project_i ┆ feature_ │\n", + "│ --- ┆ _x ┆ _y ┆ h_pc_0 ┆ ┆ drite_tot ┆ ned_dist_ ┆ d ┆ set_id │\n", + "│ str ┆ --- ┆ --- ┆ --- ┆ ┆ al_surfac ┆ from_pia ┆ --- ┆ --- │\n", + "│ ┆ f32 ┆ f32 ┆ f32 ┆ ┆ e_a… ┆ --- ┆ str ┆ str │\n", + "│ ┆ ┆ ┆ ┆ ┆ --- ┆ f32 ┆ ┆ │\n", + "│ ┆ ┆ ┆ ┆ ┆ f32 ┆ ┆ ┆ │\n", + "╞═══════════╪═══════════╪═══════════╪═══════════╪═══╪═══════════╪═══════════╪═══════════╪══════════╡\n", + "│ 601506507 ┆ 180.83319 ┆ -249.8307 ┆ -255.2250 ┆ … ┆ 7207.4599 ┆ 357.15982 ┆ visp_patc ┆ inh_visp │\n", + "│ ┆ 1 ┆ 5 ┆ 98 ┆ ┆ 61 ┆ 1 ┆ hseq ┆ _morph_f │\n", + "│ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ eatures │\n", + "│ 601790961 ┆ 25.481123 ┆ 434.25106 ┆ -216.8098 ┆ … ┆ 11691.149 ┆ 663.10302 ┆ visp_patc ┆ inh_visp │\n", + "│ ┆ ┆ 8 ┆ 91 ┆ ┆ 414 ┆ 7 ┆ hseq ┆ _morph_f │\n", + "│ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ eatures │\n", + "│ 601803754 ┆ 42.650597 ┆ 104.69784 ┆ 1157.1519 ┆ … ┆ 11384.542 ┆ 170.36506 ┆ visp_patc ┆ inh_visp │\n", + "│ ┆ ┆ 5 ┆ 78 ┆ ┆ 969 ┆ 7 ┆ hseq ┆ _morph_f │\n", + "│ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ eatures │\n", + "└───────────┴───────────┴───────────┴───────────┴───┴───────────┴───────────┴───────────┴──────────┘\n" ] } ], @@ -837,10 +861,10 @@ "execution_count": 15, "metadata": { "execution": { - "iopub.execute_input": "2026-06-12T23:38:18.383162Z", - "iopub.status.busy": "2026-06-12T23:38:18.382972Z", - "iopub.status.idle": "2026-06-12T23:38:18.448609Z", - "shell.execute_reply": "2026-06-12T23:38:18.447680Z" + "iopub.execute_input": "2026-06-23T15:20:52.262523Z", + "iopub.status.busy": "2026-06-23T15:20:52.262324Z", + "iopub.status.idle": "2026-06-23T15:20:52.368326Z", + "shell.execute_reply": "2026-06-23T15:20:52.367544Z" } }, "outputs": [ @@ -870,10 +894,10 @@ "execution_count": 16, "metadata": { "execution": { - "iopub.execute_input": "2026-06-12T23:38:18.450425Z", - "iopub.status.busy": "2026-06-12T23:38:18.450170Z", - "iopub.status.idle": "2026-06-12T23:38:18.513603Z", - "shell.execute_reply": "2026-06-12T23:38:18.512837Z" + "iopub.execute_input": "2026-06-23T15:20:52.370303Z", + "iopub.status.busy": "2026-06-23T15:20:52.370029Z", + "iopub.status.idle": "2026-06-23T15:20:52.430668Z", + "shell.execute_reply": "2026-06-23T15:20:52.429928Z" } }, "outputs": [ @@ -883,14 +907,14 @@ "text": [ "(1, 5)\n", "shape: (1, 5)\n", - "\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n", - "\u2502 id \u2506 feature_set_id \u2506 parquet_path \u2506 cell_index_column \u2506 project_id \u2502\n", - "\u2502 --- \u2506 --- \u2506 --- \u2506 --- \u2506 --- \u2502\n", - "\u2502 str \u2506 str \u2506 str \u2506 str \u2506 str \u2502\n", - "\u255e\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2561\n", - "\u2502 visp_patchseq_inh_ \u2506 inh_visp_morph_fea \u2506 file:///scratch/em \u2506 id \u2506 visp_patchseq \u2502\n", - "\u2502 visp_morph_f\u2026 \u2506 tures \u2506 _patchseq_wn\u2026 \u2506 \u2506 \u2502\n", - "\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n" + "┌────────────────────┬────────────────────┬────────────────────┬───────────────────┬───────────────┐\n", + "│ id ┆ feature_set_id ┆ parquet_path ┆ cell_index_column ┆ project_id │\n", + "│ --- ┆ --- ┆ --- ┆ --- ┆ --- │\n", + "│ str ┆ str ┆ str ┆ str ┆ str │\n", + "╞════════════════════╪════════════════════╪════════════════════╪═══════════════════╪═══════════════╡\n", + "│ visp_patchseq_inh_ ┆ inh_visp_morph_fea ┆ file:///scratch/em ┆ id ┆ visp_patchseq │\n", + "│ visp_morph_f… ┆ tures ┆ _patchseq_wn… ┆ ┆ │\n", + "└────────────────────┴────────────────────┴────────────────────┴───────────────────┴───────────────┘\n" ] } ], @@ -912,14 +936,14 @@ "\n", "| Output path | Class | Rows |\n", "|---|---|---|\n", - "| `dataitem/` | `DataItem` | +new cells from wide CSV (\u2264 520 total, 120 new on first run) |\n", - "| `dataitem_dataset_association/` | `DataItemDataSetAssociation` | +new cells from wide CSV |\n", + "| `dataitem/` | `DataItem` | +new cells from wide CSV (≤ 520 total, 120 new on first run) |\n", + "| `dataitem_dataset_association/` | `DataItemDataSetAssociation` | full scope = existing ∪ wide-CSV ids (re-written every run) |\n", "| `cellfeaturedefinition/` | `CellFeatureDefinition` | 46 |\n", "| `cellfeatureset/` | `CellFeatureSet` | 1 (`inh_visp_morph_features`) |\n", - "| `cellfeatures/inh_visp_morph_features/` | wide parquet | 520 cells \u00d7 46 features |\n", + "| `cellfeatures/inh_visp_morph_features/` | wide parquet | 520 cells × 46 features |\n", "| `cellfeaturematrix/` | `CellFeatureMatrix` | 1 |\n", "\n", - "`dataitem/` and `dataitem_dataset_association/` use `append_new_dataitems` / `mode=\"append\"` scoped to new cells only \u2014 re-running is idempotent and never wipes rows from `etl_visp_inh_patchseq_01`. All other writes use `mode=\"overwrite\"` with a scoped predicate." + "`dataitem/` uses `append_new_dataitems` scoped to new cells only — re-running is idempotent and never wipes rows from `etl_visp_inh_patchseq_01`. `dataitem_dataset_association/` is `overwrite_scoped` on `(project_id, dataset_id)`; this notebook re-asserts the full scope as `existing ∪ wide-CSV ids` so siblings under the same scope (e.g. `_01`'s ttype cells, `_03`'s MET cells) are preserved. All other writes use `mode=\"overwrite\"` with a scoped predicate." ] }, { @@ -946,9 +970,9 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.12.4" + "version": "3.13.13" } }, "nbformat": 4, "nbformat_minor": 4 -} \ No newline at end of file +} diff --git a/code/etl_visp_inh_patchseq_03_cluster_membership_and_mapping.ipynb b/code/etl_visp_inh_patchseq_03_cluster_membership_and_mapping.ipynb index c56dd0f..cb5d7b9 100644 --- a/code/etl_visp_inh_patchseq_03_cluster_membership_and_mapping.ipynb +++ b/code/etl_visp_inh_patchseq_03_cluster_membership_and_mapping.ipynb @@ -5,51 +5,53 @@ "id": "d07bcdbd", "metadata": {}, "source": [ - "# ETL \u2014 VISp Inhibitory Patch-seq: Cluster Membership & Cell-to-Cluster Mapping\n", + "# ETL — VISp Inhibitory Patch-seq: Cluster Membership & Cell-to-Cluster Mapping\n", "\n", "For VISp inhibitory Patch-seq cells (`project_id=\"visp_patchseq\"`, `dataset_id=\"visp_inh_patchseq\"`),\n", "this notebook registers two assignments per cell:\n", "\n", - "1. **T-type \u2192 Tasic 2018 VISp scRNA-seq taxonomy** as `CellToClusterMapping` (the cells were\n", - " not part of Tasic \u2014 this is a *mapping*). Source: `patchseq_tx_cell_ttype_labels.csv`,\n", - " column `ttype`, indexed by cell id. **No `ET \u2192 PT` translation** (that was an exc-only\n", + "1. **T-type → Tasic 2018 VISp scRNA-seq taxonomy** as `CellToClusterMapping` (the cells were\n", + " not part of Tasic — this is a *mapping*). Source: `patchseq_tx_cell_ttype_labels.csv`,\n", + " column `ttype`, indexed by cell id. **No `ET → PT` translation** (that was an exc-only\n", " convention; inhibitory ttypes don't contain `ET`).\n", - "2. **MET-type \u2192 VISp MET-types taxonomy** as `ClusterMembership` (these cells *belong* to\n", - " these MET-types by direct measurement \u2014 same cohort that defined the taxonomy). Source:\n", + "2. **MET-type → VISp MET-types taxonomy** as `ClusterMembership` (these cells *belong* to\n", + " these MET-types by direct measurement — same cohort that defined the taxonomy). Source:\n", " `visp_met_cell_assignments_text_names.csv`, column `met_type`, indexed by cell id.\n", "\n", "Both writes use **parent propagation**: one row per (cell, ancestor) pair walked from the\n", "leaf to the root via `walk_ancestors` (in `connects_common_connectivity.io.write_utils`).\n", "`probability` is left null (no probability column in either source).\n", "\n", - "## Section 0 \u2014 register missing inhibitory dataset associations\n", + "## Section 0 — re-assert the inhibitory dataset associations\n", "\n", "The MET CSV has 495 cells. All 495 are already in `dataitem/` (registered by earlier\n", - "notebooks), but only 392 are associated with `dataset_id=\"visp_inh_patchseq\"`. The other\n", - "103 are GABAergic MET-types with no dataset association at all. Section 0 appends the\n", - "missing 103 `DataItemDataSetAssociation` rows so every MET cell has the proper inh\n", - "dataset link before membership is written. (This mirrors the pattern in\n", - "`etl_visp_inh_patchseq_02_cell_features.ipynb`.)\n", + "notebooks). `DataItemDataSetAssociation` is `overwrite_scoped` on\n", + "`(project_id, dataset_id)`, so any `write_models` call replaces *every* row in the\n", + "scope. Multiple notebooks contribute rows to this same scope\n", + "(`_01` writes the 2759 ttype cells, `_02` writes the wide-CSV cells, `_03` adds the\n", + "GABAergic MET cells), so a plain overwrite from any one of them clobbers the\n", + "others. Section 0 therefore reads the existing scope, **unions** it with the 495\n", + "MET CSV cells, and re-writes the full set. The result is idempotent and self-heals\n", + "any cells missed by a prior partial run.\n", "\n", "## Merge-then-overwrite for `clustermembership/`\n", "\n", - "`etl_visp_exc_patchseq_03_cluster_membership_and_mapping.ipynb` already wrote 1152\n", + "`etl_visp_exc_patchseq_03_cluster_membership_and_mapping.ipynb` writes 1152\n", "ClusterMembership rows under predicate\n", "`project_id='visp_patchseq' AND hierarchy_id='visp_met_types_taxonomy'`. This notebook\n", "writes under the **same predicate**, so a plain overwrite would clobber the exc rows.\n", "Instead, the membership write uses **merge-then-overwrite**: read existing rows, drop\n", "the rows this notebook owns (`item IN `), union with new rows, then\n", - "overwrite. This makes both notebooks idempotent and order-independent, matching the\n", - "codebase pattern used in `_02` for `dataitem_dataset_association`.\n", + "overwrite. This makes both notebooks idempotent and order-independent.\n", "\n", - "## Outputs (under `../scratch/em_patchseq_wnm_v1/`)\n", + "## Outputs \n", "\n", "| Path | Class | Rows added |\n", "|---|---|---|\n", - "| `dataitem_dataset_association/` | `DataItemDataSetAssociation` | +103 (first run); 0 on re-run |\n", + "| `dataitem_dataset_association/` | `DataItemDataSetAssociation` | union of existing scope ∪ 495 MET cells (full scope re-written every run) |\n", "| `mappingset/` | `MappingSet` | 1 (`visp_inh_patchseq_ttype_mapping`) |\n", - "| `celltoclustermapping/` | `CellToClusterMapping` | 2759 cells \u00d7 4 ancestors = 11036 |\n", - "| `clustermembership/` | `ClusterMembership` | 495 cells \u00d7 3 ancestors = 1485 (merged with exc's 1152 \u2192 2637 total under predicate) |\n" + "| `celltoclustermapping/` | `CellToClusterMapping` | 2759 cells × 4 ancestors = 11036 |\n", + "| `clustermembership/` | `ClusterMembership` | 495 cells × 3 ancestors = 1485, merged with exc's 1152 → 2637 total under predicate |\n" ] }, { @@ -58,10 +60,10 @@ "id": "18153e32", "metadata": { "execution": { - "iopub.execute_input": "2026-06-12T23:38:32.926306Z", - "iopub.status.busy": "2026-06-12T23:38:32.926117Z", - "iopub.status.idle": "2026-06-12T23:38:34.205384Z", - "shell.execute_reply": "2026-06-12T23:38:34.204543Z" + "iopub.execute_input": "2026-06-23T15:20:55.810196Z", + "iopub.status.busy": "2026-06-23T15:20:55.809925Z", + "iopub.status.idle": "2026-06-23T15:20:56.809125Z", + "shell.execute_reply": "2026-06-23T15:20:56.808287Z" } }, "outputs": [], @@ -86,10 +88,10 @@ "id": "989cb0f6", "metadata": { "execution": { - "iopub.execute_input": "2026-06-12T23:38:34.207389Z", - "iopub.status.busy": "2026-06-12T23:38:34.207089Z", - "iopub.status.idle": "2026-06-12T23:38:34.212353Z", - "shell.execute_reply": "2026-06-12T23:38:34.211666Z" + "iopub.execute_input": "2026-06-23T15:20:56.811285Z", + "iopub.status.busy": "2026-06-23T15:20:56.810951Z", + "iopub.status.idle": "2026-06-23T15:20:56.816451Z", + "shell.execute_reply": "2026-06-23T15:20:56.815736Z" } }, "outputs": [ @@ -145,10 +147,10 @@ "id": "8888859f", "metadata": { "execution": { - "iopub.execute_input": "2026-06-12T23:38:34.213993Z", - "iopub.status.busy": "2026-06-12T23:38:34.213800Z", - "iopub.status.idle": "2026-06-12T23:38:34.344188Z", - "shell.execute_reply": "2026-06-12T23:38:34.343305Z" + "iopub.execute_input": "2026-06-23T15:20:56.818120Z", + "iopub.status.busy": "2026-06-23T15:20:56.817924Z", + "iopub.status.idle": "2026-06-23T15:20:57.032494Z", + "shell.execute_reply": "2026-06-23T15:20:57.031770Z" } }, "outputs": [ @@ -156,7 +158,13 @@ "name": "stdout", "output_type": "stream", "text": [ - "DataItems for project_id='visp_patchseq': 4407\n", + "DataItems for project_id='visp_patchseq': 4407\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ "Clusters loaded: tasic_2018_visp_taxonomy=138 visp_met_types_taxonomy=48\n" ] } @@ -168,7 +176,7 @@ " .filter(pl.col(\"project_id\") == PROJECT_ID)\n", ")\n", "assert existing_dataitems.shape[0] > 0, (\n", - " f\"earlier notebooks must run first \u2014 no DataItem rows for project_id='{PROJECT_ID}'\"\n", + " f\"earlier notebooks must run first — no DataItem rows for project_id='{PROJECT_ID}'\"\n", ")\n", "registered_ids = set(existing_dataitems[\"id\"].to_list())\n", "print(f\"DataItems for project_id='{PROJECT_ID}': {len(registered_ids)}\")\n", @@ -176,8 +184,8 @@ "cluster_df = pl.read_delta(OUTPUT_ROOT + \"cluster/\")\n", "ttype_clu = cluster_df.filter(pl.col(\"hierarchy_id\") == TTYPE_HIERARCHY_ID)\n", "met_clu = cluster_df.filter(pl.col(\"hierarchy_id\") == METTYPE_HIERARCHY_ID)\n", - "assert ttype_clu.shape[0] > 0, f\"etl_tasic_01_cluster must run first \u2014 no clusters for {TTYPE_HIERARCHY_ID}\"\n", - "assert met_clu.shape[0] > 0, f\"etl_visp_met_types_01_cluster must run first \u2014 no clusters for {METTYPE_HIERARCHY_ID}\"\n", + "assert ttype_clu.shape[0] > 0, f\"etl_tasic_01_cluster must run first — no clusters for {TTYPE_HIERARCHY_ID}\"\n", + "assert met_clu.shape[0] > 0, f\"etl_visp_met_types_01_cluster must run first — no clusters for {METTYPE_HIERARCHY_ID}\"\n", "\n", "ttype_parent = dict(zip(ttype_clu[\"id\"].to_list(), ttype_clu[\"parent\"].to_list()))\n", "met_parent = dict(zip(met_clu[\"id\"].to_list(), met_clu[\"parent\"].to_list()))\n", @@ -189,12 +197,15 @@ "id": "7d1b7045", "metadata": {}, "source": [ - "## Section 0 \u2014 register missing `visp_inh_patchseq` associations\n", + "## Section 0 — re-assert `visp_inh_patchseq` associations\n", "\n", - "Cells in `visp_met_cell_assignments_text_names.csv` that exist in `dataitem/` for\n", - "`project_id='visp_patchseq'` but lack a `dataset_id='visp_inh_patchseq'` association\n", - "get one appended here. `mode=\"append\"` is safe because we only emit rows for ids that\n", - "are not yet associated; on re-run the to-register set is empty and the block no-ops.\n" + "`DataItemDataSetAssociation` is `overwrite_scoped` on `(project_id, dataset_id)`,\n", + "so a `write_models` call replaces the entire scope. Several notebooks contribute\n", + "rows to `(visp_patchseq, visp_inh_patchseq)` — `_01` writes the 2759 ttype cells,\n", + "`_02` writes the wide-CSV cells, and `_03` adds the 495 GABAergic MET cells. To\n", + "avoid clobbering siblings, this section reads the existing scope, unions it with\n", + "the MET CSV ids, and re-writes the full set. On re-run the set is unchanged so\n", + "the write is a no-op in content.\n" ] }, { @@ -203,10 +214,10 @@ "id": "f7cce99b", "metadata": { "execution": { - "iopub.execute_input": "2026-06-12T23:38:34.345890Z", - "iopub.status.busy": "2026-06-12T23:38:34.345691Z", - "iopub.status.idle": "2026-06-12T23:38:34.359951Z", - "shell.execute_reply": "2026-06-12T23:38:34.359243Z" + "iopub.execute_input": "2026-06-23T15:20:57.034186Z", + "iopub.status.busy": "2026-06-23T15:20:57.033963Z", + "iopub.status.idle": "2026-06-23T15:20:57.048121Z", + "shell.execute_reply": "2026-06-23T15:20:57.047461Z" } }, "outputs": [ @@ -289,10 +300,10 @@ "id": "9e0f8d3e", "metadata": { "execution": { - "iopub.execute_input": "2026-06-12T23:38:34.361584Z", - "iopub.status.busy": "2026-06-12T23:38:34.361395Z", - "iopub.status.idle": "2026-06-12T23:38:34.376612Z", - "shell.execute_reply": "2026-06-12T23:38:34.375906Z" + "iopub.execute_input": "2026-06-23T15:20:57.049626Z", + "iopub.status.busy": "2026-06-23T15:20:57.049448Z", + "iopub.status.idle": "2026-06-23T15:20:57.113248Z", + "shell.execute_reply": "2026-06-23T15:20:57.112507Z" } }, "outputs": [ @@ -329,25 +340,38 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 6, "id": "2c8d4901", "metadata": { "execution": { - "iopub.execute_input": "2026-06-12T23:38:34.378220Z", - "iopub.status.busy": "2026-06-12T23:38:34.378034Z", - "iopub.status.idle": "2026-06-12T23:38:34.393559Z", - "shell.execute_reply": "2026-06-12T23:38:34.392730Z" + "iopub.execute_input": "2026-06-23T15:20:57.115015Z", + "iopub.status.busy": "2026-06-23T15:20:57.114800Z", + "iopub.status.idle": "2026-06-23T15:20:57.337830Z", + "shell.execute_reply": "2026-06-23T15:20:57.337011Z" } }, - "outputs": [], + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Associations written for (visp_patchseq, visp_inh_patchseq): 2879\n", + "Total visp_inh_patchseq associations now: 2879\n" + ] + } + ], "source": [ - "# Re-assert the full (project_id, dataset_id) association scope for every\n", - "# MET cell. DataItemDataSetAssociation is overwrite_scoped on\n", - "# (project_id, dataset_id), so passing the full set is idempotent and\n", - "# self-heals any cells that were missed by a prior partial run.\n", + "# Re-assert the full (project_id, dataset_id) association scope.\n", + "# DataItemDataSetAssociation is overwrite_scoped on (project_id, dataset_id),\n", + "# so we must pass every id that should remain in scope — not just the MET\n", + "# CSV ids — otherwise rows registered by earlier notebooks (e.g. `_01`'s\n", + "# ttype-CSV cells and `_02`'s wide-CSV cells) would be clobbered. Union the\n", + "# existing scope with the MET CSV ids; the write is idempotent and self-\n", + "# heals any cells that were missed by a prior partial run.\n", + "full_assoc_ids = sorted(existing_inh_ids | met_csv_ids)\n", "n_assoc = write_models(\n", " [DataItemDataSetAssociation(dataitem_id=cid, dataset_id=DATASET_ID, project_id=PROJECT_ID)\n", - " for cid in sorted(met_csv_ids)],\n", + " for cid in full_assoc_ids],\n", " output_root=OUTPUT_ROOT,\n", ").rows_written\n", "print(f\"Associations written for ({PROJECT_ID}, {DATASET_ID}): {n_assoc}\")\n", @@ -371,7 +395,7 @@ "id": "b6557137", "metadata": {}, "source": [ - "## Section 1 \u2014 T-type \u2192 `CellToClusterMapping` against Tasic 2018\n" + "## Section 1 — T-type → `CellToClusterMapping` against Tasic 2018\n" ] }, { @@ -380,10 +404,10 @@ "id": "321ae589", "metadata": { "execution": { - "iopub.execute_input": "2026-06-12T23:38:34.395367Z", - "iopub.status.busy": "2026-06-12T23:38:34.395156Z", - "iopub.status.idle": "2026-06-12T23:38:34.408631Z", - "shell.execute_reply": "2026-06-12T23:38:34.407961Z" + "iopub.execute_input": "2026-06-23T15:20:57.339643Z", + "iopub.status.busy": "2026-06-23T15:20:57.339427Z", + "iopub.status.idle": "2026-06-23T15:20:57.353843Z", + "shell.execute_reply": "2026-06-23T15:20:57.353141Z" } }, "outputs": [ @@ -458,9 +482,9 @@ "tt_df.index = tt_df.index.astype(str)\n", "print(\"T-type CSV shape:\", tt_df.shape)\n", "print(\"ttype non-null:\", tt_df[\"ttype\"].notna().sum())\n", - "# Inhibitory ttypes don't contain \"ET\" \u2014 assert and skip the legacy ET\u2192PT translation.\n", + "# Inhibitory ttypes don't contain \"ET\" — assert and skip the legacy ET→PT translation.\n", "assert tt_df[\"ttype\"].astype(str).str.contains(\"ET\").sum() == 0, (\n", - " \"unexpected 'ET' in inhibitory ttypes \u2014 exc-only translation rule should not apply\"\n", + " \"unexpected 'ET' in inhibitory ttypes — exc-only translation rule should not apply\"\n", ")\n", "tt_df.head(3)\n" ] @@ -471,10 +495,10 @@ "id": "be05809d", "metadata": { "execution": { - "iopub.execute_input": "2026-06-12T23:38:34.410231Z", - "iopub.status.busy": "2026-06-12T23:38:34.410044Z", - "iopub.status.idle": "2026-06-12T23:38:34.415895Z", - "shell.execute_reply": "2026-06-12T23:38:34.415278Z" + "iopub.execute_input": "2026-06-23T15:20:57.355599Z", + "iopub.status.busy": "2026-06-23T15:20:57.355306Z", + "iopub.status.idle": "2026-06-23T15:20:57.361378Z", + "shell.execute_reply": "2026-06-23T15:20:57.360677Z" } }, "outputs": [ @@ -482,13 +506,7 @@ "name": "stdout", "output_type": "stream", "text": [ - "All 2759 T-type CSV cells are in visp_inh_patchseq.\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ + "All 2759 T-type CSV cells are in visp_inh_patchseq.\n", "All 2759 cells have a valid Tasic ttype.\n" ] } @@ -516,10 +534,10 @@ "id": "1719a00c", "metadata": { "execution": { - "iopub.execute_input": "2026-06-12T23:38:34.417580Z", - "iopub.status.busy": "2026-06-12T23:38:34.417401Z", - "iopub.status.idle": "2026-06-12T23:38:34.515524Z", - "shell.execute_reply": "2026-06-12T23:38:34.514677Z" + "iopub.execute_input": "2026-06-23T15:20:57.363017Z", + "iopub.status.busy": "2026-06-23T15:20:57.362820Z", + "iopub.status.idle": "2026-06-23T15:20:57.532173Z", + "shell.execute_reply": "2026-06-23T15:20:57.531422Z" } }, "outputs": [ @@ -532,7 +550,7 @@ } ], "source": [ - "# MappingSet \u2014 one row describing the t-type assignment method.\n", + "# MappingSet — one row describing the t-type assignment method.\n", "ttype_mapping_set = MappingSet(\n", " id=MAPPING_SET_ID,\n", " name=\"VISp inhibitory Patch-seq T-type assignments\",\n", @@ -540,7 +558,7 @@ " \"Tree-mapping of VISp inhibitory Patch-seq cells onto the Tasic 2018 VISp \"\n", " \"scRNA-seq taxonomy, as used in Gouwens et al. 2020. Source labels are read \"\n", " \"from the `ttype` column of patchseq_tx_cell_ttype_labels.csv. No legacy \"\n", - " \"ET\u2192PT rename is applied (inhibitory ttypes do not contain 'ET').\"\n", + " \"ET→PT rename is applied (inhibitory ttypes do not contain 'ET').\"\n", " ),\n", " method_name=\"Patch-seq tree-mapping (Gouwens et al. 2020)\",\n", " source_dataset=DATASET_ID,\n", @@ -557,10 +575,10 @@ "id": "8ba5cb2d", "metadata": { "execution": { - "iopub.execute_input": "2026-06-12T23:38:34.517229Z", - "iopub.status.busy": "2026-06-12T23:38:34.517028Z", - "iopub.status.idle": "2026-06-12T23:38:34.534910Z", - "shell.execute_reply": "2026-06-12T23:38:34.534226Z" + "iopub.execute_input": "2026-06-23T15:20:57.533854Z", + "iopub.status.busy": "2026-06-23T15:20:57.533639Z", + "iopub.status.idle": "2026-06-23T15:20:57.570176Z", + "shell.execute_reply": "2026-06-23T15:20:57.569421Z" } }, "outputs": [ @@ -570,17 +588,17 @@ "text": [ "(1, 13)\n", "shape: (1, 13)\n", - "\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n", - "\u2502 id \u2506 name \u2506 descripti \u2506 method_na \u2506 \u2026 \u2506 source_hi \u2506 target_hi \u2506 json_obje \u2506 project_ \u2502\n", - "\u2502 --- \u2506 --- \u2506 on \u2506 me \u2506 \u2506 erarchy \u2506 erarchy \u2506 ct \u2506 id \u2502\n", - "\u2502 str \u2506 str \u2506 --- \u2506 --- \u2506 \u2506 --- \u2506 --- \u2506 --- \u2506 --- \u2502\n", - "\u2502 \u2506 \u2506 str \u2506 str \u2506 \u2506 str \u2506 str \u2506 str \u2506 str \u2502\n", - "\u255e\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2561\n", - "\u2502 visp_inh_ \u2506 VISp inhi \u2506 Tree-mapp \u2506 Patch-seq \u2506 \u2026 \u2506 null \u2506 tasic_201 \u2506 null \u2506 visp_pat \u2502\n", - "\u2502 patchseq_ \u2506 bitory \u2506 ing of \u2506 tree-mapp \u2506 \u2506 \u2506 8_visp_ta \u2506 \u2506 chseq \u2502\n", - "\u2502 ttype_map \u2506 Patch-seq \u2506 VISp inhi \u2506 ing \u2506 \u2506 \u2506 xonomy \u2506 \u2506 \u2502\n", - "\u2502 pin\u2026 \u2506 T-ty\u2026 \u2506 bitor\u2026 \u2506 (Gouwen\u2026 \u2506 \u2506 \u2506 \u2506 \u2506 \u2502\n", - "\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n" + "┌───────────┬───────────┬───────────┬───────────┬───┬───────────┬───────────┬───────────┬──────────┐\n", + "│ id ┆ name ┆ descripti ┆ method_na ┆ … ┆ source_hi ┆ target_hi ┆ json_obje ┆ project_ │\n", + "│ --- ┆ --- ┆ on ┆ me ┆ ┆ erarchy ┆ erarchy ┆ ct ┆ id │\n", + "│ str ┆ str ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ --- ┆ --- │\n", + "│ ┆ ┆ str ┆ str ┆ ┆ str ┆ str ┆ str ┆ str │\n", + "╞═══════════╪═══════════╪═══════════╪═══════════╪═══╪═══════════╪═══════════╪═══════════╪══════════╡\n", + "│ visp_inh_ ┆ VISp inhi ┆ Tree-mapp ┆ Patch-seq ┆ … ┆ null ┆ tasic_201 ┆ null ┆ visp_pat │\n", + "│ patchseq_ ┆ bitory ┆ ing of ┆ tree-mapp ┆ ┆ ┆ 8_visp_ta ┆ ┆ chseq │\n", + "│ ttype_map ┆ Patch-seq ┆ VISp inhi ┆ ing ┆ ┆ ┆ xonomy ┆ ┆ │\n", + "│ pin… ┆ T-ty… ┆ bitor… ┆ (Gouwen… ┆ ┆ ┆ ┆ ┆ │\n", + "└───────────┴───────────┴───────────┴───────────┴───┴───────────┴───────────┴───────────┴──────────┘\n" ] } ], @@ -604,10 +622,10 @@ "id": "e9784282", "metadata": { "execution": { - "iopub.execute_input": "2026-06-12T23:38:34.536620Z", - "iopub.status.busy": "2026-06-12T23:38:34.536428Z", - "iopub.status.idle": "2026-06-12T23:38:34.949621Z", - "shell.execute_reply": "2026-06-12T23:38:34.948838Z" + "iopub.execute_input": "2026-06-23T15:20:57.571869Z", + "iopub.status.busy": "2026-06-23T15:20:57.571658Z", + "iopub.status.idle": "2026-06-23T15:20:57.934109Z", + "shell.execute_reply": "2026-06-23T15:20:57.933171Z" } }, "outputs": [ @@ -631,7 +649,7 @@ "ttype_mappings: list[CellToClusterMapping] = []\n", "for cell_id, leaf in zip(tt_df.index, tt_df[\"ttype\"]):\n", " if not isinstance(leaf, str):\n", - " continue # defensive \u2014 current data has no NaN ttypes\n", + " continue # defensive — current data has no NaN ttypes\n", " for cid, is_leaf in walk_ancestors(leaf, ttype_parent):\n", " ttype_mappings.append(CellToClusterMapping(\n", " id=f\"{cell_id}-{cid}-{PROJECT_ID}-{TTYPE_HIERARCHY_ID}\",\n", @@ -652,10 +670,10 @@ "id": "0f5e1931", "metadata": { "execution": { - "iopub.execute_input": "2026-06-12T23:38:34.951312Z", - "iopub.status.busy": "2026-06-12T23:38:34.951105Z", - "iopub.status.idle": "2026-06-12T23:38:34.975660Z", - "shell.execute_reply": "2026-06-12T23:38:34.974916Z" + "iopub.execute_input": "2026-06-23T15:20:57.935888Z", + "iopub.status.busy": "2026-06-23T15:20:57.935680Z", + "iopub.status.idle": "2026-06-23T15:20:57.977838Z", + "shell.execute_reply": "2026-06-23T15:20:57.977022Z" } }, "outputs": [ @@ -665,23 +683,23 @@ "text": [ "(11036, 8)\n", "shape: (3, 8)\n", - "\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n", - "\u2502 id \u2506 mapping_set \u2506 source_cell \u2506 target_clus \u2506 score \u2506 probability \u2506 notes \u2506 project_id \u2502\n", - "\u2502 --- \u2506 --- \u2506 --- \u2506 ter \u2506 --- \u2506 --- \u2506 --- \u2506 --- \u2502\n", - "\u2502 str \u2506 str \u2506 str \u2506 --- \u2506 f64 \u2506 f64 \u2506 str \u2506 str \u2502\n", - "\u2502 \u2506 \u2506 \u2506 str \u2506 \u2506 \u2506 \u2506 \u2502\n", - "\u255e\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2561\n", - "\u2502 888001481-L \u2506 visp_inh_pa \u2506 888001481 \u2506 Lamp5 \u2506 null \u2506 null \u2506 null \u2506 visp_patch \u2502\n", - "\u2502 amp5 \u2506 tchseq_ttyp \u2506 \u2506 Fam19a1 \u2506 \u2506 \u2506 \u2506 seq \u2502\n", - "\u2502 Fam19a1 \u2506 e_mappin\u2026 \u2506 \u2506 Tmem182 \u2506 \u2506 \u2506 \u2506 \u2502\n", - "\u2502 Tmem18\u2026 \u2506 \u2506 \u2506 \u2506 \u2506 \u2506 \u2506 \u2502\n", - "\u2502 888001481-L \u2506 visp_inh_pa \u2506 888001481 \u2506 Lamp5 \u2506 null \u2506 null \u2506 null \u2506 visp_patch \u2502\n", - "\u2502 amp5-visp_p \u2506 tchseq_ttyp \u2506 \u2506 \u2506 \u2506 \u2506 \u2506 seq \u2502\n", - "\u2502 atchseq-\u2026 \u2506 e_mappin\u2026 \u2506 \u2506 \u2506 \u2506 \u2506 \u2506 \u2502\n", - "\u2502 888001481-G \u2506 visp_inh_pa \u2506 888001481 \u2506 GABAergic \u2506 null \u2506 null \u2506 null \u2506 visp_patch \u2502\n", - "\u2502 ABAergic-vi \u2506 tchseq_ttyp \u2506 \u2506 \u2506 \u2506 \u2506 \u2506 seq \u2502\n", - "\u2502 sp_patch\u2026 \u2506 e_mappin\u2026 \u2506 \u2506 \u2506 \u2506 \u2506 \u2506 \u2502\n", - "\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n" + "┌─────────────┬─────────────┬─────────────┬─────────────┬───────┬─────────────┬───────┬────────────┐\n", + "│ id ┆ mapping_set ┆ source_cell ┆ target_clus ┆ score ┆ probability ┆ notes ┆ project_id │\n", + "│ --- ┆ --- ┆ --- ┆ ter ┆ --- ┆ --- ┆ --- ┆ --- │\n", + "│ str ┆ str ┆ str ┆ --- ┆ f64 ┆ f64 ┆ str ┆ str │\n", + "│ ┆ ┆ ┆ str ┆ ┆ ┆ ┆ │\n", + "╞═════════════╪═════════════╪═════════════╪═════════════╪═══════╪═════════════╪═══════╪════════════╡\n", + "│ 888001481-L ┆ visp_inh_pa ┆ 888001481 ┆ Lamp5 ┆ null ┆ null ┆ null ┆ visp_patch │\n", + "│ amp5 ┆ tchseq_ttyp ┆ ┆ Fam19a1 ┆ ┆ ┆ ┆ seq │\n", + "│ Fam19a1 ┆ e_mappin… ┆ ┆ Tmem182 ┆ ┆ ┆ ┆ │\n", + "│ Tmem18… ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", + "│ 888001481-L ┆ visp_inh_pa ┆ 888001481 ┆ Lamp5 ┆ null ┆ null ┆ null ┆ visp_patch │\n", + "│ amp5-visp_p ┆ tchseq_ttyp ┆ ┆ ┆ ┆ ┆ ┆ seq │\n", + "│ atchseq-… ┆ e_mappin… ┆ ┆ ┆ ┆ ┆ ┆ │\n", + "│ 888001481-G ┆ visp_inh_pa ┆ 888001481 ┆ GABAergic ┆ null ┆ null ┆ null ┆ visp_patch │\n", + "│ ABAergic-vi ┆ tchseq_ttyp ┆ ┆ ┆ ┆ ┆ ┆ seq │\n", + "│ sp_patch… ┆ e_mappin… ┆ ┆ ┆ ┆ ┆ ┆ │\n", + "└─────────────┴─────────────┴─────────────┴─────────────┴───────┴─────────────┴───────┴────────────┘\n" ] } ], @@ -707,13 +725,13 @@ "id": "7aea9cb7", "metadata": {}, "source": [ - "## Section 2 \u2014 MET-type \u2192 `ClusterMembership` against VISp MET-types\n", + "## Section 2 — MET-type → `ClusterMembership` against VISp MET-types\n", "\n", "Uses **merge-then-overwrite**: read the existing rows under the\n", "`(project_id, hierarchy_id)` predicate, drop rows whose `item` is one of our 495\n", "cells, union with the new rows, and overwrite. This preserves whatever else is\n", "written under the same predicate (e.g. `etl_visp_exc_patchseq_03`'s 1152 rows for\n", - "the exc cells \u2014 disjoint cell ids, but same `project_id`/`hierarchy_id` partition).\n" + "the exc cells — disjoint cell ids, but same `project_id`/`hierarchy_id` partition).\n" ] }, { @@ -722,10 +740,10 @@ "id": "1c6330d0", "metadata": { "execution": { - "iopub.execute_input": "2026-06-12T23:38:34.977460Z", - "iopub.status.busy": "2026-06-12T23:38:34.977180Z", - "iopub.status.idle": "2026-06-12T23:38:34.983225Z", - "shell.execute_reply": "2026-06-12T23:38:34.982574Z" + "iopub.execute_input": "2026-06-23T15:20:57.979675Z", + "iopub.status.busy": "2026-06-23T15:20:57.979460Z", + "iopub.status.idle": "2026-06-23T15:20:57.986158Z", + "shell.execute_reply": "2026-06-23T15:20:57.985423Z" } }, "outputs": [ @@ -758,10 +776,10 @@ "id": "3d5fd53f", "metadata": { "execution": { - "iopub.execute_input": "2026-06-12T23:38:34.984754Z", - "iopub.status.busy": "2026-06-12T23:38:34.984580Z", - "iopub.status.idle": "2026-06-12T23:38:34.994024Z", - "shell.execute_reply": "2026-06-12T23:38:34.993357Z" + "iopub.execute_input": "2026-06-23T15:20:57.987882Z", + "iopub.status.busy": "2026-06-23T15:20:57.987675Z", + "iopub.status.idle": "2026-06-23T15:20:57.995516Z", + "shell.execute_reply": "2026-06-23T15:20:57.994782Z" } }, "outputs": [ @@ -774,7 +792,7 @@ } ], "source": [ - "# Build new ClusterMembership rows (one per cell \u00d7 ancestor).\n", + "# Build new ClusterMembership rows (one per cell × ancestor).\n", "new_memberships: list[ClusterMembership] = []\n", "for cell_id, leaf in zip(met_clean[\"specimen_id\"], met_clean[\"met_type\"]):\n", " for cid, is_leaf in walk_ancestors(leaf, met_parent):\n", @@ -795,10 +813,10 @@ "id": "e0e80e17", "metadata": { "execution": { - "iopub.execute_input": "2026-06-12T23:38:34.995537Z", - "iopub.status.busy": "2026-06-12T23:38:34.995354Z", - "iopub.status.idle": "2026-06-12T23:38:34.998642Z", - "shell.execute_reply": "2026-06-12T23:38:34.998025Z" + "iopub.execute_input": "2026-06-23T15:20:57.997261Z", + "iopub.status.busy": "2026-06-23T15:20:57.997054Z", + "iopub.status.idle": "2026-06-23T15:20:58.048872Z", + "shell.execute_reply": "2026-06-23T15:20:58.048055Z" } }, "outputs": [ @@ -806,15 +824,29 @@ "name": "stdout", "output_type": "stream", "text": [ - "Total ClusterMembership rows to write: 1485\n" + "Existing rows under predicate: 2637; kept (other notebooks): 1152; new: 1485\n", + "Total ClusterMembership rows to write: 2637\n" ] } ], "source": [ - "# write_models overwrites scope (project_id, hierarchy_id) so no manual merge needed.\n", - "import polars as _pl\n", - "other_cm = _pl.DataFrame({\"item\": []})\n", - "all_memberships = new_memberships\n", + "# Merge-then-overwrite: ClusterMembership is overwrite_scoped on\n", + "# (project_id, hierarchy_id), so a plain overwrite here would clobber rows\n", + "# written under the same predicate by sibling notebooks (e.g.\n", + "# etl_visp_exc_patchseq_03's 1152 mMET-exc cells). Read existing rows,\n", + "# keep the ones this notebook does not own (item NOT IN our_cell_ids), and\n", + "# union them with the new rows before re-writing the full scope.\n", + "try:\n", + " existing_cm = (\n", + " pl.read_delta(OUTPUT_ROOT + \"clustermembership/\")\n", + " .filter((pl.col(\"project_id\") == PROJECT_ID) & (pl.col(\"hierarchy_id\") == METTYPE_HIERARCHY_ID))\n", + " )\n", + "except Exception:\n", + " existing_cm = pl.DataFrame(schema={\"item\": pl.Utf8})\n", + "other_cm = existing_cm.filter(~pl.col(\"item\").is_in(list(our_cell_ids))) if existing_cm.shape[0] else existing_cm\n", + "other_memberships = [ClusterMembership(**row) for row in other_cm.to_dicts()]\n", + "all_memberships = other_memberships + new_memberships\n", + "print(f\"Existing rows under predicate: {existing_cm.shape[0]}; kept (other notebooks): {other_cm.shape[0]}; new: {len(new_memberships)}\")\n", "print(f\"Total ClusterMembership rows to write: {len(all_memberships)}\")\n" ] }, @@ -824,10 +856,10 @@ "id": "be955ac8", "metadata": { "execution": { - "iopub.execute_input": "2026-06-12T23:38:35.000108Z", - "iopub.status.busy": "2026-06-12T23:38:34.999904Z", - "iopub.status.idle": "2026-06-12T23:38:35.155111Z", - "shell.execute_reply": "2026-06-12T23:38:35.154443Z" + "iopub.execute_input": "2026-06-23T15:20:58.050716Z", + "iopub.status.busy": "2026-06-23T15:20:58.050415Z", + "iopub.status.idle": "2026-06-23T15:20:58.201025Z", + "shell.execute_reply": "2026-06-23T15:20:58.200282Z" } }, "outputs": [ @@ -835,7 +867,7 @@ "name": "stdout", "output_type": "stream", "text": [ - "ClusterMembership written: 1485 rows\n" + "ClusterMembership written: 2637 rows\n" ] } ], @@ -850,10 +882,10 @@ "id": "4f82b279", "metadata": { "execution": { - "iopub.execute_input": "2026-06-12T23:38:35.157052Z", - "iopub.status.busy": "2026-06-12T23:38:35.156855Z", - "iopub.status.idle": "2026-06-12T23:38:35.174766Z", - "shell.execute_reply": "2026-06-12T23:38:35.174116Z" + "iopub.execute_input": "2026-06-23T15:20:58.203086Z", + "iopub.status.busy": "2026-06-23T15:20:58.202793Z", + "iopub.status.idle": "2026-06-23T15:20:58.240103Z", + "shell.execute_reply": "2026-06-23T15:20:58.239359Z" } }, "outputs": [ @@ -861,23 +893,23 @@ "name": "stdout", "output_type": "stream", "text": [ - "(1485, 7)\n", + "(2637, 7)\n", "shape: (3, 7)\n", - "\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n", - "\u2502 item \u2506 cluster \u2506 membership_sco \u2506 probability \u2506 distance \u2506 project_id \u2506 hierarchy_id \u2502\n", - "\u2502 --- \u2506 --- \u2506 re \u2506 --- \u2506 --- \u2506 --- \u2506 --- \u2502\n", - "\u2502 str \u2506 str \u2506 --- \u2506 f64 \u2506 f64 \u2506 str \u2506 str \u2502\n", - "\u2502 \u2506 \u2506 f64 \u2506 \u2506 \u2506 \u2506 \u2502\n", - "\u255e\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2561\n", - "\u2502 601506507 \u2506 Vip-MET-2 \u2506 null \u2506 null \u2506 null \u2506 visp_patchseq \u2506 visp_met_types \u2502\n", - "\u2502 \u2506 \u2506 \u2506 \u2506 \u2506 \u2506 _taxonomy \u2502\n", - "\u2502 601506507 \u2506 GABAergic \u2506 null \u2506 null \u2506 null \u2506 visp_patchseq \u2506 visp_met_types \u2502\n", - "\u2502 \u2506 \u2506 \u2506 \u2506 \u2506 \u2506 _taxonomy \u2502\n", - "\u2502 601506507 \u2506 cell \u2506 null \u2506 null \u2506 null \u2506 visp_patchseq \u2506 visp_met_types \u2502\n", - "\u2502 \u2506 \u2506 \u2506 \u2506 \u2506 \u2506 _taxonomy \u2502\n", - "\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n", + "┌────────────┬───────────────┬──────────────┬─────────────┬──────────┬──────────────┬──────────────┐\n", + "│ item ┆ cluster ┆ membership_s ┆ probability ┆ distance ┆ project_id ┆ hierarchy_id │\n", + "│ --- ┆ --- ┆ core ┆ --- ┆ --- ┆ --- ┆ --- │\n", + "│ str ┆ str ┆ --- ┆ f64 ┆ f64 ┆ str ┆ str │\n", + "│ ┆ ┆ f64 ┆ ┆ ┆ ┆ │\n", + "╞════════════╪═══════════════╪══════════════╪═════════════╪══════════╪══════════════╪══════════════╡\n", + "│ 1039273993 ┆ L6b ┆ null ┆ null ┆ null ┆ visp_patchse ┆ visp_met_typ │\n", + "│ ┆ ┆ ┆ ┆ ┆ q ┆ es_taxonomy │\n", + "│ 1039273993 ┆ Glutamatergic ┆ null ┆ null ┆ null ┆ visp_patchse ┆ visp_met_typ │\n", + "│ ┆ ┆ ┆ ┆ ┆ q ┆ es_taxonomy │\n", + "│ 1039273993 ┆ cell ┆ null ┆ null ┆ null ┆ visp_patchse ┆ visp_met_typ │\n", + "│ ┆ ┆ ┆ ┆ ┆ q ┆ es_taxonomy │\n", + "└────────────┴───────────────┴──────────────┴─────────────┴──────────┴──────────────┴──────────────┘\n", "Our cells present: 495 / 495\n", - "Other-notebook rows preserved: 0\n" + "Other-notebook rows preserved: 1152\n" ] } ], @@ -924,8 +956,8 @@ "|---|---|---|\n", "| `dataitem_dataset_association/` | `DataItemDataSetAssociation` | up to 103 (first run) |\n", "| `mappingset/` | `MappingSet` | 1 (`visp_inh_patchseq_ttype_mapping`) |\n", - "| `celltoclustermapping/` | `CellToClusterMapping` | 2759 cells \u00d7 4 levels = 11036 |\n", - "| `clustermembership/` | `ClusterMembership` | 495 cells \u00d7 3 levels = 1485 (merged with prior rows under same predicate) |\n", + "| `celltoclustermapping/` | `CellToClusterMapping` | 2759 cells × 4 levels = 11036 |\n", + "| `clustermembership/` | `ClusterMembership` | 495 cells × 3 levels = 1485 (merged with prior rows under same predicate) |\n", "\n", "All writes are scoped by two-level predicates and are individually idempotent on re-run.\n" ] @@ -947,9 +979,9 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.12.4" + "version": "3.13.13" } }, "nbformat": 4, "nbformat_minor": 5 -} \ No newline at end of file +} diff --git a/code/etl_visp_met_types_01_cluster.ipynb b/code/etl_visp_met_types_01_cluster.ipynb index 47ca867..6c7bbe4 100644 --- a/code/etl_visp_met_types_01_cluster.ipynb +++ b/code/etl_visp_met_types_01_cluster.ipynb @@ -5,25 +5,18 @@ "id": "d0e57e11", "metadata": {}, "source": [ - "# ETL \u2014 VISp MET-types Taxonomy (cluster reference)\n", + "# ETL — VISp MET-types Taxonomy (cluster reference)\n", "\n", "Registers the **VISp MET-types taxonomy** as a global cluster reference. Writes `algorithmrun/`, `clusterhierarchy/`, `cluster/`, `hierarchycategory/`. **Out of scope:** no `DataItem` registration.\n", "\n", - "Source: `met_type_colors.json` (45 MET-type labels, leaf-only colors). Two real levels (class \u2192 cluster) with a synthetic `cell` root. Class-level colors sourced from Tasic's `anno.feather` for visual consistency. Schema caveats already documented in `etl_tasic_01_cluster.ipynb`; not repeated here." + "Source: `met_type_colors.json` (45 MET-type labels, leaf-only colors). Two real levels (class → cluster) with a synthetic `cell` root. Class-level colors sourced from Tasic's `anno.feather` for visual consistency. Schema caveats already documented in `etl_tasic_01_cluster.ipynb`; not repeated here." ] }, { "cell_type": "code", "execution_count": 1, "id": "c11cf781", - "metadata": { - "execution": { - "iopub.execute_input": "2026-06-12T23:38:24.568804Z", - "iopub.status.busy": "2026-06-12T23:38:24.568612Z", - "iopub.status.idle": "2026-06-12T23:38:25.733936Z", - "shell.execute_reply": "2026-06-12T23:38:25.733185Z" - } - }, + "metadata": {}, "outputs": [], "source": [ "from pathlib import Path\n", @@ -46,14 +39,7 @@ "cell_type": "code", "execution_count": 2, "id": "b00131aa", - "metadata": { - "execution": { - "iopub.execute_input": "2026-06-12T23:38:25.735740Z", - "iopub.status.busy": "2026-06-12T23:38:25.735447Z", - "iopub.status.idle": "2026-06-12T23:38:25.742186Z", - "shell.execute_reply": "2026-06-12T23:38:25.741467Z" - } - }, + "metadata": {}, "outputs": [ { "name": "stdout", @@ -96,14 +82,7 @@ "cell_type": "code", "execution_count": 3, "id": "2a561a2d", - "metadata": { - "execution": { - "iopub.execute_input": "2026-06-12T23:38:25.743706Z", - "iopub.status.busy": "2026-06-12T23:38:25.743526Z", - "iopub.status.idle": "2026-06-12T23:38:25.925732Z", - "shell.execute_reply": "2026-06-12T23:38:25.924959Z" - } - }, + "metadata": {}, "outputs": [ { "name": "stdout", @@ -125,7 +104,7 @@ "GABA_COLOR = tasic_class_colors[\"GABAergic\"]\n", "GLUT_COLOR = tasic_class_colors[\"Glutamatergic\"]\n", "\n", - "# Leaf split: \"MET\" in label \u2192 GABAergic, otherwise \u2192 Glutamatergic.\n", + "# Leaf split: \"MET\" in label → GABAergic, otherwise → Glutamatergic.\n", "gaba_met_types = [t for t in met_colors if \"MET\" in t]\n", "glut_met_types = [t for t in met_colors if \"MET\" not in t]\n", "\n", @@ -143,21 +122,14 @@ "id": "95b9bb7e", "metadata": {}, "source": [ - "## `HierarchyCategory` \u2014 3 rows (`major_class`/`class`/`cluster`); no `subclass` for this taxonomy" + "## `HierarchyCategory` — 3 rows (`major_class`/`class`/`cluster`); no `subclass` for this taxonomy" ] }, { "cell_type": "code", "execution_count": 4, "id": "761bfedf", - "metadata": { - "execution": { - "iopub.execute_input": "2026-06-12T23:38:25.927759Z", - "iopub.status.busy": "2026-06-12T23:38:25.927554Z", - "iopub.status.idle": "2026-06-12T23:38:26.200589Z", - "shell.execute_reply": "2026-06-12T23:38:26.199792Z" - } - }, + "metadata": {}, "outputs": [ { "name": "stdout", @@ -184,30 +156,23 @@ "cell_type": "code", "execution_count": 5, "id": "c9e74c2d", - "metadata": { - "execution": { - "iopub.execute_input": "2026-06-12T23:38:26.202206Z", - "iopub.status.busy": "2026-06-12T23:38:26.202010Z", - "iopub.status.idle": "2026-06-12T23:38:26.227829Z", - "shell.execute_reply": "2026-06-12T23:38:26.227077Z" - } - }, + "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "shape: (4, 3)\n", - "\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n", - "\u2502 id \u2506 description \u2506 level \u2502\n", - "\u2502 --- \u2506 --- \u2506 --- \u2502\n", - "\u2502 str \u2506 str \u2506 str \u2502\n", - "\u255e\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2561\n", - "\u2502 class \u2506 Top-level cell class. \u2506 2 \u2502\n", - "\u2502 cluster \u2506 Leaf cluster (cell type / MET-\u2026 \u2506 0 \u2502\n", - "\u2502 major_class \u2506 Synthetic root grouping all cl\u2026 \u2506 3 \u2502\n", - "\u2502 subclass \u2506 Subclass of cell types. \u2506 1 \u2502\n", - "\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n" + "┌─────────────┬─────────────────────────────────┬───────┐\n", + "│ id ┆ description ┆ level │\n", + "│ --- ┆ --- ┆ --- │\n", + "│ str ┆ str ┆ str │\n", + "╞═════════════╪═════════════════════════════════╪═══════╡\n", + "│ class ┆ Top-level cell class. ┆ 2 │\n", + "│ cluster ┆ Leaf cluster (cell type / MET-… ┆ 0 │\n", + "│ major_class ┆ Synthetic root grouping all cl… ┆ 3 │\n", + "│ subclass ┆ Subclass of cell types. ┆ 1 │\n", + "└─────────────┴─────────────────────────────────┴───────┘\n" ] } ], @@ -225,21 +190,14 @@ "id": "0114bd8d", "metadata": {}, "source": [ - "## `AlgorithmRun` \u2014 1 row" + "## `AlgorithmRun` — 1 row" ] }, { "cell_type": "code", "execution_count": 6, "id": "b472c110", - "metadata": { - "execution": { - "iopub.execute_input": "2026-06-12T23:38:26.229530Z", - "iopub.status.busy": "2026-06-12T23:38:26.229243Z", - "iopub.status.idle": "2026-06-12T23:38:26.296348Z", - "shell.execute_reply": "2026-06-12T23:38:26.295577Z" - } - }, + "metadata": {}, "outputs": [ { "name": "stdout", @@ -268,14 +226,7 @@ "cell_type": "code", "execution_count": 7, "id": "d4e82c41", - "metadata": { - "execution": { - "iopub.execute_input": "2026-06-12T23:38:26.298241Z", - "iopub.status.busy": "2026-06-12T23:38:26.297947Z", - "iopub.status.idle": "2026-06-12T23:38:26.314156Z", - "shell.execute_reply": "2026-06-12T23:38:26.313410Z" - } - }, + "metadata": {}, "outputs": [ { "name": "stdout", @@ -283,18 +234,18 @@ "text": [ "(1, 9)\n", "shape: (1, 9)\n", - "\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n", - "\u2502 id \u2506 algorithm \u2506 algorithm \u2506 json_obje \u2506 \u2026 \u2506 input_dat \u2506 produced_ \u2506 score_des \u2506 distance \u2502\n", - "\u2502 --- \u2506 _name \u2506 _version \u2506 ct \u2506 \u2506 aset \u2506 hierarchi \u2506 cription \u2506 _descrip \u2502\n", - "\u2502 str \u2506 --- \u2506 --- \u2506 --- \u2506 \u2506 --- \u2506 es \u2506 --- \u2506 tion \u2502\n", - "\u2502 \u2506 str \u2506 str \u2506 str \u2506 \u2506 str \u2506 --- \u2506 str \u2506 --- \u2502\n", - "\u2502 \u2506 \u2506 \u2506 \u2506 \u2506 \u2506 str \u2506 \u2506 str \u2502\n", - "\u255e\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2561\n", - "\u2502 visp_met_ \u2506 VISp \u2506 2021 \u2506 null \u2506 \u2026 \u2506 null \u2506 null \u2506 null \u2506 null \u2502\n", - "\u2502 types_clu \u2506 MET-types \u2506 \u2506 \u2506 \u2506 \u2506 \u2506 \u2506 \u2502\n", - "\u2502 stering \u2506 taxonomy \u2506 \u2506 \u2506 \u2506 \u2506 \u2506 \u2506 \u2502\n", - "\u2502 \u2506 (Patch\u2026 \u2506 \u2506 \u2506 \u2506 \u2506 \u2506 \u2506 \u2502\n", - "\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n" + "┌───────────┬───────────┬───────────┬───────────┬───┬───────────┬───────────┬───────────┬──────────┐\n", + "│ id ┆ algorithm ┆ algorithm ┆ json_obje ┆ … ┆ input_dat ┆ produced_ ┆ score_des ┆ distance │\n", + "│ --- ┆ _name ┆ _version ┆ ct ┆ ┆ aset ┆ hierarchi ┆ cription ┆ _descrip │\n", + "│ str ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ es ┆ --- ┆ tion │\n", + "│ ┆ str ┆ str ┆ str ┆ ┆ str ┆ --- ┆ str ┆ --- │\n", + "│ ┆ ┆ ┆ ┆ ┆ ┆ str ┆ ┆ str │\n", + "╞═══════════╪═══════════╪═══════════╪═══════════╪═══╪═══════════╪═══════════╪═══════════╪══════════╡\n", + "│ visp_met_ ┆ VISp ┆ 2021 ┆ null ┆ … ┆ null ┆ null ┆ null ┆ null │\n", + "│ types_clu ┆ MET-types ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", + "│ stering ┆ taxonomy ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", + "│ ┆ (Patch… ┆ ┆ ┆ ┆ ┆ ┆ ┆ │\n", + "└───────────┴───────────┴───────────┴───────────┴───┴───────────┴───────────┴───────────┴──────────┘\n" ] } ], @@ -310,33 +261,20 @@ "id": "c92da157", "metadata": {}, "source": [ - "## `Cluster` \u2014 48 rows (1 synthetic root + 2 classes + 45 leaves)" + "## `Cluster` — 48 rows (1 synthetic root + 2 classes + 45 leaves)" ] }, { "cell_type": "code", "execution_count": 8, "id": "932c050e", - "metadata": { - "execution": { - "iopub.execute_input": "2026-06-12T23:38:26.316301Z", - "iopub.status.busy": "2026-06-12T23:38:26.316090Z", - "iopub.status.idle": "2026-06-12T23:38:26.452251Z", - "shell.execute_reply": "2026-06-12T23:38:26.451453Z" - } - }, + "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "Cluster rows built: 48\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ + "Cluster rows built: 48\n", "Cluster written: 48 rows\n" ] } @@ -397,14 +335,7 @@ "cell_type": "code", "execution_count": 9, "id": "69e176bd", - "metadata": { - "execution": { - "iopub.execute_input": "2026-06-12T23:38:26.454202Z", - "iopub.status.busy": "2026-06-12T23:38:26.453898Z", - "iopub.status.idle": "2026-06-12T23:38:26.524013Z", - "shell.execute_reply": "2026-06-12T23:38:26.523161Z" - } - }, + "metadata": {}, "outputs": [ { "name": "stdout", @@ -412,15 +343,15 @@ "text": [ "(48, 9)\n", "shape: (3, 2)\n", - "\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2510\n", - "\u2502 hierarchy_category \u2506 len \u2502\n", - "\u2502 --- \u2506 --- \u2502\n", - "\u2502 str \u2506 u32 \u2502\n", - "\u255e\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2561\n", - "\u2502 class \u2506 2 \u2502\n", - "\u2502 cluster \u2506 45 \u2502\n", - "\u2502 major_class \u2506 1 \u2502\n", - "\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2518\n" + "┌────────────────────┬─────┐\n", + "│ hierarchy_category ┆ len │\n", + "│ --- ┆ --- │\n", + "│ str ┆ u32 │\n", + "╞════════════════════╪═════╡\n", + "│ class ┆ 2 │\n", + "│ cluster ┆ 45 │\n", + "│ major_class ┆ 1 │\n", + "└────────────────────┴─────┘\n" ] } ], @@ -437,21 +368,14 @@ "id": "27f5ed38", "metadata": {}, "source": [ - "## `ClusterHierarchy` \u2014 1 row" + "## `ClusterHierarchy` — 1 row" ] }, { "cell_type": "code", "execution_count": 10, "id": "6322bb75", - "metadata": { - "execution": { - "iopub.execute_input": "2026-06-12T23:38:26.526101Z", - "iopub.status.busy": "2026-06-12T23:38:26.525892Z", - "iopub.status.idle": "2026-06-12T23:38:26.593814Z", - "shell.execute_reply": "2026-06-12T23:38:26.593044Z" - } - }, + "metadata": {}, "outputs": [ { "name": "stdout", @@ -476,14 +400,7 @@ "cell_type": "code", "execution_count": 11, "id": "759595ff", - "metadata": { - "execution": { - "iopub.execute_input": "2026-06-12T23:38:26.595365Z", - "iopub.status.busy": "2026-06-12T23:38:26.595162Z", - "iopub.status.idle": "2026-06-12T23:38:26.610227Z", - "shell.execute_reply": "2026-06-12T23:38:26.609513Z" - } - }, + "metadata": {}, "outputs": [ { "name": "stdout", @@ -522,6 +439,14 @@ "\n", "Coexists alongside the Tasic taxonomy in the same global tables. Idempotent." ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ffe9882c-bf92-428d-bb65-f3fb574bbc13", + "metadata": {}, + "outputs": [], + "source": [] } ], "metadata": { @@ -540,7 +465,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.13.13" + "version": "3.12.4" } }, "nbformat": 4, diff --git a/code/etl_wnm_exc_01_dataset_dataitem.ipynb b/code/etl_wnm_exc_01_dataset_dataitem.ipynb index dc10fbc..ab9d264 100644 --- a/code/etl_wnm_exc_01_dataset_dataitem.ipynb +++ b/code/etl_wnm_exc_01_dataset_dataitem.ipynb @@ -4,7 +4,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "# ETL \u2014 VISp Excitatory Whole Neuron Morphology: DataSet & DataItem\n", + "# ETL — VISp Excitatory Whole Neuron Morphology: DataSet & DataItem\n", "\n", "Writes one `DataSet` record (`dataset_id = \"visp_exc_wnm\"`, `project_id = \"visp_wnm\"`), one `DataItem` per cell from `FullMorphMetaData_Master.csv` (cell id = SWC filename with `.swc` stripped), and the corresponding `DataItemDataSetAssociation` links. No prerequisites; features and cluster mappings are written in `_02` and `_03`." ] @@ -12,14 +12,7 @@ { "cell_type": "code", "execution_count": 1, - "metadata": { - "execution": { - "iopub.execute_input": "2026-06-12T21:50:44.696832Z", - "iopub.status.busy": "2026-06-12T21:50:44.696614Z", - "iopub.status.idle": "2026-06-12T21:50:45.925796Z", - "shell.execute_reply": "2026-06-12T21:50:45.924805Z" - } - }, + "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", @@ -39,14 +32,7 @@ { "cell_type": "code", "execution_count": 2, - "metadata": { - "execution": { - "iopub.execute_input": "2026-06-12T21:50:45.928230Z", - "iopub.status.busy": "2026-06-12T21:50:45.927910Z", - "iopub.status.idle": "2026-06-12T21:50:45.932765Z", - "shell.execute_reply": "2026-06-12T21:50:45.932056Z" - } - }, + "metadata": {}, "outputs": [ { "name": "stdout", @@ -81,14 +67,7 @@ { "cell_type": "code", "execution_count": 3, - "metadata": { - "execution": { - "iopub.execute_input": "2026-06-12T21:50:45.969754Z", - "iopub.status.busy": "2026-06-12T21:50:45.969539Z", - "iopub.status.idle": "2026-06-12T21:50:46.313326Z", - "shell.execute_reply": "2026-06-12T21:50:46.312352Z" - } - }, + "metadata": {}, "outputs": [ { "name": "stdout", @@ -273,14 +252,7 @@ { "cell_type": "code", "execution_count": 4, - "metadata": { - "execution": { - "iopub.execute_input": "2026-06-12T21:50:46.315756Z", - "iopub.status.busy": "2026-06-12T21:50:46.315468Z", - "iopub.status.idle": "2026-06-12T21:50:46.405149Z", - "shell.execute_reply": "2026-06-12T21:50:46.404259Z" - } - }, + "metadata": {}, "outputs": [ { "name": "stdout", @@ -305,14 +277,7 @@ { "cell_type": "code", "execution_count": 5, - "metadata": { - "execution": { - "iopub.execute_input": "2026-06-12T21:50:46.406929Z", - "iopub.status.busy": "2026-06-12T21:50:46.406731Z", - "iopub.status.idle": "2026-06-12T21:50:46.427100Z", - "shell.execute_reply": "2026-06-12T21:50:46.426221Z" - } - }, + "metadata": {}, "outputs": [ { "name": "stdout", @@ -320,14 +285,14 @@ "text": [ "(1, 5)\n", "shape: (1, 5)\n", - "\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n", - "\u2502 id \u2506 name \u2506 publication \u2506 modality \u2506 project_id \u2502\n", - "\u2502 --- \u2506 --- \u2506 --- \u2506 --- \u2506 --- \u2502\n", - "\u2502 str \u2506 str \u2506 str \u2506 str \u2506 str \u2502\n", - "\u255e\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2561\n", - "\u2502 visp_exc_wnm \u2506 VISp excitatory whole \u2506 doi.org/10.1101/2023.11.25.568\u2026 \u2506 MORPHOLOGY \u2506 visp_wnm \u2502\n", - "\u2502 \u2506 neuron m\u2026 \u2506 \u2506 \u2506 \u2502\n", - "\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n" + "┌──────────────┬───────────────────────┬─────────────────────────────────┬────────────┬────────────┐\n", + "│ id ┆ name ┆ publication ┆ modality ┆ project_id │\n", + "│ --- ┆ --- ┆ --- ┆ --- ┆ --- │\n", + "│ str ┆ str ┆ str ┆ str ┆ str │\n", + "╞══════════════╪═══════════════════════╪═════════════════════════════════╪════════════╪════════════╡\n", + "│ visp_exc_wnm ┆ VISp excitatory whole ┆ doi.org/10.1101/2023.11.25.568… ┆ MORPHOLOGY ┆ visp_wnm │\n", + "│ ┆ neuron m… ┆ ┆ ┆ │\n", + "└──────────────┴───────────────────────┴─────────────────────────────────┴────────────┴────────────┘\n" ] } ], @@ -354,14 +319,7 @@ { "cell_type": "code", "execution_count": 6, - "metadata": { - "execution": { - "iopub.execute_input": "2026-06-12T21:50:46.428830Z", - "iopub.status.busy": "2026-06-12T21:50:46.428632Z", - "iopub.status.idle": "2026-06-12T21:50:46.523232Z", - "shell.execute_reply": "2026-06-12T21:50:46.522279Z" - } - }, + "metadata": {}, "outputs": [ { "name": "stdout", @@ -385,14 +343,7 @@ { "cell_type": "code", "execution_count": 7, - "metadata": { - "execution": { - "iopub.execute_input": "2026-06-12T21:50:46.525360Z", - "iopub.status.busy": "2026-06-12T21:50:46.524988Z", - "iopub.status.idle": "2026-06-12T21:50:46.546864Z", - "shell.execute_reply": "2026-06-12T21:50:46.546055Z" - } - }, + "metadata": {}, "outputs": [ { "name": "stdout", @@ -400,17 +351,17 @@ "text": [ "(341, 4)\n", "shape: (5, 4)\n", - "\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n", - "\u2502 id \u2506 name \u2506 neuroglancer_link \u2506 project_id \u2502\n", - "\u2502 --- \u2506 --- \u2506 --- \u2506 --- \u2502\n", - "\u2502 str \u2506 str \u2506 str \u2506 str \u2502\n", - "\u255e\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2561\n", - "\u2502 182709_6984-X2452-Y12423_reg \u2506 182709_6984-X2452-Y12423_reg \u2506 null \u2506 visp_wnm \u2502\n", - "\u2502 182709_7126-X2913-Y10535_reg \u2506 182709_7126-X2913-Y10535_reg \u2506 null \u2506 visp_wnm \u2502\n", - "\u2502 182724_5937-X3804-Y11955_reg \u2506 182724_5937-X3804-Y11955_reg \u2506 null \u2506 visp_wnm \u2502\n", - "\u2502 182724_6175-X3782-Y10859_reg \u2506 182724_6175-X3782-Y10859_reg \u2506 null \u2506 visp_wnm \u2502\n", - "\u2502 182724_6354-X4834-Y8105_reg \u2506 182724_6354-X4834-Y8105_reg \u2506 null \u2506 visp_wnm \u2502\n", - "\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n" + "┌──────────────────────────────┬──────────────────────────────┬───────────────────┬────────────┐\n", + "│ id ┆ name ┆ neuroglancer_link ┆ project_id │\n", + "│ --- ┆ --- ┆ --- ┆ --- │\n", + "│ str ┆ str ┆ str ┆ str │\n", + "╞══════════════════════════════╪══════════════════════════════╪═══════════════════╪════════════╡\n", + "│ 182709_6984-X2452-Y12423_reg ┆ 182709_6984-X2452-Y12423_reg ┆ null ┆ visp_wnm │\n", + "│ 182709_7126-X2913-Y10535_reg ┆ 182709_7126-X2913-Y10535_reg ┆ null ┆ visp_wnm │\n", + "│ 182724_5937-X3804-Y11955_reg ┆ 182724_5937-X3804-Y11955_reg ┆ null ┆ visp_wnm │\n", + "│ 182724_6175-X3782-Y10859_reg ┆ 182724_6175-X3782-Y10859_reg ┆ null ┆ visp_wnm │\n", + "│ 182724_6354-X4834-Y8105_reg ┆ 182724_6354-X4834-Y8105_reg ┆ null ┆ visp_wnm │\n", + "└──────────────────────────────┴──────────────────────────────┴───────────────────┴────────────┘\n" ] } ], @@ -437,14 +388,7 @@ { "cell_type": "code", "execution_count": 8, - "metadata": { - "execution": { - "iopub.execute_input": "2026-06-12T21:50:46.548907Z", - "iopub.status.busy": "2026-06-12T21:50:46.548707Z", - "iopub.status.idle": "2026-06-12T21:50:46.742101Z", - "shell.execute_reply": "2026-06-12T21:50:46.741177Z" - } - }, + "metadata": {}, "outputs": [ { "name": "stdout", @@ -470,14 +414,7 @@ { "cell_type": "code", "execution_count": 9, - "metadata": { - "execution": { - "iopub.execute_input": "2026-06-12T21:50:46.743811Z", - "iopub.status.busy": "2026-06-12T21:50:46.743611Z", - "iopub.status.idle": "2026-06-12T21:50:46.762469Z", - "shell.execute_reply": "2026-06-12T21:50:46.761718Z" - } - }, + "metadata": {}, "outputs": [ { "name": "stdout", @@ -485,17 +422,17 @@ "text": [ "(341, 3)\n", "shape: (5, 3)\n", - "\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n", - "\u2502 dataitem_id \u2506 dataset_id \u2506 project_id \u2502\n", - "\u2502 --- \u2506 --- \u2506 --- \u2502\n", - "\u2502 str \u2506 str \u2506 str \u2502\n", - "\u255e\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2561\n", - "\u2502 182709_6984-X2452-Y12423_reg \u2506 visp_exc_wnm \u2506 visp_wnm \u2502\n", - "\u2502 182709_7126-X2913-Y10535_reg \u2506 visp_exc_wnm \u2506 visp_wnm \u2502\n", - "\u2502 182724_5937-X3804-Y11955_reg \u2506 visp_exc_wnm \u2506 visp_wnm \u2502\n", - "\u2502 182724_6175-X3782-Y10859_reg \u2506 visp_exc_wnm \u2506 visp_wnm \u2502\n", - "\u2502 182724_6354-X4834-Y8105_reg \u2506 visp_exc_wnm \u2506 visp_wnm \u2502\n", - "\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n" + "┌──────────────────────────────┬──────────────┬────────────┐\n", + "│ dataitem_id ┆ dataset_id ┆ project_id │\n", + "│ --- ┆ --- ┆ --- │\n", + "│ str ┆ str ┆ str │\n", + "╞══════════════════════════════╪══════════════╪════════════╡\n", + "│ 182709_6984-X2452-Y12423_reg ┆ visp_exc_wnm ┆ visp_wnm │\n", + "│ 182709_7126-X2913-Y10535_reg ┆ visp_exc_wnm ┆ visp_wnm │\n", + "│ 182724_5937-X3804-Y11955_reg ┆ visp_exc_wnm ┆ visp_wnm │\n", + "│ 182724_6175-X3782-Y10859_reg ┆ visp_exc_wnm ┆ visp_wnm │\n", + "│ 182724_6354-X4834-Y8105_reg ┆ visp_exc_wnm ┆ visp_wnm │\n", + "└──────────────────────────────┴──────────────┴────────────┘\n" ] } ], @@ -527,8 +464,8 @@ "| `dataitem_dataset_association/` | `DataItemDataSetAssociation` | 341 |\n", "\n", "**Input columns intentionally not written here:**\n", - "- `predicted_met_type`, `probability` \u2014 MET-type classification; written in a later notebook as `CellToClusterMapping`.\n", - "- `ccf_soma_location`, `ccf_soma_x/y/z` and remaining morphology metadata \u2014 written in a later notebook as `SingleCellRecon` records." + "- `predicted_met_type`, `probability` — MET-type classification; written in a later notebook as `CellToClusterMapping`.\n", + "- `ccf_soma_location`, `ccf_soma_x/y/z` and remaining morphology metadata — written in a later notebook as `SingleCellRecon` records." ] }, { @@ -555,7 +492,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.13.13" + "version": "3.12.4" } }, "nbformat": 4, diff --git a/code/etl_wnm_exc_02_cell_features.ipynb b/code/etl_wnm_exc_02_cell_features.ipynb index bd3955b..dfce782 100644 --- a/code/etl_wnm_exc_02_cell_features.ipynb +++ b/code/etl_wnm_exc_02_cell_features.ipynb @@ -4,22 +4,15 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "# ETL \u2014 WNM excitatory: Cell Features (three feature sets)\n", + "# ETL — WNM excitatory: Cell Features (three feature sets)\n", "\n", - "Writes WNM excitatory neuron cell features across three feature sets: **`exc_visp_morph_features`** (shared with `etl_visp_exc_patchseq_02`; defs/set owned by that notebook), **`wnm_exc_local_axon_features`** (local axon + apical dendrite morphology), and **`wnm_exc_complete_axon_features`** (whole-brain axon features from fMOST \u2014 placeholder, file not yet available). All rows use `project_id=\"visp_wnm\"`. Prerequisites: `etl_wnm_exc_01_dataset_dataitem.ipynb` and `etl_visp_exc_patchseq_02_cell_features.ipynb`." + "Writes WNM excitatory neuron cell features across three feature sets: **`exc_visp_morph_features`** (shared with `etl_visp_exc_patchseq_02`; defs/set owned by that notebook), **`wnm_exc_local_axon_features`** (local axon + apical dendrite morphology), and **`wnm_exc_complete_axon_features`** (whole-brain axon features from fMOST — placeholder, file not yet available). All rows use `project_id=\"visp_wnm\"`. Prerequisites: `etl_wnm_exc_01_dataset_dataitem.ipynb` and `etl_visp_exc_patchseq_02_cell_features.ipynb`." ] }, { "cell_type": "code", "execution_count": 1, - "metadata": { - "execution": { - "iopub.execute_input": "2026-06-12T21:50:48.872848Z", - "iopub.status.busy": "2026-06-12T21:50:48.872649Z", - "iopub.status.idle": "2026-06-12T21:50:50.164949Z", - "shell.execute_reply": "2026-06-12T21:50:50.164135Z" - } - }, + "metadata": {}, "outputs": [], "source": [ "import os\n", @@ -48,14 +41,7 @@ { "cell_type": "code", "execution_count": 2, - "metadata": { - "execution": { - "iopub.execute_input": "2026-06-12T21:50:50.167070Z", - "iopub.status.busy": "2026-06-12T21:50:50.166693Z", - "iopub.status.idle": "2026-06-12T21:50:50.171735Z", - "shell.execute_reply": "2026-06-12T21:50:50.170946Z" - } - }, + "metadata": {}, "outputs": [ { "name": "stdout", @@ -98,14 +84,7 @@ { "cell_type": "code", "execution_count": 3, - "metadata": { - "execution": { - "iopub.execute_input": "2026-06-12T21:50:50.173234Z", - "iopub.status.busy": "2026-06-12T21:50:50.173055Z", - "iopub.status.idle": "2026-06-12T21:50:50.263721Z", - "shell.execute_reply": "2026-06-12T21:50:50.263025Z" - } - }, + "metadata": {}, "outputs": [ { "name": "stdout", @@ -125,7 +104,7 @@ " )\n", ")\n", "assert assoc.shape[0] > 0, (\n", - " f\"etl_wnm_exc_01 must be run first \u2014 \"\n", + " f\"etl_wnm_exc_01 must be run first — \"\n", " f\"no DataItemDataSetAssociation rows for dataset_id='{DATASET_ID}'\"\n", ")\n", "wnm_registered_ids = set(assoc[\"dataitem_id\"].to_list())\n", @@ -134,7 +113,7 @@ "# Assert the shared CellFeatureSet exists (written by etl_visp_exc_patchseq_02).\n", "cfs_check = pl.read_delta(OUTPUT_ROOT + \"cellfeatureset/\").filter(pl.col(\"id\") == FSI_SHARED)\n", "assert cfs_check.shape[0] == 1, (\n", - " f\"etl_visp_exc_patchseq_02 must be run first \u2014 \"\n", + " f\"etl_visp_exc_patchseq_02 must be run first — \"\n", " f\"CellFeatureSet '{FSI_SHARED}' not found\"\n", ")\n", "print(f\"Shared CellFeatureSet '{FSI_SHARED}' found.\")" @@ -145,7 +124,7 @@ "metadata": {}, "source": [ "---\n", - "## Set 1 \u2014 `exc_visp_morph_features` (shared defs; WNM rows only)\n", + "## Set 1 — `exc_visp_morph_features` (shared defs; WNM rows only)\n", "\n", "Defs and `CellFeatureSet` are owned by `etl_visp_exc_patchseq_02_cell_features.ipynb`. This notebook only writes WNM rows to `cellfeatures/exc_visp_morph_features/` and the corresponding `CellFeatureMatrix` pointer." ] @@ -153,14 +132,7 @@ { "cell_type": "code", "execution_count": 4, - "metadata": { - "execution": { - "iopub.execute_input": "2026-06-12T21:50:50.266102Z", - "iopub.status.busy": "2026-06-12T21:50:50.265805Z", - "iopub.status.idle": "2026-06-12T21:50:50.284645Z", - "shell.execute_reply": "2026-06-12T21:50:50.283869Z" - } - }, + "metadata": {}, "outputs": [ { "name": "stdout", @@ -187,14 +159,7 @@ { "cell_type": "code", "execution_count": 5, - "metadata": { - "execution": { - "iopub.execute_input": "2026-06-12T21:50:50.286351Z", - "iopub.status.busy": "2026-06-12T21:50:50.286161Z", - "iopub.status.idle": "2026-06-12T21:50:50.444331Z", - "shell.execute_reply": "2026-06-12T21:50:50.443545Z" - } - }, + "metadata": {}, "outputs": [ { "name": "stdout", @@ -322,7 +287,7 @@ " \n", " \n", "\n", - "

3 rows \u00d7 45 columns

\n", + "

3 rows × 45 columns

\n", "" ], "text/plain": [ @@ -401,20 +366,13 @@ { "cell_type": "code", "execution_count": 6, - "metadata": { - "execution": { - "iopub.execute_input": "2026-06-12T21:50:50.446130Z", - "iopub.status.busy": "2026-06-12T21:50:50.445878Z", - "iopub.status.idle": "2026-06-12T21:50:50.453008Z", - "shell.execute_reply": "2026-06-12T21:50:50.452196Z" - } - }, + "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "WARNING: 6 shared def columns are missing from Set1 CSV \u2014 will be filled with NaN:\n", + "WARNING: 6 shared def columns are missing from Set1 CSV — will be filled with NaN:\n", " ['apical_dendrite_mean_diameter', 'apical_dendrite_total_surface_area', 'axon_exit_distance', 'axon_exit_theta', 'basal_dendrite_mean_diameter', 'basal_dendrite_total_surface_area']\n", "After alignment: 6 NaN-filled, 0 dropped\n" ] @@ -427,13 +385,13 @@ "extra_cols = [c for c in csv_feat_cols if c not in shared_def_ids]\n", "\n", "if missing_cols:\n", - " print(f\"WARNING: {len(missing_cols)} shared def columns are missing from Set1 CSV \u2014 \"\n", + " print(f\"WARNING: {len(missing_cols)} shared def columns are missing from Set1 CSV — \"\n", " f\"will be filled with NaN:\\n {missing_cols}\")\n", " for col in missing_cols:\n", " wide1_raw[col] = np.nan\n", "\n", "if extra_cols:\n", - " print(f\"WARNING: {len(extra_cols)} columns in CSV are NOT in shared defs \u2014 dropping:\\n {extra_cols}\")\n", + " print(f\"WARNING: {len(extra_cols)} columns in CSV are NOT in shared defs — dropping:\\n {extra_cols}\")\n", " wide1_raw = wide1_raw.drop(columns=extra_cols)\n", "\n", "print(f\"After alignment: {len(missing_cols)} NaN-filled, {len(extra_cols)} dropped\")" @@ -442,14 +400,7 @@ { "cell_type": "code", "execution_count": 7, - "metadata": { - "execution": { - "iopub.execute_input": "2026-06-12T21:50:50.454769Z", - "iopub.status.busy": "2026-06-12T21:50:50.454512Z", - "iopub.status.idle": "2026-06-12T21:50:50.459107Z", - "shell.execute_reply": "2026-06-12T21:50:50.458472Z" - } - }, + "metadata": {}, "outputs": [ { "name": "stdout", @@ -475,16 +426,18 @@ }, { "cell_type": "code", - "execution_count": null, - "metadata": { - "execution": { - "iopub.execute_input": "2026-06-12T21:50:50.460934Z", - "iopub.status.busy": "2026-06-12T21:50:50.460749Z", - "iopub.status.idle": "2026-06-12T21:50:50.704454Z", - "shell.execute_reply": "2026-06-12T21:50:50.703619Z" + "execution_count": 8, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Appended 4 new DataItem rows\n", + "Associations written for (visp_wnm, visp_exc_wnm): 345\n" + ] } - }, - "outputs": [], + ], "source": [ "# Register new cells (DataItem + DataItemDataSetAssociation) for those in Set1 not yet in _01.\n", "if new_ids_set1:\n", @@ -522,14 +475,7 @@ { "cell_type": "code", "execution_count": 9, - "metadata": { - "execution": { - "iopub.execute_input": "2026-06-12T21:50:50.706198Z", - "iopub.status.busy": "2026-06-12T21:50:50.705993Z", - "iopub.status.idle": "2026-06-12T21:50:50.726193Z", - "shell.execute_reply": "2026-06-12T21:50:50.725444Z" - } - }, + "metadata": {}, "outputs": [ { "name": "stdout", @@ -580,14 +526,7 @@ { "cell_type": "code", "execution_count": 10, - "metadata": { - "execution": { - "iopub.execute_input": "2026-06-12T21:50:50.727914Z", - "iopub.status.busy": "2026-06-12T21:50:50.727714Z", - "iopub.status.idle": "2026-06-12T21:50:50.898963Z", - "shell.execute_reply": "2026-06-12T21:50:50.898200Z" - } - }, + "metadata": {}, "outputs": [ { "name": "stdout", @@ -611,14 +550,7 @@ { "cell_type": "code", "execution_count": 11, - "metadata": { - "execution": { - "iopub.execute_input": "2026-06-12T21:50:50.901297Z", - "iopub.status.busy": "2026-06-12T21:50:50.900977Z", - "iopub.status.idle": "2026-06-12T21:50:50.922714Z", - "shell.execute_reply": "2026-06-12T21:50:50.921510Z" - } - }, + "metadata": {}, "outputs": [ { "name": "stdout", @@ -626,20 +558,20 @@ "text": [ "(345, 53)\n", "shape: (3, 3)\n", - "\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n", - "\u2502 id \u2506 project_id \u2506 feature_set_id \u2502\n", - "\u2502 --- \u2506 --- \u2506 --- \u2502\n", - "\u2502 str \u2506 str \u2506 str \u2502\n", - "\u255e\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2561\n", - "\u2502 17109_6201-X4328-Y6753_reg \u2506 visp_wnm \u2506 exc_visp_morph_features \u2502\n", - "\u2502 17109_6301-X4756-Y24516_reg \u2506 visp_wnm \u2506 exc_visp_morph_features \u2502\n", - "\u2502 17109_6601-X4384-Y7436_reg \u2506 visp_wnm \u2506 exc_visp_morph_features \u2502\n", - "\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n" + "┌─────────────────────────────┬────────────┬─────────────────────────┐\n", + "│ id ┆ project_id ┆ feature_set_id │\n", + "│ --- ┆ --- ┆ --- │\n", + "│ str ┆ str ┆ str │\n", + "╞═════════════════════════════╪════════════╪═════════════════════════╡\n", + "│ 17109_6201-X4328-Y6753_reg ┆ visp_wnm ┆ exc_visp_morph_features │\n", + "│ 17109_6301-X4756-Y24516_reg ┆ visp_wnm ┆ exc_visp_morph_features │\n", + "│ 17109_6601-X4384-Y7436_reg ┆ visp_wnm ┆ exc_visp_morph_features │\n", + "└─────────────────────────────┴────────────┴─────────────────────────┘\n" ] } ], "source": [ - "# Verification \u2014 Set1 wide parquet.\n", + "# Verification — Set1 wide parquet.\n", "set1_v = pl.read_delta(OUTPUT_ROOT + f\"cellfeatures/{FSI_SHARED}/\").filter(\n", " pl.col(\"project_id\") == PROJECT_ID\n", ")\n", @@ -653,14 +585,7 @@ { "cell_type": "code", "execution_count": 12, - "metadata": { - "execution": { - "iopub.execute_input": "2026-06-12T21:50:50.924866Z", - "iopub.status.busy": "2026-06-12T21:50:50.924661Z", - "iopub.status.idle": "2026-06-12T21:50:51.025705Z", - "shell.execute_reply": "2026-06-12T21:50:51.024820Z" - } - }, + "metadata": {}, "outputs": [ { "name": "stdout", @@ -686,14 +611,7 @@ { "cell_type": "code", "execution_count": 13, - "metadata": { - "execution": { - "iopub.execute_input": "2026-06-12T21:50:51.027575Z", - "iopub.status.busy": "2026-06-12T21:50:51.027374Z", - "iopub.status.idle": "2026-06-12T21:50:51.046071Z", - "shell.execute_reply": "2026-06-12T21:50:51.045341Z" - } - }, + "metadata": {}, "outputs": [ { "name": "stdout", @@ -701,19 +619,19 @@ "text": [ "(1, 5)\n", "shape: (1, 5)\n", - "\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n", - "\u2502 id \u2506 feature_set_id \u2506 parquet_path \u2506 cell_index_column \u2506 project_id \u2502\n", - "\u2502 --- \u2506 --- \u2506 --- \u2506 --- \u2506 --- \u2502\n", - "\u2502 str \u2506 str \u2506 str \u2506 str \u2506 str \u2502\n", - "\u255e\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2561\n", - "\u2502 visp_wnm_exc_visp_m \u2506 exc_visp_morph_feat \u2506 file:///scratch/em_ \u2506 id \u2506 visp_wnm \u2502\n", - "\u2502 orph_featur\u2026 \u2506 ures \u2506 patchseq_wn\u2026 \u2506 \u2506 \u2502\n", - "\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n" + "┌─────────────────────┬─────────────────────┬─────────────────────┬───────────────────┬────────────┐\n", + "│ id ┆ feature_set_id ┆ parquet_path ┆ cell_index_column ┆ project_id │\n", + "│ --- ┆ --- ┆ --- ┆ --- ┆ --- │\n", + "│ str ┆ str ┆ str ┆ str ┆ str │\n", + "╞═════════════════════╪═════════════════════╪═════════════════════╪═══════════════════╪════════════╡\n", + "│ visp_wnm_exc_visp_m ┆ exc_visp_morph_feat ┆ file:///scratch/em_ ┆ id ┆ visp_wnm │\n", + "│ orph_featur… ┆ ures ┆ patchseq_wn… ┆ ┆ │\n", + "└─────────────────────┴─────────────────────┴─────────────────────┴───────────────────┴────────────┘\n" ] } ], "source": [ - "# Verification \u2014 CellFeatureMatrix Set1.\n", + "# Verification — CellFeatureMatrix Set1.\n", "cfm1_v = pl.read_delta(OUTPUT_ROOT + \"cellfeaturematrix/\").filter(\n", " (pl.col(\"project_id\") == PROJECT_ID) & (pl.col(\"feature_set_id\") == FSI_SHARED)\n", ")\n", @@ -726,20 +644,13 @@ "metadata": {}, "source": [ "---\n", - "## Set 2 \u2014 `wnm_exc_local_axon_features`" + "## Set 2 — `wnm_exc_local_axon_features`" ] }, { "cell_type": "code", "execution_count": 14, - "metadata": { - "execution": { - "iopub.execute_input": "2026-06-12T21:50:51.047765Z", - "iopub.status.busy": "2026-06-12T21:50:51.047573Z", - "iopub.status.idle": "2026-06-12T21:50:51.208885Z", - "shell.execute_reply": "2026-06-12T21:50:51.208132Z" - } - }, + "metadata": {}, "outputs": [ { "name": "stdout", @@ -867,7 +778,7 @@ " \n", " \n", "\n", - "

3 rows \u00d7 52 columns

\n", + "

3 rows × 52 columns

\n", "" ], "text/plain": [ @@ -936,14 +847,7 @@ { "cell_type": "code", "execution_count": 15, - "metadata": { - "execution": { - "iopub.execute_input": "2026-06-12T21:50:51.211213Z", - "iopub.status.busy": "2026-06-12T21:50:51.210954Z", - "iopub.status.idle": "2026-06-12T21:50:51.217087Z", - "shell.execute_reply": "2026-06-12T21:50:51.216108Z" - } - }, + "metadata": {}, "outputs": [ { "name": "stdout", @@ -966,14 +870,7 @@ { "cell_type": "code", "execution_count": 16, - "metadata": { - "execution": { - "iopub.execute_input": "2026-06-12T21:50:51.218953Z", - "iopub.status.busy": "2026-06-12T21:50:51.218706Z", - "iopub.status.idle": "2026-06-12T21:50:51.222853Z", - "shell.execute_reply": "2026-06-12T21:50:51.222194Z" - } - }, + "metadata": {}, "outputs": [ { "name": "stdout", @@ -1001,14 +898,7 @@ { "cell_type": "code", "execution_count": 17, - "metadata": { - "execution": { - "iopub.execute_input": "2026-06-12T21:50:51.224726Z", - "iopub.status.busy": "2026-06-12T21:50:51.224424Z", - "iopub.status.idle": "2026-06-12T21:50:51.333620Z", - "shell.execute_reply": "2026-06-12T21:50:51.332901Z" - } - }, + "metadata": {}, "outputs": [ { "name": "stdout", @@ -1027,14 +917,7 @@ { "cell_type": "code", "execution_count": 18, - "metadata": { - "execution": { - "iopub.execute_input": "2026-06-12T21:50:51.335542Z", - "iopub.status.busy": "2026-06-12T21:50:51.335347Z", - "iopub.status.idle": "2026-06-12T21:50:51.352807Z", - "shell.execute_reply": "2026-06-12T21:50:51.352181Z" - } - }, + "metadata": {}, "outputs": [ { "name": "stdout", @@ -1042,27 +925,27 @@ "text": [ "(51, 8)\n", "shape: (3, 8)\n", - "\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n", - "\u2502 id \u2506 description \u2506 unit \u2506 data_type \u2506 range_min \u2506 range_max \u2506 project_id \u2506 feature_set \u2502\n", - "\u2502 --- \u2506 --- \u2506 --- \u2506 --- \u2506 --- \u2506 --- \u2506 --- \u2506 _id \u2502\n", - "\u2502 str \u2506 str \u2506 str \u2506 str \u2506 f64 \u2506 f64 \u2506 str \u2506 --- \u2502\n", - "\u2502 \u2506 \u2506 \u2506 \u2506 \u2506 \u2506 \u2506 str \u2502\n", - "\u255e\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2561\n", - "\u2502 apical_dendr \u2506 null \u2506 null \u2506 0, (\n", - " f\"etl_wnm_exc_01 must run first \u2014 no association rows for dataset_id='{DATASET_ID}'\"\n", + " f\"etl_wnm_exc_01 must run first — no association rows for dataset_id='{DATASET_ID}'\"\n", ")\n", "registered_ids = set(assoc[\"dataitem_id\"].to_list())\n", "print(f\"Registered DataItems for {DATASET_ID}: {len(registered_ids)}\")\n", @@ -135,7 +114,7 @@ " .filter(pl.col(\"hierarchy_id\") == METTYPE_HIERARCHY_ID)\n", ")\n", "assert met_clu.shape[0] > 0, (\n", - " f\"etl_visp_met_types_01_cluster must run first \u2014 no clusters for {METTYPE_HIERARCHY_ID}\"\n", + " f\"etl_visp_met_types_01_cluster must run first — no clusters for {METTYPE_HIERARCHY_ID}\"\n", ")\n", "met_parent = dict(zip(met_clu[\"id\"].to_list(), met_clu[\"parent\"].to_list()))\n", "print(f\"Clusters loaded: {METTYPE_HIERARCHY_ID}={len(met_parent)}\")\n" @@ -155,14 +134,7 @@ "cell_type": "code", "execution_count": 4, "id": "bd1355ee", - "metadata": { - "execution": { - "iopub.execute_input": "2026-06-12T21:50:56.449026Z", - "iopub.status.busy": "2026-06-12T21:50:56.448686Z", - "iopub.status.idle": "2026-06-12T21:50:56.518182Z", - "shell.execute_reply": "2026-06-12T21:50:56.517284Z" - } - }, + "metadata": {}, "outputs": [ { "name": "stdout", @@ -243,14 +215,7 @@ "cell_type": "code", "execution_count": 5, "id": "8b8edd30", - "metadata": { - "execution": { - "iopub.execute_input": "2026-06-12T21:50:56.520130Z", - "iopub.status.busy": "2026-06-12T21:50:56.519850Z", - "iopub.status.idle": "2026-06-12T21:50:56.526433Z", - "shell.execute_reply": "2026-06-12T21:50:56.525702Z" - } - }, + "metadata": {}, "outputs": [ { "name": "stdout", @@ -292,14 +257,7 @@ "cell_type": "code", "execution_count": 6, "id": "75c8f92f", - "metadata": { - "execution": { - "iopub.execute_input": "2026-06-12T21:50:56.528218Z", - "iopub.status.busy": "2026-06-12T21:50:56.528038Z", - "iopub.status.idle": "2026-06-12T21:50:56.621862Z", - "shell.execute_reply": "2026-06-12T21:50:56.620821Z" - } - }, + "metadata": {}, "outputs": [ { "name": "stdout", @@ -332,14 +290,7 @@ "cell_type": "code", "execution_count": 7, "id": "ce55a05f", - "metadata": { - "execution": { - "iopub.execute_input": "2026-06-12T21:50:56.623828Z", - "iopub.status.busy": "2026-06-12T21:50:56.623449Z", - "iopub.status.idle": "2026-06-12T21:50:56.712469Z", - "shell.execute_reply": "2026-06-12T21:50:56.711471Z" - } - }, + "metadata": {}, "outputs": [ { "name": "stdout", @@ -347,18 +298,18 @@ "text": [ "(1, 13)\n", "shape: (1, 13)\n", - "\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n", - "\u2502 id \u2506 name \u2506 descripti \u2506 method_na \u2506 \u2026 \u2506 source_hi \u2506 target_hi \u2506 json_obje \u2506 project_ \u2502\n", - "\u2502 --- \u2506 --- \u2506 on \u2506 me \u2506 \u2506 erarchy \u2506 erarchy \u2506 ct \u2506 id \u2502\n", - "\u2502 str \u2506 str \u2506 --- \u2506 --- \u2506 \u2506 --- \u2506 --- \u2506 --- \u2506 --- \u2502\n", - "\u2502 \u2506 \u2506 str \u2506 str \u2506 \u2506 str \u2506 str \u2506 str \u2506 str \u2502\n", - "\u255e\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2561\n", - "\u2502 visp_exc_ \u2506 VISp WNM \u2506 Routed \u2506 Routed \u2506 \u2026 \u2506 null \u2506 visp_met_ \u2506 null \u2506 visp_wnm \u2502\n", - "\u2502 wnm_metty \u2506 excitator \u2506 random \u2506 random \u2506 \u2506 \u2506 types_tax \u2506 \u2506 \u2502\n", - "\u2502 pe_mappin \u2506 y \u2506 forest \u2506 forest \u2506 \u2506 \u2506 onomy \u2506 \u2506 \u2502\n", - "\u2502 g \u2506 MET-type \u2506 mapping \u2506 mapping \u2506 \u2506 \u2506 \u2506 \u2506 \u2502\n", - "\u2502 \u2506 a\u2026 \u2506 o\u2026 \u2506 \u2506 \u2506 \u2506 \u2506 \u2506 \u2502\n", - "\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n" + "┌───────────┬───────────┬───────────┬───────────┬───┬───────────┬───────────┬───────────┬──────────┐\n", + "│ id ┆ name ┆ descripti ┆ method_na ┆ … ┆ source_hi ┆ target_hi ┆ json_obje ┆ project_ │\n", + "│ --- ┆ --- ┆ on ┆ me ┆ ┆ erarchy ┆ erarchy ┆ ct ┆ id │\n", + "│ str ┆ str ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ --- ┆ --- │\n", + "│ ┆ ┆ str ┆ str ┆ ┆ str ┆ str ┆ str ┆ str │\n", + "╞═══════════╪═══════════╪═══════════╪═══════════╪═══╪═══════════╪═══════════╪═══════════╪══════════╡\n", + "│ visp_exc_ ┆ VISp WNM ┆ Routed ┆ Routed ┆ … ┆ null ┆ visp_met_ ┆ null ┆ visp_wnm │\n", + "│ wnm_metty ┆ excitator ┆ random ┆ random ┆ ┆ ┆ types_tax ┆ ┆ │\n", + "│ pe_mappin ┆ y ┆ forest ┆ forest ┆ ┆ ┆ onomy ┆ ┆ │\n", + "│ g ┆ MET-type ┆ mapping ┆ mapping ┆ ┆ ┆ ┆ ┆ │\n", + "│ ┆ a… ┆ o… ┆ ┆ ┆ ┆ ┆ ┆ │\n", + "└───────────┴───────────┴───────────┴───────────┴───┴───────────┴───────────┴───────────┴──────────┘\n" ] } ], @@ -390,26 +341,13 @@ "cell_type": "code", "execution_count": 8, "id": "2372cceb", - "metadata": { - "execution": { - "iopub.execute_input": "2026-06-12T21:50:56.714800Z", - "iopub.status.busy": "2026-06-12T21:50:56.714510Z", - "iopub.status.idle": "2026-06-12T21:50:56.837555Z", - "shell.execute_reply": "2026-06-12T21:50:56.836828Z" - } - }, + "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ - "CellToClusterMapping rows built: 1023\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ + "CellToClusterMapping rows built: 1023\n", "CellToClusterMapping written: 1023 rows\n" ] } @@ -440,14 +378,7 @@ "cell_type": "code", "execution_count": 9, "id": "99654342", - "metadata": { - "execution": { - "iopub.execute_input": "2026-06-12T21:50:56.839650Z", - "iopub.status.busy": "2026-06-12T21:50:56.839446Z", - "iopub.status.idle": "2026-06-12T21:50:56.871761Z", - "shell.execute_reply": "2026-06-12T21:50:56.870951Z" - } - }, + "metadata": {}, "outputs": [ { "name": "stdout", @@ -455,22 +386,22 @@ "text": [ "(1023, 8)\n", "shape: (3, 8)\n", - "\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n", - "\u2502 id \u2506 mapping_set \u2506 source_cell \u2506 target_clus \u2506 score \u2506 probability \u2506 notes \u2506 project_id \u2502\n", - "\u2502 --- \u2506 --- \u2506 --- \u2506 ter \u2506 --- \u2506 --- \u2506 --- \u2506 --- \u2502\n", - "\u2502 str \u2506 str \u2506 str \u2506 --- \u2506 f64 \u2506 f64 \u2506 str \u2506 str \u2502\n", - "\u2502 \u2506 \u2506 \u2506 str \u2506 \u2506 \u2506 \u2506 \u2502\n", - "\u255e\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2561\n", - "\u2502 182709_6984 \u2506 visp_exc_wn \u2506 182709_6984 \u2506 L5 ET-2 \u2506 null \u2506 0.988 \u2506 null \u2506 visp_wnm \u2502\n", - "\u2502 -X2452-Y124 \u2506 m_mettype_m \u2506 -X2452-Y124 \u2506 \u2506 \u2506 \u2506 \u2506 \u2502\n", - "\u2502 23_reg-L\u2026 \u2506 apping \u2506 23_reg \u2506 \u2506 \u2506 \u2506 \u2506 \u2502\n", - "\u2502 182709_6984 \u2506 visp_exc_wn \u2506 182709_6984 \u2506 Glutamaterg \u2506 null \u2506 null \u2506 null \u2506 visp_wnm \u2502\n", - "\u2502 -X2452-Y124 \u2506 m_mettype_m \u2506 -X2452-Y124 \u2506 ic \u2506 \u2506 \u2506 \u2506 \u2502\n", - "\u2502 23_reg-G\u2026 \u2506 apping \u2506 23_reg \u2506 \u2506 \u2506 \u2506 \u2506 \u2502\n", - "\u2502 182709_6984 \u2506 visp_exc_wn \u2506 182709_6984 \u2506 cell \u2506 null \u2506 null \u2506 null \u2506 visp_wnm \u2502\n", - "\u2502 -X2452-Y124 \u2506 m_mettype_m \u2506 -X2452-Y124 \u2506 \u2506 \u2506 \u2506 \u2506 \u2502\n", - "\u2502 23_reg-c\u2026 \u2506 apping \u2506 23_reg \u2506 \u2506 \u2506 \u2506 \u2506 \u2502\n", - "\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n" + "┌─────────────┬─────────────┬─────────────┬─────────────┬───────┬─────────────┬───────┬────────────┐\n", + "│ id ┆ mapping_set ┆ source_cell ┆ target_clus ┆ score ┆ probability ┆ notes ┆ project_id │\n", + "│ --- ┆ --- ┆ --- ┆ ter ┆ --- ┆ --- ┆ --- ┆ --- │\n", + "│ str ┆ str ┆ str ┆ --- ┆ f64 ┆ f64 ┆ str ┆ str │\n", + "│ ┆ ┆ ┆ str ┆ ┆ ┆ ┆ │\n", + "╞═════════════╪═════════════╪═════════════╪═════════════╪═══════╪═════════════╪═══════╪════════════╡\n", + "│ 182709_6984 ┆ visp_exc_wn ┆ 182709_6984 ┆ L5 ET-2 ┆ null ┆ 0.988 ┆ null ┆ visp_wnm │\n", + "│ -X2452-Y124 ┆ m_mettype_m ┆ -X2452-Y124 ┆ ┆ ┆ ┆ ┆ │\n", + "│ 23_reg-L… ┆ apping ┆ 23_reg ┆ ┆ ┆ ┆ ┆ │\n", + "│ 182709_6984 ┆ visp_exc_wn ┆ 182709_6984 ┆ Glutamaterg ┆ null ┆ null ┆ null ┆ visp_wnm │\n", + "│ -X2452-Y124 ┆ m_mettype_m ┆ -X2452-Y124 ┆ ic ┆ ┆ ┆ ┆ │\n", + "│ 23_reg-G… ┆ apping ┆ 23_reg ┆ ┆ ┆ ┆ ┆ │\n", + "│ 182709_6984 ┆ visp_exc_wn ┆ 182709_6984 ┆ cell ┆ null ┆ null ┆ null ┆ visp_wnm │\n", + "│ -X2452-Y124 ┆ m_mettype_m ┆ -X2452-Y124 ┆ ┆ ┆ ┆ ┆ │\n", + "│ 23_reg-c… ┆ apping ┆ 23_reg ┆ ┆ ┆ ┆ ┆ │\n", + "└─────────────┴─────────────┴─────────────┴─────────────┴───────┴─────────────┴───────┴────────────┘\n" ] } ], @@ -511,7 +442,7 @@ "| Output path | Class | Rows |\n", "|---|---|---|\n", "| `mappingset/` (`id={MAPPING_SET_ID}`) | `MappingSet` (Routed random forest mapping) | 1 |\n", - "| `celltoclustermapping/` (`mapping_set={MAPPING_SET_ID}`) | `CellToClusterMapping` | one per (cell \u00d7 MET-type ancestor); leaf rows carry `probability`, ancestors are null |\n", + "| `celltoclustermapping/` (`mapping_set={MAPPING_SET_ID}`) | `CellToClusterMapping` | one per (cell × MET-type ancestor); leaf rows carry `probability`, ancestors are null |\n", "\n", "**Not written:** no `ClusterMembership` rows. WNM cells did not define the VISp MET-types taxonomy, so their assignments are mappings, not memberships.\n" ] diff --git a/code/etl_wnm_exc_04_projection_matrix.ipynb b/code/etl_wnm_exc_04_projection_matrix.ipynb index 73a7d6a..6ec399b 100644 --- a/code/etl_wnm_exc_04_projection_matrix.ipynb +++ b/code/etl_wnm_exc_04_projection_matrix.ipynb @@ -5,9 +5,9 @@ "id": "7dab0b27", "metadata": {}, "source": [ - "# ETL \u2014 WNM Excitatory: Projection Matrix\n", + "# ETL — WNM Excitatory: Projection Matrix\n", "\n", - "Writes two `ProjectionMeasurementMatrix` rows (ipsi + contra) for `project_id=\"visp_wnm\"`, `dataset_id=\"visp_exc_wnm\"`, plus the backing wide-form Delta tables. Source: `ProjectionMatrix_tip_and_branch_roll_up.csv` (345 cells \u00d7 152 ipsi + 68 contra regions). Prerequisite: `etl_wnm_exc_01`. Registers 4 cells absent from `_01` via `append_new_dataitems`." + "Writes two `ProjectionMeasurementMatrix` rows (ipsi + contra) for `project_id=\"visp_wnm\"`, `dataset_id=\"visp_exc_wnm\"`, plus the backing wide-form Delta tables. Source: `ProjectionMatrix_tip_and_branch_roll_up.csv` (345 cells × 152 ipsi + 68 contra regions). Prerequisite: `etl_wnm_exc_01`. Registers 4 cells absent from `_01` via `append_new_dataitems`." ] }, { @@ -15,7 +15,7 @@ "id": "b1804bce", "metadata": {}, "source": [ - "**Caveat:** `measurement_type=MICRONS_OF_AXON` is a best guess. The filename `tip_and_branch_roll_up` suggests counts, but values are floats with magnitudes ~10\u2074 \u2014 consistent with \u00b5m of axon length per region. To confirm with the data owner." + "**Caveat:** `measurement_type=MICRONS_OF_AXON` is a best guess. The filename `tip_and_branch_roll_up` suggests counts, but values are floats with magnitudes ~10⁴ — consistent with µm of axon length per region. To confirm with the data owner." ] }, { @@ -25,7 +25,7 @@ "source": [ "**Known schema mismatches (stopgaps):**\n", "\n", - "1. `ProjectionMeasurementMatrix` lacks `ProjectScoped` \u2192 metadata predicate is `id IN (...)` only. Fix: add `mixins: [ProjectScoped]` in `schemas/projection_schema.yaml` and regenerate.\n", + "1. `ProjectionMeasurementMatrix` lacks `ProjectScoped` → metadata predicate is `id IN (...)` only. Fix: add `mixins: [ProjectScoped]` in `schemas/projection_schema.yaml` and regenerate.\n", "2. `region_index` stores raw acronym strings instead of `BrainRegion.id`s; `brainregion/` is not yet populated. Re-run after that bootstrap.\n", "3. `values` is typed `ZarrArray` but stored here as a `file://` delta-path string (mirrors `CellFeatureMatrix.parquet_path`). Fix: add a `parquet_path` slot or commit to zarr." ] @@ -34,14 +34,7 @@ "cell_type": "code", "execution_count": 1, "id": "631cddfd", - "metadata": { - "execution": { - "iopub.execute_input": "2026-06-12T21:50:59.038836Z", - "iopub.status.busy": "2026-06-12T21:50:59.038653Z", - "iopub.status.idle": "2026-06-12T21:51:00.359122Z", - "shell.execute_reply": "2026-06-12T21:51:00.358164Z" - } - }, + "metadata": {}, "outputs": [], "source": [ "from pathlib import Path\n", @@ -69,14 +62,7 @@ "cell_type": "code", "execution_count": 2, "id": "ecbab904", - "metadata": { - "execution": { - "iopub.execute_input": "2026-06-12T21:51:00.361196Z", - "iopub.status.busy": "2026-06-12T21:51:00.360893Z", - "iopub.status.idle": "2026-06-12T21:51:00.366512Z", - "shell.execute_reply": "2026-06-12T21:51:00.365316Z" - } - }, + "metadata": {}, "outputs": [ { "name": "stdout", @@ -112,14 +98,7 @@ "cell_type": "code", "execution_count": 3, "id": "12f64431", - "metadata": { - "execution": { - "iopub.execute_input": "2026-06-12T21:51:00.368665Z", - "iopub.status.busy": "2026-06-12T21:51:00.368469Z", - "iopub.status.idle": "2026-06-12T21:51:00.532177Z", - "shell.execute_reply": "2026-06-12T21:51:00.531441Z" - } - }, + "metadata": {}, "outputs": [ { "name": "stdout", @@ -136,7 +115,7 @@ " .filter((pl.col(\"project_id\") == PROJECT_ID) & (pl.col(\"dataset_id\") == DATASET_ID))\n", ")\n", "assert prereq_assoc.shape[0] > 0, (\n", - " f\"etl_wnm_exc_01_dataset_dataitem.ipynb must be run first \u2014 \"\n", + " f\"etl_wnm_exc_01_dataset_dataitem.ipynb must be run first — \"\n", " f\"no DataItemDataSetAssociation rows for project_id='{PROJECT_ID}', dataset_id='{DATASET_ID}'\"\n", ")\n", "print(f\"Prereq OK: {prereq_assoc.shape[0]} DataItem associations registered for {DATASET_ID}.\")" @@ -154,14 +133,7 @@ "cell_type": "code", "execution_count": 4, "id": "faa98c9e", - "metadata": { - "execution": { - "iopub.execute_input": "2026-06-12T21:51:00.534048Z", - "iopub.status.busy": "2026-06-12T21:51:00.533852Z", - "iopub.status.idle": "2026-06-12T21:51:00.737783Z", - "shell.execute_reply": "2026-06-12T21:51:00.736996Z" - } - }, + "metadata": {}, "outputs": [ { "name": "stdout", @@ -261,7 +233,7 @@ ], "source": [ "# First column (unnamed) is the swc filename. Strip the .swc suffix to get the cell id\n", - "# (matches etl_wnm_exc_01 convention). Cell ids are kept as strings \u2014 never cast.\n", + "# (matches etl_wnm_exc_01 convention). Cell ids are kept as strings — never cast.\n", "df = pd.read_csv(INPUT_CSV, index_col=0)\n", "df.index = df.index.astype(str).str.removesuffix(\".swc\")\n", "df.index.name = \"id\"\n", @@ -273,14 +245,7 @@ "cell_type": "code", "execution_count": 5, "id": "e931e9cc", - "metadata": { - "execution": { - "iopub.execute_input": "2026-06-12T21:51:00.739511Z", - "iopub.status.busy": "2026-06-12T21:51:00.739310Z", - "iopub.status.idle": "2026-06-12T21:51:00.744617Z", - "shell.execute_reply": "2026-06-12T21:51:00.743896Z" - } - }, + "metadata": {}, "outputs": [ { "name": "stdout", @@ -320,14 +285,7 @@ "cell_type": "code", "execution_count": 6, "id": "360d2dce", - "metadata": { - "execution": { - "iopub.execute_input": "2026-06-12T21:51:00.746171Z", - "iopub.status.busy": "2026-06-12T21:51:00.745985Z", - "iopub.status.idle": "2026-06-12T21:51:00.750123Z", - "shell.execute_reply": "2026-06-12T21:51:00.749426Z" - } - }, + "metadata": {}, "outputs": [ { "name": "stdout", @@ -356,14 +314,7 @@ "cell_type": "code", "execution_count": 7, "id": "88c01dbf", - "metadata": { - "execution": { - "iopub.execute_input": "2026-06-12T21:51:00.751862Z", - "iopub.status.busy": "2026-06-12T21:51:00.751669Z", - "iopub.status.idle": "2026-06-12T21:51:00.755298Z", - "shell.execute_reply": "2026-06-12T21:51:00.754592Z" - } - }, + "metadata": {}, "outputs": [ { "name": "stdout", @@ -374,7 +325,7 @@ } ], "source": [ - "# Append only-new DataItem rows for cells absent from _01. append_new_dataitems is idempotent \u2014\n", + "# Append only-new DataItem rows for cells absent from _01. append_new_dataitems is idempotent —\n", "# re-running this cell appends 0 and does not disturb other projects' rows in dataitem/.\n", "if new_ids:\n", " new_items = [\n", @@ -391,14 +342,7 @@ "cell_type": "code", "execution_count": 8, "id": "9f5495f0", - "metadata": { - "execution": { - "iopub.execute_input": "2026-06-12T21:51:00.756881Z", - "iopub.status.busy": "2026-06-12T21:51:00.756695Z", - "iopub.status.idle": "2026-06-12T21:51:00.780828Z", - "shell.execute_reply": "2026-06-12T21:51:00.779770Z" - } - }, + "metadata": {}, "outputs": [ { "name": "stdout", @@ -406,15 +350,15 @@ "text": [ "(345, 4)\n", "shape: (3, 4)\n", - "\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n", - "\u2502 id \u2506 name \u2506 neuroglancer_link \u2506 project_id \u2502\n", - "\u2502 --- \u2506 --- \u2506 --- \u2506 --- \u2502\n", - "\u2502 str \u2506 str \u2506 str \u2506 str \u2502\n", - "\u255e\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2561\n", - "\u2502 17109_6801-X7432-Y4405_reg \u2506 17109_6801-X7432-Y4405_reg \u2506 null \u2506 visp_wnm \u2502\n", - "\u2502 211541_6961-X18505-Y15909_reg \u2506 211541_6961-X18505-Y15909_reg \u2506 null \u2506 visp_wnm \u2502\n", - "\u2502 220309_5824-X3486-Y10261_reg \u2506 220309_5824-X3486-Y10261_reg \u2506 null \u2506 visp_wnm \u2502\n", - "\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n", + "┌───────────────────────────────┬───────────────────────────────┬───────────────────┬────────────┐\n", + "│ id ┆ name ┆ neuroglancer_link ┆ project_id │\n", + "│ --- ┆ --- ┆ --- ┆ --- │\n", + "│ str ┆ str ┆ str ┆ str │\n", + "╞═══════════════════════════════╪═══════════════════════════════╪═══════════════════╪════════════╡\n", + "│ 17109_6801-X7432-Y4405_reg ┆ 17109_6801-X7432-Y4405_reg ┆ null ┆ visp_wnm │\n", + "│ 211541_6961-X18505-Y15909_reg ┆ 211541_6961-X18505-Y15909_reg ┆ null ┆ visp_wnm │\n", + "│ 220309_5824-X3486-Y10261_reg ┆ 220309_5824-X3486-Y10261_reg ┆ null ┆ visp_wnm │\n", + "└───────────────────────────────┴───────────────────────────────┴───────────────────┴────────────┘\n", "All 345 cells present in DataItem.\n" ] } @@ -439,21 +383,18 @@ "id": "6f6bf4aa", "metadata": {}, "source": [ - "## Write `DataItemDataSetAssociation` for all 345 cells" + "## Write `DataItemDataSetAssociation` as `existing ∪ 345 cells`\n", + "\n", + "`DataItemDataSetAssociation` is `overwrite_scoped` on `(project_id, dataset_id)`, so passing\n", + "only the 345 cell ids from this CSV would clobber rows written by `_01`/`_02` for the same\n", + "scope. Union with the existing scope before re-writing." ] }, { "cell_type": "code", "execution_count": 9, "id": "cd9f70b7", - "metadata": { - "execution": { - "iopub.execute_input": "2026-06-12T21:51:00.783268Z", - "iopub.status.busy": "2026-06-12T21:51:00.783056Z", - "iopub.status.idle": "2026-06-12T21:51:00.925226Z", - "shell.execute_reply": "2026-06-12T21:51:00.924491Z" - } - }, + "metadata": {}, "outputs": [ { "name": "stdout", @@ -464,14 +405,25 @@ } ], "source": [ - "# Two-level predicate scopes the overwrite to (project_id, dataset_id), so other datasets\n", - "# sharing project_id (none today, but future-proof) are untouched. _01's 341 associations are\n", - "# a strict subset of these 345, so this overwrite is a safe superset.\n", + "# Re-assert the full (project_id, dataset_id) association scope as the union\n", + "# of any existing assoc rows and this CSV's cell_ids. DataItemDataSetAssociation\n", + "# is overwrite_scoped on (project_id, dataset_id), so passing only this CSV's\n", + "# ids would clobber rows registered by `_01` or `_02` for the same scope.\n", + "# Union with existing ids — the write is idempotent and self-heals partial runs.\n", + "try:\n", + " existing_assoc_ids = set(\n", + " pl.read_delta(OUTPUT_ROOT + \"dataitem_dataset_association/\")\n", + " .filter((pl.col(\"project_id\") == PROJECT_ID) & (pl.col(\"dataset_id\") == DATASET_ID))\n", + " [\"dataitem_id\"].to_list()\n", + " )\n", + "except Exception:\n", + " existing_assoc_ids = set()\n", + "full_assoc_ids = sorted(existing_assoc_ids | set(cell_ids))\n", "associations = [\n", " DataItemDataSetAssociation(\n", " dataitem_id=cid, dataset_id=DATASET_ID, project_id=PROJECT_ID,\n", " )\n", - " for cid in cell_ids\n", + " for cid in full_assoc_ids\n", "]\n", "result = write_models(associations, output_root=OUTPUT_ROOT)\n", "print(f\"DataItemDataSetAssociation written: {result.rows_written} rows\")" @@ -481,14 +433,7 @@ "cell_type": "code", "execution_count": 10, "id": "8a91a81d", - "metadata": { - "execution": { - "iopub.execute_input": "2026-06-12T21:51:00.927041Z", - "iopub.status.busy": "2026-06-12T21:51:00.926848Z", - "iopub.status.idle": "2026-06-12T21:51:00.946179Z", - "shell.execute_reply": "2026-06-12T21:51:00.945448Z" - } - }, + "metadata": {}, "outputs": [ { "name": "stdout", @@ -496,15 +441,15 @@ "text": [ "(345, 3)\n", "shape: (3, 3)\n", - "\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n", - "\u2502 dataitem_id \u2506 dataset_id \u2506 project_id \u2502\n", - "\u2502 --- \u2506 --- \u2506 --- \u2502\n", - "\u2502 str \u2506 str \u2506 str \u2502\n", - "\u255e\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2561\n", - "\u2502 18864_6734-X4899-Y27447_reg \u2506 visp_exc_wnm \u2506 visp_wnm \u2502\n", - "\u2502 191812_7938-X6892-Y25312_reg \u2506 visp_exc_wnm \u2506 visp_wnm \u2502\n", - "\u2502 211550_7718-X19461-Y16950_reg \u2506 visp_exc_wnm \u2506 visp_wnm \u2502\n", - "\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n" + "┌─────────────────────────────┬──────────────┬────────────┐\n", + "│ dataitem_id ┆ dataset_id ┆ project_id │\n", + "│ --- ┆ --- ┆ --- │\n", + "│ str ┆ str ┆ str │\n", + "╞═════════════════════════════╪══════════════╪════════════╡\n", + "│ 17109_6201-X4328-Y6753_reg ┆ visp_exc_wnm ┆ visp_wnm │\n", + "│ 17109_6301-X4756-Y24516_reg ┆ visp_exc_wnm ┆ visp_wnm │\n", + "│ 17109_6601-X4384-Y7436_reg ┆ visp_exc_wnm ┆ visp_wnm │\n", + "└─────────────────────────────┴──────────────┴────────────┘\n" ] } ], @@ -534,14 +479,7 @@ "cell_type": "code", "execution_count": 11, "id": "4af58dbe", - "metadata": { - "execution": { - "iopub.execute_input": "2026-06-12T21:51:00.947857Z", - "iopub.status.busy": "2026-06-12T21:51:00.947662Z", - "iopub.status.idle": "2026-06-12T21:51:01.082640Z", - "shell.execute_reply": "2026-06-12T21:51:01.081927Z" - } - }, + "metadata": {}, "outputs": [ { "name": "stdout", @@ -582,14 +520,7 @@ "cell_type": "code", "execution_count": 12, "id": "f097ee84", - "metadata": { - "execution": { - "iopub.execute_input": "2026-06-12T21:51:01.084409Z", - "iopub.status.busy": "2026-06-12T21:51:01.084177Z", - "iopub.status.idle": "2026-06-12T21:51:01.111716Z", - "shell.execute_reply": "2026-06-12T21:51:01.110614Z" - } - }, + "metadata": {}, "outputs": [ { "name": "stdout", @@ -597,17 +528,17 @@ "text": [ "(345, 155)\n", "shape: (3, 6)\n", - "\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n", - "\u2502 id \u2506 project_id \u2506 dataset_id \u2506 VISam \u2506 VISp \u2506 VISpm \u2502\n", - "\u2502 --- \u2506 --- \u2506 --- \u2506 --- \u2506 --- \u2506 --- \u2502\n", - "\u2502 str \u2506 str \u2506 str \u2506 f64 \u2506 f64 \u2506 f64 \u2502\n", - "\u255e\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2561\n", - "\u2502 18864_6734-X4899-Y27447_reg \u2506 visp_wnm \u2506 visp_exc_wnm \u2506 8287.70664 \u2506 34450.175934 \u2506 483.223644 \u2502\n", - "\u2502 191812_7938-X6892-Y25312_re \u2506 visp_wnm \u2506 visp_exc_wnm \u2506 0.0 \u2506 794.102517 \u2506 0.0 \u2502\n", - "\u2502 g \u2506 \u2506 \u2506 \u2506 \u2506 \u2502\n", - "\u2502 211550_7718-X19461-Y16950_r \u2506 visp_wnm \u2506 visp_exc_wnm \u2506 0.0 \u2506 6473.751624 \u2506 0.0 \u2502\n", - "\u2502 eg \u2506 \u2506 \u2506 \u2506 \u2506 \u2502\n", - "\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n" + "┌─────────────────────────────┬────────────┬──────────────┬────────────┬──────────────┬────────────┐\n", + "│ id ┆ project_id ┆ dataset_id ┆ VISam ┆ VISp ┆ VISpm │\n", + "│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │\n", + "│ str ┆ str ┆ str ┆ f64 ┆ f64 ┆ f64 │\n", + "╞═════════════════════════════╪════════════╪══════════════╪════════════╪══════════════╪════════════╡\n", + "│ 18864_6734-X4899-Y27447_reg ┆ visp_wnm ┆ visp_exc_wnm ┆ 8287.70664 ┆ 34450.175934 ┆ 483.223644 │\n", + "│ 191812_7938-X6892-Y25312_re ┆ visp_wnm ┆ visp_exc_wnm ┆ 0.0 ┆ 794.102517 ┆ 0.0 │\n", + "│ g ┆ ┆ ┆ ┆ ┆ │\n", + "│ 211550_7718-X19461-Y16950_r ┆ visp_wnm ┆ visp_exc_wnm ┆ 0.0 ┆ 6473.751624 ┆ 0.0 │\n", + "│ eg ┆ ┆ ┆ ┆ ┆ │\n", + "└─────────────────────────────┴────────────┴──────────────┴────────────┴──────────────┴────────────┘\n" ] } ], @@ -637,14 +568,7 @@ "cell_type": "code", "execution_count": 13, "id": "eb78fbca", - "metadata": { - "execution": { - "iopub.execute_input": "2026-06-12T21:51:01.113992Z", - "iopub.status.busy": "2026-06-12T21:51:01.113587Z", - "iopub.status.idle": "2026-06-12T21:51:01.323307Z", - "shell.execute_reply": "2026-06-12T21:51:01.322613Z" - } - }, + "metadata": {}, "outputs": [ { "name": "stdout", @@ -682,14 +606,7 @@ "cell_type": "code", "execution_count": 14, "id": "b6dacbdf", - "metadata": { - "execution": { - "iopub.execute_input": "2026-06-12T21:51:01.325448Z", - "iopub.status.busy": "2026-06-12T21:51:01.325220Z", - "iopub.status.idle": "2026-06-12T21:51:01.343597Z", - "shell.execute_reply": "2026-06-12T21:51:01.342780Z" - } - }, + "metadata": {}, "outputs": [ { "name": "stdout", @@ -697,15 +614,15 @@ "text": [ "(345, 71)\n", "shape: (3, 6)\n", - "\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2510\n", - "\u2502 id \u2506 project_id \u2506 dataset_id \u2506 VISpor \u2506 VISp \u2506 CP \u2502\n", - "\u2502 --- \u2506 --- \u2506 --- \u2506 --- \u2506 --- \u2506 --- \u2502\n", - "\u2502 str \u2506 str \u2506 str \u2506 f64 \u2506 f64 \u2506 f64 \u2502\n", - "\u255e\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2561\n", - "\u2502 18864_6734-X4899-Y27447_reg \u2506 visp_wnm \u2506 visp_exc_wnm \u2506 0.0 \u2506 0.0 \u2506 0.0 \u2502\n", - "\u2502 191812_7938-X6892-Y25312_reg \u2506 visp_wnm \u2506 visp_exc_wnm \u2506 1045.437572 \u2506 0.0 \u2506 0.0 \u2502\n", - "\u2502 211550_7718-X19461-Y16950_reg \u2506 visp_wnm \u2506 visp_exc_wnm \u2506 0.0 \u2506 0.0 \u2506 0.0 \u2502\n", - "\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2518\n" + "┌───────────────────────────────┬────────────┬──────────────┬─────────────┬──────┬─────┐\n", + "│ id ┆ project_id ┆ dataset_id ┆ VISpor ┆ VISp ┆ CP │\n", + "│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │\n", + "│ str ┆ str ┆ str ┆ f64 ┆ f64 ┆ f64 │\n", + "╞═══════════════════════════════╪════════════╪══════════════╪═════════════╪══════╪═════╡\n", + "│ 18864_6734-X4899-Y27447_reg ┆ visp_wnm ┆ visp_exc_wnm ┆ 0.0 ┆ 0.0 ┆ 0.0 │\n", + "│ 191812_7938-X6892-Y25312_reg ┆ visp_wnm ┆ visp_exc_wnm ┆ 1045.437572 ┆ 0.0 ┆ 0.0 │\n", + "│ 211550_7718-X19461-Y16950_reg ┆ visp_wnm ┆ visp_exc_wnm ┆ 0.0 ┆ 0.0 ┆ 0.0 │\n", + "└───────────────────────────────┴────────────┴──────────────┴─────────────┴──────┴─────┘\n" ] } ], @@ -735,14 +652,7 @@ "cell_type": "code", "execution_count": 15, "id": "43d47d58", - "metadata": { - "execution": { - "iopub.execute_input": "2026-06-12T21:51:01.345312Z", - "iopub.status.busy": "2026-06-12T21:51:01.345125Z", - "iopub.status.idle": "2026-06-12T21:51:01.352500Z", - "shell.execute_reply": "2026-06-12T21:51:01.351862Z" - } - }, + "metadata": {}, "outputs": [ { "name": "stdout", @@ -767,14 +677,7 @@ "cell_type": "code", "execution_count": 16, "id": "0f278d8a", - "metadata": { - "execution": { - "iopub.execute_input": "2026-06-12T21:51:01.354162Z", - "iopub.status.busy": "2026-06-12T21:51:01.353981Z", - "iopub.status.idle": "2026-06-12T21:51:01.510423Z", - "shell.execute_reply": "2026-06-12T21:51:01.509319Z" - } - }, + "metadata": {}, "outputs": [ { "name": "stdout", @@ -791,7 +694,7 @@ "\n", "ipsi_matrix = ProjectionMeasurementMatrix(\n", " id=FSI_IPSI,\n", - " description=\"WNM excitatory ipsilateral projection matrix: per-cell axon length (\u00b5m, inferred) by ipsilateral CCF region.\",\n", + " description=\"WNM excitatory ipsilateral projection matrix: per-cell axon length (µm, inferred) by ipsilateral CCF region.\",\n", " measurement_type=ProjectionMeasurementType.MICRONS_OF_AXON,\n", " modality=Modality.MORPHOLOGY,\n", " laterality=Laterality.IPSILATERAL,\n", @@ -803,7 +706,7 @@ ")\n", "contra_matrix = ProjectionMeasurementMatrix(\n", " id=FSI_CONTRA,\n", - " description=\"WNM excitatory contralateral projection matrix: per-cell axon length (\u00b5m, inferred) by contralateral CCF region.\",\n", + " description=\"WNM excitatory contralateral projection matrix: per-cell axon length (µm, inferred) by contralateral CCF region.\",\n", " measurement_type=ProjectionMeasurementType.MICRONS_OF_AXON,\n", " modality=Modality.MORPHOLOGY,\n", " laterality=Laterality.CONTRALATERAL,\n", @@ -823,14 +726,7 @@ "cell_type": "code", "execution_count": 17, "id": "b5fca905", - "metadata": { - "execution": { - "iopub.execute_input": "2026-06-12T21:51:01.512612Z", - "iopub.status.busy": "2026-06-12T21:51:01.512251Z", - "iopub.status.idle": "2026-06-12T21:51:01.535372Z", - "shell.execute_reply": "2026-06-12T21:51:01.534491Z" - } - }, + "metadata": {}, "outputs": [ { "name": "stdout", @@ -838,16 +734,16 @@ "text": [ "(2, 10)\n", "shape: (2, 5)\n", - "\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n", - "\u2502 id \u2506 laterality \u2506 measurement_type \u2506 unit \u2506 values \u2502\n", - "\u2502 --- \u2506 --- \u2506 --- \u2506 --- \u2506 --- \u2502\n", - "\u2502 str \u2506 str \u2506 str \u2506 str \u2506 str \u2502\n", - "\u255e\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2561\n", - "\u2502 wnm_exc_proj_contra \u2506 CONTRALATERAL \u2506 MICRONS_OF_AXON \u2506 MICRONS_LENGTH \u2506 file:///scratch/em_pat \u2502\n", - "\u2502 \u2506 \u2506 \u2506 \u2506 chseq_wn\u2026 \u2502\n", - "\u2502 wnm_exc_proj_ipsi \u2506 IPSILATERAL \u2506 MICRONS_OF_AXON \u2506 MICRONS_LENGTH \u2506 file:///scratch/em_pat \u2502\n", - "\u2502 \u2506 \u2506 \u2506 \u2506 chseq_wn\u2026 \u2502\n", - "\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n", + "┌─────────────────────┬───────────────┬──────────────────┬────────────────┬────────────────────────┐\n", + "│ id ┆ laterality ┆ measurement_type ┆ unit ┆ values │\n", + "│ --- ┆ --- ┆ --- ┆ --- ┆ --- │\n", + "│ str ┆ str ┆ str ┆ str ┆ str │\n", + "╞═════════════════════╪═══════════════╪══════════════════╪════════════════╪════════════════════════╡\n", + "│ wnm_exc_proj_contra ┆ CONTRALATERAL ┆ MICRONS_OF_AXON ┆ MICRONS_LENGTH ┆ file:///scratch/em_pat │\n", + "│ ┆ ┆ ┆ ┆ chseq_wn… │\n", + "│ wnm_exc_proj_ipsi ┆ IPSILATERAL ┆ MICRONS_OF_AXON ┆ MICRONS_LENGTH ┆ file:///scratch/em_pat │\n", + "│ ┆ ┆ ┆ ┆ chseq_wn… │\n", + "└─────────────────────┴───────────────┴──────────────────┴────────────────┴────────────────────────┘\n", "Verified both matrix rows.\n" ] } @@ -883,9 +779,9 @@ "| Output path | Class | Rows |\n", "|---|---|---|\n", "| `dataitem/` | `DataItem` | +N new cells (4 expected; via `append_new_dataitems`) |\n", - "| `dataitem_dataset_association/` | `DataItemDataSetAssociation` | 345 (overwrite scoped to `project_id` + `dataset_id`) |\n", - "| `projectionmeasurementmatrix/wnm_exc_proj_ipsi/` | wide parquet | 345 cells \u00d7 152 ipsilateral region columns |\n", - "| `projectionmeasurementmatrix/wnm_exc_proj_contra/` | wide parquet | 345 cells \u00d7 68 contralateral region columns |\n", + "| `dataitem_dataset_association/` | `DataItemDataSetAssociation` | full scope = existing ∪ 345 cells (overwrite_scoped on `(project_id, dataset_id)`; union preserves rows from `_01`/`_02`) |\n", + "| `projectionmeasurementmatrix/wnm_exc_proj_ipsi/` | wide parquet | 345 cells × 152 ipsilateral region columns |\n", + "| `projectionmeasurementmatrix/wnm_exc_proj_contra/` | wide parquet | 345 cells × 68 contralateral region columns |\n", "| `projectionmeasurementmatrix/` | `ProjectionMeasurementMatrix` | 2 (one per laterality) |\n", "\n", "`measurement_type=MICRONS_OF_AXON` is recorded based on inference from value magnitudes; awaiting confirmation from the data owner. Region indices are stored as raw acronym strings until `brainregion/` is bootstrapped (see schema-mismatch note above).\n" @@ -908,7 +804,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.13.13" + "version": "3.12.4" } }, "nbformat": 4, diff --git a/planning/20260623/PR_message.md b/planning/20260623/PR_message.md index 6d3a2c8..5575892 100644 --- a/planning/20260623/PR_message.md +++ b/planning/20260623/PR_message.md @@ -51,6 +51,7 @@ The `WriteSpec` registered per writable class is one declaration that drives bot - Wide cell-feature / projection-matrix parquet writes (still use `write_deltalake` directly). - `CellCellConnectivityLong` — no registry entry yet; the `write_cellcellconnectivitylong` stub in `io/writers.py` documents the migration plan. - The `etl_v1dd_01` new dataset ingestion prototype ongoing in parallel. +- A `merge_by_id` (read-existing → union → overwrite) write mode for shared scopes like `(visp_patchseq, visp_inh_patchseq)` where multiple notebooks contribute disjoint subsets. The union is currently inlined in patch-seq / WNM notebooks; see `planning/multi_writer_scope_design.md` for the draft design discussion. ## Verification diff --git a/planning/multi_writer_scope_design.md b/planning/multi_writer_scope_design.md new file mode 100644 index 0000000..eede026 --- /dev/null +++ b/planning/multi_writer_scope_design.md @@ -0,0 +1,201 @@ +# Multi-writer Delta scopes: bug, contrast with minnie, design options + +Captured 2026-06-23 from a debugging session that started with an +`AssertionError` in `etl_visp_inh_patchseq_03_cluster_membership_and_mapping.ipynb`. + +## The problem + +`write_models` dispatches `overwrite_scoped` writes that **replace every row** in +the scope defined by a spec's `scope_columns`. For example +(`src/connects_common_connectivity/io/write_spec.py`): + +| Class | `scope_columns` | +|---|---| +| `DataItemDataSetAssociation` | `(project_id, dataset_id)` | +| `ClusterMembership` | `(project_id, hierarchy_id)` | +| `CellToClusterMapping` | `(project_id, mapping_set)` | + +When **multiple notebooks contribute disjoint row subsets to the same scope**, +any one of them issuing `write_models([...own_rows...])` deletes the other +notebooks' rows. The latest writer wins, silently. + +This bit us concretely in the patch-seq pipeline. After running +`etl_visp_inh_patchseq_01/02/03`: + +``` +visp_patchseq / visp_inh_patchseq associations: 495 (expected ≥ 2759) +``` + +`_03` Section 1's assertion surfaced it: + +``` +AssertionError: 2367 T-type CSV cells are not associated with visp_inh_patchseq +``` + +Root cause: `_01` writes 2759 association rows from the ttype CSV → `_02` +overwrites with 520 rows from the wide CSV → `_03` overwrites with 495 rows from +the MET CSV. Every step is a valid `overwrite_scoped` call; together they shrink +the scope monotonically. The same bug existed (silently) for `ClusterMembership` +under `(visp_patchseq, visp_met_types_taxonomy)`, where `etl_visp_exc_patchseq_03`'s +1152 rows were being wiped by `etl_visp_inh_patchseq_03`'s 1485-row overwrite. + +### Numbers from patch-seq + +| Source CSV (input) | Notebook | Rows in CSV | Scope written | +|---|---|---:|---| +| `patchseq_tx_cell_ttype_labels.csv` | `inh_01` | 2759 | `(visp_patchseq, visp_inh_patchseq)` | +| `inh_ivscc_features_wide_unnormalized.csv` | `inh_02` | 520 | `(visp_patchseq, visp_inh_patchseq)` | +| `visp_met_cell_assignments_text_names.csv` | `inh_03` § 0 | 495 | `(visp_patchseq, visp_inh_patchseq)` | +| `visp_met_cell_assignments_text_names.csv` | `inh_03` § 2 | 495 cells × 3 ancestors = 1485 | `(visp_patchseq, visp_met_types_taxonomy)` | +| `inferred_met_types.csv` | `exc_03` § 1 | 384 cells × 3 ancestors = 1152 | `(visp_patchseq, visp_met_types_taxonomy)` | + +After the fix (read-existing → union → re-write the full scope): + +``` +visp_patchseq / visp_inh_patchseq associations: 2879 +visp_patchseq / visp_met_types_taxonomy clustermembership rows: 2637 (=1152 exc + 1485 inh) +visp_patchseq / visp_met_types_taxonomy clustermembership items: 879 (=384 exc + 495 inh) +``` + +### Origin: migration regression + +This is a migration regression, not an original design flaw. The pre-migration +notebooks did the merge manually with raw +`write_deltalake(..., mode="overwrite", predicate=...)`: + +```python +existing_cm = pl.read_delta(...).filter(predicate) +other_cm = existing_cm.filter(~pl.col("item").is_in(our_cell_ids)) +all_memberships = [ClusterMembership(**r) for r in other_cm.to_dicts()] + new +write_deltalake(..., mode="overwrite", predicate=..., partition_by=...) +``` + +When that pattern was migrated to `write_models([...])`, the read-and-union step +was dropped (replaced with a stub `other_cm = pl.DataFrame({"item": []})`) and +the assertion `others_present.shape[0] == other_cm.shape[0]` continued to +"pass" because both sides became 0 — the verification was no longer +load-bearing. + +## How minnie avoids the problem entirely + +Minnie uses a **sub-dataset (cohort) pattern**: each notebook writes into its +own unique `(project_id, dataset_id)` scope, so `overwrite_scoped` calls never +collide. + +| Notebook | `DATASET_ID` | +|---|---| +| `etl_minnie_01_dataset_dataitem` | `minnie65_v1300_nuclei` (the universe) | +| `etl_minnie_02_cell_features` | `minnie65_v1300_csm_cluster` (CSM cohort) | +| `etl_minnie_03_cluster_and_cluster_membership` | reuses `minnie65_v1300_csm_cluster`, but writes `ClusterMembership` under `hierarchy_id="minnie65_csm_cell_types"` — a hierarchy no other minnie notebook writes to | +| `etl_minnie_04_cell_cell` | proofread cohorts (`minnie65_v1300_proofread*`) | + +For every `overwrite_scoped` write minnie issues, the **scope owner is exactly +one notebook**. No merge, no surprises. + +Patch-seq took the opposite philosophy: one `DataSet` +(`visp_inh_patchseq`) is treated as a single coherent cohort, and multiple +notebooks add rows of different kinds to the **same** `(project, dataset)` and +`(project, hierarchy)` scopes. That's what creates the multi-writer hazard. + +There's a meta-question buried here: should patch-seq follow minnie's cohort +pattern? E.g. `visp_inh_patchseq_ttype`, `visp_inh_patchseq_morph`, +`visp_inh_patchseq_met` as sibling sub-datasets. It would remove the merge +problem entirely but would also fragment what is currently a clean +"inh-cohort" abstraction. Not obvious which is better. + +## Considered solutions + +### Option A — Add a merging write mode to `write_models` + +Add a new `WriteSpec.write_mode` value, e.g. `"merge_by_id"` or +`"overwrite_scoped_by_id"`, that: + +1. Requires the spec to declare an **identity column** within the scope + (`dataitem_id` for `DataItemDataSetAssociation`, `item` for + `ClusterMembership`, …). +2. On write: reads existing rows in scope, replaces rows whose identity is in + the incoming batch, keeps the rest. + +**Pros** +- Eliminates the boilerplate currently duplicated in every patch-seq notebook. +- Makes the multi-writer contract explicit in the spec (it's *declared* that + this scope is multi-writer and merged on column X). +- Closes the regression class that bit us — a future migration cannot + accidentally strip the merge logic because the merge lives in the library. + +**Cons** +- Silently merging vs. overwriting is a semantically distinct contract; a + caller who actually wanted to *clear* sibling rows would have to opt out. +- Requires a read per write (negligible at current data sizes). +- The library implicitly trusts that the caller's batch is the authoritative + subset for the ids it contains. + +### Option B — Keep `write_models` overwrite-only, add a sibling helper + +```python +write_models_merging_on(items, id_column="item", output_root=...) +``` + +**Pros** +- No change to existing call sites or `write_models` semantics. +- Explicit at the call site: a reader sees "this notebook merges into a shared + scope" without having to look up the spec. +- Matches how minnie sidesteps the issue (use distinct scopes whenever + possible; reach for the merging helper only when you can't). + +**Cons** +- Still requires every shared-scope notebook to remember to use the merge + variant; the next migration can still regress this. + +### Option C — Status quo (don't change the library) + +Document the convention; every notebook touching a shared scope does its own +read-and-union before `write_models`. + +**Pros** +- Library stays minimal and explicit. + +**Cons** +- This is exactly the trap the recent migration walked into. There is no + structural mechanism preventing a recurrence. + +### Option D — Forbid shared scopes (push patch-seq toward minnie's pattern) + +Refactor patch-seq notebooks so each `(project, dataset)` and +`(project, hierarchy)` scope has a single owner — possibly by introducing +sub-datasets (`visp_inh_patchseq_ttype`, `_morph`, `_met`). + +**Pros** +- Removes the multi-writer hazard at the data-model level rather than papering + over it in the library. +- Brings patch-seq into stylistic alignment with minnie / V1DD. + +**Cons** +- Larger change. Downstream queries that group rows by "the inh cohort" now + need to union sub-datasets. May lose a useful natural grouping. +- Doesn't solve the `ClusterMembership` case (different MET-types + contributors *do* share a hierarchy — that's the taxonomy's whole point). + So a merge mechanism is probably still needed somewhere. + +## Suggested next step (for discussion, not yet decided) + +Lean toward **A + a scope-ownership audit**: + +1. For every `overwrite_scoped` spec, decide whether the scope is + single-writer (minnie-style) or multi-writer (patch-seq-style). +2. Single-writer specs stay as-is. +3. Multi-writer specs declare a merge key (Option A). +4. Bonus: `write_models` could detect "a write that would shrink the scope it + targets" (i.e. incoming rows form a strict subset of the existing scope by + the merge key) and warn/error when the spec is not marked multi-writer. + That would have caught the regression at runtime. + +Open questions: + +- Should `WriteSpec` gain a `merge_on: list[str] | None` field? +- Is the implicit "incoming batch is the truth for these ids" contract + acceptable for every multi-writer class, or do we need a more general + "upsert by composite key" mode? +- Do we want to keep patch-seq as multi-writer at all, or migrate to + sub-datasets and reserve the merge mechanism only for `ClusterMembership` + (where taxonomy-sharing makes single-ownership impossible)? diff --git a/uv.lock b/uv.lock index 0f39a21..f3926a0 100644 --- a/uv.lock +++ b/uv.lock @@ -9,7 +9,7 @@ resolution-markers = [ ] [options] -exclude-newer = "2026-04-15T22:36:33.389267Z" +exclude-newer = "0001-01-01T00:00:00Z" # This has no effect and is included for backwards compatibility when using relative exclude-newer values. exclude-newer-span = "P7D" [[package]] @@ -505,7 +505,7 @@ wheels = [ [[package]] name = "connects-common-connectivity" -version = "0.1.0" +version = "0.2.0" source = { editable = "." } dependencies = [ { name = "caveclient" },