Skip to content

[WIP] Generalize entities#86

Open
gkiar wants to merge 12 commits into
serial-indexfrom
generalize-entities
Open

[WIP] Generalize entities#86
gkiar wants to merge 12 commits into
serial-indexfrom
generalize-entities

Conversation

@gkiar
Copy link
Copy Markdown
Collaborator

@gkiar gkiar commented May 21, 2026

Downstream/On top of #85

PR Contribution Summary

This PR generalizes the indexing layer from subject-only discovery to entity-based crawling, and replaces hardcoded patterns with schema-derived logic wherever possible.

Architecture: Subject → Entity

  • _index_bids_subject_dir_index_bids_entity_dir — indexes any entity directory (sub-*, tpl-*, etc.)
  • _find_bids_subject_dirs_find_bids_entity_dirs — discovers any entity type at a dataset root
  • _is_bids_subject_dir_is_bids_entity_dir — checks arbitrary entity type by name
  • format_bids_path now uses schema-derived directory hierarchy (TPL → Cohort → Sub → Ses → Datatype)
  • All schema-related discovery functions centralized in _entities.py

Template and Cohort Support

  • tpl-* and cohort-* directories are indexed alongside sub-*
  • _is_bids_dataset derivative checks look for subject OR template entity dirs
  • Verified with real TemplateFlow datasets (1590 files, 30 templates)

Schema-Driven Discovery

  • get_entity_child_dirs(dataset_type, parent_rule) — reads valid entity subdirectories from rules.directories
  • get_file_entity_prefixes() — root-level entity name prefixes derived from schema
  • get_all_root_entity_types() — deduplicated root entity types across all dataset types
  • get_all_dataset_types() — enumerates schema-defined dataset types
  • _BIDS_JSON_SIDECAR_EXCEPTION_SUFFIXES — derived from rules.files (currently coordsystem, description)
  • _BIDS_DATATYPE_PATTERN — built from entity names at schema init
  • _ensure_dict() helper — centralizes bidsschematools Namespace→dict conversion

Derivative Detection

  • _is_bids_dataset() and _get_dataset_type() detect derivative datasets without dataset_description.json by checking inside derivatives/ for valid entity subdirectories
  • Correctly rejects combined sub-*_ses-* directories (spec-invalid)
  • Fallback iterates all dataset types when the detected type yields no entity dirs

Generic Filtering and .bidsignore

  • include_subjects → generic filters dict mapping any entity name to glob patterns
  • --filter / -f CLI argument replaces --subjects (deprecated, backward-compatible)
  • .bidsignore support via _is_bidsignored with cached upward search
  • Filters forwarded through batch_index_dataset to workers

Dataset Metadata Columns

  • dataset_name, dataset_type, bids_version added to Arrow schema, populated from dataset_description.json
  • clear_schema_caches() exposed as public API for schema reload safety

Code Cleanup

  • Removed dead code: get_all_entity_prefixes, get_required_entity_types
  • Deduplicated _get_subdir_names() for oneOf expansion
  • _read_dataset_description with @lru_cache to deduplicate reads
  • Simplified _resolve_entity_dirs — extracts entity discovery into _discover_entity_dirs
  • Updated stale comments and removed redundant wrapper functions

Testing

  • test_derivative_detection — 5 scenarios including no-description derivatives and invalid combined entity dirs
  • test_index_dataset_filters — single, multi-value, glob, and cross-entity AND filters
  • test_batch_index_dataset_filters — filter forwarding through parallel workers
  • test_index_dataset_bidsignore.bidsignore exclusion
  • Template integration tests gated by @templateflow_available
  • Renamed test_is_bids_subject_dirtest_is_bids_entity_dir
  • test_find_bids_datasets is now skipped (@pytest.mark.skip); the rglob("dataset_description.json") baseline no longer matches the schema-correct derivative detection

Impact

We should now be less fragile in schema updates, and can correctly index derivative datasets using entity types other than subject and session (namely template and cohort), meaning this can be used across a wide range of the field's projects.

@github-actions
Copy link
Copy Markdown

Coverage

Coverage Report
FileStmtsMissCoverMissing
__init__.py70100% 
__main__.py64592%101, 127, 155, 159, 163
_entities.py112199%129
_indexing.py228796%154, 163–164, 179, 357, 407, 445
_logging.py31487%30, 37, 39–40
_metadata.py48491%39–40, 66, 71
_pathlib.py17382%12–13, 15
_version.py110100% 
pybids
   __init__.py40100% 
   _bidsfile.py381365%71–73, 77–79, 83–85, 89–91, 95
   _layout.py1564571%63, 72, 81, 104, 114–115, 118, 140–141, 156–157, 173–174, 177–181, 186, 188–189, 192–193, 228, 233, 241, 322–324, 389–394, 396, 399–404, 406, 462, 482
   _utils.py13561%47–50, 52
TOTAL7298788% 

Tests Skipped Failures Errors Time
100 1 💤 0 ❌ 0 🔥 23.612s ⏱️

@gkiar gkiar changed the base branch from main to serial-index May 21, 2026 13:07
@gkiar gkiar marked this pull request as ready for review May 21, 2026 20:30
Comment thread bids2table/_entities.py
name = get_entity_name(entity_type)
if not name:
return ""
return f"{name}-[a-zA-Z0-9]+"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want to be very correct, you can look up the format:

fmt = schema.objects.entities[entity_type].format
pattern = schema.objects.formats[fmt].pattern
return f'{name}-{pattern}'

But the main thing that's actually missing here is + in labels:

Suggested change
return f"{name}-[a-zA-Z0-9]+"
return f"{name}-[a-zA-Z0-9+]+"

Comment thread bids2table/_indexing.py


@lru_cache(maxsize=None)
def _find_bidsignore(start: PathT) -> PathT | None:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IDK if you actually want this. The BIDS validator only supports a root-level .bidsignore, though that is defined at the dataset root level, not a git-like per-directory ignore.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants