Added sanitization feature #244

CodyCBakerPhD · 2025-12-17T18:44:49Z

What Changed

Release Notes

The sanitization feature is now available. 'Sanitization' refers to actively altering various values extracted from the NWB files in an effort to adhere to BIDS validation requirements.

An example would be a NWB file containing the field nwbfile.session_id = "my session #2". While this is perfectly allowed in NWB, attempting to write filenames such as _ses-my session #2_ would not be valid in BIDS and the resulting files including these labels would not be uploadable to common data archives. Sanitization would instead generate BIDS filenames as _ses-my+session+2_.

Currently, only the codes 'sub-labels' or 'ses-labels' are available. These sanitize session or subject ID labels in a way guaranteed to be valid BIDS. We plan to add more capabilities in the future, including some limited in-place modification of NWB contents to allow consistency with the BIDS validation requirements.

@CodyCBakerPhD

* feat: add basic approach to sanitization * feat: improve test cases * feat: expand to session as well * feat: fix tests * fix: test * fix: test * feat: rollback to non-LRU caches * fix: Replace click with rich_click for CLI options * fix: Change click.Choice to rich_click.Choice * feat: adding file dumping * feat: saving state of passing report context * feat: generalize log structure * feat: some refactors for simpler config handling * fix: all tests on new structure * Apply suggestion from @CodyCBakerPhD * chore: fix pre-commit from GitHub suggestion * chore: fix typo * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add sanitization level to command line interface * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Apply suggestions from code review Co-authored-by: Isaac To <candleindark@users.noreply.github.com> * other PR suggestions * other PR suggestions * restore sanitization code * Enhance tests for sanitization and reduce code duplications in those tests (#210) * test(fix): test `sanitize_session_id()` in `test_sanitize_session_id()` `sanitize_participant_id()` instead of `sanitize_session_id()` was tested in `test_sanitize_session_id()`, which I don't think was intended. * refactor(test): reduce code in `test_sanitization.py` By using the parametrization feature from pytest * test: parametrize `sanitization_level` In `test_sanitize_participant_id()` and `test_sanitize_session_id()` * style(test): simplify calls to sanitization func with positional args There is no benefits from using keyword arguments since the input variable names are the same as the keyword arguments' name * docs: complete docstring for `sanitize_session_id()` * docs: complete docstring for `sanitize_participant_id()` * test: add test cases for sanitization level set to `SanitizationLevel.NONE` --------- Co-authored-by: Cody Baker <51133164+CodyCBakerPhD@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: Isaac To <candleindark@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix merge * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix initial creation of sanitization path * Alternative sanitization approach (#168) * feat: finish alternative approach to sanitization model and fix all tests * chore: remove comments * fix: resolve pandas complaint * fix: just suppress pandas wrong warning * properly newlin * add missing participant id support * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fixup tests * fix merge * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * re-order some conversion steps * re-order some conversion steps and parent creation * re-order some conversion steps and parent creation --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: CodyCBakerPhD <codycbakerphd@gmail.com> * fix test path * major refactors * lower case and hyphens in CLI; some pandas typing fixes * adjust all remaining references to levels --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Isaac To <candleindark@users.noreply.github.com> Co-authored-by: CodyCBakerPhD <codycbakerphd@gmail.com>

codecov · 2025-12-17T18:48:21Z

Codecov Report

❌ Patch coverage is 90.52632% with 9 lines in your changes missing coverage. Please review.
✅ Project coverage is 84.75%. Comparing base (e727b61) to head (71b410c).
⚠️ Report is 2 commits behind head on main.

Files with missing lines	Patch %	Lines
src/nwb2bids/_command_line_interface/_main.py	0.00%	7 Missing ⚠️
src/nwb2bids/_converters/_session_converter.py	92.85%	1 Missing ⚠️
src/nwb2bids/bids_models/_bids_session_metadata.py	88.88%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #244      +/-   ##
==========================================
+ Coverage   84.65%   84.75%   +0.10%     
==========================================
  Files          32       35       +3     
  Lines        1212     1279      +67     
==========================================
+ Hits         1026     1084      +58     
- Misses        186      195       +9

Flag	Coverage Δ
unittests	`84.75% <90.52%> (+0.10%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
src/nwb2bids/__init__.py	`100.00% <ø> (ø)`
src/nwb2bids/_converters/_dataset_converter.py	`87.33% <100.00%> (-0.34%)`	⬇️
src/nwb2bids/_converters/_run_config.py	`94.59% <100.00%> (+1.26%)`	⬆️
src/nwb2bids/bids_models/_participant.py	`95.34% <100.00%> (ø)`
src/nwb2bids/sanitization/__init__.py	`100.00% <100.00%> (ø)`
src/nwb2bids/sanitization/_configuration.py	`100.00% <100.00%> (ø)`
src/nwb2bids/sanitization/_sanitization.py	`100.00% <100.00%> (ø)`
src/nwb2bids/_converters/_session_converter.py	`89.74% <92.85%> (-0.61%)`	⬇️
src/nwb2bids/bids_models/_bids_session_metadata.py	`92.39% <88.88%> (-0.38%)`	⬇️
src/nwb2bids/_command_line_interface/_main.py	`0.00% <0.00%> (ø)`

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

asmacdo

I wasn't able to find a way to test with real data use cases, but the code looks good.

My issue was that I struggled to find a dataset that meets the criteria suggested: "have a bunch of errors on the unsanitized branch, but don't (or have fewer) after basic sanitization". Maybe I misunderstood, but https://github.com/bids-dandisets/dashboard/blob/main/README.md shows equal number of errors for sanitized and unsanitized (ignoring BEP 32).

The exception is 000016, which acutally shows 1 extra error. So there may be some regression/edge case/race condition bug with parent dir?

[Errno 2] No such file or directory: '/data/dandi/bids-dandisets/work/000016/.nwb2bids/datetime-20251218063120_sanitization.txt

diff not_sanitized_000016.json sanitized_000016.json

2a3,14
>     "title": "Failed to extract metadata for one or more sessions",
>     "reason": "An error occurred while executing `DatasetConverter.extract_metadata`.\n\nTraceback (most recent call last):\n  File \"/data/dandi/bids-dandisets/nwb2bids/src/nwb2bids/_converters/_dataset_converter.py\", line 191, in extract_metadata\n    collections.deque(\n    ~~~~~~~~~~~~~~~~~^\n        (\n        ^\n    ...<4 lines>...\n        maxlen=0,\n        ^^^^^^^^^\n    )\n    ^\n  File \"/data/dandi/bids-dandisets/nwb2bids/src/nwb2bids/_converters/_dataset_converter.py\", line 193, in <genexpr>\n    session_converter.extract_metadata()\n    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^\n  File \"/data/dandi/bids-dandisets/nwb2bids/src/nwb2bids/_converters/_session_converter.py\", line 102, in extract_metadata\n    self.session_metadata = BidsSessionMetadata.from_nwbfile_paths(\n                            ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^\n        nwbfile_paths=self.nwbfile_paths, run_config=self.run_config\n        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n    )\n    ^\n  File \"/data/dandi/s3-logs/conda/envs/bids_dandisets/lib/python3.13/site-packages/pydantic/_internal/_validate_call.py\", line 39, in wrapper_function\n    return wrapper(*args, **kwargs)\n  File \"/data/dandi/s3-logs/conda/envs/bids_dandisets/lib/python3.13/site-packages/pydantic/_internal/_validate_call.py\", line 136, in __call__\n    res = self.__pydantic_validator__.validate_python(pydantic_core.ArgsKwargs(args, kwargs))\n  File \"/data/dandi/bids-dandisets/nwb2bids/src/nwb2bids/bids_models/_bids_session_metadata.py\", line 149, in from_nwbfile_paths\n    session_metadata = cls(**dictionary)\n  File \"/data/dandi/s3-logs/conda/envs/bids_dandisets/lib/python3.13/site-packages/pydantic/main.py\", line 250, in __init__\n    validated_self = self.__pydantic_validator__.validate_python(data, self_instance=self)\n  File \"/data/dandi/bids-dandisets/nwb2bids/src/nwb2bids/bids_models/_bids_session_metadata.py\", line 44, in model_post_init\n    self.sanitization = Sanitization(\n                        ~~~~~~~~~~~~^\n        sanitization_config=self.run_config.sanitization_config,\n        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n    ...<2 lines>...\n        original_participant_id=self.participant.participant_id,\n        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n    )\n    ^\n  File \"/data/dandi/s3-logs/conda/envs/bids_dandisets/lib/python3.13/site-packages/pydantic/main.py\", line 250, in __init__\n    validated_self = self.__pydantic_validator__.validate_python(data, self_instance=self)\n  File \"/data/dandi/bids-dandisets/nwb2bids/src/nwb2bids/sanitization/_sanitization.py\", line 35, in model_post_init\n    with self.sanitization_file_path.open(mode=\"w\") as file_stream:\n         ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^\n  File \"/data/dandi/s3-logs/conda/envs/bids_dandisets/lib/python3.13/pathlib/_local.py\", line 537, in open\n    return io.open(self, mode, buffering, encoding, errors, newline)\n           ~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\nFileNotFoundError: [Errno 2] No such file or directory: '/data/dandi/bids-dandisets/work/000016/.nwb2bids/datetime-20251218063120_sanitization.txt'\n",
>     "solution": "Please raise an issue on `nwb2bids`: https://github.com/con/nwb2bids/issues.",
>     "examples": null,
>     "field": null,
>     "source_file_paths": null,
>     "target_file_paths": null,
>     "data_standards": null,
>     "category": "INTERNAL_ERROR",
>     "severity": "ERROR"
>   },
>   {

src/nwb2bids/_command_line_interface/_main.py

CodyCBakerPhD · 2025-12-19T22:49:34Z

shows equal number of errors for sanitized and unsanitized (ignoring BEP 32).

The feature is entirely about BIDS validation (not even specific to BEP32 tbh) - no changes are expected with nwb2bids notifications, though I might change that in a follow-up

As I mentioned, the one you had already looked at for IP freely (000003) is a good example

CodyCBakerPhD · 2025-12-19T22:51:39Z

The effectiveness is further summarized by the common issues table

Co-authored-by: Austin Macdonald <austin@dartmouth.edu>

for more information, see https://pre-commit.ci

CodyCBakerPhD · 2025-12-19T23:07:16Z

The exception is 000016, which acutally shows 1 extra error. So there may be some regression/edge case/race condition bug with parent dir?

How interesting. Looking at it I can maybe see how that might happen (and clearly did at least once)

Odd it doesn't happen all the time though, but ah well...

I added a line to ensure parents exist before first write so shouldn't happen again

Good catch

CodyCBakerPhD · 2026-01-05T16:18:37Z

@asmacdo Welcome back from break! Let's make this top priority since a fair amount of follow-up work would be a pain to have to rebase after this fairly extensive PR

I will do what I can for now on completely independent PRs

asmacdo · 2026-01-05T18:19:35Z

No sanitization flags:

├── sub-mouse_AAYYT
│   ├── ses-20180420_sample_2
│   │   └── ecephys
│   │       └── sub-mouse_AAYYT_ses-20180420_sample_2_ecephys.nwb -> ../../../../000008/sub-mouse-AAYYT/sub-mouse-AAYYT_ses-20180420-sample-2_slice-20180420-slice-2_cell-20180420-sample-2_icephys.nwb
│   ├── ses-20180420_sample_3
│   │   └── ecephys
│   │       └── sub-mouse_AAYYT_ses-20180420_sample_3_ecephys.nwb -> ../../../../000008/sub-mouse-AAYYT/sub-mouse-AAYYT_ses-20180420-sample-3_slice-20180420-slice-3_cell-20180420-sample-3_icephys.nwb

Sanitization flags nwb2bids convert 000008 --bids-directory 08-sanit --sanitization sub-labels --sanitization ses-labels

├── sub-mouse+AAYYT
│   ├── ses-20180420+sample+2
│   │   └── ecephys
│   │       └── sub-mouse+AAYYT_ses-20180420+sample+2_ecephys.nwb -> ../../../../000008/sub-mouse-AAYYT/sub-mouse-AAYYT_ses-20180420-sample-2_slice-20180420-slice-2_cell-20180420-sample-2_icephys.nwb
│   ├── ses-20180420+sample+3
│   │   └── ecephys
│   │       └── sub-mouse+AAYYT_ses-20180420+sample+3_ecephys.nwb -> ../../../../000008/sub-mouse-AAYYT/sub-mouse-AAYYT_ses-20180420-sample-3_slice-20180420-slice-3_cell-20180420-sample-3_icephys.nwb

And the participants.tsv is modified also, ie sub-mouse+AAYYT, looks like conversion+sanitization is working right to me!

However, this does create a confusing situation, the validations are called prior to any sanitization, so nwb2bids reports that it failed even when it succeeds! (Note, the sanitization flags have no effect on the output messages, both are identical)

nwb2bids convert 000008 --bids-directory 08-sanit2 --sanitization sub-labels --sanitization ses-labels

BIDS dataset was not successfully created!
Some errors were encountered during conversion.
The first 3 of 98 are shown below:


	- The participant ID contains invalid characters. BIDS allows only the plus sign to be used as a separator in the subject entity label. Underscores, dashes, spaces, slashes, and other special characters (including #) are expressly forbidden.
	- The participant ID contains invalid characters. BIDS allows only the plus sign to be used as a separator in the subject entity label. Underscores, dashes, spaces, slashes, and other special characters (including #) are expressly forbidden.
	- The participant ID contains invalid characters. BIDS allows only the plus sign to be used as a separator in the subject entity label. Underscores, dashes, spaces, slashes, and other special characters (including #) are expressly forbidden.

Please review the full notifications report at 08-sanit2/.nwb2bids/datetime-20260105121259_notifications.json

CodyCBakerPhD · 2026-01-05T18:24:03Z

However, this does create a confusing situation, the validations are called prior to any sanitization, so nwb2bids reports that it failed even when it succeeds! (Note, the sanitization flags have no effect on the output messages, both are identical)

Hmm.. it should have diminished the level of the notifications from errors to warnings (resulting in slightly different printout text + coloration)

CodyCBakerPhD · 2026-01-05T18:25:05Z

Could you upload or copy/paste (under details plz) the full .json dump from each run (with/without sanitization flags)

CodyCBakerPhD · 2026-01-05T18:26:32Z

Though I will note in both cases the intention of the notifications is to reflect any 'badness' to inform the user to fix them within the source files (future aggressive sanitization modes may make such modifications in-place, in which case no warning would be re-issued on second run)

asmacdo · 2026-01-05T18:33:10Z

sanitization off datetime-20260105122610_notifications.json
sanitization on datetime-20260105122738_notifications.json

asmacdo · 2026-01-05T18:49:04Z

For completeness I only ran this against a subset of subjects, others were deleted.

▾ 000008/                                                                                                      
  ▸ sub-mouse-AAYYT/                                                                                           
  ▸ sub-mouse-AEJGZ/                                                                                           
  ▸ sub-mouse-ALHXK/                                                                                           
  ▸ sub-mouse-ANUPT/                                                                                           
  ▸ sub-mouse-APLSV/                                                                                           
  ▸ sub-mouse-AVLEY/                                                                                           
  ▸ sub-mouse-AWOAY/                                                                                           
  ▸ sub-mouse-AXEMD/                                                                                           
  ▸ sub-mouse-AYGIX/                                                                                           
  ▸ sub-mouse-AZZAO/                                                                                           
    dandiset.yaml

CodyCBakerPhD · 2026-01-05T18:51:41Z

@asmacdo OK thank you for clarifying - I think it was in my mind to do something like that with the notifications but would be a bit much to add it to this PR so I say we resolve in a follow-up

As-is the notifications still serve as feedback to the user that their subject/session IDs (in the NWB files) are invalid and they should change them to avoid any need for sanitization to begin with

CodyCBakerPhD · 2026-01-05T18:52:54Z

It's also probably confusing that these are called errors, but it is also on the to do list to rethink notification schema design in this respect; only 'internal errors' (which this is an external error) are blocking to the process so they should be called something else most likely

asmacdo · 2026-01-05T18:53:06Z

Hmm.. it should have diminished the level of the notifications from errors to warnings (resulting in slightly different printout text + coloration)

Looks the same to me

CodyCBakerPhD · 2026-01-05T18:56:44Z

Looks the same to me

Yep, this is expected now - see #244 (comment)

I was dreaming about next steps and forgot what had been done vs. planned

asmacdo

Oh ok gotcha, IMO would be fine to rethink how notifications work in a separate PR.

FWIW the part that bothers me the most is BIDS dataset was not successfully created!. All the warnings/errors are annoying maybe, but this is actually a lie!

CodyCBakerPhD · 2026-01-05T19:07:04Z

FWIW the part that bothers me the most is BIDS dataset was not successfully created!. All the warnings/errors are annoying maybe, but this is actually a lie!

Yeah, true. I can prioritize that next then to at least get some minor patch through

CodyCBakerPhD requested review from asmacdo and candleindark December 17, 2025 18:44

CodyCBakerPhD self-assigned this Dec 17, 2025

CodyCBakerPhD added this to nwb2bids Roadmap Dec 17, 2025

CodyCBakerPhD added enhancement New feature or request minor Increment the minor version when merged labels Dec 17, 2025

Merge branch 'main' into redux_sanitization

f5159a8

CodyCBakerPhD added 2 commits December 18, 2025 13:12

Merge branch 'main' into redux_sanitization

67053f1

Merge branch 'main' into redux_sanitization

d2ce27f

asmacdo reviewed Dec 19, 2025

View reviewed changes

src/nwb2bids/_command_line_interface/_main.py Show resolved Hide resolved

CodyCBakerPhD and others added 4 commits December 19, 2025 17:51

Merge branch 'main' into redux_sanitization

ee46ae2

Update src/nwb2bids/_command_line_interface/_main.py

afebc5e

Co-authored-by: Austin Macdonald <austin@dartmouth.edu>

[pre-commit.ci] auto fixes from pre-commit.com hooks

a2d0fa8

for more information, see https://pre-commit.ci

just always ensure parent exists before writing sanitization file

53171cd

Merge branch 'main' into redux_sanitization

3e28304

Merge branch 'main' into redux_sanitization

71b410c

asmacdo approved these changes Jan 5, 2026

View reviewed changes

github-project-automation bot moved this to In Progress in nwb2bids Roadmap Jan 5, 2026

CodyCBakerPhD merged commit a475be8 into main Jan 5, 2026
28 checks passed

CodyCBakerPhD deleted the redux_sanitization branch January 5, 2026 19:07

github-project-automation bot moved this from In Progress to Done in nwb2bids Roadmap Jan 5, 2026

CodyCBakerPhD mentioned this pull request Jan 6, 2026

Add basic sanitization #75

Closed

Added sanitization feature #244

Added sanitization feature #244

Uh oh!

Conversation

CodyCBakerPhD commented Dec 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What Changed

Release Notes

Uh oh!

codecov bot commented Dec 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

asmacdo left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

CodyCBakerPhD commented Dec 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

CodyCBakerPhD commented Dec 19, 2025

Uh oh!

CodyCBakerPhD commented Dec 19, 2025

Uh oh!

CodyCBakerPhD commented Jan 5, 2026

Uh oh!

asmacdo commented Jan 5, 2026

Uh oh!

CodyCBakerPhD commented Jan 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

CodyCBakerPhD commented Jan 5, 2026

Uh oh!

CodyCBakerPhD commented Jan 5, 2026

Uh oh!

asmacdo commented Jan 5, 2026

Uh oh!

asmacdo commented Jan 5, 2026

Uh oh!

CodyCBakerPhD commented Jan 5, 2026

Uh oh!

CodyCBakerPhD commented Jan 5, 2026

Uh oh!

asmacdo commented Jan 5, 2026

Uh oh!

CodyCBakerPhD commented Jan 5, 2026

Uh oh!

asmacdo left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

CodyCBakerPhD commented Jan 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

CodyCBakerPhD commented Dec 17, 2025 •

edited

Loading

codecov bot commented Dec 17, 2025 •

edited

Loading

CodyCBakerPhD commented Dec 19, 2025 •

edited

Loading

CodyCBakerPhD commented Jan 5, 2026 •

edited

Loading

asmacdo left a comment •

edited

Loading