Skip to content

Conversation

@CodyCBakerPhD
Copy link
Collaborator

@CodyCBakerPhD CodyCBakerPhD commented Dec 17, 2025

What Changed

Release Notes

The sanitization feature is now available. 'Sanitization' refers to actively altering various values extracted from the NWB files in an effort to adhere to BIDS validation requirements.

An example would be a NWB file containing the field nwbfile.session_id = "my session #2". While this is perfectly allowed in NWB, attempting to write filenames such as _ses-my session #2_ would not be valid in BIDS and the resulting files including these labels would not be uploadable to common data archives. Sanitization would instead generate BIDS filenames as _ses-my+session+2_.

Currently, only the codes 'sub-labels' or 'ses-labels' are available. These sanitize session or subject ID labels in a way guaranteed to be valid BIDS. We plan to add more capabilities in the future, including some limited in-place modification of NWB contents to allow consistency with the BIDS validation requirements.

* feat: add basic approach to sanitization

* feat: improve test cases

* feat: expand to session as well

* feat: fix tests

* fix: test

* fix: test

* feat: rollback to non-LRU caches

* fix: Replace click with rich_click for CLI options

* fix: Change click.Choice to rich_click.Choice

* feat: adding file dumping

* feat: saving state of passing report context

* feat: generalize log structure

* feat: some refactors for simpler config handling

* fix: all tests on new structure

* Apply suggestion from @CodyCBakerPhD

* chore: fix pre-commit from GitHub suggestion

* chore: fix typo

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Add sanitization level to command line interface

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Apply suggestions from code review

Co-authored-by: Isaac To <candleindark@users.noreply.github.com>

* other PR suggestions

* other PR suggestions

* restore sanitization code

* Enhance tests for sanitization and reduce code duplications in those tests (#210)

* test(fix): test `sanitize_session_id()` in `test_sanitize_session_id()`

`sanitize_participant_id()` instead of `sanitize_session_id()`
 was tested in `test_sanitize_session_id()`, which I don't
 think was intended.

* refactor(test): reduce code in `test_sanitization.py`

By using the parametrization feature from pytest

* test: parametrize `sanitization_level`

In `test_sanitize_participant_id()` and
`test_sanitize_session_id()`

* style(test): simplify calls to sanitization func with positional args

There is no benefits from using
keyword arguments since the
input variable names are the same
as the keyword arguments' name

* docs: complete docstring for `sanitize_session_id()`

* docs: complete docstring for `sanitize_participant_id()`

* test: add test cases for sanitization level set to `SanitizationLevel.NONE`

---------

Co-authored-by: Cody Baker <51133164+CodyCBakerPhD@users.noreply.github.com>

* Apply suggestions from code review

Co-authored-by: Isaac To <candleindark@users.noreply.github.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix merge

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix initial creation of sanitization path

* Alternative sanitization approach (#168)

* feat: finish alternative approach to sanitization model and fix all tests

* chore: remove comments

* fix: resolve pandas complaint

* fix: just suppress pandas wrong warning

* properly newlin

* add missing participant id support

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fixup tests

* fix merge

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* re-order some conversion steps

* re-order some conversion steps and parent creation

* re-order some conversion steps and parent creation

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: CodyCBakerPhD <codycbakerphd@gmail.com>

* fix test path

* major refactors

* lower case and hyphens in CLI; some pandas typing fixes

* adjust all remaining references to levels

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Isaac To <candleindark@users.noreply.github.com>
Co-authored-by: CodyCBakerPhD <codycbakerphd@gmail.com>
@CodyCBakerPhD CodyCBakerPhD self-assigned this Dec 17, 2025
@CodyCBakerPhD CodyCBakerPhD added enhancement New feature or request minor Increment the minor version when merged labels Dec 17, 2025
@codecov
Copy link

codecov bot commented Dec 17, 2025

Codecov Report

❌ Patch coverage is 90.52632% with 9 lines in your changes missing coverage. Please review.
✅ Project coverage is 84.75%. Comparing base (e727b61) to head (71b410c).
⚠️ Report is 2 commits behind head on main.

Files with missing lines Patch % Lines
src/nwb2bids/_command_line_interface/_main.py 0.00% 7 Missing ⚠️
src/nwb2bids/_converters/_session_converter.py 92.85% 1 Missing ⚠️
src/nwb2bids/bids_models/_bids_session_metadata.py 88.88% 1 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #244      +/-   ##
==========================================
+ Coverage   84.65%   84.75%   +0.10%     
==========================================
  Files          32       35       +3     
  Lines        1212     1279      +67     
==========================================
+ Hits         1026     1084      +58     
- Misses        186      195       +9     
Flag Coverage Δ
unittests 84.75% <90.52%> (+0.10%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
src/nwb2bids/__init__.py 100.00% <ø> (ø)
src/nwb2bids/_converters/_dataset_converter.py 87.33% <100.00%> (-0.34%) ⬇️
src/nwb2bids/_converters/_run_config.py 94.59% <100.00%> (+1.26%) ⬆️
src/nwb2bids/bids_models/_participant.py 95.34% <100.00%> (ø)
src/nwb2bids/sanitization/__init__.py 100.00% <100.00%> (ø)
src/nwb2bids/sanitization/_configuration.py 100.00% <100.00%> (ø)
src/nwb2bids/sanitization/_sanitization.py 100.00% <100.00%> (ø)
src/nwb2bids/_converters/_session_converter.py 89.74% <92.85%> (-0.61%) ⬇️
src/nwb2bids/bids_models/_bids_session_metadata.py 92.39% <88.88%> (-0.38%) ⬇️
src/nwb2bids/_command_line_interface/_main.py 0.00% <0.00%> (ø)
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Member

@asmacdo asmacdo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't able to find a way to test with real data use cases, but the code looks good.

My issue was that I struggled to find a dataset that meets the criteria suggested: "have a bunch of errors on the unsanitized branch, but don't (or have fewer) after basic sanitization". Maybe I misunderstood, but https://github.com/bids-dandisets/dashboard/blob/main/README.md shows equal number of errors for sanitized and unsanitized (ignoring BEP 32).

The exception is 000016, which acutally shows 1 extra error. So there may be some regression/edge case/race condition bug with parent dir?

[Errno 2] No such file or directory: '/data/dandi/bids-dandisets/work/000016/.nwb2bids/datetime-20251218063120_sanitization.txt

diff not_sanitized_000016.json sanitized_000016.json
2a3,14
>     "title": "Failed to extract metadata for one or more sessions",
>     "reason": "An error occurred while executing `DatasetConverter.extract_metadata`.\n\nTraceback (most recent call last):\n  File \"/data/dandi/bids-dandisets/nwb2bids/src/nwb2bids/_converters/_dataset_converter.py\", line 191, in extract_metadata\n    collections.deque(\n    ~~~~~~~~~~~~~~~~~^\n        (\n        ^\n    ...<4 lines>...\n        maxlen=0,\n        ^^^^^^^^^\n    )\n    ^\n  File \"/data/dandi/bids-dandisets/nwb2bids/src/nwb2bids/_converters/_dataset_converter.py\", line 193, in <genexpr>\n    session_converter.extract_metadata()\n    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^\n  File \"/data/dandi/bids-dandisets/nwb2bids/src/nwb2bids/_converters/_session_converter.py\", line 102, in extract_metadata\n    self.session_metadata = BidsSessionMetadata.from_nwbfile_paths(\n                            ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^\n        nwbfile_paths=self.nwbfile_paths, run_config=self.run_config\n        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n    )\n    ^\n  File \"/data/dandi/s3-logs/conda/envs/bids_dandisets/lib/python3.13/site-packages/pydantic/_internal/_validate_call.py\", line 39, in wrapper_function\n    return wrapper(*args, **kwargs)\n  File \"/data/dandi/s3-logs/conda/envs/bids_dandisets/lib/python3.13/site-packages/pydantic/_internal/_validate_call.py\", line 136, in __call__\n    res = self.__pydantic_validator__.validate_python(pydantic_core.ArgsKwargs(args, kwargs))\n  File \"/data/dandi/bids-dandisets/nwb2bids/src/nwb2bids/bids_models/_bids_session_metadata.py\", line 149, in from_nwbfile_paths\n    session_metadata = cls(**dictionary)\n  File \"/data/dandi/s3-logs/conda/envs/bids_dandisets/lib/python3.13/site-packages/pydantic/main.py\", line 250, in __init__\n    validated_self = self.__pydantic_validator__.validate_python(data, self_instance=self)\n  File \"/data/dandi/bids-dandisets/nwb2bids/src/nwb2bids/bids_models/_bids_session_metadata.py\", line 44, in model_post_init\n    self.sanitization = Sanitization(\n                        ~~~~~~~~~~~~^\n        sanitization_config=self.run_config.sanitization_config,\n        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n    ...<2 lines>...\n        original_participant_id=self.participant.participant_id,\n        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n    )\n    ^\n  File \"/data/dandi/s3-logs/conda/envs/bids_dandisets/lib/python3.13/site-packages/pydantic/main.py\", line 250, in __init__\n    validated_self = self.__pydantic_validator__.validate_python(data, self_instance=self)\n  File \"/data/dandi/bids-dandisets/nwb2bids/src/nwb2bids/sanitization/_sanitization.py\", line 35, in model_post_init\n    with self.sanitization_file_path.open(mode=\"w\") as file_stream:\n         ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^\n  File \"/data/dandi/s3-logs/conda/envs/bids_dandisets/lib/python3.13/pathlib/_local.py\", line 537, in open\n    return io.open(self, mode, buffering, encoding, errors, newline)\n           ~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\nFileNotFoundError: [Errno 2] No such file or directory: '/data/dandi/bids-dandisets/work/000016/.nwb2bids/datetime-20251218063120_sanitization.txt'\n",
>     "solution": "Please raise an issue on `nwb2bids`: https://github.com/con/nwb2bids/issues.",
>     "examples": null,
>     "field": null,
>     "source_file_paths": null,
>     "target_file_paths": null,
>     "data_standards": null,
>     "category": "INTERNAL_ERROR",
>     "severity": "ERROR"
>   },
>   {

@CodyCBakerPhD
Copy link
Collaborator Author

CodyCBakerPhD commented Dec 19, 2025

shows equal number of errors for sanitized and unsanitized (ignoring BEP 32).

The feature is entirely about BIDS validation (not even specific to BEP32 tbh) - no changes are expected with nwb2bids notifications, though I might change that in a follow-up

As I mentioned, the one you had already looked at for IP freely (000003) is a good example

image

@CodyCBakerPhD
Copy link
Collaborator Author

The effectiveness is further summarized by the common issues table

image

@CodyCBakerPhD
Copy link
Collaborator Author

The exception is 000016, which acutally shows 1 extra error. So there may be some regression/edge case/race condition bug with parent dir?

How interesting. Looking at it I can maybe see how that might happen (and clearly did at least once)

Odd it doesn't happen all the time though, but ah well...

I added a line to ensure parents exist before first write so shouldn't happen again

Good catch

@CodyCBakerPhD
Copy link
Collaborator Author

@asmacdo Welcome back from break! Let's make this top priority since a fair amount of follow-up work would be a pain to have to rebase after this fairly extensive PR

I will do what I can for now on completely independent PRs

@asmacdo
Copy link
Member

asmacdo commented Jan 5, 2026

No sanitization flags:

├── sub-mouse_AAYYT
│   ├── ses-20180420_sample_2
│   │   └── ecephys
│   │       └── sub-mouse_AAYYT_ses-20180420_sample_2_ecephys.nwb -> ../../../../000008/sub-mouse-AAYYT/sub-mouse-AAYYT_ses-20180420-sample-2_slice-20180420-slice-2_cell-20180420-sample-2_icephys.nwb
│   ├── ses-20180420_sample_3
│   │   └── ecephys
│   │       └── sub-mouse_AAYYT_ses-20180420_sample_3_ecephys.nwb -> ../../../../000008/sub-mouse-AAYYT/sub-mouse-AAYYT_ses-20180420-sample-3_slice-20180420-slice-3_cell-20180420-sample-3_icephys.nwb

Sanitization flags nwb2bids convert 000008 --bids-directory 08-sanit --sanitization sub-labels --sanitization ses-labels

├── sub-mouse+AAYYT
│   ├── ses-20180420+sample+2
│   │   └── ecephys
│   │       └── sub-mouse+AAYYT_ses-20180420+sample+2_ecephys.nwb -> ../../../../000008/sub-mouse-AAYYT/sub-mouse-AAYYT_ses-20180420-sample-2_slice-20180420-slice-2_cell-20180420-sample-2_icephys.nwb
│   ├── ses-20180420+sample+3
│   │   └── ecephys
│   │       └── sub-mouse+AAYYT_ses-20180420+sample+3_ecephys.nwb -> ../../../../000008/sub-mouse-AAYYT/sub-mouse-AAYYT_ses-20180420-sample-3_slice-20180420-slice-3_cell-20180420-sample-3_icephys.nwb

And the participants.tsv is modified also, ie sub-mouse+AAYYT, looks like conversion+sanitization is working right to me!

However, this does create a confusing situation, the validations are called prior to any sanitization, so nwb2bids reports that it failed even when it succeeds! (Note, the sanitization flags have no effect on the output messages, both are identical)

nwb2bids convert 000008 --bids-directory 08-sanit2 --sanitization sub-labels --sanitization ses-labels

BIDS dataset was not successfully created!
Some errors were encountered during conversion.
The first 3 of 98 are shown below:


	- The participant ID contains invalid characters. BIDS allows only the plus sign to be used as a separator in the subject entity label. Underscores, dashes, spaces, slashes, and other special characters (including #) are expressly forbidden.
	- The participant ID contains invalid characters. BIDS allows only the plus sign to be used as a separator in the subject entity label. Underscores, dashes, spaces, slashes, and other special characters (including #) are expressly forbidden.
	- The participant ID contains invalid characters. BIDS allows only the plus sign to be used as a separator in the subject entity label. Underscores, dashes, spaces, slashes, and other special characters (including #) are expressly forbidden.

Please review the full notifications report at 08-sanit2/.nwb2bids/datetime-20260105121259_notifications.json

@CodyCBakerPhD
Copy link
Collaborator Author

CodyCBakerPhD commented Jan 5, 2026

However, this does create a confusing situation, the validations are called prior to any sanitization, so nwb2bids reports that it failed even when it succeeds! (Note, the sanitization flags have no effect on the output messages, both are identical)

Hmm.. it should have diminished the level of the notifications from errors to warnings (resulting in slightly different printout text + coloration)

@CodyCBakerPhD
Copy link
Collaborator Author

Could you upload or copy/paste (under details plz) the full .json dump from each run (with/without sanitization flags)

@CodyCBakerPhD
Copy link
Collaborator Author

Though I will note in both cases the intention of the notifications is to reflect any 'badness' to inform the user to fix them within the source files (future aggressive sanitization modes may make such modifications in-place, in which case no warning would be re-issued on second run)

@asmacdo
Copy link
Member

asmacdo commented Jan 5, 2026

@asmacdo
Copy link
Member

asmacdo commented Jan 5, 2026

For completeness I only ran this against a subset of subjects, others were deleted.

▾ 000008/                                                                                                      
  ▸ sub-mouse-AAYYT/                                                                                           
  ▸ sub-mouse-AEJGZ/                                                                                           
  ▸ sub-mouse-ALHXK/                                                                                           
  ▸ sub-mouse-ANUPT/                                                                                           
  ▸ sub-mouse-APLSV/                                                                                           
  ▸ sub-mouse-AVLEY/                                                                                           
  ▸ sub-mouse-AWOAY/                                                                                           
  ▸ sub-mouse-AXEMD/                                                                                           
  ▸ sub-mouse-AYGIX/                                                                                           
  ▸ sub-mouse-AZZAO/                                                                                           
    dandiset.yaml     

@CodyCBakerPhD
Copy link
Collaborator Author

@asmacdo OK thank you for clarifying - I think it was in my mind to do something like that with the notifications but would be a bit much to add it to this PR so I say we resolve in a follow-up

As-is the notifications still serve as feedback to the user that their subject/session IDs (in the NWB files) are invalid and they should change them to avoid any need for sanitization to begin with

@CodyCBakerPhD
Copy link
Collaborator Author

It's also probably confusing that these are called errors, but it is also on the to do list to rethink notification schema design in this respect; only 'internal errors' (which this is an external error) are blocking to the process so they should be called something else most likely

@asmacdo
Copy link
Member

asmacdo commented Jan 5, 2026

Hmm.. it should have diminished the level of the notifications from errors to warnings (resulting in slightly different printout text + coloration)

Looks the same to me
image
image

@CodyCBakerPhD
Copy link
Collaborator Author

Looks the same to me

Yep, this is expected now - see #244 (comment)

I was dreaming about next steps and forgot what had been done vs. planned

Copy link
Member

@asmacdo asmacdo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh ok gotcha, IMO would be fine to rethink how notifications work in a separate PR.

FWIW the part that bothers me the most is BIDS dataset was not successfully created!. All the warnings/errors are annoying maybe, but this is actually a lie!

@github-project-automation github-project-automation bot moved this to In Progress in nwb2bids Roadmap Jan 5, 2026
@CodyCBakerPhD
Copy link
Collaborator Author

FWIW the part that bothers me the most is BIDS dataset was not successfully created!. All the warnings/errors are annoying maybe, but this is actually a lie!

Yeah, true. I can prioritize that next then to at least get some minor patch through

@CodyCBakerPhD CodyCBakerPhD merged commit a475be8 into main Jan 5, 2026
28 checks passed
@CodyCBakerPhD CodyCBakerPhD deleted the redux_sanitization branch January 5, 2026 19:07
@github-project-automation github-project-automation bot moved this from In Progress to Done in nwb2bids Roadmap Jan 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request minor Increment the minor version when merged

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

3 participants