-
Notifications
You must be signed in to change notification settings - Fork 4
Added sanitization feature #244
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
* feat: add basic approach to sanitization * feat: improve test cases * feat: expand to session as well * feat: fix tests * fix: test * fix: test * feat: rollback to non-LRU caches * fix: Replace click with rich_click for CLI options * fix: Change click.Choice to rich_click.Choice * feat: adding file dumping * feat: saving state of passing report context * feat: generalize log structure * feat: some refactors for simpler config handling * fix: all tests on new structure * Apply suggestion from @CodyCBakerPhD * chore: fix pre-commit from GitHub suggestion * chore: fix typo * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add sanitization level to command line interface * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Apply suggestions from code review Co-authored-by: Isaac To <candleindark@users.noreply.github.com> * other PR suggestions * other PR suggestions * restore sanitization code * Enhance tests for sanitization and reduce code duplications in those tests (#210) * test(fix): test `sanitize_session_id()` in `test_sanitize_session_id()` `sanitize_participant_id()` instead of `sanitize_session_id()` was tested in `test_sanitize_session_id()`, which I don't think was intended. * refactor(test): reduce code in `test_sanitization.py` By using the parametrization feature from pytest * test: parametrize `sanitization_level` In `test_sanitize_participant_id()` and `test_sanitize_session_id()` * style(test): simplify calls to sanitization func with positional args There is no benefits from using keyword arguments since the input variable names are the same as the keyword arguments' name * docs: complete docstring for `sanitize_session_id()` * docs: complete docstring for `sanitize_participant_id()` * test: add test cases for sanitization level set to `SanitizationLevel.NONE` --------- Co-authored-by: Cody Baker <51133164+CodyCBakerPhD@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: Isaac To <candleindark@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix merge * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix initial creation of sanitization path * Alternative sanitization approach (#168) * feat: finish alternative approach to sanitization model and fix all tests * chore: remove comments * fix: resolve pandas complaint * fix: just suppress pandas wrong warning * properly newlin * add missing participant id support * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fixup tests * fix merge * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * re-order some conversion steps * re-order some conversion steps and parent creation * re-order some conversion steps and parent creation --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: CodyCBakerPhD <codycbakerphd@gmail.com> * fix test path * major refactors * lower case and hyphens in CLI; some pandas typing fixes * adjust all remaining references to levels --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Isaac To <candleindark@users.noreply.github.com> Co-authored-by: CodyCBakerPhD <codycbakerphd@gmail.com>
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #244 +/- ##
==========================================
+ Coverage 84.65% 84.75% +0.10%
==========================================
Files 32 35 +3
Lines 1212 1279 +67
==========================================
+ Hits 1026 1084 +58
- Misses 186 195 +9
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
asmacdo
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wasn't able to find a way to test with real data use cases, but the code looks good.
My issue was that I struggled to find a dataset that meets the criteria suggested: "have a bunch of errors on the unsanitized branch, but don't (or have fewer) after basic sanitization". Maybe I misunderstood, but https://github.com/bids-dandisets/dashboard/blob/main/README.md shows equal number of errors for sanitized and unsanitized (ignoring BEP 32).
The exception is 000016, which acutally shows 1 extra error. So there may be some regression/edge case/race condition bug with parent dir?
[Errno 2] No such file or directory: '/data/dandi/bids-dandisets/work/000016/.nwb2bids/datetime-20251218063120_sanitization.txt
diff not_sanitized_000016.json sanitized_000016.json
2a3,14
> "title": "Failed to extract metadata for one or more sessions",
> "reason": "An error occurred while executing `DatasetConverter.extract_metadata`.\n\nTraceback (most recent call last):\n File \"/data/dandi/bids-dandisets/nwb2bids/src/nwb2bids/_converters/_dataset_converter.py\", line 191, in extract_metadata\n collections.deque(\n ~~~~~~~~~~~~~~~~~^\n (\n ^\n ...<4 lines>...\n maxlen=0,\n ^^^^^^^^^\n )\n ^\n File \"/data/dandi/bids-dandisets/nwb2bids/src/nwb2bids/_converters/_dataset_converter.py\", line 193, in <genexpr>\n session_converter.extract_metadata()\n ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^\n File \"/data/dandi/bids-dandisets/nwb2bids/src/nwb2bids/_converters/_session_converter.py\", line 102, in extract_metadata\n self.session_metadata = BidsSessionMetadata.from_nwbfile_paths(\n ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^\n nwbfile_paths=self.nwbfile_paths, run_config=self.run_config\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n )\n ^\n File \"/data/dandi/s3-logs/conda/envs/bids_dandisets/lib/python3.13/site-packages/pydantic/_internal/_validate_call.py\", line 39, in wrapper_function\n return wrapper(*args, **kwargs)\n File \"/data/dandi/s3-logs/conda/envs/bids_dandisets/lib/python3.13/site-packages/pydantic/_internal/_validate_call.py\", line 136, in __call__\n res = self.__pydantic_validator__.validate_python(pydantic_core.ArgsKwargs(args, kwargs))\n File \"/data/dandi/bids-dandisets/nwb2bids/src/nwb2bids/bids_models/_bids_session_metadata.py\", line 149, in from_nwbfile_paths\n session_metadata = cls(**dictionary)\n File \"/data/dandi/s3-logs/conda/envs/bids_dandisets/lib/python3.13/site-packages/pydantic/main.py\", line 250, in __init__\n validated_self = self.__pydantic_validator__.validate_python(data, self_instance=self)\n File \"/data/dandi/bids-dandisets/nwb2bids/src/nwb2bids/bids_models/_bids_session_metadata.py\", line 44, in model_post_init\n self.sanitization = Sanitization(\n ~~~~~~~~~~~~^\n sanitization_config=self.run_config.sanitization_config,\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n ...<2 lines>...\n original_participant_id=self.participant.participant_id,\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n )\n ^\n File \"/data/dandi/s3-logs/conda/envs/bids_dandisets/lib/python3.13/site-packages/pydantic/main.py\", line 250, in __init__\n validated_self = self.__pydantic_validator__.validate_python(data, self_instance=self)\n File \"/data/dandi/bids-dandisets/nwb2bids/src/nwb2bids/sanitization/_sanitization.py\", line 35, in model_post_init\n with self.sanitization_file_path.open(mode=\"w\") as file_stream:\n ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^\n File \"/data/dandi/s3-logs/conda/envs/bids_dandisets/lib/python3.13/pathlib/_local.py\", line 537, in open\n return io.open(self, mode, buffering, encoding, errors, newline)\n ~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\nFileNotFoundError: [Errno 2] No such file or directory: '/data/dandi/bids-dandisets/work/000016/.nwb2bids/datetime-20251218063120_sanitization.txt'\n",
> "solution": "Please raise an issue on `nwb2bids`: https://github.com/con/nwb2bids/issues.",
> "examples": null,
> "field": null,
> "source_file_paths": null,
> "target_file_paths": null,
> "data_standards": null,
> "category": "INTERNAL_ERROR",
> "severity": "ERROR"
> },
> {
|
The effectiveness is further summarized by the common issues table
|
Co-authored-by: Austin Macdonald <austin@dartmouth.edu>
for more information, see https://pre-commit.ci
How interesting. Looking at it I can maybe see how that might happen (and clearly did at least once) Odd it doesn't happen all the time though, but ah well... I added a line to ensure parents exist before first write so shouldn't happen again Good catch |
|
@asmacdo Welcome back from break! Let's make this top priority since a fair amount of follow-up work would be a pain to have to rebase after this fairly extensive PR I will do what I can for now on completely independent PRs |
|
No sanitization flags: Sanitization flags And the participants.tsv is modified also, ie However, this does create a confusing situation, the validations are called prior to any sanitization, so
|
Hmm.. it should have diminished the level of the notifications from errors to warnings (resulting in slightly different printout text + coloration) |
|
Could you upload or copy/paste (under details plz) the full .json dump from each run (with/without sanitization flags) |
|
Though I will note in both cases the intention of the notifications is to reflect any 'badness' to inform the user to fix them within the source files (future aggressive sanitization modes may make such modifications in-place, in which case no warning would be re-issued on second run) |
|
For completeness I only ran this against a subset of subjects, others were deleted. |
|
@asmacdo OK thank you for clarifying - I think it was in my mind to do something like that with the notifications but would be a bit much to add it to this PR so I say we resolve in a follow-up As-is the notifications still serve as feedback to the user that their subject/session IDs (in the NWB files) are invalid and they should change them to avoid any need for sanitization to begin with |
|
It's also probably confusing that these are called errors, but it is also on the to do list to rethink notification schema design in this respect; only 'internal errors' (which this is an external error) are blocking to the process so they should be called something else most likely |
Yep, this is expected now - see #244 (comment) I was dreaming about next steps and forgot what had been done vs. planned |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh ok gotcha, IMO would be fine to rethink how notifications work in a separate PR.
FWIW the part that bothers me the most is BIDS dataset was not successfully created!. All the warnings/errors are annoying maybe, but this is actually a lie!
Yeah, true. I can prioritize that next then to at least get some minor patch through |




What Changed
Release Notes
The sanitization feature is now available. 'Sanitization' refers to actively altering various values extracted from the NWB files in an effort to adhere to BIDS validation requirements.
An example would be a NWB file containing the field
nwbfile.session_id = "my session #2". While this is perfectly allowed in NWB, attempting to write filenames such as_ses-my session #2_would not be valid in BIDS and the resulting files including these labels would not be uploadable to common data archives. Sanitization would instead generate BIDS filenames as_ses-my+session+2_.Currently, only the codes 'sub-labels' or 'ses-labels' are available. These sanitize session or subject ID labels in a way guaranteed to be valid BIDS. We plan to add more capabilities in the future, including some limited in-place modification of NWB contents to allow consistency with the BIDS validation requirements.