Scripts and data for bulk-generation of OpenNeuro data dictionaries.
-
Clone and install from source
ondiagnostics: -
Run the following script:
import asyncio from ondiagnostics.graphql import create_client, datasets_generator from ondiagnostics.graphql import Dataset from ondiagnostics.tasks import clone_dataset from pathlib import Path PUT_HERE = Path("/home/surchs/Repositories/external/ondiagnostics/seb_fun_things/data") client = create_client() async def main(): async for dataset in datasets_generator(client): print("getting", dataset.id, "now") await clone_dataset(Dataset(id=dataset.id, tag=dataset.tag), cache_dir=PUT_HERE) if __name__ == "__main__": asyncio.run(main())%
-
Create files from git blob:
for dir in /home/surchs/Repositories/external/ondiagnostics/seb_fun_things/data/*; do dataset_name=$(basename "$dir" .git) if [ ! -f "./${dataset_name}.tsv" ]; then git -C $dir show HEAD:participants.tsv > "./${dataset_name}.tsv" fi done
NOTE: This script creates a TSV file for all datasets regardless of if the file exists or not in the repo.
Note
This script is agnostic to whether participants.tsv files have already been fetched.
- Create a file containing a private key for the Neurobagel Bot app
- Set two environment variables
NB_BOT_ID: app ID of the Neurobagel Bot app, which you can find on the settings page for the appNB_BOT_KEY_PATH: path to the private key file on your machine
- Run the script to get all
participants.jsonfiles:python run code/get_participants_json_files.py
python code/create_data_overview.pyThis will create a file called data_overview.tsv.
Tip
To adjust the target percentage, modify the PERCENTAGE constant in code/get_openneuro_tabular_overview.py.
python code/get_openneuro_tabular_overview.pypython code/create_participants_tsv_column_and_value_summaries_tables.pyThis script will also generate a version of the column summaries table with heuristic-based first guesses of standardized variable mappings and age column formats, and a version of the value summaries table with heuristic-based first guesses of standardized term mappings for sex column values.
python code/process_annotations_to_dicts.pyThis script will:
- Create a JSON file (
resources/annotated_columns_by_dataset.json) summarizing the currently annotated vs. unannotated columns by dataset based on the input column summaries table - Create a Neurobagel data dictionary JSON for each dataset from the column summaries table with at least one column annotation, with output files stored in
data/annotated_dictionaries
- Set the environment variable
OPENROUTER_API_KEYto the value of your API key for OpenRouter - Run the LLM classification (note: can take up to several minutes per dataset):
python code/llm_classify_assessments.py
- Add the LLM annotation information for columns classified as assessment back to the column summaries table:
python code/add_llm_annotations_to_column_summaries_table.py