annotate-openneuro

Scripts and data for bulk-generation of OpenNeuro data dictionaries.

Get annotatable data

Fetch all `participants.tsv` files (WIP)

Clone and install from source ondiagnostics:

Run the following script:

import asyncio
from ondiagnostics.graphql import create_client, datasets_generator
from ondiagnostics.graphql import Dataset
from ondiagnostics.tasks import clone_dataset
from pathlib import Path

PUT_HERE = Path("/home/surchs/Repositories/external/ondiagnostics/seb_fun_things/data")
client = create_client()


async def main():
    async for dataset in datasets_generator(client):
        print("getting", dataset.id, "now")
        await clone_dataset(Dataset(id=dataset.id, tag=dataset.tag), cache_dir=PUT_HERE)


if __name__ == "__main__":
    asyncio.run(main())%

Create files from git blob:

for dir in /home/surchs/Repositories/external/ondiagnostics/seb_fun_things/data/*; do
    dataset_name=$(basename "$dir" .git)
    if [ ! -f "./${dataset_name}.tsv" ]; then
        git -C $dir show HEAD:participants.tsv > "./${dataset_name}.tsv"
    fi
done

NOTE: This script creates a TSV file for all datasets regardless of if the file exists or not in the repo.

Fetch all `participants.json` files

Note

This script is agnostic to whether participants.tsv files have already been fetched.

Create a file containing a private key for the Neurobagel Bot app
Set two environment variables

NB_BOT_ID: app ID of the Neurobagel Bot app, which you can find on the settings page for the app
NB_BOT_KEY_PATH: path to the private key file on your machine

Run the script to get all participants.json files:
```
python run code/get_participants_json_files.py
```

Create overview of `participants.tsv` files

python code/create_data_overview.py

This will create a file called data_overview.tsv.

Determine datasets needed to cover a target percentage of all participants

Tip

To adjust the target percentage, modify the PERCENTAGE constant in code/get_openneuro_tabular_overview.py.

python code/get_openneuro_tabular_overview.py

Generate summary mega-tables of all columns and categorical values in all `participants.tsv` files

python code/create_participants_tsv_column_and_value_summaries_tables.py

This script will also generate a version of the column summaries table with heuristic-based first guesses of standardized variable mappings and age column formats, and a version of the value summaries table with heuristic-based first guesses of standardized term mappings for sex column values.

Create bulk annotations

Create Neurobagel data dictionaries from bulk annotations

python code/process_annotations_to_dicts.py

This script will:

Create a JSON file (resources/annotated_columns_by_dataset.json) summarizing the currently annotated vs. unannotated columns by dataset based on the input column summaries table
Create a Neurobagel data dictionary JSON for each dataset from the column summaries table with at least one column annotation, with output files stored in data/annotated_dictionaries

First-pass LLM assessment annotation

Set the environment variable OPENROUTER_API_KEY to the value of your API key for OpenRouter
Run the LLM classification (note: can take up to several minutes per dataset):
```
python code/llm_classify_assessments.py
```
Add the LLM annotation information for columns classified as assessment back to the column summaries table:
```
python code/add_llm_annotations_to_column_summaries_table.py
```

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
code		code
data		data
resources		resources
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

annotate-openneuro

Get annotatable data

Fetch all `participants.tsv` files (WIP)

Fetch all `participants.json` files

Create overview of `participants.tsv` files

Determine datasets needed to cover a target percentage of all participants

Generate summary mega-tables of all columns and categorical values in all `participants.tsv` files

Create bulk annotations

Create Neurobagel data dictionaries from bulk annotations

First-pass LLM assessment annotation

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

annotate-openneuro

Get annotatable data

Fetch all participants.tsv files (WIP)

Fetch all participants.json files

Create overview of participants.tsv files

Determine datasets needed to cover a target percentage of all participants

Generate summary mega-tables of all columns and categorical values in all participants.tsv files

Create bulk annotations

Create Neurobagel data dictionaries from bulk annotations

First-pass LLM assessment annotation

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Fetch all `participants.tsv` files (WIP)

Fetch all `participants.json` files

Create overview of `participants.tsv` files

Generate summary mega-tables of all columns and categorical values in all `participants.tsv` files

Packages