Skip to content

neurobagel/annotate-openneuro

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

annotate-openneuro

Scripts and data for bulk-generation of OpenNeuro data dictionaries.

Get annotatable data

Fetch all participants.tsv files (WIP)

  1. Clone and install from source ondiagnostics:

  2. Run the following script:

    import asyncio
    from ondiagnostics.graphql import create_client, datasets_generator
    from ondiagnostics.graphql import Dataset
    from ondiagnostics.tasks import clone_dataset
    from pathlib import Path
    
    PUT_HERE = Path("/home/surchs/Repositories/external/ondiagnostics/seb_fun_things/data")
    client = create_client()
    
    
    async def main():
        async for dataset in datasets_generator(client):
            print("getting", dataset.id, "now")
            await clone_dataset(Dataset(id=dataset.id, tag=dataset.tag), cache_dir=PUT_HERE)
    
    
    if __name__ == "__main__":
        asyncio.run(main())%
  3. Create files from git blob:

    for dir in /home/surchs/Repositories/external/ondiagnostics/seb_fun_things/data/*; do
        dataset_name=$(basename "$dir" .git)
        if [ ! -f "./${dataset_name}.tsv" ]; then
            git -C $dir show HEAD:participants.tsv > "./${dataset_name}.tsv"
        fi
    done

    NOTE: This script creates a TSV file for all datasets regardless of if the file exists or not in the repo.

Fetch all participants.json files

Note

This script is agnostic to whether participants.tsv files have already been fetched.

  1. Create a file containing a private key for the Neurobagel Bot app
  2. Set two environment variables
  • NB_BOT_ID: app ID of the Neurobagel Bot app, which you can find on the settings page for the app
  • NB_BOT_KEY_PATH: path to the private key file on your machine
  1. Run the script to get all participants.json files:
    python run code/get_participants_json_files.py

Create overview of participants.tsv files

python code/create_data_overview.py

This will create a file called data_overview.tsv.

Determine datasets needed to cover a target percentage of all participants

Tip

To adjust the target percentage, modify the PERCENTAGE constant in code/get_openneuro_tabular_overview.py.

python code/get_openneuro_tabular_overview.py

Generate summary mega-tables of all columns and categorical values in all participants.tsv files

python code/create_participants_tsv_column_and_value_summaries_tables.py

This script will also generate a version of the column summaries table with heuristic-based first guesses of standardized variable mappings and age column formats, and a version of the value summaries table with heuristic-based first guesses of standardized term mappings for sex column values.

Create bulk annotations

Create Neurobagel data dictionaries from bulk annotations

python code/process_annotations_to_dicts.py

This script will:

  • Create a JSON file (resources/annotated_columns_by_dataset.json) summarizing the currently annotated vs. unannotated columns by dataset based on the input column summaries table
  • Create a Neurobagel data dictionary JSON for each dataset from the column summaries table with at least one column annotation, with output files stored in data/annotated_dictionaries

First-pass LLM assessment annotation

  1. Set the environment variable OPENROUTER_API_KEY to the value of your API key for OpenRouter
  2. Run the LLM classification (note: can take up to several minutes per dataset):
    python code/llm_classify_assessments.py
  3. Add the LLM annotation information for columns classified as assessment back to the column summaries table:
    python code/add_llm_annotations_to_column_summaries_table.py

About

Scripts and data for bulk-generation of OpenNeuro data dictionaries.

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages