Taxonomy‑aware protein naming by fast homology search with DIAMOND. Brownaming assigns concise names to predicted proteins by prioritizing homologs from the closest available taxa and only expanding outward in the taxonomy when needed.
Goal: give each query protein the most specific, biologically meaningful name supported by homology.
- Iterative taxonomy expansion: start at the target species; if no hit, move up (genus → family → order ...)
- Prioritize candidates by: (1) taxonomy distance (only varies across iterations) then (2) DIAMOND bitscore, then (3) % identity
- If nothing found up to configured rank → "Uncharacterized protein"
- Fully local: UniProt Swiss‑Prot + TrEMBL DIAMOND database with taxonomy metadata
- Optional exclusion of unwanted taxa
- Linux environment for database build (tested on Ubuntu; search also works under WSL)
- DIAMOND ≥ 2.1.x compiled with taxonomy support
- curl, awk, gzip, coreutils (for the provided build script)
- Disk space: ≈ 120–150 GB (TrEMBL + Swiss‑Prot FASTA + DIAMOND index + temporary files)
- RAM: ≥ 32 GB recommended for faster makedb sorting (less will still work, slower)
- Python ≥ 3.9
- Conda or Mamba (recommended for dependency management)
See environment.yml for the complete list of dependencies. Main packages include:
- BioPython
- NumPy, Pandas
- scikit-learn
- openpyxl
- matplotlib
- requests
Before using Brownaming, you must create the local DIAMOND database:
- Clone repository:
git clone <repo_url>
cd Brownaming- Create or edit
config.json:
{
"local_db_path": "/path/to/brownaming_db"
}- Run the database build script:
./create_local_db.shWhat it does:
- Downloads UniProt Swiss‑Prot + TrEMBL (current release)
- Extracts TaxIDs from FASTA headers (OX=)
- Generates
taxonmap.tsv, taxonomy JSON caches (parent/rank/children) - Builds two DIAMOND databases:
- full (Swiss‑Prot + TrEMBL)
- swissprot (Swiss‑Prot only)
Duration: ~8 h
- Create conda environment:
conda env create -f environment.yml
conda activate brownaming# Basic run
python main.py -p /path/to/query.fasta -s 83333 --threads 16
# SwissProt only
python main.py -p /path/to/query.fasta -s 83333 --swissprot-only
# Specify database path explicitly (overrides config.json)
python main.py -p /path/to/query.fasta -s 83333 --local-db /custom/path/to/db
# Specify custom final output directory
python main.py -p /path/to/query.fasta -s 83333 --working-dir /custom/output/path
# Resume run
python main.py --resume 2026-02-24-14-30-83333Brownaming always executes in: runs/YYYY-MM-DD-HH-MM-TAXID/
If --working-dir is provided, the run directory is moved to that destination only after successful completion.
After running several analyses, update the time prediction model:
# Activate environment (if not already active)
conda activate brownaming
# Generate updated dataset from all run logs
cd time_prediction_model/
python create_data.py
# Retrain the model
python train_model.pyDocker provides an isolated, reproducible environment.
docker build -t brownaming .The wrapper script brownaming-compose automatically:
- Reads database path from
config.json - Detects and mounts the directory containing your input file
Make the script executable:
chmod +x brownaming-composeUse absolute paths for your input files:
# Basic run
./brownaming-compose run --rm brownaming \
python main.py -p /absolute/path/to/query.fasta -s 83333 --threads 16
# SwissProt only
./brownaming-compose run --rm brownaming \
python main.py -p /absolute/path/to/query.fasta -s 83333 --swissprot-only
# Specify database path explicitly (overrides config.json)
./brownaming-compose run --rm brownaming \
python main.py -p /absolute/path/to/query.fasta -s 83333 --local-db /path/to/db
# Specify custom final output directory
./brownaming-compose run --rm brownaming \
python main.py -p /absolute/path/to/query.fasta -s 83333 --working-dir /custom/output/path
# Resume run
./brownaming-compose run --rm brownaming \
python main.py --resume 2026-02-24-14-30-83333Brownaming always executes in: runs/YYYY-MM-DD-HH-MM-TAXID/
If --working-dir is provided, the run directory is moved to that destination only after successful completion.
After running several analyses, update the time prediction model:
# Generate updated dataset from all run logs
./brownaming-compose run --rm brownaming \
python time_prediction_model/create_data.py
# Retrain the model
./brownaming-compose run --rm brownaming \
python time_prediction_model/train_model.pyThe updated model is immediately available - no rebuild needed! The more analyses you run, the more accurate the time estimates become.
Required:
- -p / --proteins : Query protein FASTA
- -s / --species : NCBI TaxID of target species (root of initial search)
Optional:
- --local-db : Path to local database directory (overrides LOCAL_DB_PATH env var and config.json)
- --working-dir : Final output directory (optional). Computation still runs in
script_dir/runs/<run_id>and is moved at the end if successful. - --threads : DIAMOND threads (default: all)
- --last-tax : Stop expanding after this specific TaxID is reached.
- --ex-tax : TaxID to exclude. For multiple exclusions, use this flag multiple times; each instance excludes the specified taxon and its subtree.
- --swissprot-only: Run DIAMOND searches only on the SwissProt database.
- --run-id <custom_id> : Custom run ID (optional, default: YYYY-MM-DD-HH-MM-TAXID). Useful for integration with external systems.
- --resume <run_id> : Resume a previous run using its run ID (format: YYYY-MM-DD-HH-MM-TAXID)
When using --resume, only the run_id is required. Brownaming reloads saved parameters from runs/<run_id>/state_args.json
Example:
python main.py --resume <run_id>Brownaming determines the database location in the following order:
--local-dbcommand-line argumentlocal_db_pathinconfig.json
- query_file_name_brownamed.fasta : FASTA file with updated headers containing the assigned names.
- query_file_name_diamond_results.xlsx : Excel table listing, for each query protein, the match used for naming, including homology scores (identity, evalue, bitscore, ...), and the rank and name of the last common ancestor.
- query_file_name_brownaming_stats.png : Statistics figure showing the progression through taxonomic ranks.
- YYYY-MM-DD-HH-MM-TAXID.log : Complete log file of the run (in the run directory).
Re-run: ./create_local_db.sh --refresh