Scripts to automatic recovery/process information of NCBI.
Recover NCBI sequences, host and organism taxonomy information based on list of tax_ids
- To recovery genbank information from nucleotide sequences:
python ncbi_seq_retrieve.py -in file_with_access_ids.txt -db nucleotide -ot gb
Or to recovery in xml format, just insert the parameter -tf xml.
- To recovery cds translated to aminoacids from nucleotide sequences:
python ncbi_seq_retrieve.py -in file_with_acess_ids.txt -db nucleotide -ot fasta_cds_aa
Or to recovery cds not translated, just change fasta_cds_aa for fasta_cds_na
- To recovery nucleotide of aminoacid sequences
python ncbi_seq_retrieve.py -in file_with_acess_ids.txt -db (nucleotide or protein) -ot fasta
Or to recovery in xml format, just insert the parameter -tf xml.
- To recovery taxonomy information of ncbi acess IDs
python ncbi_seq_retrieve.py -in file_with_acess_ids.txt -db (nucleotide or protein) -ot gb -tx True
- To recovery taxonomy information of host of ncbi acess IDs (ideal for viruses)
python ncbi_seq_retrieve.py -in file_with_acess_ids.txt -db (nucleotide or protein) -ot gb -tx True -th True
If you have a file with IDs from nucleotide sequences, you can't use this file in a protein database, and vice-versa. If you call help function, a table with which text formats are allowed per output type, and which output types are allowed per database.
Sample a fasta file based on taxonomy and virus name, the header of sequence should follow the pattern: <ncbi-access>|<tax>|<sequence name>[<virus name>]. For example YP_010037467.1|Alphacoronavirus|polyprotein 1ab [Alphacoronavirus sp.]
can be used to samble by Genus.
python split_by_tax.py input.fasta output_directory seed
With test data:
python split_by_tax.py test_data/ncbi_virus.fa test_out 123