-
Notifications
You must be signed in to change notification settings - Fork 2
File Format
For structure file names, replace all characters such as |, : and . by _.
For sequence ID in fasta file, you could put any ID but, all |, : and . will be automatically replaced by _. However, for Uniprot headers like sp|UniqueIdentifier|EntryName or tr|UniqueIdentifier|EntryName, the UniqueIdentifier is extracted and used as sequence ID.
Input of:
- subcommand
runwith the option-r/--ref - subcommand
identitywith the option-r/--ref-str
Mandatory to run ASMC based on structures.
This file contains one reference structure path per line, e.g:
/home/User/data/RefA.pdb
/home/User/data/RefZ.pdb
Input of:
- subcommand
runwith the option-p/--pocket
Output of:
- subcommand
runif the option-p/--pocketisn't provided
File used to indicate the active sites positions. The format is ID,Chain,pos..., e.g:
RefA,A,55,57,59,77,101,102,129,130,131,145,148
RefZ,A,89,91,93,118,142,143,170,171,172,197,198
If not provided, this file will be built by ASMC.
Input of:
- subcommand
runwith the option-m/--models - subcommand
identitywith the option-r/--ref-str*
*only the first column is necessary.
Output of:
- subcommand
runif the option-s/--seqsis provided
File built by ASMC if the modelling steps is performed. Otherwise, the file should be built by yourself and provided to the subcommand run, e.g:
/home/User/data/models/target_1.pdb /home/User/data/RefA.pdb
/home/User/data/models/target_2.pdb /home/User/data/RefZ.pdb
Input of:
- subcommand
runwith the option-a/--active-sites
Output of:
- subcommand
runif the option-a/--active-sitesisn't provided - subcommand
runalso returns a fasta file for each group which can be used with the option-a/--active-sites
This is simply a fasta file containing all active site sequences to be clustered, e.g:
>A0A015SZL4
MGAPECWKFSRHHEYERD
>A0A015TUY7
MGAPECWKFSRHHEYERD
>A0A017H2J5
MGAPECWEKANLREYKGA
Input of:
- subcommand
runwith the option-M/--msa
The file should contain 2 information if only one reference is used:
- The active site positions in the reference sequence
- The path to the multiple sequence alignment
refA,55,57,59,77,101,102,129,130,131,145,148
/home/User/data/multiple_sequence_alignment.fasta
If they are multiple references, it's necessary to have 1) the pocket positions of each reference and 2) the path to a file similar to identity_targets_refs.tsv (see below)
refA,55,57,59,77,101,102,129,130,131,145,148
RefZ,89,91,93,118,142,143,170,171,172,197,198
/home/User/data/identity_targets_refs.tsv
/home/User/data/multiple_sequence_alignment.fasta
Input of:
- subcommand
identitywith the option-R/--ref-seq
This file contains one reference ID per line, e.g:
RefA
RefZ
Input of:
- subcommand
comparewith the option-f1and-f2 - subcommand
to_xlsxwith the option-f/--file
Output of:
- subcommand
run
The x corresponds to the -e/--eps value of the subcommand run. By default, the value is auto so the value is automatically chosen before the clustering, based on the normalised distances distribution.
The y corresponds to the --min-samples value of the subcommand run. By default, the value is auto so the value is 5 if the number of samples ≤ 1500 and 25 for more.
The format is ID Active_site_sequence Group_id, e.g:
ID1 ACQGINFIRVDYEIHIGMGGT -1
ID2 SAEGINLMRNSFVQHVGHQGT 0
ID3 SAEGINFVRNSFVQHVGHQGT 0
ID4 SCEGVNFVRVDRLVHVGLIGT 1
ID5 SCEGVNFIRVDRLVHVGLIGT 1
Note: The group numbering starts at 0 and -1 is the ID for the outliers
Input of:
- subcommand
runas a path in a file to provide to the-M/--msa, see above - subcommand
comparewith the option-id
Output:
- subcommand
runif the-s/--seqsis provided (homology modelling performed by ASMC with MODELLER) - subcommand
identity
Output example for the subcommand identity:
id1 refA 62.50
id2 refA 68.75
id3 refZ 68.75
id4 refZ 50.00
id5 refZ 62.50
Note: the identity_targets_refs.tsv returned by the subcommand run have 4 columns, the last column contains the value of --id option.
Output of:
- subcommand
compare
The format of this file is: ID G1 SEQ1 G2 SEQ2 DIFF REF_ID PERC_ID REF_SEQ, e.g:
ID G1 SEQ1 G2 SEQ2 DIFF REF_ID PERC_ID REF_SEQ
ID22 0 FGSNLGCYEVFMYP 0 FGSNLGCYEVFMYP 0 REFC 16.81 LPSQLDWYEVMEYP
ID45 0 ILSKVAWFEVFVPG -1 ILS-VAWFEAVIYP 5 REFB 18.14 VLSAAAWYEIIVYP
ID48 0 VGSEVTWYESAMYP 0 VGSSVTWYESAMYP 1 REFD 26.85 LGSQVTWYEIIIYP
ID61 0 IASQMGWYEAIIYP 0 IASQMGWYEAIIYP 0 REFB 39.82 VLSAAAWYEIIVYP
ID67 0 ILSAAAWYEIIVYP 0 ILSAAAWYEIIVYP 0 REFB 51.77 VLSAAAWYEIIVYP
Note: The values in G1 and G2 columns are just the id of the groups for their respective runs. Two 0 don't signify that the members composition is identical for the two groups. However, multiple runs with same parameters on the same active sites alignment always return the same clusters.