PROBEst is a tool designed for generating nucleotide probes with specified properties, leveraging advanced algorithms and AI-driven techniques to ensure high-quality results. The tool is particularly useful for researchers and bioinformaticians who require probes with tailored universality and specificity for applications such as PCR, hybridization, and sequencing. By integrating a wrapped evolutionary algorithm, PROBEst optimizes probe generation through iterative refinement, ensuring that the final probes meet stringent biological and computational criteria.
At the core of PROBEst is an AI-enhanced workflow that combines Primer3 or OligoMiner for initial oligonucleotide generation, BLASTn for specificity and universality checks, and a mutation module for probe optimization. The tool allows users to input target sequences, select reference files for universality and specificity validation, and customize layouts for probe design. The evolutionary algorithm iteratively refines the probes by introducing mutations and evaluating their performance, ensuring that the final output is both specific to the target and universally applicable across related sequences.
git clone https://github.com/CTLab-ITMO/PROBESt.git
cd PROBEst
bash setup/install.sh
#validate installation with
#setup/test_generator.shPROBEst can be run using the following command:
python pipeline.py \
-i {INPUT} \
-tb {TRUE_BASE} \
-fb [FALSE_BASE ...] \
-c {CONTIG_TABLE} \
-o {OUTPUT}Blastn databases and contig table are results of the prep_db.sh
-i INPUT: Input FASTA file (or directory with fasta / fasta.gz file) for the initial probe setgeneration.-tb TRUE_BASE: Input BLASTn database path for primer adjusting.-fb FALSE_BASE: Input BLASTn database path for non-specific testing.-c CONTIG_TABLE: .tsv table with BLAST database information.-o OUTPUT: Output path for results.-t THREADS: Number of threads to use.-a ALGORITHM: Algorithm for probe generation (FISHorprimer).--initial_generator: Tool for initial probe set generation (primer3oroligominer, default:primer3).
For a full list of arguments, run:
python pipeline.py --helpFor parameter selection, grid search is implemented. You can specify parameters in json (see for example data/test/general/param_grid_light.json) and run
python test_parameters.py \
-p {JSON}pipeline.py relies on pre-prepared BLASTn databases. To create the required true_base, false_base, and contig_table, you can use the following script:
bash scripts/generator/prep_db.sh \
-n {DATABASE_NAME} \
-c {CONTIG_NAME} \
-t {TMP_DIR} \
[FASTA]-n DATABASE_NAME: Name of the output BLAST database (required).-c CONTIG_TABLE: Output file to store contig names and their corresponding sequence headers (required).-t TMP_DIR: Temporary directory for intermediate files (optional, defaults to./.tmp).FASTA: List of input FASTA files (gzipped or uncompressed).
PROBEst includes a user-friendly web interface for probe generation. The web app provides:
python app/app.pyFor detailed web app documentation, see app/README.md
-
Prepare BLASTn databases
-
Select File for Probe Generation (
INPUT) -
Select Files for Universality Check (
TRUE_BASE) -
Select Files for Specificity Check (
FALSE_BASE) -
Select Layouts and Run Wrapped Evolutionary Algorithm (
pipeline.py)a. Primer3 Generation
b. BLASTn Check
c. Parsing
d. Mutation in Probe
e. AI corrections
---
config:
layout: elk
look: classic
---
%%{init: {
'theme': 'base',
'themeVariables': {
'fontFamily': 'arial',
'fontSize': '16px',
'primaryColor': '#fff',
'primaryBorderColor': '#FFAC1C',
'primaryTextColor': '#000',
'lineColor': '#000',
'secondaryColor': 'white',
'tertiaryColor': '#fff',
'subgraphBorderStyle': 'dotted'
},
'flowchart': {
'curve': 'monotoneY',
'padding': 15
}
}}%%
graph LR
subgraph inputs
A
A1
T1
T3
end
A([Initial probe generation]):::input -- primer3 --> B2(initial probe set):::probe
A -- oligominer --> B2
A1([Custom probes]):::input --> B2
T1([Target sequences]):::input -- blastn-db --> T2[(target)]
T3([Offtarget sequences]):::input -- blastn-db --> T4[(offtarget)]
subgraph database
T2
T4
end
T2 --> EA
T4 --> EA
B2 --> EA
EA[evolutionary algorithm] --> T11(results):::probe
classDef empty width:0px,height:0px;
classDef input fill:#90EE9020,stroke:#fff,stroke-width:2px,shape:ellipse;
classDef probe fill:#FFAC1C20,stroke:#fff,stroke-width:2px;
---
config:
layout: elk
look: classic
---
%%{init: {
'layout': 'elk',
'theme': 'base',
'themeVariables': {
'fontFamily': 'arial',
'fontSize': '16px',
'primaryColor': '#fff',
'primaryBorderColor': '#FFAC1C',
'primaryTextColor': '#000',
'lineColor': '#000',
'secondaryColor': 'white',
'tertiaryColor': '#fff',
'subgraphBorderStyle': 'dotted'
},
'flowchart': {
'curve': 'monotoneY',
'padding': 15
}
}}%%
graph LR
subgraph evolutionary algorithm
subgraph hits
TP
TN
end
B(probe set):::probe --> TP[target]
B --> TN[offtarget]
B1 -- mutations --> B
TP -- coverage --> T6[universality]
TP -- duplications --> T7[multimapping]
TN ---> T8[specificity]
subgraph check
T6
T7
T8
M1
E3
end
B --- E6[ ]:::empty --> M1[modeling]
TP --- E6
M1 --- E3[ ]:::empty
T6 --- E3
T7 --- E3
T8 --- E3
E3 -- quality prediction --> B1(filtered probe set):::probe
end
B1 --> T11(results):::probe
classDef empty width:0px,height:0px;
classDef input fill:#90EE9020,stroke:#fff,stroke-width:2px,shape:ellipse;
classDef probe fill:#FFAC1C20,stroke:#fff,stroke-width:2px;
---
config:
theme: neutral
look: classic
---
%%{init: {
'theme': 'base',
'themeVariables': {
'fontFamily': 'arial',
'fontSize': '16px',
'primaryColor': '#fff',
'primaryBorderColor': '#FFAC1C',
'primaryTextColor': '#000',
'lineColor': '#000',
'secondaryColor': '#90EE90',
'tertiaryColor': '#fff',
'subgraphBorderStyle': 'dotted'
},
'flowchart': {
'curve': 'monotoneY',
'padding': 15
}
}}%%
graph LR
PROBEst([PROBEst]) --> src[src/]
PROBEst --> scripts[scripts/]
PROBEst --> tests[tests/]
PROBEst --> app[app/]
subgraph folders
src
scripts
tests
app
end
src --> C[benchmarking]
src --> A[generation]
tests --> A
scripts --> D[preprocessing]
scripts --> B[database parsing]
D --> A
app --> E[web interface]
E --> A
-
To check the installation:
bash test_generator.sh -
For developers: use
pytest
If you use PROBEst LLM pipeline for the extraction of the research data, please cite:
BibTeX:
@article{202511.2140,
doi = {10.20944/preprints202511.2140.v1},
url = {https://doi.org/10.20944/preprints202511.2140.v1},
year = 2025,
month = {November},
publisher = {Preprints},
author = {Alexandr Serdiukov and Vitaliy Dragvelis and Daniil Smutin and Amir Taldaev and Sergey Muravyov},
title = {Efficient and Verified Extraction of the Research Data Using LLM},
journal = {Preprints}
}Plain text: Serdiukov, A., Dragvelis, V., Smutin, D., Taldaev, A., & Muravyov, S. (2025). Efficient and Verified Extraction of the Research Data Using LLM. Preprints. https://doi.org/10.20944/preprints202511.2140.v1
This project is licensed under the MIT License - see the LICENSE file for details.
We welcome contributions from the community! To contribute:
Please read the Contribution Guidelines for more details.
Tool have its own Wiki pages with detailed information on usage cases, data description and another neccessary information
