A FASTQ-based machine learning classifier that identifies Anopheles mosquito species directly from raw sequencing reads — no alignment or variant calling required.
Built as part of a MalariaGEN GSoC 2026 proposal for lightweight taxonomic identification of malaria vectors.
Taxonomic identification of Anopheles mosquitoes is critical for malaria control but challenging because many species are morphologically identical. Misidentification leads to wasted sequencing resources and flawed epidemiological insights.
This tool classifies mosquito samples into three major species groups using only raw FASTQ reads:
| Species | Region |
|---|---|
| An. gambiae complex | Sub-Saharan Africa |
| An. funestus | Sub-Saharan Africa |
| An. stephensi | South/Southeast Asia |
- No alignment needed — works directly on raw FASTQ reads streamed from ENA
- K-mer frequency profiles — uses
HashingVectorizerto extract character n-gram features - LightGBM classifier — fast, handles sparse high-dimensional data natively
- Multiple k-mer sizes — tested with k=15 (98.3% accuracy) and k=21 (97.5% accuracy)
- Noise robust — maintains 100% accuracy up to 10% base substitution error rate
- Low read depth — reliable classification with as few as 1,000 reads
| Metric | k=15 | k=21 |
|---|---|---|
| Accuracy | 98.3% | 97.5% |
| Macro-F1 | 0.983 | 0.975 |
| Compression Ratio | 10³:1 | 10⁶:1 |
Raw FASTQ reads → K-mer extraction → Hashing (2²⁰ bins) → LightGBM → Species prediction
- Extract: Stream FASTQ reads from ENA, extract k-mer frequency profiles using
HashingVectorizer - Train: Train a LightGBM model with stratified cross-validation
- Evaluate: Predict species from unseen FASTQ files with confidence scores
git clone https://github.com/31puneet/Taxon-Classifier.git
cd Taxon-Classifierpython -m venv venv
source venv/bin/activate # Linux/Mac
# venv\Scripts\activate # Windowspip install -r requirements.txtcd src/evaluation
python evaluate_models.py --k 15python poison_test.py --k 15cd src
python main.py --k 15Or run individual steps:
cd src/scripts
python extract_kmers.py --k 15
python train_models.py --k 15
cd ../evaluation
python evaluate_models.py --k 15Open notebooks/visualization.ipynb in Jupyter to generate plots:
from evaluation.visualize import plot_tsne, plot_k_comparison, plot_poison_comparison, plot_depth_curve
plot_tsne(k=15) # t-SNE species clustering
plot_k_comparison() # k=15 vs k=21 comparison
plot_poison_comparison() # Noise robustness curves
plot_depth_curve(k=15) # Read depth vs confidence