From 33bee4c4ff101198548e9fac1a6b47b00ab4ef8f Mon Sep 17 00:00:00 2001 From: Merryalem <134082046+Merryalem@users.noreply.github.com> Date: Sun, 29 Mar 2026 06:11:24 -0700 Subject: [PATCH 1/4] Create ml_taxon_classifier_idea_byMeron.md The general Idea note --- docs/ml_taxon_classifier_idea_byMeron.md | 40 ++++++++++++++++++++++++ 1 file changed, 40 insertions(+) create mode 100644 docs/ml_taxon_classifier_idea_byMeron.md diff --git a/docs/ml_taxon_classifier_idea_byMeron.md b/docs/ml_taxon_classifier_idea_byMeron.md new file mode 100644 index 000000000..ed5293af5 --- /dev/null +++ b/docs/ml_taxon_classifier_idea_byMeron.md @@ -0,0 +1,40 @@ +# Machine Learning Taxon Classifier (GSoC Idea Contribution) + +## Motivation +Accurate classification of Anopheles mosquito species is critical for malaria control. Current approaches rely on variant calling pipelines, which can be computationally expensive and inaccessible in low-resource settings. + +## Proposed Approach +This document outlines a lightweight machine learning approach to classify mosquito samples directly from raw sequencing reads (FASTQ files). + +## Method Idea + +### 1. Feature Extraction +- Extract k-mer frequencies from raw FASTQ reads +- Use small k (e.g., k=5–7) for efficiency +- Normalize counts to create feature vectors + +### 2. Model +- Train a supervised classifier such as: + - Random Forest + - eXtream Gradient Boosting + - Gradient Boosting + - Support Vector Machine + - Light Gradient Boosting + - Logistic Regression + - Labels: major taxa (e.g., An. gambiae, An. coluzzii, An. Arabiensis, and An. funestus) + +### 3. Advantages +- No need for full variant calling +- Lower computational cost +- Faster classification pipeline + +### 4. Future Extensions +- Deep learning models (CNNs on sequence data) +- Integration into malariagen-data-python workflows +- Deployment via cloud (e.g., Google Cloud Storage) + +### 5. Main Idea +- Develop a lightweight FASTQ-based taxonomic classifier that learns discriminative k-mer signatures across Anopheles species and integrate them into a scalable ML model, potentially benchmarking against tools like Kraken and improving performance in low-coverage or noisy data. + +## Relevance to MalariaGEN +This approach aligns with MalariaGEN’s goal of lowering barriers to genomic data analysis and enabling scalable tools for malaria-endemic regions. From 93ee91abf8bec2424178ab610b69aef15e95c5ba Mon Sep 17 00:00:00 2001 From: Merryalem <134082046+Merryalem@users.noreply.github.com> Date: Sun, 29 Mar 2026 07:53:30 -0700 Subject: [PATCH 2/4] Add files via upload ML taxon classifier demo notebook for GSoC_by Meron Asmamaw --- ...n_ML_Taxon_Classifier_GSoC_2026_Prop.ipynb | 397 ++++++++++++++++++ 1 file changed, 397 insertions(+) create mode 100644 notebooks/Meron_ML_Taxon_Classifier_GSoC_2026_Prop.ipynb diff --git a/notebooks/Meron_ML_Taxon_Classifier_GSoC_2026_Prop.ipynb b/notebooks/Meron_ML_Taxon_Classifier_GSoC_2026_Prop.ipynb new file mode 100644 index 000000000..4dd173667 --- /dev/null +++ b/notebooks/Meron_ML_Taxon_Classifier_GSoC_2026_Prop.ipynb @@ -0,0 +1,397 @@ +{ + "nbformat": 4, + "nbformat_minor": 0, + "metadata": { + "colab": { + "provenance": [], + "toc_visible": true + }, + "kernelspec": { + "name": "python3", + "display_name": "Python 3" + }, + "language_info": { + "name": "python" + } + }, + "cells": [ + { + "cell_type": "markdown", + "source": [ + "# ML Taxon Classifier Demo for East African Mosquito Samples by **Meron Asmamaw Alemayehu**\n", + "This notebook demonstrates my workflow for filtering East African samples from MalariaGEN AG3 metadata and preparing them for a machine-learning taxon classifier. \n", + "\n", + "I will write the code step by step, explaining what I plan to do at each stage." + ], + "metadata": { + "id": "1cPntG7rZLjV" + } + }, + { + "cell_type": "markdown", + "source": [ + "# 1 Install and Import Libraries" + ], + "metadata": { + "id": "1TajsazZb9N2" + } + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "oFEqsL9ZYk1B" + }, + "outputs": [], + "source": [ + "# -----------------------------\n", + "# 1. Install and import libraries\n", + "# -----------------------------\n", + "# I will install the malariagen_data package to access AG3 data\n", + "%pip install -q --no-warn-conflicts malariagen_data\n", + "\n", + "# Import libraries I will need\n", + "import malariagen_data\n", + "import pandas as pd\n", + "import matplotlib.pyplot as plt\n", + "import warnings\n", + "\n", + "# I will also use this to make plots render nicely in Colab\n", + "import plotly.io as pio\n", + "pio.renderers.default = \"notebook+colab\"" + ] + }, + { + "cell_type": "markdown", + "source": [ + "# 2 Load AG3 Metadata\n", + "I will load the AG3 dataset and check the metadata so I know which samples are available and what columns I can use for filtering." + ], + "metadata": { + "id": "7y2wJHxWZgwX" + } + }, + { + "cell_type": "code", + "source": [ + "# Initialize AG3 dataset\n", + "ag3 = malariagen_data.Ag3()\n", + "\n", + "# Load sample metadata into a DataFrame\n", + "samples_df = ag3.sample_metadata()\n", + "\n", + "# View first few rows to understand structure\n", + "samples_df.head()\n", + "\n", + "# I will also check all available columns to plan which ones I need\n", + "samples_df.columns.tolist()" + ], + "metadata": { + "id": "nowZv_76ZRky" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "# 3 Filter for East African Samples\n", + "I will filter samples to only include those from East African countries. \n", + "Later I will use the 'sample_id' column to access ENA fastq files." + ], + "metadata": { + "id": "muHO3XtqZl-W" + } + }, + { + "cell_type": "code", + "source": [ + "# List of East African countries I will filter\n", + "east_africa_countries = [\n", + " \"Ethiopia\", \"Comoros, The Union of the\", \"Kenya\", \"Madagascar\",\n", + " \"Malawi\", \"Mozambique\", \"South Sudan\", \"Tanzania\", \"Uganda\",\n", + " \"Zambia\", \"Zimbabwe\"\n", + "]\n", + "\n", + "# Filter metadata for East African samples\n", + "east_africa_samples = samples_df[samples_df['country'].isin(east_africa_countries)].copy()\n", + "print(\"Total East African samples:\", len(east_africa_samples))\n", + "\n", + "# Save the filtered metadata for downstream analysis\n", + "east_africa_samples.to_csv(\"east_africa_samples_metadata.csv\", index=False)" + ], + "metadata": { + "id": "cQcOt1rOZRjF" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "# 4 Explore Samples\n", + "I will look at how many samples come from each country and each year. \n", + "This will help me understand the dataset before I fetch FASTQ files and do ML." + ], + "metadata": { + "id": "A8cfACwdZvj3" + } + }, + { + "cell_type": "code", + "source": [ + "# Samples per country\n", + "samples_per_country = east_africa_samples['country'].value_counts()\n", + "print(samples_per_country)\n", + "\n", + "# Samples per year\n", + "samples_per_year = east_africa_samples.groupby(\"year\").size()\n", + "print(samples_per_year)\n", + "\n", + "# plot for visualization\n", + "samples_per_year.plot(kind=\"bar\", figsize=(8,4), title=\"East African samples per year\")\n", + "plt.xlabel(\"Year\")\n", + "plt.ylabel(\"Number of samples\")\n", + "plt.show()\n" + ], + "metadata": { + "id": "F70jBWFKZRhI" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "# 5 geographic plot" + ], + "metadata": { + "id": "FVs14s6ZZ7H2" + } + }, + { + "cell_type": "code", + "source": [ + "# geographic plot\n", + "plt.figure(figsize=(6,6))\n", + "plt.scatter(\n", + " east_africa_samples[\"longitude\"],\n", + " east_africa_samples[\"latitude\"],\n", + " alpha=0.5,\n", + " c=\"blue\",\n", + " label=\"Samples\"\n", + ")\n", + "plt.xlabel(\"Longitude\")\n", + "plt.ylabel(\"Latitude\")\n", + "plt.title(\"Geographic distribution of East African samples\")\n", + "plt.legend()\n", + "plt.show()" + ], + "metadata": { + "id": "tSyHsnzIZyWw" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "# 6 Planned ML Pipeline Steps (with example code)\n", + "Here I will show **how I plan to implement the full workflow** for my taxon classifier. \n", + "For demonstration, I will use placeholder data so that the notebook runs safely in Colab." + ], + "metadata": { + "id": "p-hTAYGraE1M" + } + }, + { + "cell_type": "code", + "source": [ + "# 6.1 Use 'sample_id' to get ENA run accessions and links to FASTQ files\n", + "\n", + "# I will create a toy example of how to retrieve ENA links using sample_id\n", + "# In practice, I would use ag3.ena_run_accessions() or metadata lookups\n", + "\n", + "# Example: pretend we have these sample_ids\n", + "sample_ids = ml_metadata['sample_id'].values[:5] # take first 5 for demo\n", + "\n", + "# I will create a dictionary mapping sample_id to fake ENA FTP links\n", + "ena_links = {sid: f\"ftp://ftp.sra.ebi.ac.uk/vol1/fastq/{sid}_R1.fastq.gz\" for sid in sample_ids}\n", + "print(\"Sample ENA links:\")\n", + "for k, v in ena_links.items():\n", + " print(k, \"->\", v)" + ], + "metadata": { + "id": "xwyAPekHZyS8" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "## 6.2 Fetch FASTQ files in Colab using !wget\n", + "### I will download files directly into Colab (not local machine) using wget. For demonstration, I will not actually fetch the large files" + ], + "metadata": { + "id": "2jLVsO4NaVAg" + } + }, + { + "cell_type": "code", + "source": [ + "# Example wget command for Colab (commented out to avoid downloading)\n", + "# for sample_id, link in ena_links.items():\n", + "# !wget {link} -O {sample_id}_R1.fastq.gz\n", + "\n", + "print(\"I would use wget to fetch FASTQ files for each sample_id directly in Colab.\")" + ], + "metadata": { + "id": "yZLXXJZYZyRO" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "## 6.3 Filter FASTQ reads for high quality\n", + "I will show a placeholder example: in practice, I would parse FASTQ and filter by quality scores" + ], + "metadata": { + "id": "De2XaWwza6Ll" + } + }, + { + "cell_type": "code", + "source": [ + "# Example: pseudo-function to filter FASTQ reads\n", + "def filter_high_quality_reads(fastq_file):\n", + " \"\"\"\n", + " I will parse each read from the FASTQ file and keep only reads\n", + " where quality scores are high. This is a placeholder.\n", + " \"\"\"\n", + " # Placeholder: just return a list of \"high-quality reads\"\n", + " return [\"ATCG\"*25]*100 # pretend 100 high-quality reads\n", + "\n", + "# Demonstrate with one sample\n", + "sample_reads = filter_high_quality_reads(\"fake_sample.fastq.gz\")\n", + "print(\"Number of high-quality reads:\", len(sample_reads))" + ], + "metadata": { + "id": "pKI4gJlwaeWG" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "## 6.4 Extract k-mer features for each sample\n", + "I will use the filtered reads to generate k-mer counts (toy example)" + ], + "metadata": { + "id": "FCXUvI_GbOn0" + } + }, + { + "cell_type": "code", + "source": [ + "# Example function to extract k-mer features\n", + "def extract_kmer_features(reads, k=5):\n", + " \"\"\"\n", + " I will count k-mers in the reads and return a feature vector.\n", + " Here I use random numbers as a placeholder.\n", + " \"\"\"\n", + " import numpy as np\n", + " num_features = 50\n", + " return np.random.rand(num_features)\n", + "\n", + "# Apply to one sample\n", + "features = extract_kmer_features(sample_reads)\n", + "print(\"Example feature vector (length):\", len(features))" + ], + "metadata": { + "id": "OkKdO0lJaeSA" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "## 6.5 Split dataset into train and test, and the Target Column will be the '**taxon**' for supervised ML.\n", + "I will use placeholder features for all samples to demonstrate splitting" + ], + "metadata": { + "id": "mXHpx9DWbX9N" + } + }, + { + "cell_type": "code", + "source": [ + "from sklearn.model_selection import train_test_split\n", + "import numpy as np\n", + "\n", + "# Generate fake feature matrix for all samples\n", + "num_samples = ml_metadata.shape[0]\n", + "num_features = 50\n", + "X = np.random.rand(num_samples, num_features)\n", + "\n", + "# Target labels from 'taxon' column\n", + "y = ml_metadata['taxon'].values\n", + "\n", + "# Split dataset\n", + "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)\n", + "print(\"Training samples:\", len(X_train), \"Testing samples:\", len(X_test))" + ], + "metadata": { + "id": "4rYPdfLlaeP7" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "## 6.6 Train ML classifiers\n", + "I will train a few example classifiers using the toy features to show the workflow" + ], + "metadata": { + "id": "IYqNDCMdbstJ" + } + }, + { + "cell_type": "code", + "source": [ + "from sklearn.ensemble import RandomForestClassifier\n", + "from sklearn.linear_model import LogisticRegression\n", + "from sklearn.svm import SVC\n", + "from sklearn.metrics import accuracy_score\n", + "\n", + "# Random Forest classifier\n", + "rf = RandomForestClassifier(n_estimators=50, random_state=42)\n", + "rf.fit(X_train, y_train)\n", + "y_pred_rf = rf.predict(X_test)\n", + "print(\"Random Forest toy accuracy:\", accuracy_score(y_test, y_pred_rf))\n", + "\n", + "# Logistic Regression\n", + "lr = LogisticRegression(max_iter=200)\n", + "lr.fit(X_train, y_train)\n", + "y_pred_lr = lr.predict(X_test)\n", + "print(\"Logistic Regression toy accuracy:\", accuracy_score(y_test, y_pred_lr))\n", + "\n", + "# Support Vector Machine\n", + "svm = SVC()\n", + "svm.fit(X_train, y_train)\n", + "y_pred_svm = svm.predict(X_test)\n", + "print(\"SVM toy accuracy:\", accuracy_score(y_test, y_pred_svm))" + ], + "metadata": { + "id": "koG11432bXlV" + }, + "execution_count": null, + "outputs": [] + } + ] +} \ No newline at end of file From 1e6e0663684b68e9806be9b4955a7465af83e6a5 Mon Sep 17 00:00:00 2001 From: Merryalem <134082046+Merryalem@users.noreply.github.com> Date: Sun, 29 Mar 2026 08:00:23 -0700 Subject: [PATCH 3/4] Create Meron_ML Taxon Classifier_GSoC 2026_Prop.md Description of the 'Meron_ML Taxon Classifier_GSoC 2026_Prop.ipynb' notebook --- ...eron_ML Taxon Classifier_GSoC 2026_Prop.md | 77 +++++++++++++++++++ 1 file changed, 77 insertions(+) create mode 100644 notebooks/Meron_ML Taxon Classifier_GSoC 2026_Prop.md diff --git a/notebooks/Meron_ML Taxon Classifier_GSoC 2026_Prop.md b/notebooks/Meron_ML Taxon Classifier_GSoC 2026_Prop.md new file mode 100644 index 000000000..8fbafdcbe --- /dev/null +++ b/notebooks/Meron_ML Taxon Classifier_GSoC 2026_Prop.md @@ -0,0 +1,77 @@ +# **This README describes the workflow for `Meron_ML Taxon Classifier_GSoC 2026_Prop.ipynb`.** + +# ML Taxon Classifier Proposal – GSoC 2026 + +This file describes my step-by-step plan for building a machine-learning classifier to assign malaria mosquito samples to major taxa using raw sequencing reads. + +--- + +## 1. Filter Samples + +I will start by using the MalariaGEN AG3 metadata to filter samples from **East African countries**. +This will give me the subset of samples I want to analyze, focusing on countries like Ethiopia, Kenya, Tanzania, Uganda, etc. + +--- + +## 2. Access ENA FASTQ Links + +Using the `sample_id` column of the filtered metadata, I will retrieve **ENA run accessions** and generate links to the raw FASTQ files. +I will not download these files locally but instead use **Colab** to access them directly via HTTP/FTP links. + +--- + +## 3. Fetch FASTQ Files in Colab + +In Colab, I will use `!wget` to fetch FASTQ files directly into the Colab environment without storing them on my local machine. +This ensures that the analysis can be done even in resource-limited environments. + +--- + +## 4. Filter FASTQ Reads + +I will filter the raw FASTQ reads to **retain only high-quality reads**. +This step ensures that downstream k-mer extraction and ML models are trained on reliable data. + +--- + +## 5. Extract K-mer Features + +From the filtered FASTQ reads, I will extract **k-mer counts** for each sample. +These k-mer features will form the **input matrix** for ML classification. + +--- + +## 6. Split Dataset + +I will split the k-mer feature matrix into **training and testing datasets**. +The `taxon` column in the metadata will serve as the **target label**. + +--- + +## 7. Train Machine Learning Classifiers + +I will train several ML models on the training dataset, including: + +- Random Forest +- Logistic Regression +- Support Vector Machine +- XGBoost +- LightGBM +- K-Nearest Neighbors +- Decision Tree +- Gradient Boosting + +I will evaluate the models on the testing set and choose the best-performing one for further work. + +--- + +## 8. Demonstration Notebook + +Alongside this proposal, I have included a **Colab-ready notebook** (`Meron_ML Taxon Classifier_GSoC 2026_Prop.ipynb`) that: + +- Filters East African samples +- Shows placeholder ENA FASTQ links +- Demonstrates k-mer feature extraction on toy data +- Splits dataset and trains toy classifiers + +This notebook is intended to **demonstrate my workflow and coding approach safely**, without needing to download large FASTQ files. From 8c3e29abb751d5fca1dfc0c27888be6201359447 Mon Sep 17 00:00:00 2001 From: Merryalem <134082046+Merryalem@users.noreply.github.com> Date: Sun, 29 Mar 2026 08:12:11 -0700 Subject: [PATCH 4/4] Delete docs/ml_taxon_classifier_idea_byMeron.md --- docs/ml_taxon_classifier_idea_byMeron.md | 40 ------------------------ 1 file changed, 40 deletions(-) delete mode 100644 docs/ml_taxon_classifier_idea_byMeron.md diff --git a/docs/ml_taxon_classifier_idea_byMeron.md b/docs/ml_taxon_classifier_idea_byMeron.md deleted file mode 100644 index ed5293af5..000000000 --- a/docs/ml_taxon_classifier_idea_byMeron.md +++ /dev/null @@ -1,40 +0,0 @@ -# Machine Learning Taxon Classifier (GSoC Idea Contribution) - -## Motivation -Accurate classification of Anopheles mosquito species is critical for malaria control. Current approaches rely on variant calling pipelines, which can be computationally expensive and inaccessible in low-resource settings. - -## Proposed Approach -This document outlines a lightweight machine learning approach to classify mosquito samples directly from raw sequencing reads (FASTQ files). - -## Method Idea - -### 1. Feature Extraction -- Extract k-mer frequencies from raw FASTQ reads -- Use small k (e.g., k=5–7) for efficiency -- Normalize counts to create feature vectors - -### 2. Model -- Train a supervised classifier such as: - - Random Forest - - eXtream Gradient Boosting - - Gradient Boosting - - Support Vector Machine - - Light Gradient Boosting - - Logistic Regression - - Labels: major taxa (e.g., An. gambiae, An. coluzzii, An. Arabiensis, and An. funestus) - -### 3. Advantages -- No need for full variant calling -- Lower computational cost -- Faster classification pipeline - -### 4. Future Extensions -- Deep learning models (CNNs on sequence data) -- Integration into malariagen-data-python workflows -- Deployment via cloud (e.g., Google Cloud Storage) - -### 5. Main Idea -- Develop a lightweight FASTQ-based taxonomic classifier that learns discriminative k-mer signatures across Anopheles species and integrate them into a scalable ML model, potentially benchmarking against tools like Kraken and improving performance in low-coverage or noisy data. - -## Relevance to MalariaGEN -This approach aligns with MalariaGEN’s goal of lowering barriers to genomic data analysis and enabling scalable tools for malaria-endemic regions.