From 33bee4c4ff101198548e9fac1a6b47b00ab4ef8f Mon Sep 17 00:00:00 2001
From: Merryalem <134082046+Merryalem@users.noreply.github.com>
Date: Sun, 29 Mar 2026 06:11:24 -0700
Subject: [PATCH 1/4] Create ml_taxon_classifier_idea_byMeron.md

The general Idea note
---
 docs/ml_taxon_classifier_idea_byMeron.md | 40 ++++++++++++++++++++++++
 1 file changed, 40 insertions(+)
 create mode 100644 docs/ml_taxon_classifier_idea_byMeron.md

diff --git a/docs/ml_taxon_classifier_idea_byMeron.md b/docs/ml_taxon_classifier_idea_byMeron.md
new file mode 100644
index 000000000..ed5293af5
--- /dev/null
+++ b/docs/ml_taxon_classifier_idea_byMeron.md
@@ -0,0 +1,40 @@
+# Machine Learning Taxon Classifier (GSoC Idea Contribution)
+
+## Motivation
+Accurate classification of Anopheles mosquito species is critical for malaria control. Current approaches rely on variant calling pipelines, which can be computationally expensive and inaccessible in low-resource settings.
+
+## Proposed Approach
+This document outlines a lightweight machine learning approach to classify mosquito samples directly from raw sequencing reads (FASTQ files).
+
+## Method Idea
+
+### 1. Feature Extraction
+- Extract k-mer frequencies from raw FASTQ reads
+- Use small k (e.g., k=5–7) for efficiency
+- Normalize counts to create feature vectors
+
+### 2. Model
+- Train a supervised classifier such as:
+  - Random Forest
+  - eXtream Gradient Boosting
+  - Gradient Boosting
+  - Support Vector Machine
+  - Light Gradient Boosting 
+  - Logistic Regression
+  - Labels: major taxa (e.g., An. gambiae, An. coluzzii, An.  Arabiensis, and An. funestus)
+
+### 3. Advantages
+- No need for full variant calling
+- Lower computational cost
+- Faster classification pipeline
+
+### 4. Future Extensions
+- Deep learning models (CNNs on sequence data)
+- Integration into malariagen-data-python workflows
+- Deployment via cloud (e.g., Google Cloud Storage)
+  
+### 5. Main Idea
+- Develop a lightweight FASTQ-based taxonomic classifier that learns discriminative k-mer signatures across Anopheles species and integrate them into a scalable ML model, potentially benchmarking against tools like Kraken and improving performance in low-coverage or noisy data.
+  
+## Relevance to MalariaGEN
+This approach aligns with MalariaGEN’s goal of lowering barriers to genomic data analysis and enabling scalable tools for malaria-endemic regions.

From 93ee91abf8bec2424178ab610b69aef15e95c5ba Mon Sep 17 00:00:00 2001
From: Merryalem <134082046+Merryalem@users.noreply.github.com>
Date: Sun, 29 Mar 2026 07:53:30 -0700
Subject: [PATCH 2/4] Add files via upload

ML taxon classifier demo notebook for GSoC_by Meron Asmamaw
---
 ...n_ML_Taxon_Classifier_GSoC_2026_Prop.ipynb | 397 ++++++++++++++++++
 1 file changed, 397 insertions(+)
 create mode 100644 notebooks/Meron_ML_Taxon_Classifier_GSoC_2026_Prop.ipynb

diff --git a/notebooks/Meron_ML_Taxon_Classifier_GSoC_2026_Prop.ipynb b/notebooks/Meron_ML_Taxon_Classifier_GSoC_2026_Prop.ipynb
new file mode 100644
index 000000000..4dd173667
--- /dev/null
+++ b/notebooks/Meron_ML_Taxon_Classifier_GSoC_2026_Prop.ipynb
@@ -0,0 +1,397 @@
+{
+  "nbformat": 4,
+  "nbformat_minor": 0,
+  "metadata": {
+    "colab": {
+      "provenance": [],
+      "toc_visible": true
+    },
+    "kernelspec": {
+      "name": "python3",
+      "display_name": "Python 3"
+    },
+    "language_info": {
+      "name": "python"
+    }
+  },
+  "cells": [
+    {
+      "cell_type": "markdown",
+      "source": [
+        "#  ML Taxon Classifier Demo for East African Mosquito Samples by **Meron Asmamaw Alemayehu**\n",
+        "This notebook demonstrates my workflow for filtering East African samples from MalariaGEN AG3 metadata and preparing them for a machine-learning taxon classifier.  \n",
+        "\n",
+        "I will write the code step by step, explaining what I plan to do at each stage."
+      ],
+      "metadata": {
+        "id": "1cPntG7rZLjV"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# 1 Install and Import Libraries"
+      ],
+      "metadata": {
+        "id": "1TajsazZb9N2"
+      }
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {
+        "id": "oFEqsL9ZYk1B"
+      },
+      "outputs": [],
+      "source": [
+        "# -----------------------------\n",
+        "# 1. Install and import libraries\n",
+        "# -----------------------------\n",
+        "# I will install the malariagen_data package to access AG3 data\n",
+        "%pip install -q --no-warn-conflicts malariagen_data\n",
+        "\n",
+        "# Import libraries I will need\n",
+        "import malariagen_data\n",
+        "import pandas as pd\n",
+        "import matplotlib.pyplot as plt\n",
+        "import warnings\n",
+        "\n",
+        "# I will also use this to make plots render nicely in Colab\n",
+        "import plotly.io as pio\n",
+        "pio.renderers.default = \"notebook+colab\""
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# 2 Load AG3 Metadata\n",
+        "I will load the AG3 dataset and check the metadata so I know which samples are available and what columns I can use for filtering."
+      ],
+      "metadata": {
+        "id": "7y2wJHxWZgwX"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "# Initialize AG3 dataset\n",
+        "ag3 = malariagen_data.Ag3()\n",
+        "\n",
+        "# Load sample metadata into a DataFrame\n",
+        "samples_df = ag3.sample_metadata()\n",
+        "\n",
+        "# View first few rows to understand structure\n",
+        "samples_df.head()\n",
+        "\n",
+        "# I will also check all available columns to plan which ones I need\n",
+        "samples_df.columns.tolist()"
+      ],
+      "metadata": {
+        "id": "nowZv_76ZRky"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# 3 Filter for East African Samples\n",
+        "I will filter samples to only include those from East African countries.  \n",
+        "Later I will use the 'sample_id' column to access ENA fastq files."
+      ],
+      "metadata": {
+        "id": "muHO3XtqZl-W"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "# List of East African countries I will filter\n",
+        "east_africa_countries = [\n",
+        "    \"Ethiopia\", \"Comoros, The Union of the\", \"Kenya\", \"Madagascar\",\n",
+        "    \"Malawi\", \"Mozambique\", \"South Sudan\", \"Tanzania\", \"Uganda\",\n",
+        "    \"Zambia\", \"Zimbabwe\"\n",
+        "]\n",
+        "\n",
+        "# Filter metadata for East African samples\n",
+        "east_africa_samples = samples_df[samples_df['country'].isin(east_africa_countries)].copy()\n",
+        "print(\"Total East African samples:\", len(east_africa_samples))\n",
+        "\n",
+        "# Save the filtered metadata for downstream analysis\n",
+        "east_africa_samples.to_csv(\"east_africa_samples_metadata.csv\", index=False)"
+      ],
+      "metadata": {
+        "id": "cQcOt1rOZRjF"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# 4 Explore Samples\n",
+        "I will look at how many samples come from each country and each year.  \n",
+        "This will help me understand the dataset before I fetch FASTQ files and do ML."
+      ],
+      "metadata": {
+        "id": "A8cfACwdZvj3"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "# Samples per country\n",
+        "samples_per_country = east_africa_samples['country'].value_counts()\n",
+        "print(samples_per_country)\n",
+        "\n",
+        "# Samples per year\n",
+        "samples_per_year = east_africa_samples.groupby(\"year\").size()\n",
+        "print(samples_per_year)\n",
+        "\n",
+        "#  plot for visualization\n",
+        "samples_per_year.plot(kind=\"bar\", figsize=(8,4), title=\"East African samples per year\")\n",
+        "plt.xlabel(\"Year\")\n",
+        "plt.ylabel(\"Number of samples\")\n",
+        "plt.show()\n"
+      ],
+      "metadata": {
+        "id": "F70jBWFKZRhI"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# 5 geographic plot"
+      ],
+      "metadata": {
+        "id": "FVs14s6ZZ7H2"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "#  geographic plot\n",
+        "plt.figure(figsize=(6,6))\n",
+        "plt.scatter(\n",
+        "    east_africa_samples[\"longitude\"],\n",
+        "    east_africa_samples[\"latitude\"],\n",
+        "    alpha=0.5,\n",
+        "    c=\"blue\",\n",
+        "    label=\"Samples\"\n",
+        ")\n",
+        "plt.xlabel(\"Longitude\")\n",
+        "plt.ylabel(\"Latitude\")\n",
+        "plt.title(\"Geographic distribution of East African samples\")\n",
+        "plt.legend()\n",
+        "plt.show()"
+      ],
+      "metadata": {
+        "id": "tSyHsnzIZyWw"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# 6 Planned ML Pipeline Steps (with example code)\n",
+        "Here I will show **how I plan to implement the full workflow** for my taxon classifier.  \n",
+        "For demonstration, I will use placeholder data so that the notebook runs safely in Colab."
+      ],
+      "metadata": {
+        "id": "p-hTAYGraE1M"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "# 6.1 Use 'sample_id' to get ENA run accessions and links to FASTQ files\n",
+        "\n",
+        "# I will create a toy example of how to retrieve ENA links using sample_id\n",
+        "# In practice, I would use ag3.ena_run_accessions() or metadata lookups\n",
+        "\n",
+        "# Example: pretend we have these sample_ids\n",
+        "sample_ids = ml_metadata['sample_id'].values[:5]  # take first 5 for demo\n",
+        "\n",
+        "# I will create a dictionary mapping sample_id to fake ENA FTP links\n",
+        "ena_links = {sid: f\"ftp://ftp.sra.ebi.ac.uk/vol1/fastq/{sid}_R1.fastq.gz\" for sid in sample_ids}\n",
+        "print(\"Sample ENA links:\")\n",
+        "for k, v in ena_links.items():\n",
+        "    print(k, \"->\", v)"
+      ],
+      "metadata": {
+        "id": "xwyAPekHZyS8"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "## 6.2 Fetch FASTQ files in Colab using !wget\n",
+        "### I will download files directly into Colab (not local machine) using wget. For demonstration, I will not actually fetch the large files"
+      ],
+      "metadata": {
+        "id": "2jLVsO4NaVAg"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "# Example wget command for Colab (commented out to avoid downloading)\n",
+        "# for sample_id, link in ena_links.items():\n",
+        "#     !wget {link} -O {sample_id}_R1.fastq.gz\n",
+        "\n",
+        "print(\"I would use wget to fetch FASTQ files for each sample_id directly in Colab.\")"
+      ],
+      "metadata": {
+        "id": "yZLXXJZYZyRO"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "## 6.3 Filter FASTQ reads for high quality\n",
+        "I will show a placeholder example: in practice, I would parse FASTQ and filter by quality scores"
+      ],
+      "metadata": {
+        "id": "De2XaWwza6Ll"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "# Example: pseudo-function to filter FASTQ reads\n",
+        "def filter_high_quality_reads(fastq_file):\n",
+        "    \"\"\"\n",
+        "    I will parse each read from the FASTQ file and keep only reads\n",
+        "    where quality scores are high. This is a placeholder.\n",
+        "    \"\"\"\n",
+        "    # Placeholder: just return a list of \"high-quality reads\"\n",
+        "    return [\"ATCG\"*25]*100  # pretend 100 high-quality reads\n",
+        "\n",
+        "# Demonstrate with one sample\n",
+        "sample_reads = filter_high_quality_reads(\"fake_sample.fastq.gz\")\n",
+        "print(\"Number of high-quality reads:\", len(sample_reads))"
+      ],
+      "metadata": {
+        "id": "pKI4gJlwaeWG"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "## 6.4 Extract k-mer features for each sample\n",
+        "I will use the filtered reads to generate k-mer counts (toy example)"
+      ],
+      "metadata": {
+        "id": "FCXUvI_GbOn0"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "# Example function to extract k-mer features\n",
+        "def extract_kmer_features(reads, k=5):\n",
+        "    \"\"\"\n",
+        "    I will count k-mers in the reads and return a feature vector.\n",
+        "    Here I use random numbers as a placeholder.\n",
+        "    \"\"\"\n",
+        "    import numpy as np\n",
+        "    num_features = 50\n",
+        "    return np.random.rand(num_features)\n",
+        "\n",
+        "# Apply to one sample\n",
+        "features = extract_kmer_features(sample_reads)\n",
+        "print(\"Example feature vector (length):\", len(features))"
+      ],
+      "metadata": {
+        "id": "OkKdO0lJaeSA"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "## 6.5 Split dataset into train and test, and the Target Column will be the '**taxon**' for supervised ML.\n",
+        "I will use placeholder features for all samples to demonstrate splitting"
+      ],
+      "metadata": {
+        "id": "mXHpx9DWbX9N"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "from sklearn.model_selection import train_test_split\n",
+        "import numpy as np\n",
+        "\n",
+        "# Generate fake feature matrix for all samples\n",
+        "num_samples = ml_metadata.shape[0]\n",
+        "num_features = 50\n",
+        "X = np.random.rand(num_samples, num_features)\n",
+        "\n",
+        "# Target labels from 'taxon' column\n",
+        "y = ml_metadata['taxon'].values\n",
+        "\n",
+        "# Split dataset\n",
+        "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)\n",
+        "print(\"Training samples:\", len(X_train), \"Testing samples:\", len(X_test))"
+      ],
+      "metadata": {
+        "id": "4rYPdfLlaeP7"
+      },
+      "execution_count": null,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "## 6.6 Train ML classifiers\n",
+        "I will train a few example classifiers using the toy features to show the workflow"
+      ],
+      "metadata": {
+        "id": "IYqNDCMdbstJ"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "from sklearn.ensemble import RandomForestClassifier\n",
+        "from sklearn.linear_model import LogisticRegression\n",
+        "from sklearn.svm import SVC\n",
+        "from sklearn.metrics import accuracy_score\n",
+        "\n",
+        "# Random Forest classifier\n",
+        "rf = RandomForestClassifier(n_estimators=50, random_state=42)\n",
+        "rf.fit(X_train, y_train)\n",
+        "y_pred_rf = rf.predict(X_test)\n",
+        "print(\"Random Forest toy accuracy:\", accuracy_score(y_test, y_pred_rf))\n",
+        "\n",
+        "# Logistic Regression\n",
+        "lr = LogisticRegression(max_iter=200)\n",
+        "lr.fit(X_train, y_train)\n",
+        "y_pred_lr = lr.predict(X_test)\n",
+        "print(\"Logistic Regression toy accuracy:\", accuracy_score(y_test, y_pred_lr))\n",
+        "\n",
+        "# Support Vector Machine\n",
+        "svm = SVC()\n",
+        "svm.fit(X_train, y_train)\n",
+        "y_pred_svm = svm.predict(X_test)\n",
+        "print(\"SVM toy accuracy:\", accuracy_score(y_test, y_pred_svm))"
+      ],
+      "metadata": {
+        "id": "koG11432bXlV"
+      },
+      "execution_count": null,
+      "outputs": []
+    }
+  ]
+}
\ No newline at end of file

From 1e6e0663684b68e9806be9b4955a7465af83e6a5 Mon Sep 17 00:00:00 2001
From: Merryalem <134082046+Merryalem@users.noreply.github.com>
Date: Sun, 29 Mar 2026 08:00:23 -0700
Subject: [PATCH 3/4] Create Meron_ML Taxon Classifier_GSoC 2026_Prop.md

Description of the 'Meron_ML Taxon Classifier_GSoC 2026_Prop.ipynb' notebook
---
 ...eron_ML Taxon Classifier_GSoC 2026_Prop.md | 77 +++++++++++++++++++
 1 file changed, 77 insertions(+)
 create mode 100644 notebooks/Meron_ML Taxon Classifier_GSoC 2026_Prop.md

diff --git a/notebooks/Meron_ML Taxon Classifier_GSoC 2026_Prop.md b/notebooks/Meron_ML Taxon Classifier_GSoC 2026_Prop.md
new file mode 100644
index 000000000..8fbafdcbe
--- /dev/null
+++ b/notebooks/Meron_ML Taxon Classifier_GSoC 2026_Prop.md	
@@ -0,0 +1,77 @@
+# **This README describes the workflow for `Meron_ML Taxon Classifier_GSoC 2026_Prop.ipynb`.**
+
+# ML Taxon Classifier Proposal – GSoC 2026
+
+This file describes my step-by-step plan for building a machine-learning classifier to assign malaria mosquito samples to major taxa using raw sequencing reads.
+
+---
+
+## 1. Filter Samples
+
+I will start by using the MalariaGEN AG3 metadata to filter samples from **East African countries**.  
+This will give me the subset of samples I want to analyze, focusing on countries like Ethiopia, Kenya, Tanzania, Uganda, etc.
+
+---
+
+## 2. Access ENA FASTQ Links
+
+Using the `sample_id` column of the filtered metadata, I will retrieve **ENA run accessions** and generate links to the raw FASTQ files.  
+I will not download these files locally but instead use **Colab** to access them directly via HTTP/FTP links.
+
+---
+
+## 3. Fetch FASTQ Files in Colab
+
+In Colab, I will use `!wget` to fetch FASTQ files directly into the Colab environment without storing them on my local machine.  
+This ensures that the analysis can be done even in resource-limited environments.
+
+---
+
+## 4. Filter FASTQ Reads
+
+I will filter the raw FASTQ reads to **retain only high-quality reads**.  
+This step ensures that downstream k-mer extraction and ML models are trained on reliable data.
+
+---
+
+## 5. Extract K-mer Features
+
+From the filtered FASTQ reads, I will extract **k-mer counts** for each sample.  
+These k-mer features will form the **input matrix** for ML classification.
+
+---
+
+## 6. Split Dataset
+
+I will split the k-mer feature matrix into **training and testing datasets**.  
+The `taxon` column in the metadata will serve as the **target label**.
+
+---
+
+## 7. Train Machine Learning Classifiers
+
+I will train several ML models on the training dataset, including:
+
+- Random Forest  
+- Logistic Regression  
+- Support Vector Machine  
+- XGBoost  
+- LightGBM  
+- K-Nearest Neighbors  
+- Decision Tree  
+- Gradient Boosting  
+
+I will evaluate the models on the testing set and choose the best-performing one for further work.
+
+---
+
+## 8. Demonstration Notebook
+
+Alongside this proposal, I have included a **Colab-ready notebook** (`Meron_ML Taxon Classifier_GSoC 2026_Prop.ipynb`) that:
+
+- Filters East African samples  
+- Shows placeholder ENA FASTQ links  
+- Demonstrates k-mer feature extraction on toy data  
+- Splits dataset and trains toy classifiers  
+
+This notebook is intended to **demonstrate my workflow and coding approach safely**, without needing to download large FASTQ files.

From 8c3e29abb751d5fca1dfc0c27888be6201359447 Mon Sep 17 00:00:00 2001
From: Merryalem <134082046+Merryalem@users.noreply.github.com>
Date: Sun, 29 Mar 2026 08:12:11 -0700
Subject: [PATCH 4/4] Delete docs/ml_taxon_classifier_idea_byMeron.md

---
 docs/ml_taxon_classifier_idea_byMeron.md | 40 ------------------------
 1 file changed, 40 deletions(-)
 delete mode 100644 docs/ml_taxon_classifier_idea_byMeron.md

diff --git a/docs/ml_taxon_classifier_idea_byMeron.md b/docs/ml_taxon_classifier_idea_byMeron.md
deleted file mode 100644
index ed5293af5..000000000
--- a/docs/ml_taxon_classifier_idea_byMeron.md
+++ /dev/null
@@ -1,40 +0,0 @@
-# Machine Learning Taxon Classifier (GSoC Idea Contribution)
-
-## Motivation
-Accurate classification of Anopheles mosquito species is critical for malaria control. Current approaches rely on variant calling pipelines, which can be computationally expensive and inaccessible in low-resource settings.
-
-## Proposed Approach
-This document outlines a lightweight machine learning approach to classify mosquito samples directly from raw sequencing reads (FASTQ files).
-
-## Method Idea
-
-### 1. Feature Extraction
-- Extract k-mer frequencies from raw FASTQ reads
-- Use small k (e.g., k=5–7) for efficiency
-- Normalize counts to create feature vectors
-
-### 2. Model
-- Train a supervised classifier such as:
-  - Random Forest
-  - eXtream Gradient Boosting
-  - Gradient Boosting
-  - Support Vector Machine
-  - Light Gradient Boosting 
-  - Logistic Regression
-  - Labels: major taxa (e.g., An. gambiae, An. coluzzii, An.  Arabiensis, and An. funestus)
-
-### 3. Advantages
-- No need for full variant calling
-- Lower computational cost
-- Faster classification pipeline
-
-### 4. Future Extensions
-- Deep learning models (CNNs on sequence data)
-- Integration into malariagen-data-python workflows
-- Deployment via cloud (e.g., Google Cloud Storage)
-  
-### 5. Main Idea
-- Develop a lightweight FASTQ-based taxonomic classifier that learns discriminative k-mer signatures across Anopheles species and integrate them into a scalable ML model, potentially benchmarking against tools like Kraken and improving performance in low-coverage or noisy data.
-  
-## Relevance to MalariaGEN
-This approach aligns with MalariaGEN’s goal of lowering barriers to genomic data analysis and enabling scalable tools for malaria-endemic regions.