Add baseline essential-gene classifier with main.py, requirements, and README

venkamita · venkamita · commit bb81dc283949 · 2025-12-07T12:27:06.000+05:30
diff --git a/Binary-Gene-Classifier-Model/README.md b/Binary-Gene-Classifier-Model/README.md
@@ -0,0 +1,179 @@
+# Essential Gene Classification from DNA Sequences
+
+This project implements a baseline machine learning pipeline to classify bacterial genes as essential or non-essential using DNA sequence information from the **macwiatrak/bacbench-essential-genes-dna** dataset (Hugging Face Datasets).
+
+## Project Overview
+
+The notebook:
+- Loads the BacBench essential genes dataset (train/validation/test splits).
+- Cleans and simplifies the dataset by removing unused metadata columns.
+- Encodes DNA sequences into integer representations using a custom nucleotide mapping.
+- Extracts non-overlapping 4-mer (length-4 subsequence) count features.
+- Trains a Logistic Regression classifier on the resulting feature vectors.
+- Evaluates model performance using accuracy and F1 score on validation and test splits.
+
+This serves as a simple, fast baseline for essential-gene prediction from raw DNA sequences.
+
+## Dataset
+
+The project uses the `macwiatrak/bacbench-essential-genes-dna` dataset loaded via `datasets.load_dataset`.  
+Each split (train, validation, test) originally contains, among others, the following fields:
+- `dna_seq`: DNA sequence of the gene.
+- `essential`: Label indicating whether a gene is essential (`"Yes"` or `"No"`).
+- Several metadata columns (e.g., `genome_name`, `start`, `end`, `protein_id`, `strand`, `product`, `__index_level_0__`).
+
+In this notebook, the unnecessary metadata columns are dropped, and only `dna_seq` and `essential` are retained for modeling.
+
+## Preprocessing
+
+Key preprocessing steps:
+
+- **Label encoding**  
+  The `essential` field is converted from string to integer:
+  - `"Yes"` → `1`  
+  - `"No"` → `0`  
+
+- **DNA character mapping**  
+  Each base in `dna_seq` is mapped to an integer to prioritize efficiency:
+  - `A → 0`, `T → 1`, `C → 2`, `G → 3`  
+  - Ambiguous bases: `N → 4`, `K → 5`, `R → 6`, `S → 7`, `Y → 8`, `M → 9`, `W → 10`  
+
+- **Sequence encoding**  
+  A helper function converts each DNA string into a list of integers using the mapping above, discarding characters not present in the map.
+
+## Feature Extraction
+
+The feature representation is based on **non-overlapping 4-mers**:
+
+- The number of possible symbols is `NUM_BASES = 11`.
+- The total number of distinct 4-mers is `NUM_4MERS = 11^4 = 14641`.
+- For each encoded sequence, the notebook:
+  - Iterates with step size `STEP = 4` to form non-overlapping 4-mers.
+  - Maps each 4-mer to a unique integer index using positional encoding:
+    \[
+    \text{kmer\_int} = b_0 \cdot 11^3 + b_1 \cdot 11^2 + b_2 \cdot 11 + b_3
+    \]
+  - Increments the corresponding position in a length-14641 count vector.
+
+The resulting dense feature matrix is then converted to a SciPy CSR sparse matrix for memory efficiency.
+
+## Model
+
+The classification model is a **Logistic Regression** from `sklearn.linear_model` with:
+
+- `solver='saga'`
+- `max_iter=2000`
+- `n_jobs=-1` (parallel training where possible)
+
+Training is performed on the 4-mer count features of the train split.
+
+## Evaluation
+
+Model performance is evaluated on both validation and test splits using:
+
+- **Accuracy** (`sklearn.metrics.accuracy_score`)
+- **F1 Score** (`sklearn.metrics.f1_score`)
+
+The notebook prints:
+
+- Validation Accuracy  
+- Validation F1 Score  
+- Test Accuracy  
+- Test F1 Score  
+
+These metrics provide an initial benchmark for this simple 4-mer + Logistic Regression approach.
+
+## Requirements
+
+Main Python dependencies:
+
+- `pandas`
+- `numpy`
+- `scipy`
+- `datasets` (Hugging Face Datasets)
+- `scikit-learn`
+
+Example installation (if running locally):
+`pip install pandas numpy scipy datasets scikit-learn`
+
+## How to Run
+
+1. Open the notebook in Google Colab or your preferred environment.
+2. Ensure all required packages are installed.
+3. Run the cells in order:
+   - Dataset loading and column filtering
+   - Label encoding
+   - DNA mapping and sequence encoding
+   - 4-mer feature extraction
+   - Model training
+   - Evaluation on validation and test splits
+
+## Possible Extensions
+
+- Use overlapping k-mers or different k-mer sizes to capture more sequence context.
+- Try more expressive models (e.g., tree-based methods, neural networks).
+- Explore alternative encodings (e.g., one-hot, embeddings, or biologically informed encodings).
+- Add cross-validation and hyperparameter tuning for more robust performance estimates.
+# Issues with Current Gene Classifier
+
+1. **Class Imbalance**
+   - Essential genes (`1`) are much rarer than non-essential genes (`0`).
+   - Logistic Regression tends to predict the majority class, lowering F1 score on validation.
+
+2. **Simple Features**
+   - Using **non-overlapping 4-mer counts** loses many sequence patterns.
+   - Linear combinations of k-mer counts may not capture complex dependencies between nucleotides.
+
+3. **Non-Overlapping k-mers**
+   - Step size of 4 skips many overlapping patterns in the DNA sequence.
+   - Important motifs or codon patterns might be missed.
+
+4. **Normalization**
+   - Raw 4-mer counts vary with sequence length.
+   - Longer sequences dominate the feature vectors, potentially biasing the classifier.
+
+5. **Linear Model Limitations**
+   - Logistic Regression is a linear classifier.
+   - Cannot capture non-linear interactions between k-mers that may be biologically relevant.
+
+6. **Potential Data Leakage**
+   - Some sequences in train/test splits may be very similar or overlapping.
+   - This can inflate test accuracy artificially, as seen in the high test F1 compared to validation.
+
+7. **Limited Biological Context**
+   - Only nucleotide sequences are considered.
+   - Other biological features (gene location, GC content, protein info) are ignored, which may be predictive of essentiality.
+
+8. **Sparse Signal**
+   - Many 4-mer combinations may never appear, making feature vectors sparse.
+   - Sparse linear models may struggle to generalize with limited data for certain patterns.
+9. **Mapping**
+   - I did not take into account whether W which is mapped to 10 will be treated as 10 or 1 and 0 which would essentialy derail the classification
+## Model Evaluation
+
+The baseline Logistic Regression classifier was evaluated on the validation and test splits using **accuracy** and **F1 score**:
+
+| Split       | Accuracy | F1 Score |
+|------------|---------|----------|
+| Validation | 0.45    | 0.25     |
+| Test       | 0.90    | 0.80     |
+
+> ⚠️ Note:  
+> - Validation F1 is low due to class imbalance and simple linear model.  
+> - The high test metrics may be artificially inflated if some sequences are very similar across splits.  
+> - This baseline serves as a starting point for further improvements.
+
+
+## Credits
+
+- **Dataset:** [BacBench Essential Genes DNA Dataset](https://huggingface.co/macwiatrak/bacbench-essential-genes-dna) by Mac Wiatrak et al., hosted on HuggingFace.  
+- **Libraries & Tools:**  
+  - [HuggingFace `datasets`](https://huggingface.co/docs/datasets) for data loading and preprocessing  
+  - [NumPy](https://numpy.org/) for numerical operations  
+  - [SciPy](https://www.scipy.org/) for scientific computing  
+  - [scikit-learn](https://scikit-learn.org/) for machine learning models and evaluation metrics  
+- **Inspired by:** Standard bioinformatics workflows for DNA k-mer feature extraction and baseline classification.  
+-**Workflow & Model Implementation:** Done by Sharat Doddihal
+### Note 
+This was my first attempt at creating a Ml model by myself without too much use from AI.AI has been used here but only for helping with the debugging process.
+Overall I am happy with how this turned as this was a great learning experience.There are many fundamental errors that mess with the accuracy.
diff --git a/Binary-Gene-Classifier-Model/main.py b/Binary-Gene-Classifier-Model/main.py
@@ -0,0 +1,82 @@
+import pandas as pd
+import numpy as np
+from scipy import stats
+from datasets import load_dataset
+from scipy.sparse import csr_matrix
+from sklearn.linear_model import LogisticRegression
+from sklearn.metrics import accuracy_score, f1_score
+# Load datasets
+ds = load_dataset("macwiatrak/bacbench-essential-genes-dna", split="validation")
+ds = ds.remove_columns(['genome_name', 'start', 'end', 'protein_id',
+                        'strand', 'product', '__index_level_0__'])
+ds1 = load_dataset("macwiatrak/bacbench-essential-genes-dna", split="train")
+ds1 = ds1.remove_columns(['genome_name', 'start', 'end', 'protein_id',
+                          'strand', 'product', '__index_level_0__'])
+ds2 = load_dataset("macwiatrak/bacbench-essential-genes-dna", split="test")
+ds2 = ds2.remove_columns(['genome_name', 'start', 'end', 'protein_id',
+                          'strand', 'product', '__index_level_0__'])
+# Convert Yes/No to 1/0
+def convert(example):
+    example["essential"] = [1 if x == "Yes" else 0 for x in example["essential"]]
+    return example
+ds = ds.map(convert)
+ds1 = ds1.map(convert)
+ds2 = ds2.map(convert)
+# DNA base mapping
+dna_map = {
+    "A": 0, "T": 1, "C": 2, "G": 3,
+    "N": 4, "K": 5, "R": 6, "S": 7,
+    "Y": 8, "M": 9, "W": 10
+}
+# encode sequences in each split
+for i in range(len(ds1)):
+    ds1[i]["dna_seq"] = [dna_map[base] for base in ds1[i]["dna_seq"]]
+for i in range(len(ds)):
+    ds[i]["dna_seq"] = [dna_map[base] for base in ds[i]["dna_seq"]]
+for i in range(len(ds2)):
+    ds2[i]["dna_seq"] = [dna_map[base] for base in ds2[i]["dna_seq"]]
+# 4-mer encoding utilities
+NUM_BASES = len(dna_map)
+NUM_4MERS = NUM_BASES ** 4
+STEP = 4
+def encode_sequence(seq, mapping=dna_map):
+    return [mapping[base] for base in seq if base in mapping]
+def sequence_to_4mer_counts(seq, step=STEP):
+    counts = np.zeros(NUM_4MERS, dtype=int)
+    for i in range(0, len(seq) - (step - 1), step):
+        kmer = seq[i:i + step]
+        if len(kmer) < step:
+            continue
+        kmer_int = (
+            kmer[0] * NUM_BASES ** 3 +
+            kmer[1] * NUM_BASES ** 2 +
+            kmer[2] * NUM_BASES +
+            kmer[3]
+        )
+        counts[kmer_int] += 1
+    return counts
+# Prepare dataset for ML
+def prepare_dataset(ds_split):
+    def _map_dna_sequence_to_integers(batch):
+        return {"dna_seq": [encode_sequence(seq_str) for seq_str in batch["dna_seq"]]}
+    ds_processed = ds_split.map(_map_dna_sequence_to_integers, batched=True)
+    X_dense = np.array([sequence_to_4mer_counts(item["dna_seq"]) for item in ds_processed])
+    y = np.array([item["essential"][0] for item in ds_processed])
+    return X_dense, y
+X_train_dense, y_train = prepare_dataset(ds1)
+X_val_dense, y_val = prepare_dataset(ds)
+X_test_dense, y_test = prepare_dataset(ds2)
+# sparse conversion
+X_train = csr_matrix(X_train_dense)
+X_val = csr_matrix(X_val_dense)
+X_test = csr_matrix(X_test_dense)
+# Train classifier
+clf = LogisticRegression(max_iter=2000, solver='saga', n_jobs=-1)
+clf.fit(X_train, y_train)
+# Evaluation
+y_pred_val = clf.predict(X_val)
+print("Validation Accuracy:", accuracy_score(y_val, y_pred_val))
+print("Validation F1 Score:", f1_score(y_val, y_pred_val))
+y_pred_test = clf.predict(X_test)
+print("Test Accuracy:", accuracy_score(y_test, y_pred_test))
+print("Test F1 Score:", f1_score(y_test, y_pred_test))
diff --git a/Binary-Gene-Classifier-Model/requirements.txt b/Binary-Gene-Classifier-Model/requirements.txt
@@ -0,0 +1,5 @@
+numpy==1.25.0
+pandas==2.0.1
+scipy==1.11.0
+scikit-learn==1.3.0
+datasets==2.16.0