Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
179 changes: 179 additions & 0 deletions Binary-Gene-Classifier-Model/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,179 @@
# Essential Gene Classification from DNA Sequences

This project implements a baseline machine learning pipeline to classify bacterial genes as essential or non-essential using DNA sequence information from the **macwiatrak/bacbench-essential-genes-dna** dataset (Hugging Face Datasets).

## Project Overview

The notebook:
- Loads the BacBench essential genes dataset (train/validation/test splits).
- Cleans and simplifies the dataset by removing unused metadata columns.
- Encodes DNA sequences into integer representations using a custom nucleotide mapping.
- Extracts non-overlapping 4-mer (length-4 subsequence) count features.
- Trains a Logistic Regression classifier on the resulting feature vectors.
- Evaluates model performance using accuracy and F1 score on validation and test splits.

This serves as a simple, fast baseline for essential-gene prediction from raw DNA sequences.

## Dataset

The project uses the `macwiatrak/bacbench-essential-genes-dna` dataset loaded via `datasets.load_dataset`.
Each split (train, validation, test) originally contains, among others, the following fields:
- `dna_seq`: DNA sequence of the gene.
- `essential`: Label indicating whether a gene is essential (`"Yes"` or `"No"`).
- Several metadata columns (e.g., `genome_name`, `start`, `end`, `protein_id`, `strand`, `product`, `__index_level_0__`).

In this notebook, the unnecessary metadata columns are dropped, and only `dna_seq` and `essential` are retained for modeling.

## Preprocessing

Key preprocessing steps:

- **Label encoding**
The `essential` field is converted from string to integer:
- `"Yes"` → `1`
- `"No"` → `0`

- **DNA character mapping**
Each base in `dna_seq` is mapped to an integer to prioritize efficiency:
- `A → 0`, `T → 1`, `C → 2`, `G → 3`
- Ambiguous bases: `N → 4`, `K → 5`, `R → 6`, `S → 7`, `Y → 8`, `M → 9`, `W → 10`

- **Sequence encoding**
A helper function converts each DNA string into a list of integers using the mapping above, discarding characters not present in the map.

## Feature Extraction

The feature representation is based on **non-overlapping 4-mers**:

- The number of possible symbols is `NUM_BASES = 11`.
- The total number of distinct 4-mers is `NUM_4MERS = 11^4 = 14641`.
- For each encoded sequence, the notebook:
- Iterates with step size `STEP = 4` to form non-overlapping 4-mers.
- Maps each 4-mer to a unique integer index using positional encoding:
\[
\text{kmer\_int} = b_0 \cdot 11^3 + b_1 \cdot 11^2 + b_2 \cdot 11 + b_3
\]
- Increments the corresponding position in a length-14641 count vector.

The resulting dense feature matrix is then converted to a SciPy CSR sparse matrix for memory efficiency.

## Model

The classification model is a **Logistic Regression** from `sklearn.linear_model` with:

- `solver='saga'`
- `max_iter=2000`
- `n_jobs=-1` (parallel training where possible)

Training is performed on the 4-mer count features of the train split.

## Evaluation

Model performance is evaluated on both validation and test splits using:

- **Accuracy** (`sklearn.metrics.accuracy_score`)
- **F1 Score** (`sklearn.metrics.f1_score`)

The notebook prints:

- Validation Accuracy
- Validation F1 Score
- Test Accuracy
- Test F1 Score

These metrics provide an initial benchmark for this simple 4-mer + Logistic Regression approach.

## Requirements

Main Python dependencies:

- `pandas`
- `numpy`
- `scipy`
- `datasets` (Hugging Face Datasets)
- `scikit-learn`

Example installation (if running locally):
`pip install pandas numpy scipy datasets scikit-learn`

## How to Run

1. Open the notebook in Google Colab or your preferred environment.
2. Ensure all required packages are installed.
3. Run the cells in order:
- Dataset loading and column filtering
- Label encoding
- DNA mapping and sequence encoding
- 4-mer feature extraction
- Model training
- Evaluation on validation and test splits

## Possible Extensions

- Use overlapping k-mers or different k-mer sizes to capture more sequence context.
- Try more expressive models (e.g., tree-based methods, neural networks).
- Explore alternative encodings (e.g., one-hot, embeddings, or biologically informed encodings).
- Add cross-validation and hyperparameter tuning for more robust performance estimates.
# Issues with Current Gene Classifier

1. **Class Imbalance**
- Essential genes (`1`) are much rarer than non-essential genes (`0`).
- Logistic Regression tends to predict the majority class, lowering F1 score on validation.

2. **Simple Features**
- Using **non-overlapping 4-mer counts** loses many sequence patterns.
- Linear combinations of k-mer counts may not capture complex dependencies between nucleotides.

3. **Non-Overlapping k-mers**
- Step size of 4 skips many overlapping patterns in the DNA sequence.
- Important motifs or codon patterns might be missed.

4. **Normalization**
- Raw 4-mer counts vary with sequence length.
- Longer sequences dominate the feature vectors, potentially biasing the classifier.

5. **Linear Model Limitations**
- Logistic Regression is a linear classifier.
- Cannot capture non-linear interactions between k-mers that may be biologically relevant.

6. **Potential Data Leakage**
- Some sequences in train/test splits may be very similar or overlapping.
- This can inflate test accuracy artificially, as seen in the high test F1 compared to validation.

7. **Limited Biological Context**
- Only nucleotide sequences are considered.
- Other biological features (gene location, GC content, protein info) are ignored, which may be predictive of essentiality.

8. **Sparse Signal**
- Many 4-mer combinations may never appear, making feature vectors sparse.
- Sparse linear models may struggle to generalize with limited data for certain patterns.
9. **Mapping**
- I did not take into account whether W which is mapped to 10 will be treated as 10 or 1 and 0 which would essentialy derail the classification
## Model Evaluation

The baseline Logistic Regression classifier was evaluated on the validation and test splits using **accuracy** and **F1 score**:

| Split | Accuracy | F1 Score |
|------------|---------|----------|
| Validation | 0.45 | 0.25 |
| Test | 0.90 | 0.80 |

> ⚠️ Note:
> - Validation F1 is low due to class imbalance and simple linear model.
> - The high test metrics may be artificially inflated if some sequences are very similar across splits.
> - This baseline serves as a starting point for further improvements.


## Credits

- **Dataset:** [BacBench Essential Genes DNA Dataset](https://huggingface.co/macwiatrak/bacbench-essential-genes-dna) by Mac Wiatrak et al., hosted on HuggingFace.
- **Libraries & Tools:**
- [HuggingFace `datasets`](https://huggingface.co/docs/datasets) for data loading and preprocessing
- [NumPy](https://numpy.org/) for numerical operations
- [SciPy](https://www.scipy.org/) for scientific computing
- [scikit-learn](https://scikit-learn.org/) for machine learning models and evaluation metrics
- **Inspired by:** Standard bioinformatics workflows for DNA k-mer feature extraction and baseline classification.
-**Workflow & Model Implementation:** Done by Sharat Doddihal
### Note
This was my first attempt at creating a Ml model by myself without too much use from AI.AI has been used here but only for helping with the debugging process.
Overall I am happy with how this turned as this was a great learning experience.There are many fundamental errors that mess with the accuracy.
82 changes: 82 additions & 0 deletions Binary-Gene-Classifier-Model/main.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,82 @@
import pandas as pd
import numpy as np
from scipy import stats
from datasets import load_dataset
from scipy.sparse import csr_matrix
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, f1_score
# Load datasets
ds = load_dataset("macwiatrak/bacbench-essential-genes-dna", split="validation")
ds = ds.remove_columns(['genome_name', 'start', 'end', 'protein_id',
'strand', 'product', '__index_level_0__'])
ds1 = load_dataset("macwiatrak/bacbench-essential-genes-dna", split="train")
ds1 = ds1.remove_columns(['genome_name', 'start', 'end', 'protein_id',
'strand', 'product', '__index_level_0__'])
ds2 = load_dataset("macwiatrak/bacbench-essential-genes-dna", split="test")
ds2 = ds2.remove_columns(['genome_name', 'start', 'end', 'protein_id',
'strand', 'product', '__index_level_0__'])
# Convert Yes/No to 1/0
def convert(example):
example["essential"] = [1 if x == "Yes" else 0 for x in example["essential"]]
return example
ds = ds.map(convert)
ds1 = ds1.map(convert)
ds2 = ds2.map(convert)
# DNA base mapping
dna_map = {
"A": 0, "T": 1, "C": 2, "G": 3,
"N": 4, "K": 5, "R": 6, "S": 7,
"Y": 8, "M": 9, "W": 10
}
# encode sequences in each split
for i in range(len(ds1)):
ds1[i]["dna_seq"] = [dna_map[base] for base in ds1[i]["dna_seq"]]
for i in range(len(ds)):
ds[i]["dna_seq"] = [dna_map[base] for base in ds[i]["dna_seq"]]
for i in range(len(ds2)):
ds2[i]["dna_seq"] = [dna_map[base] for base in ds2[i]["dna_seq"]]
# 4-mer encoding utilities
NUM_BASES = len(dna_map)
NUM_4MERS = NUM_BASES ** 4
STEP = 4
def encode_sequence(seq, mapping=dna_map):
return [mapping[base] for base in seq if base in mapping]
def sequence_to_4mer_counts(seq, step=STEP):
counts = np.zeros(NUM_4MERS, dtype=int)
for i in range(0, len(seq) - (step - 1), step):
kmer = seq[i:i + step]
if len(kmer) < step:
continue
kmer_int = (
kmer[0] * NUM_BASES ** 3 +
kmer[1] * NUM_BASES ** 2 +
kmer[2] * NUM_BASES +
kmer[3]
)
counts[kmer_int] += 1
return counts
# Prepare dataset for ML
def prepare_dataset(ds_split):
def _map_dna_sequence_to_integers(batch):
return {"dna_seq": [encode_sequence(seq_str) for seq_str in batch["dna_seq"]]}
ds_processed = ds_split.map(_map_dna_sequence_to_integers, batched=True)
X_dense = np.array([sequence_to_4mer_counts(item["dna_seq"]) for item in ds_processed])
y = np.array([item["essential"][0] for item in ds_processed])
return X_dense, y
X_train_dense, y_train = prepare_dataset(ds1)
X_val_dense, y_val = prepare_dataset(ds)
X_test_dense, y_test = prepare_dataset(ds2)
# sparse conversion
X_train = csr_matrix(X_train_dense)
X_val = csr_matrix(X_val_dense)
X_test = csr_matrix(X_test_dense)
# Train classifier
clf = LogisticRegression(max_iter=2000, solver='saga', n_jobs=-1)
clf.fit(X_train, y_train)
# Evaluation
y_pred_val = clf.predict(X_val)
print("Validation Accuracy:", accuracy_score(y_val, y_pred_val))
print("Validation F1 Score:", f1_score(y_val, y_pred_val))
y_pred_test = clf.predict(X_test)
print("Test Accuracy:", accuracy_score(y_test, y_pred_test))
print("Test F1 Score:", f1_score(y_test, y_pred_test))
5 changes: 5 additions & 0 deletions Binary-Gene-Classifier-Model/requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
numpy==1.25.0
pandas==2.0.1
scipy==1.11.0
scikit-learn==1.3.0
datasets==2.16.0