|
| 1 | +# Essential Gene Classification from DNA Sequences |
| 2 | + |
| 3 | +This project implements a baseline machine learning pipeline to classify bacterial genes as essential or non-essential using DNA sequence information from the **macwiatrak/bacbench-essential-genes-dna** dataset (Hugging Face Datasets). |
| 4 | + |
| 5 | +## Project Overview |
| 6 | + |
| 7 | +The notebook: |
| 8 | +- Loads the BacBench essential genes dataset (train/validation/test splits). |
| 9 | +- Cleans and simplifies the dataset by removing unused metadata columns. |
| 10 | +- Encodes DNA sequences into integer representations using a custom nucleotide mapping. |
| 11 | +- Extracts non-overlapping 4-mer (length-4 subsequence) count features. |
| 12 | +- Trains a Logistic Regression classifier on the resulting feature vectors. |
| 13 | +- Evaluates model performance using accuracy and F1 score on validation and test splits. |
| 14 | + |
| 15 | +This serves as a simple, fast baseline for essential-gene prediction from raw DNA sequences. |
| 16 | + |
| 17 | +## Dataset |
| 18 | + |
| 19 | +The project uses the `macwiatrak/bacbench-essential-genes-dna` dataset loaded via `datasets.load_dataset`. |
| 20 | +Each split (train, validation, test) originally contains, among others, the following fields: |
| 21 | +- `dna_seq`: DNA sequence of the gene. |
| 22 | +- `essential`: Label indicating whether a gene is essential (`"Yes"` or `"No"`). |
| 23 | +- Several metadata columns (e.g., `genome_name`, `start`, `end`, `protein_id`, `strand`, `product`, `__index_level_0__`). |
| 24 | + |
| 25 | +In this notebook, the unnecessary metadata columns are dropped, and only `dna_seq` and `essential` are retained for modeling. |
| 26 | + |
| 27 | +## Preprocessing |
| 28 | + |
| 29 | +Key preprocessing steps: |
| 30 | + |
| 31 | +- **Label encoding** |
| 32 | + The `essential` field is converted from string to integer: |
| 33 | + - `"Yes"` → `1` |
| 34 | + - `"No"` → `0` |
| 35 | + |
| 36 | +- **DNA character mapping** |
| 37 | + Each base in `dna_seq` is mapped to an integer to prioritize efficiency: |
| 38 | + - `A → 0`, `T → 1`, `C → 2`, `G → 3` |
| 39 | + - Ambiguous bases: `N → 4`, `K → 5`, `R → 6`, `S → 7`, `Y → 8`, `M → 9`, `W → 10` |
| 40 | + |
| 41 | +- **Sequence encoding** |
| 42 | + A helper function converts each DNA string into a list of integers using the mapping above, discarding characters not present in the map. |
| 43 | + |
| 44 | +## Feature Extraction |
| 45 | + |
| 46 | +The feature representation is based on **non-overlapping 4-mers**: |
| 47 | + |
| 48 | +- The number of possible symbols is `NUM_BASES = 11`. |
| 49 | +- The total number of distinct 4-mers is `NUM_4MERS = 11^4 = 14641`. |
| 50 | +- For each encoded sequence, the notebook: |
| 51 | + - Iterates with step size `STEP = 4` to form non-overlapping 4-mers. |
| 52 | + - Maps each 4-mer to a unique integer index using positional encoding: |
| 53 | + \[ |
| 54 | + \text{kmer\_int} = b_0 \cdot 11^3 + b_1 \cdot 11^2 + b_2 \cdot 11 + b_3 |
| 55 | + \] |
| 56 | + - Increments the corresponding position in a length-14641 count vector. |
| 57 | + |
| 58 | +The resulting dense feature matrix is then converted to a SciPy CSR sparse matrix for memory efficiency. |
| 59 | + |
| 60 | +## Model |
| 61 | + |
| 62 | +The classification model is a **Logistic Regression** from `sklearn.linear_model` with: |
| 63 | + |
| 64 | +- `solver='saga'` |
| 65 | +- `max_iter=2000` |
| 66 | +- `n_jobs=-1` (parallel training where possible) |
| 67 | + |
| 68 | +Training is performed on the 4-mer count features of the train split. |
| 69 | + |
| 70 | +## Evaluation |
| 71 | + |
| 72 | +Model performance is evaluated on both validation and test splits using: |
| 73 | + |
| 74 | +- **Accuracy** (`sklearn.metrics.accuracy_score`) |
| 75 | +- **F1 Score** (`sklearn.metrics.f1_score`) |
| 76 | + |
| 77 | +The notebook prints: |
| 78 | + |
| 79 | +- Validation Accuracy |
| 80 | +- Validation F1 Score |
| 81 | +- Test Accuracy |
| 82 | +- Test F1 Score |
| 83 | + |
| 84 | +These metrics provide an initial benchmark for this simple 4-mer + Logistic Regression approach. |
| 85 | + |
| 86 | +## Requirements |
| 87 | + |
| 88 | +Main Python dependencies: |
| 89 | + |
| 90 | +- `pandas` |
| 91 | +- `numpy` |
| 92 | +- `scipy` |
| 93 | +- `datasets` (Hugging Face Datasets) |
| 94 | +- `scikit-learn` |
| 95 | + |
| 96 | +Example installation (if running locally): |
| 97 | +`pip install pandas numpy scipy datasets scikit-learn` |
| 98 | + |
| 99 | +## How to Run |
| 100 | + |
| 101 | +1. Open the notebook in Google Colab or your preferred environment. |
| 102 | +2. Ensure all required packages are installed. |
| 103 | +3. Run the cells in order: |
| 104 | + - Dataset loading and column filtering |
| 105 | + - Label encoding |
| 106 | + - DNA mapping and sequence encoding |
| 107 | + - 4-mer feature extraction |
| 108 | + - Model training |
| 109 | + - Evaluation on validation and test splits |
| 110 | + |
| 111 | +## Possible Extensions |
| 112 | + |
| 113 | +- Use overlapping k-mers or different k-mer sizes to capture more sequence context. |
| 114 | +- Try more expressive models (e.g., tree-based methods, neural networks). |
| 115 | +- Explore alternative encodings (e.g., one-hot, embeddings, or biologically informed encodings). |
| 116 | +- Add cross-validation and hyperparameter tuning for more robust performance estimates. |
| 117 | +# Issues with Current Gene Classifier |
| 118 | + |
| 119 | +1. **Class Imbalance** |
| 120 | + - Essential genes (`1`) are much rarer than non-essential genes (`0`). |
| 121 | + - Logistic Regression tends to predict the majority class, lowering F1 score on validation. |
| 122 | + |
| 123 | +2. **Simple Features** |
| 124 | + - Using **non-overlapping 4-mer counts** loses many sequence patterns. |
| 125 | + - Linear combinations of k-mer counts may not capture complex dependencies between nucleotides. |
| 126 | + |
| 127 | +3. **Non-Overlapping k-mers** |
| 128 | + - Step size of 4 skips many overlapping patterns in the DNA sequence. |
| 129 | + - Important motifs or codon patterns might be missed. |
| 130 | + |
| 131 | +4. **Normalization** |
| 132 | + - Raw 4-mer counts vary with sequence length. |
| 133 | + - Longer sequences dominate the feature vectors, potentially biasing the classifier. |
| 134 | + |
| 135 | +5. **Linear Model Limitations** |
| 136 | + - Logistic Regression is a linear classifier. |
| 137 | + - Cannot capture non-linear interactions between k-mers that may be biologically relevant. |
| 138 | + |
| 139 | +6. **Potential Data Leakage** |
| 140 | + - Some sequences in train/test splits may be very similar or overlapping. |
| 141 | + - This can inflate test accuracy artificially, as seen in the high test F1 compared to validation. |
| 142 | + |
| 143 | +7. **Limited Biological Context** |
| 144 | + - Only nucleotide sequences are considered. |
| 145 | + - Other biological features (gene location, GC content, protein info) are ignored, which may be predictive of essentiality. |
| 146 | + |
| 147 | +8. **Sparse Signal** |
| 148 | + - Many 4-mer combinations may never appear, making feature vectors sparse. |
| 149 | + - Sparse linear models may struggle to generalize with limited data for certain patterns. |
| 150 | +9. **Mapping** |
| 151 | + - I did not take into account whether W which is mapped to 10 will be treated as 10 or 1 and 0 which would essentialy derail the classification |
| 152 | +## Model Evaluation |
| 153 | + |
| 154 | +The baseline Logistic Regression classifier was evaluated on the validation and test splits using **accuracy** and **F1 score**: |
| 155 | + |
| 156 | +| Split | Accuracy | F1 Score | |
| 157 | +|------------|---------|----------| |
| 158 | +| Validation | 0.45 | 0.25 | |
| 159 | +| Test | 0.90 | 0.80 | |
| 160 | + |
| 161 | +> ⚠️ Note: |
| 162 | +> - Validation F1 is low due to class imbalance and simple linear model. |
| 163 | +> - The high test metrics may be artificially inflated if some sequences are very similar across splits. |
| 164 | +> - This baseline serves as a starting point for further improvements. |
| 165 | +
|
| 166 | + |
| 167 | +## Credits |
| 168 | + |
| 169 | +- **Dataset:** [BacBench Essential Genes DNA Dataset](https://huggingface.co/macwiatrak/bacbench-essential-genes-dna) by Mac Wiatrak et al., hosted on HuggingFace. |
| 170 | +- **Libraries & Tools:** |
| 171 | + - [HuggingFace `datasets`](https://huggingface.co/docs/datasets) for data loading and preprocessing |
| 172 | + - [NumPy](https://numpy.org/) for numerical operations |
| 173 | + - [SciPy](https://www.scipy.org/) for scientific computing |
| 174 | + - [scikit-learn](https://scikit-learn.org/) for machine learning models and evaluation metrics |
| 175 | +- **Inspired by:** Standard bioinformatics workflows for DNA k-mer feature extraction and baseline classification. |
| 176 | +-**Workflow & Model Implementation:** Done by Sharat Doddihal |
| 177 | +### Note |
| 178 | +This was my first attempt at creating a Ml model by myself without too much use from AI.AI has been used here but only for helping with the debugging process. |
| 179 | +Overall I am happy with how this turned as this was a great learning experience.There are many fundamental errors that mess with the accuracy. |
0 commit comments