Add baseline essential-gene classifier with main.py, requirements, and README.md #49
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Adds a baseline machine learning project to classify bacterial genes as essential or non-essential using DNA sequence data from the macwiatrak/bacbench-essential-genes-dna dataset.
Files Added:
main.py – Implements the Logistic Regression pipeline with 4-mer feature extraction.
requirements.txt – Lists all Python dependencies needed to run the project.
README.md – Project overview, dataset description, preprocessing steps, model evaluation, and usage instructions.
Notes:
Serves as a simple baseline for essential gene prediction.
First ML project attempt; AI was used only for debugging assistance.
Follow-up improvements could include handling class imbalance, overlapping k-mers, and more advanced models.