Skip to content

joshiyukta16/Machine_Learning_Breast_Cancer_Detection_Model-

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

18 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

๐Ÿงฌ Breast Cancer Detection using Machine Learning (Gene Expression Data)


๐Ÿ“Œ Overview

This project focuses on classifying breast cancer samples into tumor and normal categories using gene expression data.
The dataset is obtained from the NCBI Gene Expression Omnibus (GEO) with accession ID GSE183947.

Machine learning techniques are applied to analyze high-dimensional RNA-Seq data and build a predictive classification model.


๐ŸŽฏ Objective

To develop an accurate and reliable machine learning model for classifying breast cancer samples based on gene expression profiles.


๐Ÿ“Š Dataset Information

  • Source: NCBI GEO
  • Accession ID: GSE183947
  • Data Type: RNA-Seq (FPKM normalized)
  • Total Samples: 60
    • Tumor: 30
    • Normal: 30
  • Features: ~20,000 genes

๐Ÿ› ๏ธ Tools and Technologies

  • IDE: Visual Studio Code
  • Language: Python

๐Ÿ“š Libraries Used

  • NumPy
  • Pandas
  • Matplotlib
  • Scikit-learn
  • SciPy

โš™๏ธ Project Workflow

1๏ธโƒฃ Data Collection

Dataset downloaded from GEO database.

2๏ธโƒฃ Data Preprocessing

  • Dataset transposed (samples โ†’ rows, genes โ†’ columns)
  • Converted to numeric format
  • Missing values handled
  • Logโ‚‚(x + 1) transformation applied
  • Feature scaling using StandardScaler

3๏ธโƒฃ Label Assignment

  • Tumor and Normal labels assigned
  • Data split into:
    • X (features)
    • y (labels)

4๏ธโƒฃ Train-Test Split

  • 80% training, 20% testing
  • Stratified sampling used

5๏ธโƒฃ Feature Selection

  • SelectKBest (ANOVA F-test) used
  • Reduced dimensionality

6๏ธโƒฃ Model Building

  • Algorithm: Support Vector Machine (SVM)
  • Kernel: Linear

7๏ธโƒฃ Model Evaluation

  • Accuracy Score
  • Confusion Matrix
  • Classification Report

๐Ÿ“ˆ Results

  • High accuracy achieved in classification
  • Feature selection improved performance
  • Model generalized well on test data

๐Ÿ“Š Confusion Matrix

Confusion Matrix


๐Ÿ“‹ Classification Report

Classification Report: precision recall f1-score support

  normal       1.00      1.00      1.00         6
   tumor       1.00      1.00      1.00         6

accuracy                           1.00        12

macro avg 1.00 1.00 1.00 12 weighted avg 1.00 1.00 1.00 12


Interpretation

The SVM model achieved 100% accuracy on the test dataset, correctly classifying all tumor and normal samples. However, due to the high dimensionality of gene expression data (~20,000 features) and relatively small sample size (60 samples), the model may be prone to overfitting. Therefore, additional validation techniques such as cross-validation are required to ensure robustness and generalizability of the model.


๐Ÿ“ˆ ROC Curve

The Receiver Operating Characteristic (ROC) curve was used to evaluate the modelโ€™s ability to distinguish between tumor and normal samples.

  • AUC Score: 1.0

ROC Curve

โœ… Results from this Project AUC Score: 1.0

  • The ROC curve is positioned near the top-left corner, indicating:
  • High sensitivity (correct tumor detection)
  • Low false positive rate (minimal misclassification of normal samples)

๐Ÿ“Š PCA Visualization

Principal Component Analysis (PCA) was applied to reduce high-dimensional gene expression data into 2D space for visualization.

PCA Plot

๐Ÿ” PCA Interpretation

  • Clear separation between tumor and normal samples
  • Indicates strong feature distinction in dataset
  • Supports modelโ€™s high accuracy

Cross Validation

To evaluate the robustness and generalizability of the model, 5-fold cross-validation was performed using a machine learning pipeline that includes feature scaling, feature selection, and classification.

๐Ÿ”ง Pipeline Components

  • Feature Scaling: StandardScaler
  • Feature Selection: SelectKBest (top 100 genes using ANOVA F-test)
  • Classifier: Support Vector Machine (SVM, linear kernel)

๐Ÿ Conclusion

In this project, a machine learning pipeline was developed to classify breast cancer samples using high-dimensional gene expression data. The dataset was preprocessed using log transformation and feature scaling to improve data quality and model performance. Feature selection helped reduce dimensionality and retain the most informative genes.

A Support Vector Machine (SVM) model was trained and evaluated, achieving strong performance in distinguishing tumor and normal samples. The results demonstrate the effectiveness of machine learning techniques in analyzing complex biological data.

This project highlights the potential of computational approaches in bioinformatics for disease classification and can be extended for biomarker discovery and precision medicine applications.

๐Ÿ”ฎ Future Work

The model can be further validated using independent datasets such as:

  • GSE42568 (tumor vs normal classification)
  • GSE70947 (RNA-seq based validation)
  • GSE2034 (cross-platform validation)

This would help assess the robustness and generalizability of the model across different biological datasets.


๐Ÿ“ Project Structure

Machine_Learning_Project/ โ”‚ โ”œโ”€โ”€ data/ โ”‚ โ””โ”€โ”€ GSE183947_fpkm.csv โ”‚ |โ”€โ”€ images/ | โ””โ”€โ”€ Confusion_Matrix_Output_Image.png โ”œโ”€โ”€ notebooks/ โ”‚ โ””โ”€โ”€ Machine_Learning_Project.ipynb โ”‚ โ”œโ”€โ”€ src/ โ”‚ โ””โ”€โ”€ model.py โ”‚ โ”œโ”€โ”€ README.md โ””โ”€โ”€ requirements.txt


โ–ถ๏ธ How to Run

git clone https://github.com/joshiyukta16/Machine_Learning_Breast_Cancer_Detection_Model-.git cd Breast-Cancer-Detection-ML pip install -r requirements.txt jupyter notebook


๐Ÿง  Applications

  • Breast cancer detection
  • Biomarker discovery
  • Precision medicine
  • Bioinformatics research

๐Ÿ”ฎ Future Scope

  • Apply deep learning models
  • Perform feature importance analysis
  • Integrate multi-omics data
  • Deploy as a web application

๐Ÿ‘ฉโ€๐Ÿ’ป Author

Yukta Joshi B.Tech Bioinformatics (AI & ML)

๐Ÿ™ Acknowledgment

Data provided by NCBI GEO database.

โญ If you like this project, give it a star!!


About

Developed a machine learning-based system for breast cancer detection using gene expression data from the GEO dataset (GSE183947). The project involves data preprocessing, feature handling, model training using scikit-learn algorithms, and performance evaluation to classify samples accurately as cancerous or non-cancerous.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors