Author: Pranava Upparlapalli
Date: June 2025
This project analyzes RNA-Seq gene expression data across five major cancer types:
- BRCA – Breast invasive carcinoma
- KIRC – Kidney renal clear cell carcinoma
- LUAD – Lung adenocarcinoma
- COAD – Colon adenocarcinoma
- PRAD – Prostate adenocarcinoma
Using the DESeq2 pipeline in R, we performed differential expression analysis, PCA, clustering, and visualized key patterns to identify cancer-specific gene signatures.
CANCER-RNA-SEQ-ANALYSIS/
│
├── src.R # Main code
├── plots/ # All generated plots
│ ├── heatmap_expression.png
│ ├── ma_plot.png
│ ├── sample_distribution.png
│ ├── silhouette_plot.png
│ └── volcano_plot.png
│
├── Notebook/ # RMarkdown and Word version of the analysis
│ ├── Project V2.Rmd
|
├── index.html # HTML output of the analysis
├── README.md # This file
└── LICENSE # License fileThe RNA-Seq count data used in this project was sourced from Kaggle. Due to its large size, the data files (data.csv and labels.csv) are not included in this GitHub repository.
You can download the dataset from Kaggle at:
🔗 Kaggle Dataset – RNA-Seq Expression Data for Cancer Types
After downloading, place the files inside the data/ directory as follows:
CANCER-RNA-SEQ-ANALYSIS/
└── data/
├── data.csv
└── labels.csv-
Data Preprocessing
- Merged
data.csvandlabels.csv. - Removed missing values and ensured valid class labels.
- Merged
-
Differential Expression (DESeq2)
- Used DESeq2 to normalize, model, and identify significantly differentially expressed genes (padj < 0.05).
-
Visualization
- Volcano plot and MA plot to show fold-change vs significance.
- PCA to visualize clustering of samples.
- Heatmap of top 50 most variable genes.
- K-means clustering and silhouette plot for unsupervised pattern discovery.
Shows the distribution of samples across cancer types.
Highlights significantly up/down-regulated genes.
Visualizes mean expression vs. fold change.
Assesses clustering performance using PCA + K-means.
Displays patterns of gene expression across cancer types.
- Thousands of genes are differentially expressed between cancers.
- PCA showed distinct clustering of cancer types.
- K-means clustering aligned well with known classes.
- Some cancer types share expression signatures, indicating potential shared biology.
- R (v4.3.2)
- DESeq2
- ggplot2
- pheatmap
- factoextra
- tidyverse
-
Clone the repository:
git clone https://github.com/yourusername/CANCER-RNA-SEQ-ANALYSIS.git cd CANCER-RNA-SEQ-ANALYSIS -
Download the data from Kaggle and place it under
data/. -
Open
Project V2.Rmdin RStudio. -
Knit to HTML to generate
index.html.
This project is licensed under the MIT License.
See LICENSE for details.
For questions, feedback, or collaborations:
📧 pxu.bioinfo@gmail.com
🔗 LinkedIn
🌐 Portfolio




