diff --git a/code/molecular_phenotypes/QC/kellis_atacseq_preprocessing.ipynb b/code/molecular_phenotypes/QC/kellis_atacseq_preprocessing.ipynb
deleted file mode 100644
index 3aa20548..00000000
--- a/code/molecular_phenotypes/QC/kellis_atacseq_preprocessing.ipynb
+++ /dev/null
@@ -1,3188 +0,0 @@
-{
- "cells": [
- {
- "cell_type": "markdown",
- "id": "27a44eb1-acb9-40f5-bcdb-7ede63d5db5e",
- "metadata": {},
- "source": [
- "# Kellis Lab Single-nuclei ATAC-seq Preprocessing Pipeline\n",
- "\n",
- "---\n",
- "\n",
- "### Overview\n",
- "\n",
- "This pipeline preprocesses single-nucleus ATAC-seq (snATAC-seq) data from the Kellis lab for downstream chromatin accessibility QTL (caQTL) analysis and region-specific studies. It processes pseudobulk peak count data across six major brain cell types with flexible workflow options depending on your analysis goals.\n",
- "\n",
- "**Pipeline Purpose:**\n",
- "- Transform raw pseudobulk peak counts into analysis-ready formats\n",
- "- Remove technical confounders and optionally biological covariates\n",
- "- Generate QTL-ready phenotype files or region-specific datasets\n",
- "\n",
- "**Supported Cell Types:**\n",
- "- **Mic** - Microglia\n",
- "- **Astro** - Astrocytes\n",
- "- **Oligo** - Oligodendrocytes\n",
- "- **Exc** - Excitatory neurons\n",
- "- **Inh** - Inhibitory neurons\n",
- "- **OPC** - Oligodendrocyte precursor cells\n",
- "\n",
- "---\n",
- "\n",
- "### Workflow Structure\n",
- "\n",
- "This pipeline consists of two main sequential steps, plus a complete pipeline for severe batch effects.\n",
- "\n",
- "#### Step 1: Pseudobulk QC with batch as covariates\n",
- "\n",
- "**Option A: Remove Biological Covariates**\n",
- "- Regresses out demographic variables (msex, age_death, pmi, study)\n",
- "- **Use when:** You want to identify genetic effects independent of sex/age\n",
- "- **Model includes:** technical covariates + sequencingBatch + msex + age_death + pmi + study\n",
- "\n",
- "**Option B: Preserve Biological Covariates**\n",
- "- Regresses out only non-demographic variables (pmi, study)\n",
- "- **Use when:** You want to study sex/age effects or preserve biological heterogeneity\n",
- "- **Model includes:** technical covariates + sequencingBatch + pmi + study (NO msex, age_death)\n",
- "\n",
- "#### Step 2: Format Output\n",
- "\n",
- "**Format A: Phenotype Reformatting**\n",
- "- Converts residuals to genome-wide BED format\n",
- "- **Input:** `{celltype}_residuals.txt` (from Step 1 Option A or B)\n",
- "- **Use for:** FastQTL, TensorQTL, MatrixEQTL (genome-wide caQTL mapping)\n",
- "\n",
- "**Format B: Region Peak Filtering**\n",
- "- Filters to specific genomic regions (chr7: 28-28.3 Mb, chr11: 85.05-86.2 Mb)\n",
- "- **Input:** `{celltype}_filtered_raw_counts.txt` (only from Step 1 Option B)\n",
- "- **Use for:** Hypothesis-driven locus analysis, region-specific comparisons\n",
- "\n",
- "#### Alternative Pseudobulk Pipeline: Explicit Batch Correction (Multiome Dataset)\n",
- "- Complete standalone pipeline with explicit batch correction using limma's `removeBatchEffect` or ComBat-seq\n",
- "- **Input:** Qc'ed Seurat object`{celltype}_qced.rds` and pseudobulk peak counts `{celltype}.rds`\n",
- "- **Use when:** Strong batch effects visible in PCA/t-SNE, many small fragmented batches, batch confounds with biology\n",
- "- **Note:** From different dataset (multiome) but demonstrates alternative batch correction approach\n",
- "\n",
- "---\n",
- "\n",
- "### Key Features:\n",
- "- Blacklist region filtering (ENCODE hg38)\n",
- "- Technical QC covariate adjustment (TSS enrichment, nucleosome signal, sequencing depth)\n",
- "- TMM normalization and expression filtering\n",
- "- Log-transformation of count-based covariates\n",
- "- Flexible batch handling (covariate vs explicit correction)\n",
- "\n",
- "#### Pipeline Outputs:\n",
- "\n",
- "**From Step 1:**\n",
- "- `{celltype}_residuals.txt`: Covariate-adjusted residuals (log2-CPM scale)\n",
- "- `{celltype}_results.rds`: Complete analysis results\n",
- "- `{celltype}_summary.txt`: QC summary and filtering statistics\n",
- "- `{celltype}_variable_explanation.txt`: Covariate documentation (Option A only)\n",
- "- `{celltype}_filtered_raw_counts.txt`: TMM-normalized counts (Option B only)\n",
- "\n",
- "**From Step 2, Format A:**\n",
- "- `{celltype}_kellis_snatac_phenotype.bed.gz`: Genome-wide QTL-ready BED file\n",
- "\n",
- "**From Step 2, Format B:**\n",
- "- `{celltype}_filtered_regions_of_interest.txt`: Region-specific count data (chr7, chr11)\n",
- "- `{celltype}_filtered_regions_of_interest_summary.txt`: Peak metadata and statistics\n",
- "\n",
- "**From Alternative Pseudobulk Pipeline: Multiome with Batch Correction:**\n",
- "- `{celltype}_residuals.txt`: Batch-corrected residuals (log2-CPM scale)\n",
- "- `{celltype}_results.rds`: Complete results with batch_adjusted_counts\n",
- "\n",
- "---\n",
- "\n",
- "### Input Files\n",
- "Input files needed to run this pipeline can be downloaded [here](https://drive.google.com/drive/folders/1UzJuHN8SotMn-PJTBp9uGShD25YxapKr?usp=drive_link)."
- ]
- },
- {
- "cell_type": "markdown",
- "id": "5476354a-a9b1-45c4-bd41-010551ca96f1",
- "metadata": {},
- "source": [
- "#### Before you start, let's set up your working path."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 1,
- "id": "955bda26-9f91-41bb-adb7-c09fbf361c5e",
- "metadata": {},
- "outputs": [],
- "source": [
- "input_dir <- \"/restricted/projectnb/xqtl/jaempawi/atac_seq/kellis_data\" #set your input directory\n",
- "output_dir <- \"/restricted/projectnb/xqtl/jaempawi/atac_seq/output/kellis\" #set your output directory"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "5540a4da-843a-4789-8123-47911cf519c5",
- "metadata": {},
- "source": [
- "## Step 1: Pseudobulk QC with batch as covariates\n",
- "\n",
- "This preprocessing workflow offers **two approaches** depending on whether you want to regress out biological covariates:\n",
- "\n",
- "---\n",
- "### Option A: Pseudobulk QC WITH Biological Variation(Standard QTL Analysis)\n",
- "\n",
- "Use this option when you want residuals adjusted for all technical AND biological covariates (sex, age, PMI).\n",
- "\n",
- "**Input:**\n",
- "- Pseudobulk peak counts (in 1_files_with_sampleid folder): `pseudobulk_peaks_counts{celltype}_50nuc.csv.gz`\n",
- "- Cell metadata (in 1_files_with_sampleid folder): `metadata_{celltype}_50nuc.csv`\n",
- "- Sample covariates: `rosmap_cov.txt`\n",
- "- hg38 blacklist: `hg38-blacklist.v2.bed.gz`\n",
- "\n",
- "**Process:**\n",
- "1. Loads pseudobulk peak count matrix and metadata per cell type\n",
- "2. Calculates technical QC metrics per sample:\n",
- " - `log_n_nuclei`: Log-transformed number of nuclei\n",
- " - `med_nucleosome_signal`: Median nucleosome signal\n",
- " - `med_tss_enrich`: Median TSS enrichment score\n",
- " - `log_med_n_tot_fragment`: Log-transformed median total fragments (sequencing depth)\n",
- " - `log_total_unique_peaks`: Log-transformed count of unique peaks detected\n",
- "3. Filters blacklisted genomic regions using `foverlaps()`\n",
- "4. Merges with demographic covariates (msex, age_death, pmi, study)\n",
- "5. Applies expression filtering with `filterByExpr()`:\n",
- " - `min.count = 2`: Minimum 2 reads in at least one sample\n",
- " - `min.total.count = 15`: Minimum 15 total reads across all samples\n",
- " - `min.prop = 0.1`: Peak must be expressed in ≥10% of samples\n",
- "6. TMM normalization with `calcNormFactors()`\n",
- "7. Handles sequencingBatch as a covariate (not batch-corrected)\n",
- "8. Fits linear model using `voom()` and `lmFit()`:\n",
- "\n",
- " ```r\n",
- " model <- ~ log_n_nuclei + med_nucleosome_signal + med_tss_enrich + log_med_n_tot_fragment + log_total_unique_peaks + sequencingBatch_factor + msex + age_death + pmi + study \n",
- " ```\n",
- "\n",
- "9. Calculates residuals adjusted for ALL covariates (technical + biological)\n",
- "10. Computes final adjusted data using predictOffset(): offset + residuals\n",
- "- `offset`: Predicted expression at median/reference covariate values\n",
- "- `residuals`: Unexplained variation after removing all covariate effects\n",
- "\n",
- "**Output:** `output/2_residuals/{celltype}/`\n",
- "\n",
- "- `{celltype}_residuals.txt`: Final covariate-adjusted peak accessibility (log2-CPM scale)\n",
- "- `{celltype}_results.rds`: Complete analysis results (DGEList, fit object, design matrix)\n",
- "- `{celltype}_summary.txt`: Filtering statistics and QC summary\n",
- "- `{celltype}_variable_explanation.txt`: Detailed covariate documentation\n",
- "\n",
- "**Key Variables Regressed Out**:\n",
- "\n",
- "- Technical: sequencing depth, nuclei count, nucleosome signal, TSS enrichment, batch\n",
- "- Biological: sex (msex), age at death (age_death), post-mortem interval (pmi), study cohort\n"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "a58dfe97-3e57-4ce9-b8bb-009aec26b1a5",
- "metadata": {},
- "source": [
- "#### Load libaries"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 2,
- "id": "77deb405-f916-42e5-a74a-c3569d587cbf",
- "metadata": {},
- "outputs": [
- {
- "name": "stderr",
- "output_type": "stream",
- "text": [
- "\n",
- "Attaching package: ‘dplyr’\n",
- "\n",
- "\n",
- "The following objects are masked from ‘package:data.table’:\n",
- "\n",
- " between, first, last\n",
- "\n",
- "\n",
- "The following objects are masked from ‘package:stats’:\n",
- "\n",
- " filter, lag\n",
- "\n",
- "\n",
- "The following objects are masked from ‘package:base’:\n",
- "\n",
- " intersect, setdiff, setequal, union\n",
- "\n",
- "\n",
- "Loading required package: limma\n",
- "\n"
- ]
- }
- ],
- "source": [
- "library(data.table)\n",
- "library(stringr)\n",
- "library(dplyr)\n",
- "library(edgeR)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 3,
- "id": "5f5d8a77-91c8-4808-94cf-bc576378556c",
- "metadata": {},
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "Processing celltype: Astro \n"
- ]
- }
- ],
- "source": [
- "# Set cell type and create output directory\n",
- "celltype <- \"Astro\" # Change this for different cell types eg. Exc, Inh, Mic, Oligo, OPC\n",
- "cat(\"Processing celltype:\", celltype, \"\\n\")\n",
- "\n",
- "out_dir <- paste0(file.path(output_dir,\"2_residuals/\", celltype))\n",
- "dir.create(out_dir, recursive = TRUE, showWarnings = FALSE)\n"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "3ed15afb-f621-4dd3-be00-c15dd736835b",
- "metadata": {},
- "source": [
- "#### Create predictOffset function "
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 4,
- "id": "823abb05-f105-4f40-918a-5c470a04ffb9",
- "metadata": {},
- "outputs": [],
- "source": [
- "predictOffset <- function(fit) {\n",
- " # Define which variables are factors and which are continuous\n",
- " usedFactors <- c(\"sequencingBatch\", \"study\") \n",
- " usedContinuous <- c(\"log_n_nuclei\", \"med_nucleosome_signal\", \"med_tss_enrich\", \"log_med_n_tot_fragment\",\n",
- " \"log_total_unique_peaks\", \"msex\", \"age_death\", \"pmi\")\n",
- " \n",
- " # Filter to only use variables actually in the design matrix\n",
- " usedFactors <- usedFactors[sapply(usedFactors, function(f) any(grepl(paste0(\"^\", f), colnames(fit$design))))]\n",
- " usedContinuous <- usedContinuous[sapply(usedContinuous, function(f) any(grepl(paste0(\"^\", f), colnames(fit$design))))]\n",
- " \n",
- " # Get indices for factor and continuous variables\n",
- " facInd <- unlist(lapply(as.list(usedFactors), \n",
- " function(f) {return(grep(paste0(\"^\", f), \n",
- " colnames(fit$design)))}))\n",
- " contInd <- unlist(lapply(as.list(usedContinuous), \n",
- " function(f) {return(grep(paste0(\"^\", f), \n",
- " colnames(fit$design)))}))\n",
- " \n",
- " # Add the intercept\n",
- " all_indices <- c(1, facInd, contInd)\n",
- " \n",
- " # Verify design matrix structure (using sorted indices to avoid duplication warning)\n",
- " all_indices_sorted <- sort(unique(all_indices))\n",
- " stopifnot(all(all_indices_sorted %in% 1:ncol(fit$design)))\n",
- " \n",
- " # Create new design matrix with median values\n",
- " D <- fit$design\n",
- " D[, facInd] <- 0 # Set all factor levels to reference level\n",
- " \n",
- " # For continuous variables, set to median value\n",
- " if (length(contInd) > 0) {\n",
- " medContVals <- apply(D[, contInd, drop=FALSE], 2, median)\n",
- " for (i in 1:length(medContVals)) {\n",
- " D[, names(medContVals)[i]] <- medContVals[i]\n",
- " }\n",
- " }\n",
- " \n",
- " # Calculate offsets\n",
- " stopifnot(all(colnames(coefficients(fit)) == colnames(D)))\n",
- " offsets <- apply(coefficients(fit), 1, function(c) {\n",
- " return(D %*% c)\n",
- " })\n",
- " offsets <- t(offsets)\n",
- " colnames(offsets) <- rownames(fit$design)\n",
- " \n",
- " return(offsets)\n",
- "}"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "5d79ceae-e255-4a39-a288-12626481b0ac",
- "metadata": {},
- "source": [
- "#### Load input"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 5,
- "id": "46927164-2761-490f-afc2-86181e917a49",
- "metadata": {},
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "Loaded metadata with 82 samples and peak data with 531489 peaks\n"
- ]
- },
- {
- "data": {
- "text/html": [
- "
\n",
- "A data.table: 6 × 9\n",
- "\n",
- "\t| individualID | sampleid | sequencingBatch | main_cell_type | avg.pct.read.in.peak.ct | med.nucleosome_signal.ct | med.n_tot_fragment.ct | med.tss.enrich.ct | n.nuclei |
\n",
- "\t| <chr> | <chr> | <chr> | <chr> | <dbl> | <dbl> | <dbl> | <dbl> | <int> |
\n",
- "\n",
- "\n",
- "\t| R1042011 | SM-CJK5G | 191203Kel | Astro | 0.3939189 | 0.7894187 | 10923.00 | 0.3762771 | 409 |
\n",
- "\t| R1154454 | SM-CTDQN | 191203Kel | Astro | 0.2557693 | 0.7786428 | 23144.00 | 0.2516681 | 144 |
\n",
- "\t| R1213305 | SM-CJEIE | 191203Kel | Astro | 0.3277831 | 0.8077042 | 16094.78 | 0.2896403 | 630 |
\n",
- "\t| R1407047 | SM-CTEM5 | 191203Kel | Astro | 0.3361316 | 0.8275109 | 59451.00 | 0.3266785 | 189 |
\n",
- "\t| R1609849 | SM-CJJ27 | 191203Kel | Astro | 0.2857020 | 0.7868788 | 7522.00 | 0.2688059 | 186 |
\n",
- "\t| R1617674 | SM-CJIWT | 191203Kel | Astro | 0.1934420 | 0.7879911 | 33724.00 | 0.1702281 | 141 |
\n",
- "\n",
- "
\n"
- ],
- "text/latex": [
- "A data.table: 6 × 9\n",
- "\\begin{tabular}{lllllllll}\n",
- " individualID & sampleid & sequencingBatch & main\\_cell\\_type & avg.pct.read.in.peak.ct & med.nucleosome\\_signal.ct & med.n\\_tot\\_fragment.ct & med.tss.enrich.ct & n.nuclei\\\\\n",
- " & & & & & & & & \\\\\n",
- "\\hline\n",
- "\t R1042011 & SM-CJK5G & 191203Kel & Astro & 0.3939189 & 0.7894187 & 10923.00 & 0.3762771 & 409\\\\\n",
- "\t R1154454 & SM-CTDQN & 191203Kel & Astro & 0.2557693 & 0.7786428 & 23144.00 & 0.2516681 & 144\\\\\n",
- "\t R1213305 & SM-CJEIE & 191203Kel & Astro & 0.3277831 & 0.8077042 & 16094.78 & 0.2896403 & 630\\\\\n",
- "\t R1407047 & SM-CTEM5 & 191203Kel & Astro & 0.3361316 & 0.8275109 & 59451.00 & 0.3266785 & 189\\\\\n",
- "\t R1609849 & SM-CJJ27 & 191203Kel & Astro & 0.2857020 & 0.7868788 & 7522.00 & 0.2688059 & 186\\\\\n",
- "\t R1617674 & SM-CJIWT & 191203Kel & Astro & 0.1934420 & 0.7879911 & 33724.00 & 0.1702281 & 141\\\\\n",
- "\\end{tabular}\n"
- ],
- "text/markdown": [
- "\n",
- "A data.table: 6 × 9\n",
- "\n",
- "| individualID <chr> | sampleid <chr> | sequencingBatch <chr> | main_cell_type <chr> | avg.pct.read.in.peak.ct <dbl> | med.nucleosome_signal.ct <dbl> | med.n_tot_fragment.ct <dbl> | med.tss.enrich.ct <dbl> | n.nuclei <int> |\n",
- "|---|---|---|---|---|---|---|---|---|\n",
- "| R1042011 | SM-CJK5G | 191203Kel | Astro | 0.3939189 | 0.7894187 | 10923.00 | 0.3762771 | 409 |\n",
- "| R1154454 | SM-CTDQN | 191203Kel | Astro | 0.2557693 | 0.7786428 | 23144.00 | 0.2516681 | 144 |\n",
- "| R1213305 | SM-CJEIE | 191203Kel | Astro | 0.3277831 | 0.8077042 | 16094.78 | 0.2896403 | 630 |\n",
- "| R1407047 | SM-CTEM5 | 191203Kel | Astro | 0.3361316 | 0.8275109 | 59451.00 | 0.3266785 | 189 |\n",
- "| R1609849 | SM-CJJ27 | 191203Kel | Astro | 0.2857020 | 0.7868788 | 7522.00 | 0.2688059 | 186 |\n",
- "| R1617674 | SM-CJIWT | 191203Kel | Astro | 0.1934420 | 0.7879911 | 33724.00 | 0.1702281 | 141 |\n",
- "\n"
- ],
- "text/plain": [
- " individualID sampleid sequencingBatch main_cell_type avg.pct.read.in.peak.ct\n",
- "1 R1042011 SM-CJK5G 191203Kel Astro 0.3939189 \n",
- "2 R1154454 SM-CTDQN 191203Kel Astro 0.2557693 \n",
- "3 R1213305 SM-CJEIE 191203Kel Astro 0.3277831 \n",
- "4 R1407047 SM-CTEM5 191203Kel Astro 0.3361316 \n",
- "5 R1609849 SM-CJJ27 191203Kel Astro 0.2857020 \n",
- "6 R1617674 SM-CJIWT 191203Kel Astro 0.1934420 \n",
- " med.nucleosome_signal.ct med.n_tot_fragment.ct med.tss.enrich.ct n.nuclei\n",
- "1 0.7894187 10923.00 0.3762771 409 \n",
- "2 0.7786428 23144.00 0.2516681 144 \n",
- "3 0.8077042 16094.78 0.2896403 630 \n",
- "4 0.8275109 59451.00 0.3266785 189 \n",
- "5 0.7868788 7522.00 0.2688059 186 \n",
- "6 0.7879911 33724.00 0.1702281 141 "
- ]
- },
- "metadata": {},
- "output_type": "display_data"
- },
- {
- "data": {
- "text/html": [
- "\n",
- "A data.table: 6 × 82\n",
- "\n",
- "\t| SM-CJK5G | SM-CTDQN | SM-CJEIE | SM-CTEM5 | SM-CJJ27 | SM-CJIWT | SM-CTEEG | ROS11430815 | SM-CJGLG | SM-CJIXU | ⋯ | R9395022 | SM-CJIX5 | SM-CJEGU | SM-CJIYH | SM-CJGMS | SM-CTEGU | SM-CTEFJ | SM-CJEJU | SM-CTEGT | SM-CJIZE |
\n",
- "\t| <int> | <int> | <int> | <int> | <int> | <int> | <int> | <int> | <int> | <int> | ⋯ | <int> | <int> | <int> | <int> | <int> | <int> | <int> | <int> | <int> | <int> |
\n",
- "\n",
- "\n",
- "\t| 4 | 0 | 0 | 2 | 0 | 0 | 0 | 2 | 2 | 0 | ⋯ | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 |
\n",
- "\t| 20 | 12 | 45 | 36 | 1 | 7 | 9 | 30 | 16 | 5 | ⋯ | 13 | 5 | 7 | 10 | 3 | 6 | 11 | 10 | 18 | 5 |
\n",
- "\t| 8 | 1 | 3 | 6 | 0 | 6 | 11 | 1 | 1 | 3 | ⋯ | 5 | 2 | 0 | 1 | 3 | 0 | 2 | 3 | 5 | 4 |
\n",
- "\t| 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | ⋯ | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
\n",
- "\t| 15 | 4 | 15 | 9 | 2 | 3 | 16 | 8 | 3 | 5 | ⋯ | 5 | 6 | 5 | 5 | 5 | 2 | 6 | 12 | 7 | 7 |
\n",
- "\t| 33 | 4 | 55 | 70 | 4 | 10 | 26 | 21 | 22 | 5 | ⋯ | 30 | 15 | 8 | 21 | 5 | 20 | 35 | 26 | 48 | 6 |
\n",
- "\n",
- "
\n"
- ],
- "text/latex": [
- "A data.table: 6 × 82\n",
- "\\begin{tabular}{lllllllllllllllllllll}\n",
- " SM-CJK5G & SM-CTDQN & SM-CJEIE & SM-CTEM5 & SM-CJJ27 & SM-CJIWT & SM-CTEEG & ROS11430815 & SM-CJGLG & SM-CJIXU & ⋯ & R9395022 & SM-CJIX5 & SM-CJEGU & SM-CJIYH & SM-CJGMS & SM-CTEGU & SM-CTEFJ & SM-CJEJU & SM-CTEGT & SM-CJIZE\\\\\n",
- " & & & & & & & & & & ⋯ & & & & & & & & & & \\\\\n",
- "\\hline\n",
- "\t 4 & 0 & 0 & 2 & 0 & 0 & 0 & 2 & 2 & 0 & ⋯ & 1 & 0 & 0 & 0 & 1 & 0 & 0 & 1 & 0 & 0\\\\\n",
- "\t 20 & 12 & 45 & 36 & 1 & 7 & 9 & 30 & 16 & 5 & ⋯ & 13 & 5 & 7 & 10 & 3 & 6 & 11 & 10 & 18 & 5\\\\\n",
- "\t 8 & 1 & 3 & 6 & 0 & 6 & 11 & 1 & 1 & 3 & ⋯ & 5 & 2 & 0 & 1 & 3 & 0 & 2 & 3 & 5 & 4\\\\\n",
- "\t 0 & 0 & 0 & 0 & 0 & 0 & 1 & 0 & 0 & 0 & ⋯ & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0\\\\\n",
- "\t 15 & 4 & 15 & 9 & 2 & 3 & 16 & 8 & 3 & 5 & ⋯ & 5 & 6 & 5 & 5 & 5 & 2 & 6 & 12 & 7 & 7\\\\\n",
- "\t 33 & 4 & 55 & 70 & 4 & 10 & 26 & 21 & 22 & 5 & ⋯ & 30 & 15 & 8 & 21 & 5 & 20 & 35 & 26 & 48 & 6\\\\\n",
- "\\end{tabular}\n"
- ],
- "text/markdown": [
- "\n",
- "A data.table: 6 × 82\n",
- "\n",
- "| SM-CJK5G <int> | SM-CTDQN <int> | SM-CJEIE <int> | SM-CTEM5 <int> | SM-CJJ27 <int> | SM-CJIWT <int> | SM-CTEEG <int> | ROS11430815 <int> | SM-CJGLG <int> | SM-CJIXU <int> | ⋯ ⋯ | R9395022 <int> | SM-CJIX5 <int> | SM-CJEGU <int> | SM-CJIYH <int> | SM-CJGMS <int> | SM-CTEGU <int> | SM-CTEFJ <int> | SM-CJEJU <int> | SM-CTEGT <int> | SM-CJIZE <int> |\n",
- "|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|\n",
- "| 4 | 0 | 0 | 2 | 0 | 0 | 0 | 2 | 2 | 0 | ⋯ | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 |\n",
- "| 20 | 12 | 45 | 36 | 1 | 7 | 9 | 30 | 16 | 5 | ⋯ | 13 | 5 | 7 | 10 | 3 | 6 | 11 | 10 | 18 | 5 |\n",
- "| 8 | 1 | 3 | 6 | 0 | 6 | 11 | 1 | 1 | 3 | ⋯ | 5 | 2 | 0 | 1 | 3 | 0 | 2 | 3 | 5 | 4 |\n",
- "| 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | ⋯ | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |\n",
- "| 15 | 4 | 15 | 9 | 2 | 3 | 16 | 8 | 3 | 5 | ⋯ | 5 | 6 | 5 | 5 | 5 | 2 | 6 | 12 | 7 | 7 |\n",
- "| 33 | 4 | 55 | 70 | 4 | 10 | 26 | 21 | 22 | 5 | ⋯ | 30 | 15 | 8 | 21 | 5 | 20 | 35 | 26 | 48 | 6 |\n",
- "\n"
- ],
- "text/plain": [
- " SM-CJK5G SM-CTDQN SM-CJEIE SM-CTEM5 SM-CJJ27 SM-CJIWT SM-CTEEG ROS11430815\n",
- "1 4 0 0 2 0 0 0 2 \n",
- "2 20 12 45 36 1 7 9 30 \n",
- "3 8 1 3 6 0 6 11 1 \n",
- "4 0 0 0 0 0 0 1 0 \n",
- "5 15 4 15 9 2 3 16 8 \n",
- "6 33 4 55 70 4 10 26 21 \n",
- " SM-CJGLG SM-CJIXU ⋯ R9395022 SM-CJIX5 SM-CJEGU SM-CJIYH SM-CJGMS SM-CTEGU\n",
- "1 2 0 ⋯ 1 0 0 0 1 0 \n",
- "2 16 5 ⋯ 13 5 7 10 3 6 \n",
- "3 1 3 ⋯ 5 2 0 1 3 0 \n",
- "4 0 0 ⋯ 0 0 0 0 0 0 \n",
- "5 3 5 ⋯ 5 6 5 5 5 2 \n",
- "6 22 5 ⋯ 30 15 8 21 5 20 \n",
- " SM-CTEFJ SM-CJEJU SM-CTEGT SM-CJIZE\n",
- "1 0 1 0 0 \n",
- "2 11 10 18 5 \n",
- "3 2 3 5 4 \n",
- "4 0 0 0 0 \n",
- "5 6 12 7 7 \n",
- "6 35 26 48 6 "
- ]
- },
- "metadata": {},
- "output_type": "display_data"
- }
- ],
- "source": [
- "meta <- fread(file.path(input_dir, \"1_files_with_sampleid\", paste0(\"metadata_\", celltype, \"_50nuc.csv\")))\n",
- "peak_data <- fread(file.path(input_dir, \"1_files_with_sampleid\", paste0(\"pseudobulk_peaks_counts\", celltype, \"_50nuc.csv.gz\")))\n",
- "\n",
- "cat(\"Loaded metadata with\", nrow(meta), \"samples and peak data with\", nrow(peak_data), \"peaks\\n\")\n",
- "\n",
- "# Extract peak_id and set as rownames\n",
- "peak_id <- peak_data$peak_id\n",
- "peak_data <- peak_data[, -1, with = FALSE] # Remove peak_id column\n",
- "peak_matrix <- as.matrix(peak_data)\n",
- "rownames(peak_matrix) <- peak_id\n",
- "\n",
- "head(meta)\n",
- "head(peak_data)"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "785bc2c7-8940-47c8-8dd4-769ab2c29f27",
- "metadata": {},
- "source": [
- "#### Process technical variables from meta data\n"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 6,
- "id": "a6714741-5c18-47ed-a0f5-c6472120ea3a",
- "metadata": {},
- "outputs": [],
- "source": [
- "# Column name normalization (for easier handling)\n",
- "meta_clean <- meta %>%\n",
- " rename(\n",
- " med_nucleosome_signal = med.nucleosome_signal.ct,\n",
- " med_tss_enrich = med.tss.enrich.ct,\n",
- " med_n_tot_fragment = med.n_tot_fragment.ct,\n",
- " n_nuclei = n.nuclei\n",
- " )\n",
- "\n",
- "# Calculate peak metrics - total unique peaks per sample\n",
- "peak_metrics <- data.frame(\n",
- " sampleid = colnames(peak_matrix),\n",
- " total_unique_peaks = colSums(peak_matrix > 0)\n",
- ") %>%\n",
- " mutate(log_total_unique_peaks = log(total_unique_peaks + 1))"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "15031ec1-8106-45ce-9056-7ae771f2468e",
- "metadata": {},
- "source": [
- "#### Process peaks"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 7,
- "id": "06ee1c4e-7b39-4ba6-ab07-f7395de638dd",
- "metadata": {},
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "Sample of peak coordinates:\n",
- " peak_name chr start end\n",
- " \n",
- "1: chr1-181293-181565 chr1 181293 181565\n",
- "2: chr1-190726-191626 chr1 190726 191626\n",
- "3: chr1-629712-630662 chr1 629712 630662\n",
- "4: chr1-631261-631470 chr1 631261 631470\n",
- "5: chr1-633891-634506 chr1 633891 634506\n",
- "6: chr1-777873-779958 chr1 777873 779958\n",
- "Number of blacklisted peaks: 2354 \n",
- "Number of peaks after blacklist filtering: 529135 \n"
- ]
- }
- ],
- "source": [
- "# Process peak coordinates\n",
- "peak_df <- data.table(\n",
- " peak_name = peak_id,\n",
- " chr = sapply(strsplit(peak_id, \"-\"), `[`, 1),\n",
- " start = as.integer(sapply(strsplit(peak_id, \"-\"), `[`, 2)),\n",
- " end = as.integer(sapply(strsplit(peak_id, \"-\"), `[`, 3)),\n",
- " stringsAsFactors = FALSE\n",
- ")\n",
- "\n",
- "# Verify peak coordinates were extracted correctly\n",
- "cat(\"Sample of peak coordinates:\\n\")\n",
- "print(head(peak_df))\n",
- "\n",
- "# Load blacklist\n",
- "blacklist_file <- file.path(input_dir,\"hg38-blacklist.v2.bed.gz\")\n",
- "if (file.exists(blacklist_file)) {\n",
- " blacklist_df <- fread(blacklist_file)\n",
- " if (ncol(blacklist_df) >= 4) {\n",
- " colnames(blacklist_df)[1:4] <- c(\"chr\", \"start\", \"end\", \"label\")\n",
- " } else {\n",
- " colnames(blacklist_df)[1:3] <- c(\"chr\", \"start\", \"end\")\n",
- " }\n",
- " \n",
- " # Filter blacklisted peaks\n",
- " setkey(blacklist_df, chr, start, end)\n",
- " setkey(peak_df, chr, start, end)\n",
- " overlapping_peaks <- foverlaps(peak_df, blacklist_df, nomatch=0)\n",
- " blacklisted_peaks <- unique(overlapping_peaks$peak_name)\n",
- " cat(\"Number of blacklisted peaks:\", length(blacklisted_peaks), \"\\n\")\n",
- " \n",
- " filtered_peak_idx <- !(peak_id %in% blacklisted_peaks)\n",
- " filtered_peak <- peak_matrix[filtered_peak_idx, ]\n",
- " cat(\"Number of peaks after blacklist filtering:\", nrow(filtered_peak), \"\\n\")\n",
- "} else {\n",
- " cat(\"Warning: Blacklist file not found at\", blacklist_file, \"\\n\")\n",
- " cat(\"Proceeding without blacklist filtering\\n\")\n",
- " filtered_peak <- peak_matrix\n",
- "}"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "14144ad5-10bf-4475-9e60-370b48550fd1",
- "metadata": {},
- "source": [
- "#### Load covariates"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 8,
- "id": "1421c90d-6b16-40ff-a0c0-7b7c60a20d0c",
- "metadata": {},
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "\n",
- "Variable statistics before and after log transformation:\n",
- "n_nuclei: min=56.00, median=227.00, max=1293.00, SD=193.79\n",
- "log_n_nuclei: min=4.03, median=5.42, max=7.16, SD=0.64\n",
- "med_n_tot_fragment: min=2890.00, median=20306.00, max=73185.00, SD=15906.37\n",
- "log_med_n_tot_fragment: min=7.97, median=9.92, max=11.20, SD=0.66\n",
- "Number of samples after joining: 76 \n",
- "Sample IDs: SM-CJK5G, SM-CTDQN, SM-CJEIE, SM-CTEM5, SM-CJJ27, SM-CJIWT ...\n",
- "Available covariates: sampleid, individualID, sequencingBatch, main_cell_type, avg.pct.read.in.peak.ct, med_nucleosome_signal, med_n_tot_fragment, med_tss_enrich, n_nuclei, total_unique_peaks, log_total_unique_peaks, msex, age_death, pmi, study, log_n_nuclei, log_med_n_tot_fragment \n"
- ]
- }
- ],
- "source": [
- "covariates_file <- file.path(input_dir,\"rosmap_cov.txt\")\n",
- "if (file.exists(covariates_file)) {\n",
- " covariates <- fread(covariates_file)\n",
- " # Check column names and adjust if needed\n",
- " if ('#id' %in% colnames(covariates)) {\n",
- " id_col <- '#id'\n",
- " } else if ('individualID' %in% colnames(covariates)) {\n",
- " id_col <- 'individualID'\n",
- " } else {\n",
- " cat(\"Warning: Could not identify ID column in covariates file. Available columns:\", \n",
- " paste(colnames(covariates), collapse=\", \"), \"\\n\")\n",
- " id_col <- colnames(covariates)[1]\n",
- " cat(\"Using\", id_col, \"as ID column\\n\")\n",
- " }\n",
- " \n",
- " # Select relevant columns\n",
- " cov_cols <- intersect(c(id_col, 'msex', 'age_death', 'pmi', 'study'), colnames(covariates))\n",
- " covariates <- covariates[, ..cov_cols]\n",
- " \n",
- " # Merge with metadata\n",
- " meta_with_ind <- meta_clean %>%\n",
- " select(sampleid, everything())\n",
- " \n",
- " all_covs <- meta_with_ind %>%\n",
- " inner_join(peak_metrics, by = \"sampleid\") %>%\n",
- " inner_join(covariates, by = setNames(id_col, \"sampleid\"))\n",
- " \n",
- " # Impute missing values\n",
- " for (col in c(\"pmi\", \"age_death\")) {\n",
- " if (col %in% colnames(all_covs) && any(is.na(all_covs[[col]]))) {\n",
- " cat(\"Imputing missing values for\", col, \"\\n\")\n",
- " all_covs[[col]][is.na(all_covs[[col]])] <- median(all_covs[[col]], na.rm=TRUE)\n",
- " }\n",
- " }\n",
- "} else {\n",
- " cat(\"Warning: Covariates file\", covariates_file, \"not found.\\n\")\n",
- " cat(\"Proceeding with only technical variables.\\n\")\n",
- " all_covs <- meta_clean %>%\n",
- " inner_join(peak_metrics, by = \"sampleid\")\n",
- "}\n",
- "\n",
- "\n",
- "# Perform log transformations on necessary variables\n",
- "# Add a small constant to avoid log(0)\n",
- "epsilon <- 1e-6\n",
- "\n",
- "all_covs$log_n_nuclei <- log(all_covs$n_nuclei + epsilon)\n",
- "all_covs$log_med_n_tot_fragment <- log(all_covs$med_n_tot_fragment + epsilon)\n",
- "\n",
- "# Show distribution of original and log-transformed variables\n",
- "cat(\"\\nVariable statistics before and after log transformation:\\n\")\n",
- "for (var in c(\"n_nuclei\", \"med_n_tot_fragment\")) {\n",
- " orig_var <- all_covs[[var]]\n",
- " log_var <- all_covs[[paste0(\"log_\", var)]]\n",
- " \n",
- " cat(sprintf(\"%s: min=%.2f, median=%.2f, max=%.2f, SD=%.2f\\n\", \n",
- " var, min(orig_var), median(orig_var), max(orig_var), sd(orig_var)))\n",
- " cat(sprintf(\"log_%s: min=%.2f, median=%.2f, max=%.2f, SD=%.2f\\n\", \n",
- " var, min(log_var), median(log_var), max(log_var), sd(log_var)))\n",
- "}\n",
- "\n",
- "cat(\"Number of samples after joining:\", nrow(all_covs), \"\\n\")\n",
- "cat(\"Sample IDs:\", paste(head(all_covs$sampleid), collapse=\", \"), \"...\\n\")\n",
- "cat(\"Available covariates:\", paste(colnames(all_covs), collapse=\", \"), \"\\n\")\n"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "33e8ab2a-87bb-46be-9c44-5e605b4cc179",
- "metadata": {},
- "source": [
- "#### Create DGE object"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 9,
- "id": "8146b2c5-56b5-449b-b86f-cb64deed05e5",
- "metadata": {},
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "Number of valid samples: 76 \n"
- ]
- }
- ],
- "source": [
- "valid_samples <- intersect(colnames(filtered_peak), all_covs$sampleid)\n",
- "cat(\"Number of valid samples:\", length(valid_samples), \"\\n\")\n",
- "\n",
- "all_covs_filtered <- all_covs[all_covs$sampleid %in% valid_samples, ]\n",
- "filtered_peak_filtered <- filtered_peak[, valid_samples]\n",
- "\n",
- "dge <- DGEList(\n",
- " counts = filtered_peak_filtered,\n",
- " samples = all_covs_filtered\n",
- ")\n",
- "rownames(dge$samples) <- dge$samples$sampleid"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "55bb8d6b-3e61-4f2b-9c29-c20d0f38663a",
- "metadata": {},
- "source": [
- "#### Filter low counts and normalize"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 10,
- "id": "6862b6b6-0dfd-45f8-9d6c-c6dfca5247de",
- "metadata": {},
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "Number of peaks before filtering: 529135 \n"
- ]
- },
- {
- "name": "stderr",
- "output_type": "stream",
- "text": [
- "Warning message in filterByExpr.DGEList(dge, min.count = 2, min.total.count = 15, :\n",
- "“All samples appear to belong to the same group.”\n"
- ]
- },
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "Number of peaks after filtering: 323638 \n"
- ]
- }
- ],
- "source": [
- "cat(\"Number of peaks before filtering:\", nrow(dge), \"\\n\")\n",
- "keep <- filterByExpr(dge, \n",
- " min.count = 2, # for one sample, min reads \n",
- " min.total.count = 15, # min reads overall\n",
- " min.prop = 0.1) \n",
- "\n",
- "dge <- dge[keep, , keep.lib.sizes=FALSE]\n",
- "cat(\"Number of peaks after filtering:\", nrow(dge), \"\\n\") #1368 in mic,2491 in Ast\n",
- "dge <- calcNormFactors(dge, method=\"TMM\")\n"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "2b4d4f64-1e91-4edd-ad87-813db4f2547b",
- "metadata": {},
- "source": [
- "#### Handle batch as technical variable"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 11,
- "id": "0389e3c4-75fc-4195-b775-032da343b664",
- "metadata": {},
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "Handling sequencingBatch as a technical variable\n",
- "Found 2 unique batches\n",
- "Batch sizes:\n",
- "batches\n",
- "190820Kel 191203Kel \n",
- " 4 72 \n"
- ]
- }
- ],
- "source": [
- "# We'll handle batch as a technical variable rather than doing batch adjustment\n",
- "cat(\"Handling sequencingBatch as a technical variable\\n\")\n",
- "\n",
- "# Check batch information\n",
- "batches <- dge$samples$sequencingBatch\n",
- "cat(\"Found\", length(unique(batches)), \"unique batches\\n\")\n",
- "\n",
- "# Check batch size\n",
- "batch_counts <- table(batches)\n",
- "cat(\"Batch sizes:\\n\")\n",
- "print(batch_counts)\n",
- "\n",
- "# Convert sequencingBatch to factor with at least 2 levels\n",
- "if (length(unique(batches)) < 2) {\n",
- " cat(\"Only one batch found. Adding dummy batch for model compatibility.\\n\")\n",
- " # Create a dummy batch factor to avoid model errors\n",
- " dge$samples$sequencingBatch_factor <- factor(rep(\"batch1\", ncol(dge)))\n",
- "} else {\n",
- " # Use the existing batch information\n",
- " dge$samples$sequencingBatch_factor <- factor(dge$samples$sequencingBatch)\n",
- "}\n"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "1b23595b-8bd0-471b-8c11-cb0819e9055e",
- "metadata": {},
- "source": [
- "#### Create model and run voom"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 12,
- "id": "efba3973-0cfc-4afd-9dcc-5842190a9995",
- "metadata": {},
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "Using full model with demographic and technical covariates\n",
- "Model formula: ~log_n_nuclei + med_nucleosome_signal + med_tss_enrich + log_med_n_tot_fragment + log_total_unique_peaks + sequencingBatch_factor + msex + age_death + pmi + study \n",
- "Warning: Factor variable group has only one level. Converting to character.\n",
- "Successfully created design matrix with 11 columns\n",
- "Calculating offsets and residuals...\n"
- ]
- }
- ],
- "source": [
- "# Define the model based on available covariates - using log-transformed variables\n",
- "if (all(c(\"msex\", \"age_death\", \"pmi\", \"study\") %in% colnames(dge$samples))) {\n",
- " # Full model with all covariates\n",
- " cat(\"Using full model with demographic and technical covariates\\n\")\n",
- " model <- ~ log_n_nuclei + med_nucleosome_signal + med_tss_enrich + log_med_n_tot_fragment +\n",
- " log_total_unique_peaks + sequencingBatch_factor + \n",
- " msex + age_death + pmi + study\n",
- "} else {\n",
- " # Technical variables only model\n",
- " cat(\"Using model with technical covariates only\\n\")\n",
- " model <- ~ log_n_nuclei + med_nucleosome_signal + med_tss_enrich + log_med_n_tot_fragment +\n",
- " log_total_unique_peaks + sequencingBatch_factor\n",
- "}\n",
- "\n",
- "# Print the model formula\n",
- "cat(\"Model formula:\", deparse(model), \"\\n\")\n",
- "\n",
- "# Check for factor variables with only one level\n",
- "for (col in colnames(dge$samples)) {\n",
- " if (is.factor(dge$samples[[col]]) && nlevels(dge$samples[[col]]) < 2) {\n",
- " cat(\"Warning: Factor variable\", col, \"has only one level. Converting to character.\\n\")\n",
- " dge$samples[[col]] <- as.character(dge$samples[[col]])\n",
- " }\n",
- "}\n",
- "\n",
- "# Create design matrix with error checking\n",
- "tryCatch({\n",
- " design <- model.matrix(model, data=dge$samples)\n",
- " cat(\"Successfully created design matrix with\", ncol(design), \"columns\\n\")\n",
- "}, error = function(e) {\n",
- " cat(\"Error in creating design matrix:\", e$message, \"\\n\")\n",
- " cat(\"Attempting to fix model formula...\\n\")\n",
- " \n",
- " # Check each term in the model\n",
- " all_terms <- all.vars(model)\n",
- " valid_terms <- character(0)\n",
- " \n",
- " for (term in all_terms) {\n",
- " if (term %in% colnames(dge$samples)) {\n",
- " # Check if it's a factor with at least 2 levels\n",
- " if (is.factor(dge$samples[[term]])) {\n",
- " if (nlevels(dge$samples[[term]]) >= 2) {\n",
- " valid_terms <- c(valid_terms, term)\n",
- " } else {\n",
- " cat(\"Skipping factor\", term, \"with only\", nlevels(dge$samples[[term]]), \"level\\n\")\n",
- " }\n",
- " } else {\n",
- " # Non-factor variables are fine\n",
- " valid_terms <- c(valid_terms, term)\n",
- " }\n",
- " } else {\n",
- " cat(\"Variable\", term, \"not found in sample data\\n\")\n",
- " }\n",
- " }\n",
- " \n",
- " # Create a simplified model with valid terms\n",
- " if (length(valid_terms) > 0) {\n",
- " model_str <- paste(\"~\", paste(valid_terms, collapse = \" + \"))\n",
- " model <- as.formula(model_str)\n",
- " cat(\"New model formula:\", model_str, \"\\n\")\n",
- " design <- model.matrix(model, data=dge$samples)\n",
- " cat(\"Successfully created design matrix with\", ncol(design), \"columns\\n\")\n",
- " } else {\n",
- " stop(\"Could not create a valid model with the available variables\")\n",
- " }\n",
- "})\n",
- "\n",
- "# Check if the design matrix is full rank\n",
- "if (!is.fullrank(design)) {\n",
- " cat(\"Design matrix is not full rank. Adjusting...\\n\")\n",
- " # Find and remove the problematic columns\n",
- " qr_res <- qr(design)\n",
- " design <- design[, qr_res$pivot[1:qr_res$rank]]\n",
- " cat(\"Adjusted design matrix columns:\", ncol(design), \"\\n\")\n",
- "}\n",
- "\n",
- "# Run voom and fit model\n",
- "v <- voom(dge, design, plot=FALSE) #logCPM\n",
- "fit <- lmFit(v, design)\n",
- "fit <- eBayes(fit)\n",
- "\n",
- "# Calculate offset and residuals\n",
- "cat(\"Calculating offsets and residuals...\\n\")\n",
- "offset <- predictOffset(fit)\n",
- "resids <- residuals(fit, y=v)\n",
- "\n",
- "# Verify dimensions\n",
- "stopifnot(all(rownames(offset) == rownames(resids)) &\n",
- " all(colnames(offset) == colnames(resids)))\n",
- "\n",
- "# Final adjusted data\n",
- "stopifnot(all(dim(offset) == dim(resids)))\n",
- "stopifnot(all(colnames(offset) == colnames(resids)))\n",
- "\n",
- "final_data <- offset + resids"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "cbc2d0da-33f0-4d51-ae43-4de228d57873",
- "metadata": {},
- "source": [
- "#### Save results"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 13,
- "id": "c0c15fea-c4d6-41a2-aa92-795b4fd0b9b7",
- "metadata": {},
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "Processing completed. Results and documentation saved to: /restricted/projectnb/xqtl/jaempawi/atac_seq/output/kellis/2_residuals//Astro \n"
- ]
- }
- ],
- "source": [
- "# Save results\n",
- "saveRDS(list(\n",
- " dge = dge,\n",
- " offset = offset,\n",
- " residuals = resids,\n",
- " final_data = final_data,\n",
- " valid_samples = colnames(dge),\n",
- " design = design,\n",
- " fit = fit,\n",
- " model = model\n",
- "), file = file.path(out_dir, paste0(celltype,\"_results.rds\")))\n",
- "\n",
- "# Write final residual data to file\n",
- "write.table(final_data,\n",
- " file = file.path(out_dir, paste0(celltype,\"_residuals.txt\")), \n",
- " quote=FALSE, sep=\"\\t\", row.names=TRUE, col.names=TRUE)\n",
- "\n",
- "# Write summary statistics\n",
- "sink(file = file.path(out_dir, paste0(celltype, \"_summary.txt\")))\n",
- "cat(\"*** Processing Summary for\", celltype, \"***\\n\\n\")\n",
- "cat(\"Original peak count:\", length(peak_id), \"\\n\")\n",
- "cat(\"Peaks after blacklist filtering:\", nrow(filtered_peak), \"\\n\")\n",
- "cat(\"Peaks after expression filtering:\", nrow(dge), \"\\n\\n\")\n",
- "cat(\"Number of samples:\", ncol(dge), \"\\n\")\n",
- "cat(\"\\nTechnical Variables Used:\\n\")\n",
- "cat(\"- log_n_nuclei: Log-transformed number of nuclei per sample\\n\")\n",
- "cat(\"- med_nucleosome_signal: Median nucleosome signal\\n\")\n",
- "cat(\"- med_tss_enrich: Median TSS enrichment\\n\")\n",
- "cat(\"- log_med_n_tot_fragment: Log-transformed median number of total fragments\\n\")\n",
- "cat(\"- log_total_unique_peaks: Log-transformed count of unique peaks per sample\\n\")\n",
- "cat(\"\\nDemographic Variables Used:\\n\")\n",
- "cat(\"- msex: Sex (male=1, female=0)\\n\")\n",
- "cat(\"- age_death: Age at death\\n\")\n",
- "cat(\"- pmi: Post-mortem interval\\n\")\n",
- "cat(\"- study: Study cohort\\n\")\n",
- "sink()\n",
- "\n",
- "# Write an additional explanation file about the variables and log transformation\n",
- "sink(file = file.path(out_dir, paste0(celltype,\"_variable_explanation.txt\")))\n",
- "cat(\"# ATAC-seq Technical Variables Explanation\\n\\n\")\n",
- "\n",
- "cat(\"## Why Log Transformation?\\n\")\n",
- "cat(\"Log transformation is applied to certain variables for several reasons:\\n\")\n",
- "cat(\"1. To make the distribution more symmetric and closer to normal\\n\")\n",
- "cat(\"2. To stabilize variance across the range of values\\n\")\n",
- "cat(\"3. To match the scale of voom-transformed peak counts, which are on log2-CPM scale\\n\")\n",
- "cat(\"4. To be consistent with the approach used in related studies like haQTL\\n\\n\")\n",
- "\n",
- "cat(\"## Variables and Their Meanings\\n\\n\")\n",
- "\n",
- "cat(\"### Technical Variables\\n\")\n",
- "cat(\"- n_nuclei: Number of nuclei that contributed to this pseudobulk sample\\n\")\n",
- "cat(\" * Log-transformed because count data typically has a right-skewed distribution\\n\\n\")\n",
- "\n",
- "cat(\"- med_n_tot_fragment: Median number of total fragments per cell\\n\")\n",
- "cat(\" * Represents sequencing depth\\n\")\n",
- "cat(\" * Log-transformed because sequencing depth typically has exponential effects\\n\\n\")\n",
- "\n",
- "cat(\"- total_unique_peaks: Number of unique peaks detected in each sample\\n\")\n",
- "cat(\" * Log-transformed similar to 'TotalNumPeaks' in haQTL pipeline\\n\\n\")\n",
- "\n",
- "cat(\"- med_nucleosome_signal: Median nucleosome signal\\n\")\n",
- "cat(\" * Measures the degree of nucleosome positioning\\n\")\n",
- "cat(\" * Not log-transformed as it's already a ratio/normalized metric\\n\\n\")\n",
- "\n",
- "cat(\"- med_tss_enrich: Median transcription start site enrichment score\\n\")\n",
- "cat(\" * Indicates the quality of the ATAC-seq data\\n\")\n",
- "cat(\" * Not log-transformed as it's already a ratio/normalized metric\\n\\n\")\n",
- "\n",
- "cat(\"### Demographic Variables\\n\")\n",
- "cat(\"- msex: Sex (male=1, female=0)\\n\")\n",
- "cat(\"- age_death: Age at death\\n\")\n",
- "cat(\"- pmi: Post-mortem interval (time between death and tissue collection)\\n\")\n",
- "cat(\"- study: Study cohort (ROSMAP, MAP, ROS)\\n\\n\")\n",
- "\n",
- "cat(\"## Relationship to voom Transformation\\n\")\n",
- "cat(\"The voom transformation converts count data to log2-CPM (counts per million) values \")\n",
- "cat(\"and estimates the mean-variance relationship. By log-transforming certain technical \")\n",
- "cat(\"covariates, we ensure they're on a similar scale to the transformed expression data, \")\n",
- "cat(\"which can improve the fit of the linear model used for removing unwanted variation.\\n\")\n",
- "sink()\n",
- "\n",
- "cat(\"Processing completed. Results and documentation saved to:\", out_dir, \"\\n\")"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "b28beaf3-804a-4b88-9e7e-156a5d4ee3d0",
- "metadata": {},
- "source": [
- "### Option B: Pseudobulk QC WITHOUT Biological Variation (noBIOvar)\n",
- "Use this option when you want to preserve biological variation (e.g., for comparing across ages/sexes or region-specific analyses).\n",
- "\n",
- "**Input:** (Same as Option A)\n",
- "- Pseudobulk peak counts (in `1_files_with_sampleid` folder): `pseudobulk_peaks_counts{celltype}_50nuc.csv.gz`\n",
- "- Cell metadata (in `1_files_with_sampleid` folder): `metadata_{celltype}_50nuc.csv`\n",
- "- Sample covariates: `rosmap_cov.txt`\n",
- "- hg38 blacklist: `hg38-blacklist.v2.bed.gz`\n",
- "\n",
- "**Process:**\n",
- "1. Loads pseudobulk peak count matrix and metadata per cell type\n",
- "2. Calculates technical QC metrics per sample:\n",
- " - `log_n_nuclei`: Log-transformed number of nuclei\n",
- " - `med_nucleosome_signal`: Median nucleosome signal\n",
- " - `med_tss_enrich`: Median TSS enrichment score\n",
- " - `log_med_n_tot_fragment`: Log-transformed median total fragments (sequencing depth)\n",
- " - `log_total_unique_peaks`: Log-transformed count of unique peaks detected\n",
- "3. Filters blacklisted genomic regions using `foverlaps()`\n",
- "4. Merges with demographic covariates (msex, age_death, pmi, study)\n",
- "5. Applies expression filtering with `filterByExpr()`:\n",
- " - `min.count = 2`: Minimum 2 reads in at least one sample\n",
- " - `min.total.count = 15`: Minimum 15 total reads across all samples\n",
- " - `min.prop = 0.1`: Peak must be expressed in ≥10% of samples\n",
- "6. TMM normalization with `calcNormFactors()`\n",
- "7. Saves **filtered raw counts** without covariate adjustment\n",
- "\n",
- "**Key Difference:** \n",
- "- Does NOT regress out msex or age_death\n",
- "- No residual calculation performed (voom/lmFit section commented out)\n",
- "- Only saves TMM-normalized, filtered count matrix\n",
- "\n",
- "**Model formula (if residuals were computed):**\n",
- "```r\n",
- "model <- ~ log_n_nuclei + med_nucleosome_signal + med_tss_enrich + log_med_n_tot_fragment + log_total_unique_peaks + med_peakwidth + sequencingBatch_factor + pmi + study\n",
- "\n",
- "```\n",
- "Note: The voom/residual calculation section is commented out; only filtered counts are saved\n",
- "\n",
- "**Output:** `output/2_residuals/{celltype}/`\n",
- "\n",
- "`{celltype}_filtered_raw_counts.txt`: TMM-normalized, filtered peak counts without biological covariate adjustment\n",
- "\n",
- "**Key Variables NOT Regressed:**\n",
- "- Sex (msex)\n",
- "- Age at death (age_death)"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "8ee626cb-8aa6-4464-8066-4f501b5d6eaf",
- "metadata": {},
- "source": [
- "#### Load libraries"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 14,
- "id": "0bfb521c-fdc2-4029-b8e6-9c3459ee8872",
- "metadata": {},
- "outputs": [],
- "source": [
- "library(data.table)\n",
- "library(stringr)\n",
- "library(dplyr)\n",
- "library(edgeR)"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "e346a569-892d-43b7-974d-f55ca725d83b",
- "metadata": {},
- "source": [
- "#### Load input"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 15,
- "id": "55b554b2-722b-48d8-aa25-bbdae074963f",
- "metadata": {},
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "Processing celltype: Exc \n"
- ]
- }
- ],
- "source": [
- "# Set cell type and create output directory\n",
- "#args <- commandArgs(trailingOnly = TRUE)\n",
- "#celltype <- args[1] # First argument is the cell type\n",
- "celltype <- \"Exc\" # Change this for different cell types\n",
- "cat(\"Processing celltype:\", celltype, \"\\n\")\n",
- "\n",
- "out_dir <- paste0(file.path(output_dir,\"2_residuals/\", celltype))\n",
- "dir.create(out_dir, recursive = TRUE, showWarnings = FALSE)"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "6d336458-99cd-4ff0-838b-4423d6bf2e9a",
- "metadata": {},
- "source": [
- "#### Create predictOffset function "
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 16,
- "id": "e01e3c57-8abc-4b09-94fb-1ec7fc55b0ac",
- "metadata": {},
- "outputs": [],
- "source": [
- "predictOffset <- function(fit) {\n",
- " # Define which variables are factors and which are continuous\n",
- " usedFactors <- c(\"sequencingBatch\", \"study\") \n",
- " usedContinuous <- c(\"log_n_nuclei\", \"med_nucleosome_signal\", \"med_tss_enrich\", \"log_med_n_tot_fragment\",\n",
- " \"log_total_unique_peaks\", \"med_peakwidth\", \"pmi\")\n",
- " \n",
- " # Filter to only use variables actually in the design matrix\n",
- " usedFactors <- usedFactors[sapply(usedFactors, function(f) any(grepl(paste0(\"^\", f), colnames(fit$design))))]\n",
- " usedContinuous <- usedContinuous[sapply(usedContinuous, function(f) any(grepl(paste0(\"^\", f), colnames(fit$design))))]\n",
- " \n",
- " # Get indices for factor and continuous variables\n",
- " facInd <- unlist(lapply(as.list(usedFactors), \n",
- " function(f) {return(grep(paste0(\"^\", f), \n",
- " colnames(fit$design)))}))\n",
- " contInd <- unlist(lapply(as.list(usedContinuous), \n",
- " function(f) {return(grep(paste0(\"^\", f), \n",
- " colnames(fit$design)))}))\n",
- " \n",
- " # Add the intercept\n",
- " all_indices <- c(1, facInd, contInd)\n",
- " \n",
- " # Verify design matrix structure (using sorted indices to avoid duplication warning)\n",
- " all_indices_sorted <- sort(unique(all_indices))\n",
- " stopifnot(all(all_indices_sorted %in% 1:ncol(fit$design)))\n",
- " \n",
- " # Create new design matrix with median values\n",
- " D <- fit$design\n",
- " D[, facInd] <- 0 # Set all factor levels to reference level\n",
- " \n",
- " # For continuous variables, set to median value\n",
- " if (length(contInd) > 0) {\n",
- " medContVals <- apply(D[, contInd, drop=FALSE], 2, median)\n",
- " for (i in 1:length(medContVals)) {\n",
- " D[, names(medContVals)[i]] <- medContVals[i]\n",
- " }\n",
- " }\n",
- " \n",
- " # Calculate offsets\n",
- " stopifnot(all(colnames(coefficients(fit)) == colnames(D)))\n",
- " offsets <- apply(coefficients(fit), 1, function(c) {\n",
- " return(D %*% c)\n",
- " })\n",
- " offsets <- t(offsets)\n",
- " colnames(offsets) <- rownames(fit$design)\n",
- " \n",
- " return(offsets)\n",
- "}"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "ab4a5c28-5295-47e4-a5f5-26d6cbb995ca",
- "metadata": {},
- "source": [
- "#### Load input"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 17,
- "id": "8ea62e42-0bbf-4166-8c51-ded8318a6463",
- "metadata": {},
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "Loaded metadata with 90 samples and peak data with 531489 peaks\n"
- ]
- }
- ],
- "source": [
- "meta <- fread(paste0(file.path(input_dir, \"1_files_with_sampleid/metadata_\"), celltype, \"_50nuc.csv\"))\n",
- "peak_data <- fread(file.path(input_dir,\"1_files_with_sampleid\", paste0(\"pseudobulk_peaks_counts\", celltype, \"_50nuc.csv.gz\")))\n",
- "\n",
- "cat(\"Loaded metadata with\", nrow(meta), \"samples and peak data with\", nrow(peak_data), \"peaks\\n\")\n",
- "\n",
- "# Extract peak_id and set as rownames\n",
- "peak_id <- peak_data$peak_id\n",
- "peak_data <- peak_data[, -1, with = FALSE] # Remove peak_id column\n",
- "peak_matrix <- as.matrix(peak_data)\n",
- "rownames(peak_matrix) <- peak_id"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "b3b9c011-775e-41bc-b581-4269628592eb",
- "metadata": {},
- "source": [
- "#### Process technical variables from meta data"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 18,
- "id": "866e1e87-0c20-4a71-9d83-450c49a3e647",
- "metadata": {},
- "outputs": [],
- "source": [
- "# Column name normalization (for easier handling)\n",
- "meta_clean <- meta %>%\n",
- " rename(\n",
- " med_nucleosome_signal = med.nucleosome_signal.ct,\n",
- " med_tss_enrich = med.tss.enrich.ct,\n",
- " med_n_tot_fragment = med.n_tot_fragment.ct,\n",
- " n_nuclei = n.nuclei\n",
- " )\n",
- "\n",
- "# Calculate peak metrics - total unique peaks per sample and median peak width\n",
- "peak_metrics <- data.frame(\n",
- " sampleid = colnames(peak_matrix),\n",
- " total_unique_peaks = colSums(peak_matrix > 0)\n",
- ") %>%\n",
- " mutate(log_total_unique_peaks = log(total_unique_peaks + 1))\n",
- "\n",
- "# Calculate median peak width for each sample using count as weight\n",
- "calculate_median_peakwidth <- function(peak_matrix, peak_info) {\n",
- " # Create a data frame with peak widths\n",
- " peak_widths <- peak_info$end - peak_info$start\n",
- " \n",
- " # Initialize a vector to store median peak widths\n",
- " median_peak_widths <- numeric(ncol(peak_matrix))\n",
- " names(median_peak_widths) <- colnames(peak_matrix)\n",
- " \n",
- " # For each sample, calculate the weighted median peak width\n",
- " for (i in 1:ncol(peak_matrix)) {\n",
- " sample_counts <- peak_matrix[, i]\n",
- " # Only consider peaks with counts > 0\n",
- " idx <- which(sample_counts > 0)\n",
- " \n",
- " if (length(idx) > 0) {\n",
- " # Method 1: Use counts as weights\n",
- " weights <- sample_counts[idx]\n",
- " # Repeat each peak width by its count for weighted calculation\n",
- " all_widths <- rep(peak_widths[idx], times=weights)\n",
- " median_peak_widths[i] <- median(all_widths)\n",
- " } else {\n",
- " median_peak_widths[i] <- NA\n",
- " }\n",
- " }\n",
- " \n",
- " return(median_peak_widths)\n",
- "}\n",
- "\n",
- "# Calculate median peak width for each sample\n",
- "# Note: Using the peak_df that was created earlier for blacklist filtering\n",
- "median_peakwidths <- calculate_median_peakwidth(peak_matrix, data.frame(\n",
- " start = as.integer(sapply(strsplit(peak_id, \"-\"), `[`, 2)),\n",
- " end = as.integer(sapply(strsplit(peak_id, \"-\"), `[`, 3))\n",
- "))\n",
- "\n",
- "# Add median peak width to peak metrics\n",
- "peak_metrics$med_peakwidth <- median_peakwidths"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "0f7eee8d-91f2-48a8-b3df-f5f6fbd6ac9b",
- "metadata": {},
- "source": [
- "#### Process peaks"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 19,
- "id": "0fc2bc63-131f-425d-bbcd-66d6eba93076",
- "metadata": {},
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "Sample of peak coordinates:\n",
- " peak_name chr start end\n",
- " \n",
- "1: chr1-181293-181565 chr1 181293 181565\n",
- "2: chr1-190726-191626 chr1 190726 191626\n",
- "3: chr1-629712-630662 chr1 629712 630662\n",
- "4: chr1-631261-631470 chr1 631261 631470\n",
- "5: chr1-633891-634506 chr1 633891 634506\n",
- "6: chr1-777873-779958 chr1 777873 779958\n",
- "Number of blacklisted peaks: 2354 \n",
- "Number of peaks after blacklist filtering: 529135 \n"
- ]
- }
- ],
- "source": [
- "# Process peak coordinates\n",
- "peak_df <- data.table(\n",
- " peak_name = peak_id,\n",
- " chr = sapply(strsplit(peak_id, \"-\"), `[`, 1),\n",
- " start = as.integer(sapply(strsplit(peak_id, \"-\"), `[`, 2)),\n",
- " end = as.integer(sapply(strsplit(peak_id, \"-\"), `[`, 3)),\n",
- " stringsAsFactors = FALSE\n",
- ")\n",
- "\n",
- "# Verify peak coordinates were extracted correctly\n",
- "cat(\"Sample of peak coordinates:\\n\")\n",
- "print(head(peak_df))\n",
- "\n",
- "# Load blacklist\n",
- "blacklist_file <- file.path(input_dir,\"hg38-blacklist.v2.bed.gz\")\n",
- "if (file.exists(blacklist_file)) {\n",
- " blacklist_df <- fread(blacklist_file)\n",
- " if (ncol(blacklist_df) >= 4) {\n",
- " colnames(blacklist_df)[1:4] <- c(\"chr\", \"start\", \"end\", \"label\")\n",
- " } else {\n",
- " colnames(blacklist_df)[1:3] <- c(\"chr\", \"start\", \"end\")\n",
- " }\n",
- " \n",
- " # Filter blacklisted peaks\n",
- " setkey(blacklist_df, chr, start, end)\n",
- " setkey(peak_df, chr, start, end)\n",
- " overlapping_peaks <- foverlaps(peak_df, blacklist_df, nomatch=0)\n",
- " blacklisted_peaks <- unique(overlapping_peaks$peak_name)\n",
- " cat(\"Number of blacklisted peaks:\", length(blacklisted_peaks), \"\\n\")\n",
- " \n",
- " filtered_peak_idx <- !(peak_id %in% blacklisted_peaks)\n",
- " filtered_peak <- peak_matrix[filtered_peak_idx, ]\n",
- " cat(\"Number of peaks after blacklist filtering:\", nrow(filtered_peak), \"\\n\")\n",
- "} else {\n",
- " cat(\"Warning: Blacklist file not found at\", blacklist_file, \"\\n\")\n",
- " cat(\"Proceeding without blacklist filtering\\n\")\n",
- " filtered_peak <- peak_matrix\n",
- "}\n"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "d764e632-2fca-401e-9457-8174ff204000",
- "metadata": {},
- "source": [
- "#### Load covariates"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 20,
- "id": "8aaabbbf-70e3-421c-863d-1c8c08c0fc24",
- "metadata": {},
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "\n",
- "Variable statistics before and after log transformation:\n",
- "n_nuclei: min=77.00, median=1762.00, max=7024.00, SD=1275.90\n",
- "log_n_nuclei: min=4.34, median=7.47, max=8.86, SD=0.88\n",
- "med_n_tot_fragment: min=3234.00, median=21072.00, max=133932.50, SD=20162.62\n",
- "log_med_n_tot_fragment: min=8.08, median=9.96, max=11.81, SD=0.73\n",
- "Number of samples after joining: 83 \n",
- "Sample IDs: SM-CJK5G, SM-CTDQN, SM-CJEIE, SM-CTEM5, SM-CJJ27, SM-CJIWT ...\n",
- "Available covariates: sampleid, individualID, sequencingBatch, main_cell_type, avg.pct.read.in.peak.ct, med_nucleosome_signal, med_n_tot_fragment, med_tss_enrich, n_nuclei, total_unique_peaks, log_total_unique_peaks, med_peakwidth, pmi, study, log_n_nuclei, log_med_n_tot_fragment \n"
- ]
- }
- ],
- "source": [
- "covariates_file <- file.path(input_dir,'rosmap_cov.txt')\n",
- "if (file.exists(covariates_file)) {\n",
- " covariates <- fread(covariates_file)\n",
- " # Check column names and adjust if needed\n",
- " if ('#id' %in% colnames(covariates)) {\n",
- " id_col <- '#id'\n",
- " } else if ('individualID' %in% colnames(covariates)) {\n",
- " id_col <- 'individualID'\n",
- " } else {\n",
- " cat(\"Warning: Could not identify ID column in covariates file. Available columns:\", \n",
- " paste(colnames(covariates), collapse=\", \"), \"\\n\")\n",
- " id_col <- colnames(covariates)[1]\n",
- " cat(\"Using\", id_col, \"as ID column\\n\")\n",
- " }\n",
- " \n",
- " # Select relevant columns - excluding msex and age_death\n",
- " cov_cols <- intersect(c(id_col, 'pmi', 'study'), colnames(covariates))\n",
- " covariates <- covariates[, ..cov_cols]\n",
- " \n",
- " # Merge with metadata\n",
- " meta_with_ind <- meta_clean %>%\n",
- " select(sampleid, everything())\n",
- " \n",
- " all_covs <- meta_with_ind %>%\n",
- " inner_join(peak_metrics, by = \"sampleid\") %>%\n",
- " inner_join(covariates, by = setNames(id_col, \"sampleid\"))\n",
- " \n",
- " # Impute missing values\n",
- " for (col in c(\"pmi\")) {\n",
- " if (col %in% colnames(all_covs) && any(is.na(all_covs[[col]]))) {\n",
- " cat(\"Imputing missing values for\", col, \"\\n\")\n",
- " all_covs[[col]][is.na(all_covs[[col]])] <- median(all_covs[[col]], na.rm=TRUE)\n",
- " }\n",
- " }\n",
- "} else {\n",
- " cat(\"Warning: Covariates file\", covariates_file, \"not found.\\n\")\n",
- " cat(\"Proceeding with only technical variables.\\n\")\n",
- " all_covs <- meta_clean %>%\n",
- " inner_join(peak_metrics, by = \"sampleid\")\n",
- "}\n",
- "\n",
- "\n",
- "# Perform log transformations on necessary variables\n",
- "# Add a small constant to avoid log(0)\n",
- "epsilon <- 1e-6\n",
- "\n",
- "all_covs$log_n_nuclei <- log(all_covs$n_nuclei + epsilon)\n",
- "all_covs$log_med_n_tot_fragment <- log(all_covs$med_n_tot_fragment + epsilon)\n",
- "\n",
- "# Show distribution of original and log-transformed variables\n",
- "cat(\"\\nVariable statistics before and after log transformation:\\n\")\n",
- "for (var in c(\"n_nuclei\", \"med_n_tot_fragment\")) {\n",
- " orig_var <- all_covs[[var]]\n",
- " log_var <- all_covs[[paste0(\"log_\", var)]]\n",
- " \n",
- " cat(sprintf(\"%s: min=%.2f, median=%.2f, max=%.2f, SD=%.2f\\n\", \n",
- " var, min(orig_var), median(orig_var), max(orig_var), sd(orig_var)))\n",
- " cat(sprintf(\"log_%s: min=%.2f, median=%.2f, max=%.2f, SD=%.2f\\n\", \n",
- " var, min(log_var), median(log_var), max(log_var), sd(log_var)))\n",
- "}\n",
- "\n",
- "cat(\"Number of samples after joining:\", nrow(all_covs), \"\\n\")\n",
- "cat(\"Sample IDs:\", paste(head(all_covs$sampleid), collapse=\", \"), \"...\\n\")\n",
- "cat(\"Available covariates:\", paste(colnames(all_covs), collapse=\", \"), \"\\n\")"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "6bbc0158-7095-48db-a80c-020fad7bd4ec",
- "metadata": {},
- "source": [
- "#### Create DGE object"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 21,
- "id": "13ebe29a-6598-4d9d-b9ff-223ae3a98656",
- "metadata": {},
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "Number of valid samples: 83 \n"
- ]
- }
- ],
- "source": [
- "valid_samples <- intersect(colnames(filtered_peak), all_covs$sampleid)\n",
- "cat(\"Number of valid samples:\", length(valid_samples), \"\\n\")\n",
- "\n",
- "all_covs_filtered <- all_covs[all_covs$sampleid %in% valid_samples, ]\n",
- "filtered_peak_filtered <- filtered_peak[, valid_samples]\n",
- "\n",
- "dge <- DGEList(\n",
- " counts = filtered_peak_filtered,\n",
- " samples = all_covs_filtered\n",
- ")\n",
- "rownames(dge$samples) <- dge$samples$sampleid"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "6962730b-7cdd-41e4-ba2c-51cd08d16013",
- "metadata": {},
- "source": [
- "#### Filter low counts and normalize"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 22,
- "id": "9f0f07a2-0b66-4031-acf9-cc0db9e8af4f",
- "metadata": {},
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "Number of peaks before filtering: 529135 \n"
- ]
- },
- {
- "name": "stderr",
- "output_type": "stream",
- "text": [
- "Warning message in filterByExpr.DGEList(dge, min.count = 5, min.total.count = 15, :\n",
- "“All samples appear to belong to the same group.”\n"
- ]
- },
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "Number of peaks after filtering: 521515 \n",
- "Saved filtered raw counts to /restricted/projectnb/xqtl/jaempawi/atac_seq/output/kellis/2_residuals//Exc/Exc_filtered_raw_counts.txt \n"
- ]
- }
- ],
- "source": [
- "cat(\"Number of peaks before filtering:\", nrow(dge), \"\\n\")\n",
- "keep <- filterByExpr(dge, \n",
- " min.count = 5, # for one sample, min reads \n",
- " min.total.count = 15, # min reads overall\n",
- " min.prop = 0.1) \n",
- "\n",
- "dge <- dge[keep, , keep.lib.sizes=FALSE]\n",
- "cat(\"Number of peaks after filtering:\", nrow(dge), \"\\n\") #1368 in mic,2491 in Ast\n",
- "\n",
- "# Save filtered raw count data\n",
- "filtered_raw_counts <- dge$counts\n",
- "write.table(filtered_raw_counts,\n",
- " file = file.path(out_dir, paste0(celltype, \"_filtered_raw_counts.txt\")), \n",
- " quote=FALSE, sep=\"\\t\", row.names=TRUE, col.names=TRUE)\n",
- "cat(\"Saved filtered raw counts to\", file.path(out_dir, paste0(celltype, \"_filtered_raw_counts.txt\")), \"\\n\")"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "3a8ac9ec-e713-411e-940a-3e0e7eff0c27",
- "metadata": {},
- "source": [
- "#### Handle batch as technical variable"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 23,
- "id": "82da4179-feae-47f0-a566-a04127beacc7",
- "metadata": {},
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "Handling sequencingBatch as a technical variable\n",
- "Found 2 unique batches\n",
- "Batch sizes:\n",
- "batches\n",
- "190820Kel 191203Kel \n",
- " 6 77 \n"
- ]
- }
- ],
- "source": [
- "dge <- calcNormFactors(dge, method=\"TMM\")\n",
- "# We'll handle batch as a technical variable rather than doing batch adjustment\n",
- "cat(\"Handling sequencingBatch as a technical variable\\n\")\n",
- "\n",
- "# Check batch information\n",
- "batches <- dge$samples$sequencingBatch\n",
- "cat(\"Found\", length(unique(batches)), \"unique batches\\n\")\n",
- "\n",
- "# Check batch size\n",
- "batch_counts <- table(batches)\n",
- "cat(\"Batch sizes:\\n\")\n",
- "print(batch_counts)\n",
- "\n",
- "# Convert sequencingBatch to factor with at least 2 levels\n",
- "if (length(unique(batches)) < 2) {\n",
- " cat(\"Only one batch found. Adding dummy batch for model compatibility.\\n\")\n",
- " # Create a dummy batch factor to avoid model errors\n",
- " dge$samples$sequencingBatch_factor <- factor(rep(\"batch1\", ncol(dge)))\n",
- "} else {\n",
- " # Use the existing batch information\n",
- " dge$samples$sequencingBatch_factor <- factor(dge$samples$sequencingBatch)\n",
- "}"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "6593bcae-cf9b-46d4-8411-37aa7b0d2f7a",
- "metadata": {},
- "source": [
- "#### Create model and run voom"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 24,
- "id": "d078474a-45e4-4762-adb2-06925885ff88",
- "metadata": {},
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "Using model with technical covariates plus pmi and study\n",
- "Model formula: ~log_n_nuclei + med_nucleosome_signal + med_tss_enrich + log_med_n_tot_fragment + log_total_unique_peaks + med_peakwidth + sequencingBatch_factor + pmi + study \n",
- "Warning: Factor variable group has only one level. Converting to character.\n",
- "Successfully created design matrix with 10 columns\n",
- "Calculating offsets and residuals...\n"
- ]
- }
- ],
- "source": [
- "# Define the model based on available covariates - using log-transformed variables\n",
- "# Removed msex and age_death from the model\n",
- "if (\"study\" %in% colnames(dge$samples) && \"pmi\" %in% colnames(dge$samples)) {\n",
- " # Technical model with pmi and study\n",
- " cat(\"Using model with technical covariates plus pmi and study\\n\")\n",
- " model <- ~ log_n_nuclei + med_nucleosome_signal + med_tss_enrich + log_med_n_tot_fragment +\n",
- " log_total_unique_peaks + med_peakwidth + sequencingBatch_factor + pmi + study\n",
- "} else if (\"pmi\" %in% colnames(dge$samples)) {\n",
- " # Technical model with pmi only\n",
- " cat(\"Using model with technical covariates and pmi\\n\")\n",
- " model <- ~ log_n_nuclei + med_nucleosome_signal + med_tss_enrich + log_med_n_tot_fragment +\n",
- " log_total_unique_peaks + med_peakwidth + sequencingBatch_factor + pmi\n",
- "} else {\n",
- " # Technical variables only model\n",
- " cat(\"Using model with technical covariates only\\n\")\n",
- " model <- ~ log_n_nuclei + med_nucleosome_signal + med_tss_enrich + log_med_n_tot_fragment +\n",
- " log_total_unique_peaks + med_peakwidth + sequencingBatch_factor\n",
- "}\n",
- "\n",
- "# Print the model formula\n",
- "cat(\"Model formula:\", deparse(model), \"\\n\")\n",
- "\n",
- "# Check for factor variables with only one level\n",
- "for (col in colnames(dge$samples)) {\n",
- " if (is.factor(dge$samples[[col]]) && nlevels(dge$samples[[col]]) < 2) {\n",
- " cat(\"Warning: Factor variable\", col, \"has only one level. Converting to character.\\n\")\n",
- " dge$samples[[col]] <- as.character(dge$samples[[col]])\n",
- " }\n",
- "}\n",
- "\n",
- "# Create design matrix with error checking\n",
- "tryCatch({\n",
- " design <- model.matrix(model, data=dge$samples)\n",
- " cat(\"Successfully created design matrix with\", ncol(design), \"columns\\n\")\n",
- "}, error = function(e) {\n",
- " cat(\"Error in creating design matrix:\", e$message, \"\\n\")\n",
- " cat(\"Attempting to fix model formula...\\n\")\n",
- " \n",
- " # Check each term in the model\n",
- " all_terms <- all.vars(model)\n",
- " valid_terms <- character(0)\n",
- " \n",
- " for (term in all_terms) {\n",
- " if (term %in% colnames(dge$samples)) {\n",
- " # Check if it's a factor with at least 2 levels\n",
- " if (is.factor(dge$samples[[term]])) {\n",
- " if (nlevels(dge$samples[[term]]) >= 2) {\n",
- " valid_terms <- c(valid_terms, term)\n",
- " } else {\n",
- " cat(\"Skipping factor\", term, \"with only\", nlevels(dge$samples[[term]]), \"level\\n\")\n",
- " }\n",
- " } else {\n",
- " # Non-factor variables are fine\n",
- " valid_terms <- c(valid_terms, term)\n",
- " }\n",
- " } else {\n",
- " cat(\"Variable\", term, \"not found in sample data\\n\")\n",
- " }\n",
- " }\n",
- " \n",
- " # Create a simplified model with valid terms\n",
- " if (length(valid_terms) > 0) {\n",
- " model_str <- paste(\"~\", paste(valid_terms, collapse = \" + \"))\n",
- " model <- as.formula(model_str)\n",
- " cat(\"New model formula:\", model_str, \"\\n\")\n",
- " design <- model.matrix(model, data=dge$samples)\n",
- " cat(\"Successfully created design matrix with\", ncol(design), \"columns\\n\")\n",
- " } else {\n",
- " stop(\"Could not create a valid model with the available variables\")\n",
- " }\n",
- "})\n",
- "\n",
- "# Check if the design matrix is full rank\n",
- "if (!is.fullrank(design)) {\n",
- " cat(\"Design matrix is not full rank. Adjusting...\\n\")\n",
- " # Find and remove the problematic columns\n",
- " qr_res <- qr(design)\n",
- " design <- design[, qr_res$pivot[1:qr_res$rank]]\n",
- " cat(\"Adjusted design matrix columns:\", ncol(design), \"\\n\")\n",
- "}\n",
- "\n",
- "# Run voom and fit model\n",
- "v <- voom(dge, design, plot=FALSE) #logCPM\n",
- "fit <- lmFit(v, design)\n",
- "fit <- eBayes(fit)\n",
- "\n",
- "# Calculate offset and residuals\n",
- "cat(\"Calculating offsets and residuals...\\n\")\n",
- "offset <- predictOffset(fit)\n",
- "resids <- residuals(fit, y=v)\n",
- "\n",
- "# Verify dimensions\n",
- "stopifnot(all(rownames(offset) == rownames(resids)) & all(colnames(offset) == colnames(resids)))\n",
- "\n",
- "# Final adjusted data\n",
- "stopifnot(all(dim(offset) == dim(resids)))\n",
- "stopifnot(all(colnames(offset) == colnames(resids)))\n",
- "\n",
- "final_data <- offset + resids"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "ef172cf6-555f-49e7-834f-b0f706b4b3bf",
- "metadata": {},
- "source": [
- "#### Save results"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "f002e9d7-d994-4cdf-9ecc-7a0f18210b58",
- "metadata": {},
- "outputs": [],
- "source": [
- "saveRDS(list(\n",
- " dge = dge,\n",
- " offset = offset,\n",
- " residuals = resids,\n",
- " final_data = final_data,\n",
- " valid_samples = colnames(dge),\n",
- " design = design,\n",
- " fit = fit,\n",
- " model = model\n",
- "), file = file.path(out_dir, paste0(celltype, \"_results.rds\")))\n",
- "\n",
- "# Write final residual data to file\n",
- "write.table(final_data,\n",
- " file = file.path(out_dir, paste0(celltype, \"_residuals.txt\")), \n",
- " quote=FALSE, sep=\"\\t\", row.names=TRUE, col.names=TRUE)\n",
- "\n",
- "# Write summary statistics\n",
- "sink(file = file.path(out_dir, paste0(celltype, \"_summary.txt\")))\n",
- "cat(\"*** Processing Summary for\", celltype, \"***\\n\\n\")\n",
- "cat(\"Original peak count:\", length(peak_id), \"\\n\")\n",
- "cat(\"Peaks after blacklist filtering:\", nrow(filtered_peak), \"\\n\")\n",
- "cat(\"Peaks after expression filtering:\", nrow(dge), \"\\n\\n\")\n",
- "cat(\"Number of samples:\", ncol(dge), \"\\n\")\n",
- "cat(\"\\nTechnical Variables Used:\\n\")\n",
- "cat(\"- log_n_nuclei: Log-transformed number of nuclei per sample\\n\")\n",
- "cat(\"- med_nucleosome_signal: Median nucleosome signal\\n\")\n",
- "cat(\"- med_tss_enrich: Median TSS enrichment\\n\")\n",
- "cat(\"- log_med_n_tot_fragment: Log-transformed median number of total fragments\\n\")\n",
- "cat(\"- log_total_unique_peaks: Log-transformed count of unique peaks per sample\\n\")\n",
- "cat(\"\\nOther Variables Used:\\n\")\n",
- "cat(\"- pmi: Post-mortem interval\\n\")\n",
- "cat(\"- study: Study cohort\\n\")\n",
- "sink()\n",
- "\n",
- "# Write an additional explanation file about the variables and log transformation\n",
- "sink(file = file.path(out_dir, paste0(celltype, \"_variable_explanation.txt\")))\n",
- "cat(\"# ATAC-seq Technical Variables Explanation\\n\\n\")\n",
- "\n",
- "cat(\"## Why Log Transformation?\\n\")\n",
- "cat(\"Log transformation is applied to certain variables for several reasons:\\n\")\n",
- "cat(\"1. To make the distribution more symmetric and closer to normal\\n\")\n",
- "cat(\"2. To stabilize variance across the range of values\\n\")\n",
- "cat(\"3. To match the scale of voom-transformed peak counts, which are on log2-CPM scale\\n\")\n",
- "cat(\"4. To be consistent with the approach used in related studies like haQTL\\n\\n\")\n",
- "\n",
- "cat(\"## Variables and Their Meanings\\n\\n\")\n",
- "\n",
- "cat(\"### Technical Variables\\n\")\n",
- "cat(\"- n_nuclei: Number of nuclei that contributed to this pseudobulk sample\\n\")\n",
- "cat(\" * Log-transformed because count data typically has a right-skewed distribution\\n\\n\")\n",
- "\n",
- "cat(\"- med_n_tot_fragment: Median number of total fragments per cell\\n\")\n",
- "cat(\" * Represents sequencing depth\\n\")\n",
- "cat(\" * Log-transformed because sequencing depth typically has exponential effects\\n\\n\")\n",
- "\n",
- "cat(\"- total_unique_peaks: Number of unique peaks detected in each sample\\n\")\n",
- "cat(\" * Log-transformed similar to 'TotalNumPeaks' in haQTL pipeline\\n\\n\")\n",
- "\n",
- "cat(\"- med_peakwidth: Median width of peaks in each sample (weighted by counts)\\n\")\n",
- "cat(\" * Represents the typical size of accessible regions\\n\\n\")\n",
- "\n",
- "cat(\"- med_nucleosome_signal: Median nucleosome signal\\n\")\n",
- "cat(\" * Measures the degree of nucleosome positioning\\n\")\n",
- "cat(\" * Not log-transformed as it's already a ratio/normalized metric\\n\\n\")\n",
- "\n",
- "cat(\"- med_tss_enrich: Median transcription start site enrichment score\\n\")\n",
- "cat(\" * Indicates the quality of the ATAC-seq data\\n\")\n",
- "cat(\" * Not log-transformed as it's already a ratio/normalized metric\\n\\n\")\n",
- "\n",
- "cat(\"### Other Variables\\n\")\n",
- "cat(\"- pmi: Post-mortem interval (time between death and tissue collection)\\n\")\n",
- "cat(\"- study: Study cohort (ROSMAP, MAP, ROS)\\n\\n\")\n",
- "\n",
- "cat(\"## Relationship to voom Transformation\\n\")\n",
- "cat(\"The voom transformation converts count data to log2-CPM (counts per million) values \")\n",
- "cat(\"and estimates the mean-variance relationship. By log-transforming certain technical \")\n",
- "cat(\"covariates, we ensure they're on a similar scale to the transformed expression data, \")\n",
- "cat(\"which can improve the fit of the linear model used for removing unwanted variation.\\n\")\n",
- "sink()\n",
- "\n",
- "cat(\"Processing completed. Results and documentation saved to:\", out_dir, \"\\n\")"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "c85d5b04-ec21-4c0c-8879-d78563d5ed96",
- "metadata": {},
- "source": [
- "## Step 2: Format Output\n",
- "### Format A: Phenotype Reformatting \n",
- "\n",
- "**Input:**\n",
- "- `{celltype}_residuals.txt` from Step 1 Option A (in `2_residuals/{celltype}/`)\n",
- "\n",
- "**Process:**\n",
- "1. Reads residuals file with proper handling of peak IDs and sample columns\n",
- "2. Parses peak coordinates from peak IDs (format: `chr-start-end`)\n",
- "3. Converts peaks to **midpoint coordinates**:\n",
- " ```r\n",
- " midpoint = (start + end) / 2\n",
- " start = midpoint\n",
- " end = midpoint + 1\n",
- "4. Creates BED format: `#chr`, `start`, `end`, `ID` (peak_id), followed by sample expression values\n",
- "5. Sorts by chromosome and genomic position using `setorder(bed_data, '#chr', start, end)`\n",
- "6. Writes BED file with headers\n",
- "7. Compresses with `bgzip -f`\n",
- "\n",
- "**Output:** `output/3_phenotype_processing/{celltype}`\n",
- "\n",
- "- `{celltype}_kellis_snatac_phenotype.bed.gz`: QTL-ready BED file with peak midpoint coordinates and bgzip-compressed format\n",
- "\n",
- "**Use Case:**\n",
- "Standard caQTL (chromatin accessibility QTL) mapping where you want to identify genetic variants affecting chromatin accessibility independent of demographic factors. Ready for FastQTL, TensorQTL, or QTLtools.\n",
- "\n"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "4ed50732-7daf-409e-af3a-b3014808cb46",
- "metadata": {},
- "source": [
- "#### Load libraries"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 26,
- "id": "0f4dbe14-2acf-4e8b-b63b-47c67f5f68e5",
- "metadata": {},
- "outputs": [],
- "source": [
- "library(data.table)\n",
- "library(stringr)"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "ff83118f-c112-4b03-9256-0a5e98322422",
- "metadata": {},
- "source": [
- "#### Load input"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 27,
- "id": "2f840ed2-3c7c-4a75-8a28-8c90c6f43d2e",
- "metadata": {},
- "outputs": [],
- "source": [
- "# Get command line arguments\n",
- "#args <- commandArgs(trailingOnly = TRUE)\n",
- "#if (length(args) < 1) {\n",
- "# celltype <- \"Astro\" # Default cell type\n",
- "# cat(\"No cell type specified, using default:\", celltype, \"\\n\")\n",
- "#} else {\n",
- "# celltype <- args[1]\n",
- "# cat(\"Processing cell type:\", celltype, \"\\n\")\n",
- "#}\n",
- "\n",
- "celltype <- \"Astro\"\n",
- "\n",
- "# Define input and output paths\n",
- "reformat_input_dir <- file.path(output_dir,\"2_residuals\")\n",
- "#output_dir <- \"/home/al4225/project/kellis_snatac/output/3_phenotype_processing\"\n",
- "reformat_output_dir <- paste0(output_dir,\"/3_phenotype_processing/\", celltype)\n",
- "\n",
- "# Create output directory if it doesn't exist\n",
- "dir.create(reformat_output_dir, recursive = TRUE, showWarnings = FALSE)\n",
- "\n",
- "# Check if input directory exists\n",
- "celltype_dir <- file.path(reformat_input_dir, celltype)\n",
- "if (!dir.exists(reformat_input_dir)) {\n",
- " cat(\"Cell type directory not found:\", celltype_dir, \"\\n\")\n",
- " cat(\"Using backup directory...\\n\")\n",
- " celltype_dir <- file.path(reformat_input_dir, \"backup\", celltype)\n",
- " if (!dir.exists(celltype_dir)) {\n",
- " stop(\"Backup directory not found either: \", celltype_dir)\n",
- " }\n",
- "}\n",
- "\n",
- "input_file <- file.path(celltype_dir, paste0(celltype, \"_residuals.txt\"))\n",
- "output_bed <- file.path(reformat_output_dir, paste0(celltype, \"_kellis_snatac_phenotype.bed\"))\n",
- "\n",
- "# Check if input file exists\n",
- "if (!file.exists(input_file)) {\n",
- " stop(\"Input file not found: \", input_file)\n",
- "}"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "53d4672d-41e7-4533-b2e3-2eccf8c3b4d4",
- "metadata": {},
- "source": [
- "#### Processing data"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 28,
- "id": "90e6f9a9-8c97-4890-ae20-be758d8c7f1e",
- "metadata": {},
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "Column names from first line: SM-CJK5G, SM-CTDQN, SM-CJEIE, SM-CTEM5, SM-CJJ27, SM-CJIWT ...\n",
- "Reading residuals file: /restricted/projectnb/xqtl/jaempawi/atac_seq/output/kellis/2_residuals/Astro/Astro_residuals.txt \n",
- "File has more data columns than header columns. Assuming first column is peak IDs.\n",
- "First few peak IDs: chr1-816945-817430, chr1-817852-818227, chr1-818626-819158, chr1-826625-827679, chr1-869475-870473, chr1-903568-904912 \n",
- "First few column names: SM-CJK5G, SM-CTDQN, SM-CJEIE, SM-CTEM5, SM-CJJ27, SM-CJIWT \n"
- ]
- }
- ],
- "source": [
- "# Read the first line manually to get the column names\n",
- "first_line <- readLines(input_file, n = 1)\n",
- "col_names <- unlist(strsplit(first_line, split = \"\\t\"))\n",
- "cat(\"Column names from first line:\", paste(head(col_names), collapse = \", \"), \"...\\n\")\n",
- "\n",
- "# Read the residuals file using fread but skip the header\n",
- "cat(\"Reading residuals file:\", input_file, \"\\n\")\n",
- "residuals <- fread(input_file, header = FALSE, skip = 1)\n",
- "\n",
- "# If we have an extra column compared to the header line (often happens with rownames)\n",
- "if (ncol(residuals) > length(col_names)) {\n",
- " cat(\"File has more data columns than header columns. Assuming first column is peak IDs.\\n\")\n",
- " peak_ids <- residuals[[1]]\n",
- " residuals <- residuals[, -1, with = FALSE]\n",
- " # Set proper column names excluding the first one which was for peak IDs\n",
- " if (length(col_names) >= 2) {\n",
- " setnames(residuals, col_names)\n",
- " }\n",
- "} else {\n",
- " # Normal case - columns match\n",
- " setnames(residuals, col_names)\n",
- " peak_ids <- residuals[[1]]\n",
- " residuals <- residuals[, -1, with = FALSE]\n",
- "}\n",
- "\n",
- "# Check that peak IDs and column names were properly extracted\n",
- "cat(\"First few peak IDs:\", paste(head(peak_ids), collapse = \", \"), \"\\n\")\n",
- "cat(\"First few column names:\", paste(head(colnames(residuals)), collapse = \", \"), \"\\n\")\n",
- "\n",
- "# Parse peak IDs to get chromosome, start, and end\n",
- "# cat(\"Parsing peak IDs into BED format\\n\")\n",
- "# parsed_peaks <- data.table(\n",
- "# '#chr' = sapply(strsplit(peak_ids, \"-\"), `[`, 1),\n",
- "# start = as.integer(sapply(strsplit(peak_ids, \"-\"), `[`, 2)),\n",
- "# end = as.integer(sapply(strsplit(peak_ids, \"-\"), `[`, 3)),\n",
- "# ID = peak_ids # Use peak_id as the ID column (4th column in BED)\n",
- "# )\n"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "7cfa7749-21b4-4db8-b0d1-a82d7d3b3994",
- "metadata": {},
- "source": [
- "#### Parse peak ID"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 29,
- "id": "cb6e8c9c-66f4-452e-b5ba-c88f9ca9de17",
- "metadata": {},
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "Parsing peak IDs into BED format with midpoint coordinates\n"
- ]
- }
- ],
- "source": [
- "# Parse peak IDs to get chromosome, start, and end\n",
- "cat(\"Parsing peak IDs into BED format with midpoint coordinates\\n\")\n",
- "\n",
- "parsed_peaks <- data.table(\n",
- " '#chr' = sapply(strsplit(peak_ids, \"-\"), `[`, 1),\n",
- " start = as.integer((as.integer(sapply(strsplit(peak_ids, \"-\"), `[`, 2)) + \n",
- " as.integer(sapply(strsplit(peak_ids, \"-\"), `[`, 3))) / 2),\n",
- " end = as.integer(((as.integer(sapply(strsplit(peak_ids, \"-\"), `[`, 2)) + \n",
- " as.integer(sapply(strsplit(peak_ids, \"-\"), `[`, 3))) / 2) + 1), \n",
- " ID = peak_ids # Use peak_id as the ID column (4th column in BED)\n",
- ")\n",
- "\n",
- "\n",
- "# Add validation to ensure end > start\n",
- "if (any(parsed_peaks$end <= parsed_peaks$start)) {\n",
- " cat(\"Warning: Found records where end <= start. Fixing...\\n\")\n",
- " parsed_peaks[end <= start, end := start + 1]\n",
- "}\n",
- "\n"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "22eb8753-a035-4b01-906d-d552abf522d5",
- "metadata": {},
- "source": [
- "#### Create BED"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 30,
- "id": "45df0f55-ce77-4e36-8af6-09511031d650",
- "metadata": {},
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "Writing BED file to: /restricted/projectnb/xqtl/jaempawi/atac_seq/output/kellis/3_phenotype_processing/Astro/Astro_kellis_snatac_phenotype.bed \n",
- "Compressing BED file with bgzip...\n",
- "Process completed.\n",
- "Output file: /restricted/projectnb/xqtl/jaempawi/atac_seq/output/kellis/3_phenotype_processing/Astro/Astro_kellis_snatac_phenotype.bed.gz \n"
- ]
- }
- ],
- "source": [
- "# Create BED format with all data columns\n",
- "# BED format: chr, start, end, ID, followed by phenotype values with sample IDs as column names\n",
- "bed_data <- cbind(parsed_peaks, residuals)\n",
- "\n",
- "# Sort by chromosome and position\n",
- "setorder(bed_data, '#chr', start, end)\n",
- "\n",
- "# Write BED file with headers\n",
- "cat(\"Writing BED file to:\", output_bed, \"\\n\")\n",
- "fwrite(bed_data, output_bed, sep = \"\\t\", col.names = TRUE, quote = FALSE)\n",
- "\n",
- "# Compress the BED file with bgzip\n",
- "cat(\"Compressing BED file with bgzip...\\n\")\n",
- "bgzip_cmd <- paste(\"bgzip -f\", output_bed)\n",
- "system(bgzip_cmd)\n",
- "\n",
- "cat(\"Process completed.\\n\")\n",
- "cat(\"Output file:\", paste0(output_bed, \".gz\"), \"\\n\")\n"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "daba0745-5ce6-4a32-8ad0-3e0647f15052",
- "metadata": {},
- "source": [
- "### Format B: Regions Peak Filtering\n",
- "**Input:**\n",
- "- `{celltype}_filtered_raw_counts.txt` from Step 1 Option B (in `2_residuals/{celltype}/`)\n",
- "\n",
- "**Process:**\n",
- "1. Reads filtered raw counts for each cell type\n",
- "2. Parses peak coordinates from peak IDs (format: `chr-start-end`)\n",
- "3. Calculates peak metrics:\n",
- " - `peakwidth`: End - Start\n",
- " - `midpoint`: (Start + End) / 2\n",
- "4. Filters for **specific genomic regions of interest**:\n",
- " - **Chr7:** 28,000,000 - 28,300,000 bp (300kb region)\n",
- " - **Chr11:** 85,050,000 - 86,200,000 bp (1.15Mb region)\n",
- "5. Includes peaks that overlap these regions (start, end, or span the boundaries)\n",
- "6. Calculates summary statistics:\n",
- " - `total_count`: Sum of counts across all samples per peak\n",
- " - `weighted_count`: total_count / peakwidth (normalizes for peak size)\n",
- "\n",
- "**Output:** `output/4_regions/{celltype}/`\n",
- "- `filtered_regions_of_interest.txt`: Full count data for peaks in target regions (all samples × selected peaks)\n",
- "- `filtered_regions_of_interest_summary.txt`: Peak metadata with coordinates and count statistics\n",
- "\n",
- "**Use Case:** \n",
- "Hypothesis-driven analysis of specific genomic loci (e.g., AD risk loci like APOE region, TREM2 locus) where biological variation should be preserved for interpretation."
- ]
- },
- {
- "cell_type": "markdown",
- "id": "4b87b48b-c7ec-4799-bba1-4574d4d660fe",
- "metadata": {},
- "source": [
- "#### Filter and save data for a specific cell type"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 43,
- "id": "891a02b1-8f7c-4b97-b151-371b65ec52a3",
- "metadata": {},
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "File not found: /restricted/projectnb/xqtl/jaempawi/atac_seq/output/kellis/2_residuals/Mic/Mic_filtered_raw_counts.txt \n",
- "File not found: /restricted/projectnb/xqtl/jaempawi/atac_seq/output/kellis/2_residuals/Astro/Astro_filtered_raw_counts.txt \n",
- "File not found: /restricted/projectnb/xqtl/jaempawi/atac_seq/output/kellis/2_residuals/Oligo/Oligo_filtered_raw_counts.txt \n",
- "Processing Exc data from: /restricted/projectnb/xqtl/jaempawi/atac_seq/output/kellis/2_residuals/Exc/Exc_filtered_raw_counts.txt \n"
- ]
- },
- {
- "name": "stderr",
- "output_type": "stream",
- "text": [
- "Warning message in fread(input_file, check.names = TRUE):\n",
- "“Detected 83 column names but the data has 84 columns (i.e. invalid file). Added an extra default column name for the first column which is guessed to be row names or an index. Use setnames() afterwards if this guess is not correct, or fix the file write command that created the file to create a valid file.”\n"
- ]
- },
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "Found 276 regions of interest for Exc \n",
- "Saved filtered data to: /restricted/projectnb/xqtl/jaempawi/atac_seq/output/kellis/3_regions/Exc/Exc_filtered_regions_of_interest.txt \n",
- "Saved summary data to: /restricted/projectnb/xqtl/jaempawi/atac_seq/output/kellis/3_regions/Exc/Exc_filtered_regions_of_interest_summary.txt \n",
- "\n",
- "File not found: /restricted/projectnb/xqtl/jaempawi/atac_seq/output/kellis/2_residuals/Inh/Inh_filtered_raw_counts.txt \n",
- "File not found: /restricted/projectnb/xqtl/jaempawi/atac_seq/output/kellis/2_residuals/OPC/OPC_filtered_raw_counts.txt \n"
- ]
- }
- ],
- "source": [
- "# Function to filter and save data for a specific cell type with additional summary information\n",
- "filter_and_save_by_celltype <- function(celltype) {\n",
- " # Create output directory\n",
- " peak_output_dir <- file.path(output_dir,\"3_regions\", celltype)\n",
- " dir.create(peak_output_dir, recursive = TRUE, showWarnings = FALSE)\n",
- " \n",
- " # Load filtered raw counts for the cell type\n",
- " input_file <- file.path(output_dir,\"2_residuals\", celltype, paste0(celltype, \"_filtered_raw_counts.txt\"))\n",
- " \n",
- " # Check if file exists before reading\n",
- " if (!file.exists(input_file)) {\n",
- " cat(\"File not found:\", input_file, \"\\n\")\n",
- " return(FALSE)\n",
- " }\n",
- " \n",
- " cat(\"Processing\", celltype, \"data from:\", input_file, \"\\n\")\n",
- " \n",
- " # Read data - handling the row names issue\n",
- " cell_data <- fread(input_file, check.names = TRUE)\n",
- " \n",
- " # If the first column has no name (it's row names), give it a proper name\n",
- " if (names(cell_data)[1] == \"V1\") {\n",
- " setnames(cell_data, \"V1\", \"peak_id\")\n",
- " }\n",
- " \n",
- " # Parse coordinates from peak IDs\n",
- " cell_data$chr <- gsub(\"^(chr[^-]+)-.*$\", \"\\\\1\", cell_data$peak_id)\n",
- " cell_data$start <- as.numeric(gsub(\"^chr[^-]+-([0-9]+)-.*$\", \"\\\\1\", cell_data$peak_id))\n",
- " cell_data$end <- as.numeric(gsub(\"^chr[^-]+-[0-9]+-([0-9]+)$\", \"\\\\1\", cell_data$peak_id))\n",
- " \n",
- " # Calculate additional metrics\n",
- " cell_data$peakwidth <- cell_data$end - cell_data$start\n",
- " cell_data$midpoint <- (cell_data$start + cell_data$end) / 2\n",
- " \n",
- " # Filter for chr7 and chr11\n",
- " chr_filtered <- cell_data[cell_data$chr %in% c(\"chr7\", \"chr11\"), ]\n",
- " \n",
- " # Filter for the specific regions\n",
- " region_filtered <- chr_filtered[\n",
- " # Chr7: 28,000kb-28,300kb\n",
- " (chr_filtered$chr == \"chr7\" & \n",
- " ((chr_filtered$start >= 28000000 & chr_filtered$start <= 28300000) | \n",
- " (chr_filtered$end >= 28000000 & chr_filtered$end <= 28300000) |\n",
- " (chr_filtered$start <= 28000000 & chr_filtered$end >= 28300000))) |\n",
- " # Chr11: 85,050kb-86,200kb\n",
- " (chr_filtered$chr == \"chr11\" & \n",
- " ((chr_filtered$start >= 85050000 & chr_filtered$start <= 86200000) | \n",
- " (chr_filtered$end >= 85050000 & chr_filtered$end <= 86200000) |\n",
- " (chr_filtered$start <= 85050000 & chr_filtered$end >= 86200000))),\n",
- " ]\n",
- " \n",
- " # Report results\n",
- " cat(\"Found\", nrow(region_filtered), \"regions of interest for\", celltype, \"\\n\")\n",
- " \n",
- " # Save the original filtered data (with all columns)\n",
- " output_file <- file.path(peak_output_dir, paste0(celltype,\"_filtered_regions_of_interest.txt\"))\n",
- " write.table(region_filtered, output_file, sep=\"\\t\", quote=FALSE, row.names=FALSE)\n",
- " cat(\"Saved filtered data to:\", output_file, \"\\n\")\n",
- " \n",
- " # Calculate total count for each peak (sum across all samples)\n",
- " # Get only the numeric columns (exclude the metadata columns we added)\n",
- " meta_cols <- c(\"peak_id\", \"chr\", \"start\", \"end\", \"peakwidth\", \"midpoint\")\n",
- " count_cols <- setdiff(names(region_filtered), meta_cols)\n",
- " \n",
- " # Ensure all count columns are numeric\n",
- " region_filtered_counts <- region_filtered[, ..count_cols]\n",
- " region_filtered_counts <- as.data.frame(apply(region_filtered_counts, 2, as.numeric))\n",
- " \n",
- " # Calculate total count\n",
- " region_filtered$total_count <- rowSums(region_filtered_counts)\n",
- " \n",
- " # Calculate weighted count (total count / peakwidth)\n",
- " region_filtered$weighted_count <- region_filtered$total_count / region_filtered$peakwidth\n",
- " \n",
- " # Create a summary data frame with just the metadata columns\n",
- " summary_df <- data.table(\n",
- " peak_id = region_filtered$peak_id,\n",
- " chr = region_filtered$chr,\n",
- " start = region_filtered$start,\n",
- " end = region_filtered$end,\n",
- " midpoint = region_filtered$midpoint,\n",
- " peakwidth = region_filtered$peakwidth,\n",
- " total_count = region_filtered$total_count,\n",
- " weighted_count = region_filtered$weighted_count\n",
- " )\n",
- " \n",
- " # Save the summary data\n",
- " summary_file <- file.path(peak_output_dir, paste0(celltype,\"_filtered_regions_of_interest_summary.txt\"))\n",
- " write.table(summary_df, summary_file, sep=\"\\t\", quote=FALSE, row.names=FALSE)\n",
- " cat(\"Saved summary data to:\", summary_file, \"\\n\\n\")\n",
- " \n",
- " return(TRUE)\n",
- "}\n",
- "\n",
- "# List of cell types to process\n",
- "celltypes <- c(\"Mic\", \"Astro\", \"Oligo\", \"Exc\", \"Inh\", \"OPC\")\n",
- "\n",
- "\n",
- "# Process each cell type\n",
- "for (ct in celltypes) {\n",
- " filter_and_save_by_celltype(ct)\n",
- "}"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "84c7cc6f-5080-45c1-8204-3b77623a557e",
- "metadata": {},
- "source": [
- "## Alternative Pseudobulk Pipeline with Batch Correction\n",
- "\n",
- "This is an alternative preprocessing approach using ComBat-seq for explicit batch correction. It is from a different dataset (multiome) but demonstrates an alternative strategy when batch effects are severe.\n",
- "\n",
- "---\n",
- "\n",
- "#### When to Use This Approach:\n",
- "- Strong batch effects that need active correction (not just covariate adjustment)\n",
- "- Data from multiple sequencing runs with substantial technical artifacts\n",
- "- When batch confounds with biological variables of interest\n",
- "- Visible batch clusters in PCA/t-SNE plots\n",
- "\n",
- "---\n",
- "\n",
- "**Input:**\n",
- "- QC'd Seurat object with metadata: `{celltype}_qced.rds`\n",
- "- Pseudobulk peak counts: `{celltype}.rds`\n",
- "- Sample covariates: `rosmap_cov.txt`\n",
- "- Batch information: `SampleSheet.csv` and `sampleSheetAfterQc.csv`\n",
- "- hg38 blacklist: `hg38-blacklist.v2.bed.gz`\n",
- "\n",
- "**Process:**\n",
- "1. Loads Seurat object and extracts metadata\n",
- "2. Loads pseudobulk peak count matrix\n",
- "3. Calculates technical QC metrics per sample:\n",
- " - `TSSEnrichment`: Median TSS enrichment\n",
- " - `NucleosomeRatio`: Median nucleosome ratio\n",
- " - `LogPercMt`: Log-transformed percent mitochondrial reads\n",
- " - `LogUniqueFrags`: Log-transformed unique fragments per sample\n",
- "4. Filters blacklisted genomic regions using `foverlaps()`\n",
- "5. Calculates peak metrics:\n",
- " - `LogTotalUniquePeaks`: Log-transformed count of unique peaks detected\n",
- "6. Merges with demographic covariates (msex, age_death, pmi, study)\n",
- "7. Creates DGEList object\n",
- "8. Applies expression filtering with `filterByExpr()`:\n",
- " - `min.count = 5`: Minimum 5 reads in at least one sample\n",
- " - `min.total.count = 7`: Minimum 7 total reads across all samples\n",
- " - `min.prop = 0.7`: Peak must be expressed in ≥70% of samples\n",
- "9. TMM normalization with `calcNormFactors()`\n",
- "10. **Batch processing:**\n",
- " - Loads sequencing batch information from sample sheets\n",
- " - Filters singleton batches (batches with only 1 sample)\n",
- " - Filters samples with low library sizes (< 5000 recommended)\n",
- "11. **ComBat-seq batch correction:**\n",
- " ```r\n",
- " adjusted_counts <- ComBat_seq(\n",
- " counts = dge$counts, \n",
- " batch = batches\n",
- " )\n",
- " ```\n",
- "12. Fits linear model on batch-corrected counts using `voom()` and `lmFit()`:\n",
- " ```r\n",
- " model <- ~ pmi + msex + age_death + \n",
- " TSSEnrichment + NucleosomeRatio + LogPercMt +\n",
- " LogUniqueFrags + LogTotalUniquePeaks + \n",
- " study\n",
- " ```\n",
- " Note: Batch is NOT in the model because it was corrected by ComBat-seq\n",
- "13. Calculates residuals using `predictOffset()`: `offset + residuals`\n",
- " - `offset`: Predicted expression at median/reference covariate values\n",
- " - `residuals`: Unexplained variation after removing covariate effects\n",
- "\n",
- "However, ComBat-seq encountered persistent errors with this dataset:\n",
- "```\n",
- "Error in .compressOffsets(y, lib.size = lib.size, offset = offset):\n",
- "offsets must be finite values\n",
- "```\n",
- "\n",
- "**Issues with ComBat-seq for this data:**\n",
- "- Dataset had 232 samples across 60 batches (many small batches)\n",
- "- Error persisted even after:\n",
- " - Filtering samples with low library sizes (< 5000)\n",
- " - Removing singleton batches\n",
- " - Ensuring all counts and library sizes were finite\n",
- " - Verifying no zero-sum peaks\n",
- "- Likely due to internal ComBat-seq edge case with highly fragmented batch structure\n",
- "\n",
- "**Solution:** Use limma's `removeBatchEffect` which operates on log-CPM values and is more robust to small batch sizes.\n",
- "\n",
- "**Process:**\n",
- "1. Loads Seurat object and extracts metadata\n",
- "2. Loads pseudobulk peak count matrix\n",
- "3. Calculates technical QC metrics per sample:\n",
- " - `TSSEnrichment`: Median TSS enrichment\n",
- " - `NucleosomeRatio`: Median nucleosome ratio\n",
- " - `LogPercMt`: Log-transformed percent mitochondrial reads\n",
- " - `LogUniqueFrags`: Log-transformed unique fragments per sample\n",
- "4. Filters blacklisted genomic regions using `foverlaps()`\n",
- "5. Calculates peak metrics:\n",
- " - `LogTotalUniquePeaks`: Log-transformed count of unique peaks detected\n",
- "6. Merges with demographic covariates (msex, age_death, pmi, study)\n",
- "7. Creates DGEList object\n",
- "8. Applies expression filtering with `filterByExpr()`:\n",
- " - `min.count = 5`: Minimum 5 reads in at least one sample\n",
- " - `min.total.count = 7`: Minimum 7 total reads across all samples\n",
- " - `min.prop = 0.7`: Peak must be expressed in ≥70% of samples\n",
- "9. TMM normalization with `calcNormFactors()`\n",
- "10. **Batch processing:**\n",
- " - Loads sequencing batch information from sample sheets\n",
- " - Filters singleton batches (batches with only 1 sample)\n",
- "11. **Batch correction using limma's removeBatchEffect:**\n",
- " ```r\n",
- " # Get log-CPM values\n",
- " logCPM <- cpm(dge, log=TRUE, prior.count=1)\n",
- " \n",
- " # Remove batch effects\n",
- " adjusted_logCPM <- removeBatchEffect(\n",
- " logCPM,\n",
- " batch = batches,\n",
- " design = model.matrix(~1, data=dge$samples)\n",
- " )\n",
- " \n",
- " # Convert back to counts scale (approximate)\n",
- " adjusted_counts <- 2^adjusted_logCPM * mean(dge$$samples$$lib.size) / 1e6\n",
- " adjusted_counts <- round(adjusted_counts)\n",
- " adjusted_counts[adjusted_counts < 0] <- 0\n",
- " ```\n",
- "12. Updates sample alignment:\n",
- " - Ensures valid_samples match current filtered data\n",
- " - Aligns covariates with sample order\n",
- " - Converts tibble to data.frame and sets rownames\n",
- "13. Fits linear model on batch-corrected counts using `voom()` and `lmFit()`:\n",
- " ```r\n",
- " model <- ~ pmi + msex + age_death + TSSEnrichment + NucleosomeRatio + LogPercMt + LogUniqueFrags + LogTotalUniquePeaks + study\n",
- " ```\n",
- " Note: Batch is NOT in the model because it was corrected by removeBatchEffect\n",
- "14. Creates new DGEList with batch-corrected counts\n",
- "15. Recalculates library sizes and TMM normalization factors\n",
- "16. Calculates residuals using `predictOffset()`: `offset + residuals`\n",
- " - `offset`: Predicted expression at median/reference covariate values\n",
- " - `residuals`: Unexplained variation after removing covariate effects\n",
- "\n",
- "\n",
- "**Output:** `output/3_calculateResiduals/{celltype})`\n",
- "- `{celltype}_results.rds`: Complete results object containing:\n",
- " - `dge`: Batch-corrected DGEList\n",
- " - `offset`: Predicted offset values\n",
- " - `residuals`: Model residuals\n",
- " - `batch_adjusted_counts`: removeBatchEffect corrected counts\n",
- " - `final_data`: Final adjusted expression (offset + residuals)\n",
- " - `valid_samples`: Sample IDs after filtering\n",
- " - `design`: Design matrix\n",
- " - `fit`: Linear model fit object\n",
- "- `{celltype}_residuals.txt`: Final covariate-adjusted peak accessibility (log2-CPM scale)\n",
- "\n",
- "\n",
- "**Key Differences from ComBat-seq:**\n",
- "- Operates on log-CPM values (not integer counts)\n",
- "- More robust to small/unbalanced batch sizes\n",
- "- Does not model mean-variance relationship (simpler correction)\n",
- "- Approximate back-transformation to count scale\n"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "fb2d3390-250e-43dc-848d-a02bcea6bbee",
- "metadata": {},
- "source": [
- "#### Load libraries"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 46,
- "id": "780bafdb-0057-4b18-b5a7-7c7ea3450926",
- "metadata": {},
- "outputs": [],
- "source": [
- "library(data.table)\n",
- "library(stringr)\n",
- "library(Seurat)\n",
- "library(dplyr)\n",
- "library(sva)\n",
- "library(edgeR)\n",
- "library(limma)"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "5c26a146-535e-4123-aca0-f41e6f3f5a0b",
- "metadata": {},
- "source": [
- "#### Create output directory"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 47,
- "id": "52e0c2f0-73da-4cd5-b43d-b744e4b0d726",
- "metadata": {},
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "Processing celltype: Astro \n"
- ]
- },
- {
- "data": {
- "text/html": [
- "'/restricted/projectnb/xqtl/jaempawi/atac_seq/output/kellis/2_residuals_batch_corrected/Astro'"
- ],
- "text/latex": [
- "'/restricted/projectnb/xqtl/jaempawi/atac\\_seq/output/kellis/2\\_residuals\\_batch\\_corrected/Astro'"
- ],
- "text/markdown": [
- "'/restricted/projectnb/xqtl/jaempawi/atac_seq/output/kellis/2_residuals_batch_corrected/Astro'"
- ],
- "text/plain": [
- "[1] \"/restricted/projectnb/xqtl/jaempawi/atac_seq/output/kellis/2_residuals_batch_corrected/Astro\""
- ]
- },
- "metadata": {},
- "output_type": "display_data"
- }
- ],
- "source": [
- "cat(\"Processing celltype:\", celltype, \"\\n\")\n",
- "\n",
- "residual_out_dir <- file.path(output_dir,\"2_residuals_batch_corrected\", celltype)\n",
- "dir.create(residual_out_dir, recursive = TRUE, showWarnings = FALSE)\n",
- "residual_out_dir"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "9990a617-310c-47aa-9895-712db99b766f",
- "metadata": {},
- "source": [
- "#### Create predictOffset function "
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 48,
- "id": "f0a1ed02-4744-422e-9830-90886cb9ec04",
- "metadata": {},
- "outputs": [],
- "source": [
- "predictOffset <- function(fit) {\n",
- " # Define which variables are factors and which are continuous\n",
- " usedFactors <- c(\"study\") \n",
- " usedContinuous <- c(\"pmi\", \"msex\", \"age_death\", \n",
- " \"TSSEnrichment\", \"NucleosomeRatio\", \"LogPercMt\",\n",
- " \"LogUniqueFrags\", \"LogTotalUniquePeaks\")\n",
- " \n",
- " # Get indices for factor and continuous variables\n",
- " facInd <- unlist(lapply(as.list(usedFactors), \n",
- " function(f) {return(grep(paste(\"^\", f, sep=\"\"), \n",
- " colnames(fit$design)))}))\n",
- " contInd <- unlist(lapply(as.list(usedContinuous), \n",
- " function(f) {return(grep(paste(\"^\", f, sep=\"\"), \n",
- " colnames(fit$design)))}))\n",
- " \n",
- " # Verify design matrix structure\n",
- " stopifnot(!any(duplicated(c(1, facInd, contInd))))\n",
- " stopifnot(all(c(1, facInd, contInd) %in% 1:ncol(fit$design)))\n",
- " stopifnot(1:ncol(fit$design) %in% c(1, facInd, contInd))\n",
- " \n",
- " # Create new design matrix with median values\n",
- " D <- fit$design\n",
- " D[, facInd] <- 0\n",
- " medContVals <- apply(D[, contInd], 2, median)\n",
- " for (i in 1:length(medContVals)) {\n",
- " D[, names(medContVals)[i]] <- medContVals[i]\n",
- " }\n",
- " \n",
- " # Calculate offsets\n",
- " stopifnot(all(colnames(coefficients(fit)) == colnames(D)))\n",
- " offsets <- apply(coefficients(fit), 1, function(c) {\n",
- " return(D %*% c)\n",
- " })\n",
- " offsets <- t(offsets)\n",
- " colnames(offsets) <- rownames(fit$design)\n",
- " \n",
- " return(offsets)\n",
- "}"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "af303e3f-b918-4cc4-8155-1f7974b48cde",
- "metadata": {},
- "source": [
- "#### Load input data"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 49,
- "id": "e00e96b8-9e1f-4387-8a3c-fcca2ae2d342",
- "metadata": {},
- "outputs": [],
- "source": [
- "meta_data = readRDS(file.path(input_dir,\"Endothelial_qced.rds\"))\n",
- "meta = meta_data@meta.data\n",
- "peak <- readRDS(file.path(input_dir,'Endothelial.rds'))"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "d746195e-ca0a-4695-a42c-64bf9c677c85",
- "metadata": {},
- "source": [
- "#### Process technical variables"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 50,
- "id": "bb915f93-9088-4db6-b663-dc893e25fe85",
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/html": [
- "\n",
- "A tibble: 6 × 7\n",
- "\n",
- "\t| demuxlet_SNG.BEST.GUESS | TSSEnrichment | NucleosomeRatio | PercMt | UniqueFrags | LogPercMt | LogUniqueFrags |
\n",
- "\t| <chr> | <dbl> | <dbl> | <dbl> | <int> | <dbl> | <dbl> |
\n",
- "\n",
- "\n",
- "\t| MAP26637867 | 7.315 | 1.4334204 | 0.5755396 | 87 | -0.5524456 | 4.465908 |
\n",
- "\t| MAP50106992 | 6.237 | 1.5703523 | 1.7753647 | 19 | 0.5740064 | 2.944439 |
\n",
- "\t| MAP61344957 | 14.587 | 0.7494390 | 0.2781486 | 7 | -1.2795960 | 1.945910 |
\n",
- "\t| ROS11430815 | 6.606 | 1.4644619 | 0.2029770 | 9 | -1.5946577 | 2.197225 |
\n",
- "\t| ROS15738428 | 12.620 | 0.9908817 | 0.1889933 | 32 | -1.6660383 | 3.465736 |
\n",
- "\t| ROS20945666 | 7.609 | 1.6842417 | 0.3885004 | 65 | -0.9454585 | 4.174387 |
\n",
- "\n",
- "
\n"
- ],
- "text/latex": [
- "A tibble: 6 × 7\n",
- "\\begin{tabular}{lllllll}\n",
- " demuxlet\\_SNG.BEST.GUESS & TSSEnrichment & NucleosomeRatio & PercMt & UniqueFrags & LogPercMt & LogUniqueFrags\\\\\n",
- " & & & & & & \\\\\n",
- "\\hline\n",
- "\t MAP26637867 & 7.315 & 1.4334204 & 0.5755396 & 87 & -0.5524456 & 4.465908\\\\\n",
- "\t MAP50106992 & 6.237 & 1.5703523 & 1.7753647 & 19 & 0.5740064 & 2.944439\\\\\n",
- "\t MAP61344957 & 14.587 & 0.7494390 & 0.2781486 & 7 & -1.2795960 & 1.945910\\\\\n",
- "\t ROS11430815 & 6.606 & 1.4644619 & 0.2029770 & 9 & -1.5946577 & 2.197225\\\\\n",
- "\t ROS15738428 & 12.620 & 0.9908817 & 0.1889933 & 32 & -1.6660383 & 3.465736\\\\\n",
- "\t ROS20945666 & 7.609 & 1.6842417 & 0.3885004 & 65 & -0.9454585 & 4.174387\\\\\n",
- "\\end{tabular}\n"
- ],
- "text/markdown": [
- "\n",
- "A tibble: 6 × 7\n",
- "\n",
- "| demuxlet_SNG.BEST.GUESS <chr> | TSSEnrichment <dbl> | NucleosomeRatio <dbl> | PercMt <dbl> | UniqueFrags <int> | LogPercMt <dbl> | LogUniqueFrags <dbl> |\n",
- "|---|---|---|---|---|---|---|\n",
- "| MAP26637867 | 7.315 | 1.4334204 | 0.5755396 | 87 | -0.5524456 | 4.465908 |\n",
- "| MAP50106992 | 6.237 | 1.5703523 | 1.7753647 | 19 | 0.5740064 | 2.944439 |\n",
- "| MAP61344957 | 14.587 | 0.7494390 | 0.2781486 | 7 | -1.2795960 | 1.945910 |\n",
- "| ROS11430815 | 6.606 | 1.4644619 | 0.2029770 | 9 | -1.5946577 | 2.197225 |\n",
- "| ROS15738428 | 12.620 | 0.9908817 | 0.1889933 | 32 | -1.6660383 | 3.465736 |\n",
- "| ROS20945666 | 7.609 | 1.6842417 | 0.3885004 | 65 | -0.9454585 | 4.174387 |\n",
- "\n"
- ],
- "text/plain": [
- " demuxlet_SNG.BEST.GUESS TSSEnrichment NucleosomeRatio PercMt UniqueFrags\n",
- "1 MAP26637867 7.315 1.4334204 0.5755396 87 \n",
- "2 MAP50106992 6.237 1.5703523 1.7753647 19 \n",
- "3 MAP61344957 14.587 0.7494390 0.2781486 7 \n",
- "4 ROS11430815 6.606 1.4644619 0.2029770 9 \n",
- "5 ROS15738428 12.620 0.9908817 0.1889933 32 \n",
- "6 ROS20945666 7.609 1.6842417 0.3885004 65 \n",
- " LogPercMt LogUniqueFrags\n",
- "1 -0.5524456 4.465908 \n",
- "2 0.5740064 2.944439 \n",
- "3 -1.2795960 1.945910 \n",
- "4 -1.5946577 2.197225 \n",
- "5 -1.6660383 3.465736 \n",
- "6 -0.9454585 4.174387 "
- ]
- },
- "metadata": {},
- "output_type": "display_data"
- }
- ],
- "source": [
- "tech_vars <- meta %>%\n",
- " group_by(demuxlet_SNG.BEST.GUESS) %>%\n",
- " summarise(\n",
- " TSSEnrichment = median(TSSEnrichment),\n",
- " NucleosomeRatio = median(NucleosomeRatio),\n",
- " PercMt = median(percent.mt),\n",
- " UniqueFrags = n_distinct(demuxlet_BARCODE)\n",
- " ) %>%\n",
- " mutate(\n",
- " LogPercMt = log(PercMt + 1e-6),\n",
- " LogUniqueFrags = log(UniqueFrags + 1e-6)\n",
- " )\n",
- "head(tech_vars)"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "bdc81763-f662-4f49-a154-7002da0953dd",
- "metadata": {},
- "source": [
- "#### Process peaks "
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 51,
- "id": "1ba88004-7088-4b2d-b169-3fc6f424a540",
- "metadata": {},
- "outputs": [],
- "source": [
- "# Load blacklist\n",
- "blacklist_df <- fread(file.path(input_dir,\"hg38-blacklist.v2.bed.gz\"))\n",
- "colnames(blacklist_df) <- c(\"chr\", \"start\", \"end\", \"label\")\n",
- "\n",
- "# Process peak coordinates\n",
- "peak_df <- data.table(\n",
- " peak_name = rownames(peak),\n",
- " chr = str_extract(rownames(peak), \"chr[0-9XY]+\"),\n",
- " start = as.integer(str_extract(rownames(peak), \"(?<=:)[0-9]+\")),\n",
- " end = as.integer(str_extract(rownames(peak), \"(?<=-)[0-9]+\"))\n",
- ")\n",
- "\n",
- "# Filter blacklisted peaks\n",
- "setkey(blacklist_df, chr, start, end)\n",
- "setkey(peak_df, chr, start, end)\n",
- "overlapping_peaks <- foverlaps(peak_df, blacklist_df, nomatch=0)\n",
- "blacklisted_peaks <- unique(overlapping_peaks$peak_name)\n",
- "filtered_peak <- peak[!rownames(peak) %in% blacklisted_peaks,]\n",
- "\n",
- "# Calculate peak metrics\n",
- "peak_metrics <- data.frame(\n",
- " sample = colnames(filtered_peak),\n",
- " TotalUniquePeaks = colSums(filtered_peak > 0)\n",
- ") %>%\n",
- " mutate(LogTotalUniquePeaks = log(TotalUniquePeaks))"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "087e03aa-3830-4b0b-a291-2e0447b36d99",
- "metadata": {},
- "source": [
- "#### Load and merge covariates"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 52,
- "id": "970879a5-7964-468d-9b53-7fbcc88ac24c",
- "metadata": {},
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "Number of samples after joining: 233 \n",
- "Sample IDs: MAP26637867, MAP50106992, MAP61344957, ROS11430815, ROS15738428, ROS20945666 ...\n"
- ]
- }
- ],
- "source": [
- "covariates <- fread(file.path(input_dir,'rosmap_cov.txt')) %>%\n",
- " select('#id', msex, age_death, pmi, study)\n",
- "\n",
- "all_covs <- tech_vars %>%\n",
- " inner_join(peak_metrics, by = c(\"demuxlet_SNG.BEST.GUESS\" = \"sample\")) %>%\n",
- " inner_join(covariates, by = c(\"demuxlet_SNG.BEST.GUESS\" = \"#id\"))\n",
- "\n",
- "cat(\"Number of samples after joining:\", nrow(all_covs), \"\\n\")\n",
- "cat(\"Sample IDs:\", paste(head(all_covs$demuxlet_SNG.BEST.GUESS), collapse=\", \"), \"...\\n\")\n",
- "\n",
- "# Impute missing values\n",
- "for(col in c(\"pmi\", \"age_death\")) {\n",
- " all_covs[[col]][is.na(all_covs[[col]])] <- median(all_covs[[col]], na.rm=TRUE)\n",
- "}"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "44c51fa4-c3e4-4e1c-84f4-599d8d1bf156",
- "metadata": {},
- "source": [
- "#### Create DGE object"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 53,
- "id": "e6608fc5-a07c-47aa-8996-96996f9297ce",
- "metadata": {},
- "outputs": [],
- "source": [
- "dge <- DGEList(\n",
- " counts = filtered_peak,\n",
- " samples = all_covs\n",
- ")"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "ec4756f2-9410-4529-81f4-bca4e5ae510c",
- "metadata": {},
- "source": [
- "#### Filter low counts and normalize"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 54,
- "id": "d9ba397d-049e-4d80-bf6f-984ee8b50724",
- "metadata": {},
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "Number of peaks before filtering: 130930 \n"
- ]
- },
- {
- "name": "stderr",
- "output_type": "stream",
- "text": [
- "Warning message in filterByExpr.DGEList(dge, min.count = 5, min.total.count = 15, :\n",
- "“All samples appear to belong to the same group.”\n"
- ]
- },
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "Number of peaks after filtering: 21197 \n"
- ]
- }
- ],
- "source": [
- "cat(\"Number of peaks before filtering:\", nrow(dge), \"\\n\")\n",
- "\n",
- "# keep <- filterByExpr(dge) #only 2 peaks left in mic\n",
- "# default paramter:\n",
- "# keep <- filterByExpr(y, \n",
- "# min.count = 10, # for one sample, min reads \n",
- "# min.total.count = 15, # min reads overall\n",
- "# min.prop = 0.7) \n",
- "\n",
- "keep <- filterByExpr(dge, \n",
- " min.count = 5, # for one sample, min reads \n",
- " min.total.count = 15, # min reads overall\n",
- " min.prop = 0.1,\n",
- " group = NULL) \n",
- "\n",
- "dge <- dge[keep, , keep.lib.sizes=TRUE] #mic: from 130930 to 2\n",
- "cat(\"Number of peaks after filtering:\", nrow(dge), \"\\n\")\n",
- "dge <- calcNormFactors(dge, method=\"TMM\")"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "ce856212-efb8-4a52-b16f-fb178d39b47b",
- "metadata": {},
- "source": [
- "#### Load batch information"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 55,
- "id": "7a2bcb0a-9e56-4cab-9439-259cb880a66f",
- "metadata": {},
- "outputs": [
- {
- "name": "stderr",
- "output_type": "stream",
- "text": [
- "\u001b[1m\u001b[22mJoining with `by = join_by(ProjID)`\n"
- ]
- }
- ],
- "source": [
- "sample_file <- file.path(input_dir,\"SampleSheet.csv\")\n",
- "wgs_qc_file <- file.path(input_dir,\"sampleSheetAfterQc.csv\")\n",
- "\n",
- "sample <- fread(sample_file, colClasses = \"character\")\n",
- "wgs_qc <- fread(wgs_qc_file, colClasses = \"character\")\n",
- "sample <- sample %>%\n",
- " inner_join(wgs_qc) %>%\n",
- " select(SequencingID, SampleID)\n",
- "\n",
- "# Extract batch information\n",
- "batches <- sample$SequencingID\n",
- "names(batches) <- sample$SampleID\n",
- "\n",
- "valid_samples <- colnames(dge$counts)\n",
- "batches <- batches[valid_samples]"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "42e9ccbe-c1c5-4f29-9f7e-bd9fa967e367",
- "metadata": {},
- "source": [
- "#### Run ComBat-seq"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 42,
- "id": "d278d1bf-b8d9-47af-8cc3-8d0fb7d8eebe",
- "metadata": {},
- "outputs": [
- {
- "ename": "ERROR",
- "evalue": "Error: all(colnames(dge$counts) == names(batches)) is not TRUE\n",
- "output_type": "error",
- "traceback": [
- "Error: all(colnames(dge$counts) == names(batches)) is not TRUE\nTraceback:\n",
- "1. stop(simpleError(msg, call = if (p <- sys.parent(1L)) sys.call(p)))"
- ]
- }
- ],
- "source": [
- "# Filter batches with only one sample\n",
- "batch_counts <- table(batches)\n",
- "valid_batches <- names(batch_counts[batch_counts > 1])\n",
- "batches <- batches[batches %in% valid_batches]\n",
- "valid_samples <- names(batches)\n",
- "\n",
- "keep <- colnames(dge$counts) %in% names(batches)\n",
- "dge <- dge[keep, , keep.lib.sizes=TRUE]\n",
- "batches <- batches[colnames(dge$counts)]\n",
- "stopifnot(all(colnames(dge$counts) == names(batches)))\n",
- "\n",
- "cat(\"Number of samples after batch filtering:\", length(valid_samples), \"\\n\")\n",
- "cat(\"Number of batches:\", length(unique(batches)), \"\\n\")\n",
- "\n",
- "# Run ComBat-seq\n",
- "adjusted_counts <- ComBat_seq(\n",
- " counts = dge$counts, \n",
- " batch = batches\n",
- ")"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "0e219247-72d4-44e5-bd5c-4dffe90537e9",
- "metadata": {},
- "source": [
- "#### Create model and run voom"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 44,
- "id": "fef7c108-0ca0-42ca-b6b8-4ca3410bb507",
- "metadata": {},
- "outputs": [
- {
- "ename": "ERROR",
- "evalue": "Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]): contrasts can be applied only to factors with 2 or more levels\n",
- "output_type": "error",
- "traceback": [
- "Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]): contrasts can be applied only to factors with 2 or more levels\nTraceback:\n",
- "1. model.matrix.default(model, data = all_covs[valid_samples, ])",
- "2. `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]])",
- "3. stop(\"contrasts can be applied only to factors with 2 or more levels\")",
- "4. .handleSimpleError(function (cnd) \n . {\n . watcher$capture_plot_and_output()\n . cnd <- sanitize_call(cnd)\n . watcher$push(cnd)\n . switch(on_error, continue = invokeRestart(\"eval_continue\"), \n . stop = invokeRestart(\"eval_stop\"), error = NULL)\n . }, \"contrasts can be applied only to factors with 2 or more levels\", \n . base::quote(`contrasts<-`(`*tmp*`, value = contr.funs[1 + \n . isOF[nn]])))"
- ]
- }
- ],
- "source": [
- "model <- ~ pmi + msex + age_death + \n",
- " TSSEnrichment + NucleosomeRatio + LogPercMt +\n",
- " LogUniqueFrags + LogTotalUniquePeaks + \n",
- " study\n",
- "\n",
- "# Update design matrix for remaining samples\n",
- "design <- model.matrix(model, data=all_covs[valid_samples,])\n",
- "stopifnot(is.fullrank(design))\n",
- "\n",
- "dge_adjusted <- dge[, valid_samples] \n",
- "dge_adjusted$counts <- adjusted_counts[, valid_samples] \n",
- "\n",
- "# Run voom and fit model\n",
- "v <- voom(dge_adjusted[, valid_samples], design, plot=FALSE)\n",
- "fit <- lmFit(v, design)\n",
- "fit <- eBayes(fit)\n",
- "\n",
- "# Calculate offset and residuals\n",
- "offset <- predictOffset(fit)\n",
- "resids <- residuals(fit, y=v)\n",
- "\n",
- "# Verify dimensions\n",
- "stopifnot(all(rownames(offset) == rownames(resids)) &\n",
- " all(colnames(offset) == colnames(resids)))\n",
- "\n",
- "# Final adjusted data\n",
- "stopifnot(all(dim(offset) == dim(resids)))\n",
- "stopifnot(all(colnames(offset) == colnames(resids)))\n",
- "\n",
- "final_data <- offset + resids"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "7ba1b486-bfcf-4488-90b8-252d94d256a2",
- "metadata": {},
- "source": [
- "#### Run LIMMA as Combat-seq alternative "
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 56,
- "id": "dbbe0cd9-9375-40d4-8d44-1d6556d12dfa",
- "metadata": {},
- "outputs": [],
- "source": [
- "# Alternative: Use limma's removeBatchEffect instead\n",
- "# Get log-CPM values\n",
- "logCPM <- cpm(dge, log=TRUE, prior.count=1)\n",
- "\n",
- "# Remove batch effects\n",
- "adjusted_logCPM <- removeBatchEffect(\n",
- " logCPM,\n",
- " batch = batches,\n",
- " design = model.matrix(~1, data=dge$samples)\n",
- ")\n",
- "\n",
- "# Convert back to counts scale (approximate)\n",
- "adjusted_counts <- 2^adjusted_logCPM * mean(dge$samples$lib.size) / 1e6\n",
- "adjusted_counts <- round(adjusted_counts)\n",
- "adjusted_counts[adjusted_counts < 0] <- 0"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "0a8e252b-bf0d-467f-9990-3fc4b0b2f99c",
- "metadata": {},
- "source": [
- "#### Create model and run voom"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 57,
- "id": "56c20501-f7db-43b2-bd7d-47d32cc43d6c",
- "metadata": {},
- "outputs": [],
- "source": [
- "# Update valid_samples to match current data\n",
- "valid_samples <- colnames(dge)\n",
- "\n",
- "# Get aligned covariates\n",
- "filtered_covs <- all_covs[match(valid_samples, all_covs$demuxlet_SNG.BEST.GUESS), ]\n",
- "filtered_covs <- as.data.frame(filtered_covs) # Convert from tibble\n",
- "rownames(filtered_covs) <- valid_samples\n",
- "\n",
- "\n",
- "# Build model formula\n",
- "model_formula <- ~ pmi + msex + age_death + \n",
- " TSSEnrichment + NucleosomeRatio + LogPercMt +\n",
- " LogUniqueFrags + LogTotalUniquePeaks + \n",
- " study\n",
- "\n",
- "# Create design matrix\n",
- "design <- model.matrix(model_formula, data=filtered_covs)\n",
- "rownames(design) <- valid_samples\n",
- "\n",
- "stopifnot(is.fullrank(design))\n",
- "stopifnot(all(rownames(design) == colnames(dge)))\n",
- "\n",
- "# Create properly formatted DGEList with adjusted counts\n",
- "dge_adjusted <- DGEList(\n",
- " counts = adjusted_counts,\n",
- " samples = filtered_covs\n",
- ")\n",
- "\n",
- "# Recalculate library sizes and normalization factors\n",
- "dge_adjusted$samples$lib.size <- colSums(dge_adjusted$counts)\n",
- "dge_adjusted <- calcNormFactors(dge_adjusted, method=\"TMM\")\n",
- "\n",
- "stopifnot(all(rownames(design) == colnames(dge_adjusted)))\n",
- "\n",
- "# Run voom and fit model\n",
- "v <- voom(dge_adjusted, design, plot=FALSE)\n",
- "fit <- lmFit(v, design)\n",
- "fit <- eBayes(fit)\n",
- "\n",
- "# Calculate offset and residuals\n",
- "offset <- predictOffset(fit)\n",
- "resids <- residuals(fit, y=v)\n",
- "\n",
- "# Final adjusted data\n",
- "final_data <- offset + resids\n"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "c0093455-7ff7-47cb-9e31-baf913bbb4cd",
- "metadata": {},
- "source": [
- "#### Save results"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 58,
- "id": "10cce409-a132-41a6-b294-877e55e136c5",
- "metadata": {},
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "Results saved to: /restricted/projectnb/xqtl/jaempawi/atac_seq/output/kellis/2_residuals_batch_corrected/Astro \n"
- ]
- }
- ],
- "source": [
- "saveRDS(list(\n",
- " dge = dge_adjusted,\n",
- " offset = offset,\n",
- " residuals = resids,\n",
- " batch_adjusted_counts = adjusted_counts,\n",
- " final_data = final_data,\n",
- " valid_samples = valid_samples,\n",
- " design = design,\n",
- " fit = fit\n",
- "), file = file.path(residual_out_dir, paste0(celltype, \"_results.rds\")))\n",
- "\n",
- "# Write final residual data to file\n",
- "write.table(final_data,\n",
- " file = file.path(residual_out_dir, paste0(celltype, \"_residuals.txt\")), \n",
- " quote=FALSE)\n",
- "\n",
- "cat(\"Results saved to:\", residual_out_dir, \"\\n\") "
- ]
- }
- ],
- "metadata": {
- "kernelspec": {
- "display_name": "R",
- "language": "R",
- "name": "ir"
- },
- "language_info": {
- "codemirror_mode": "r",
- "file_extension": ".r",
- "mimetype": "text/x-r-source",
- "name": "R",
- "pygments_lexer": "r",
- "version": "4.4.3"
- }
- },
- "nbformat": 4,
- "nbformat_minor": 5
-}
diff --git a/code/molecular_phenotypes/QC/snatacseq_preprocessing.ipynb b/code/molecular_phenotypes/QC/snatacseq_preprocessing.ipynb
new file mode 100644
index 00000000..b2b5acb6
--- /dev/null
+++ b/code/molecular_phenotypes/QC/snatacseq_preprocessing.ipynb
@@ -0,0 +1,1453 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "id": "27a44eb1-acb9-40f5-bcdb-7ede63d5db5e",
+ "metadata": {
+ "kernel": "SoS"
+ },
+ "source": [
+ "# Single-nucleus ATAC-seq Preprocessing Pipeline\n",
+ "\n",
+ "## Overview\n",
+ "\n",
+ "This pipeline preprocesses single-nucleus ATAC-seq (snATAC-seq) pseudobulk peak count data\n",
+ "for downstream chromatin accessibility QTL (caQTL) analysis and region-specific studies.\n",
+ "\n",
+ "**Goals:**\n",
+ "- Transform raw pseudobulk peak counts into analysis-ready formats\n",
+ "- Remove technical confounders while optionally preserving biological covariates\n",
+ "- Generate QTL-ready phenotype files or region-specific datasets\n",
+ "\n",
+ "## Pipeline Structure\n",
+ "```\n",
+ "Step 0: Sample ID Mapping\n",
+ "↓\n",
+ "Step 1: Pseudobulk QC\n",
+ "├── Option A: BIOvar (regress out technical + biological covariates)\n",
+ "└── Option B: noBIOvar (regress out technical covariates only)\n",
+ "↓ (optional)\n",
+ "Batch Correction (ComBat-seq or limma::removeBatchEffect)\n",
+ "↓\n",
+ "Step 2: Format Output\n",
+ "├── Format A: Phenotype Reformatting → BED (genome-wide caQTL mapping)\n",
+ "└── Format B: Region Peak Filtering → TSV (locus-specific analysis)\n",
+ "\n",
+ "```\n",
+ "\n",
+ "## Input Files\n",
+ "\n",
+ "All input files required to run this pipeline can be downloaded\n",
+ "[here](https://drive.google.com/drive/folders/1UzJuHN8SotMn-PJTBp9uGShD25YxapKr?usp=drive_link).\n",
+ "\n",
+ "| File | Used in |\n",
+ "|------|---------|\n",
+ "| `pseudobulk_peaks_counts_{celltype}.csv.gz` | Step 0, Step 1 |\n",
+ "| `metadata_{celltype}.csv` | Step 0, Step 1 |\n",
+ "| `rosmap_sample_mapping_data.csv` | Step 0 |\n",
+ "| `rosmap_cov.txt` | Step 1 |\n",
+ "| `hg38-blacklist.v2.bed.gz` | Step 1 |\n",
+ "| `SampleSheet.csv` | Step 1 (batch correction only) |\n",
+ "| `sampleSheetAfterQc.csv` | Step 1 (batch correction only) |\n",
+ "\n",
+ "\n",
+ "## Minimal Working Example"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "50e13a3f-ab64-4bd1-b47c-acca8d58a8b9",
+ "metadata": {
+ "kernel": "SoS"
+ },
+ "source": [
+ "## Step 0: Sample ID Mapping\n",
+ "\n",
+ "Maps original sample identifiers (`individualID`) to standardized sample IDs (`sampleid`)\n",
+ "across metadata and count matrix files.\n",
+ "\n",
+ "### Input\n",
+ "\n",
+ "| File | Description |\n",
+ "|------|-------------|\n",
+ "| `rosmap_sample_mapping_data.csv` | Mapping reference: `individualID → sampleid` |\n",
+ "| `metadata_{celltype}.csv` × 6 | Per-cell-type sample metadata |\n",
+ "| `pseudobulk_peaks_counts_{celltype}.csv.gz` × 6 | Per-cell-type peak count matrices |\n",
+ "\n",
+ "Cell types: `Ast`, `Ex`, `In`, `Microglia`, `Oligo`, `OPC`\n",
+ "\n",
+ "### Process\n",
+ "\n",
+ "**Part 1 — Metadata files**\n",
+ "\n",
+ "For each `metadata_{celltype}.csv`:\n",
+ "1. Look up each `individualID` in the mapping reference\n",
+ "2. Assign `sampleid` — falls back to `individualID` if no mapping found\n",
+ "3. Insert `sampleid` as the first column\n",
+ "4. Save updated file\n",
+ "\n",
+ "**Part 2 — Count matrix files**\n",
+ "\n",
+ "For each `pseudobulk_peaks_counts_{celltype}.csv.gz`:\n",
+ "1. Extract the header row (column names only)\n",
+ "2. Keep `peak_id` (first column) unchanged\n",
+ "3. Map remaining column names (`individualID` → `sampleid`) where mapping exists,\n",
+ " otherwise keep original\n",
+ "4. Write new header and stream data rows unchanged\n",
+ "5. Recompress with gzip\n",
+ "\n",
+ "### Output\n",
+ "\n",
+ "Output directory: `output/1_files_with_sampleid/`\n",
+ "\n",
+ "| File | Description |\n",
+ "|------|-------------|\n",
+ "| `metadata_{celltype}.csv` × 6 | Metadata with `sampleid` column prepended |\n",
+ "| `pseudobulk_peaks_counts_{celltype}.csv.gz` × 6 | Count matrices with mapped column headers |\n",
+ "\n",
+ "**Timing:** < 1 min\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "33f7cbe0-bf5e-4d7e-8a2b-216915dea78e",
+ "metadata": {
+ "kernel": "SoS"
+ },
+ "outputs": [],
+ "source": [
+ "sos run pipeline/snatacseq_preprocessing.ipynb sampleid_mapping \\\n",
+ " --cwd output/atac_seq/1_files_with_sampleid \\\n",
+ " --map_file data/atac_seq/rosmap_sample_mapping_data.csv \\\n",
+ " --input_dir data/atac_seq/1_files_with_sampleid_xiong \\\n",
+ " --output_dir output/atac_seq/1_files_with_sampleid \\\n",
+ " --celltype Ast Ex In Microglia Oligo OPC\n",
+ "\n",
+ "\n",
+ "# For MIT input data\n",
+ "sos run pipeline/snatacseq_preprocessing.ipynb sampleid_mapping \\\n",
+ " --cwd output/atac_seq/1_files_with_sampleid \\\n",
+ " --map_file data/atac_seq/rosmap_sample_mapping_data.csv \\\n",
+ " --input_dir data/atac_seq/1_files_with_sampleid_MIT \\\n",
+ " --output_dir output/atac_seq/1_files_with_sampleid \\\n",
+ " --celltype Astro Exc Inh Mic Oligo OPC \\\n",
+ " --suffix _50nuc"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "5540a4da-843a-4789-8123-47911cf519c5",
+ "metadata": {
+ "kernel": "SoS"
+ },
+ "source": [
+ "## Step 1: Pseudobulk QC\n",
+ "\n",
+ "Two approaches are available depending on whether biological covariates should be regressed out.\n",
+ "Both options support an **optional batch correction** step after filtering and normalization.\n",
+ "\n",
+ "\n",
+ "### Option A: With Biological Covariates (BIOvar)\n",
+ "\n",
+ "Use when residuals should be adjusted for all technical **and** biological covariates (sex, age, PMI).\n",
+ "\n",
+ "**Input:**\n",
+ "\n",
+ "| File | Location |\n",
+ "|------|----------|\n",
+ "| `pseudobulk_peaks_counts_{celltype}.csv.gz` | `1_files_with_sampleid/` |\n",
+ "| `metadata_{celltype}.csv` | `1_files_with_sampleid/` |\n",
+ "| `rosmap_cov.txt` | `data/` |\n",
+ "| `hg38-blacklist.v2.bed.gz` | `data/` |\n",
+ "| `SampleSheet.csv` *(batch correction only)* | `data/` |\n",
+ "| `sampleSheetAfterQc.csv` *(batch correction only)* | `data/` |\n",
+ "\n",
+ "**Process:**\n",
+ "\n",
+ "1. Load pseudobulk peak count matrix and metadata per cell type\n",
+ "2. Filter samples with fewer than 20 nuclei\n",
+ "3. Calculate technical QC metrics per sample:\n",
+ " - `log_n_nuclei`: log-transformed nuclei count\n",
+ " - `med_nucleosome_signal`: median nucleosome signal\n",
+ " - `med_tss_enrich`: median TSS enrichment score\n",
+ " - `log_med_n_tot_fragment`: log-transformed median total fragments\n",
+ " - `log_total_unique_peaks`: log-transformed unique peak count\n",
+ "4. Filter blacklisted genomic regions\n",
+ "5. Merge with demographic covariates (`msex`, `age_death`, `pmi`, `study`)\n",
+ "6. Apply expression filtering (`filterByExpr`):\n",
+ " - `min_count = 5`: minimum reads in at least one sample\n",
+ " - `min_total_count = 15`: minimum total reads across all samples\n",
+ " - `min_prop = 0.1`: peak expressed in ≥10% of samples\n",
+ "7. TMM normalization\n",
+ "8. *(Optional)* Batch correction — see [Batch Correction](#batch-correction-optional) below\n",
+ "9. Fit linear model (`voom` + `lmFit`):~ log_n_nuclei + med_nucleosome_signal + med_tss_enrich\n",
+ "\n",
+ "log_med_n_tot_fragment + log_total_unique_peaks\n",
+ "sequencingBatch + msex + age_death + pmi + study\n",
+ "\n",
+ " > If batch correction was applied, `sequencingBatch` is removed from the model.\n",
+ "10. Compute residuals adjusted for all covariates\n",
+ "11. Compute final adjusted values: `offset + residuals`\n",
+ " - `offset`: predicted expression at median/reference covariate values\n",
+ " - `residuals`: unexplained variation after removing all covariate effects\n",
+ "\n",
+ "**Output:** `output/2_residuals/{celltype}/`\n",
+ "\n",
+ "| File | Description |\n",
+ "|------|-------------|\n",
+ "| `{celltype}_residuals.txt` | Covariate-adjusted peak accessibility (log2-CPM) |\n",
+ "| `{celltype}_results.rds` | Full results: DGEList, fit, offset, residuals, design |\n",
+ "| `{celltype}_filtered_raw_counts.txt` | Filtered raw counts before normalization |\n",
+ "| `{celltype}_summary.txt` | Filtering statistics and QC summary |\n",
+ "\n",
+ "**Covariates regressed out:**\n",
+ "- Technical: sequencing depth, nuclei count, nucleosome signal, TSS enrichment, batch\n",
+ "- Biological: sex (`msex`), age at death (`age_death`), post-mortem interval (`pmi`), study cohort\n",
+ "\n",
+ "**Timing:** <5 min per celltype"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "21f80085-6d2c-4e1c-af35-454382d94de1",
+ "metadata": {
+ "kernel": "SoS"
+ },
+ "source": [
+ "### Pseudobulk QC with BIOVar"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "8569d816-d292-4512-85b6-fcd3ea1c9ba7",
+ "metadata": {
+ "kernel": "SoS"
+ },
+ "outputs": [],
+ "source": [
+ "sos run pipeline/snatacseq_preprocessing.ipynb pseudobulk_qc \\\n",
+ " --cwd output/atac_seq \\\n",
+ " --input_dir output/atac_seq/1_files_with_sampleid_xiong \\\n",
+ " --output_dir output/atac_seq/2_residuals \\\n",
+ " --blacklist_file data/atac_seq/hg38-blacklist.v2.bed.gz \\\n",
+ " --covariates_file data/atac_seq/rosmap_cov.txt \\\n",
+ " --include_bio TRUE \\\n",
+ " --batch_correction FALSE \\\n",
+ " --min_count 5 \\\n",
+ " --celltype Ast Ex In Microglia Oligo OPC"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "d8270ee1-1f9b-439c-969b-ac20af6fadee",
+ "metadata": {},
+ "source": [
+ "### Option B: Without Biological Covariates (noBIOvar)\n",
+ "\n",
+ "Use when biological variation should be preserved (e.g., age/sex comparisons, region-specific analyses).\n",
+ "\n",
+ "**Input:** Same as Option A.\n",
+ "\n",
+ "**Process:**\n",
+ "\n",
+ "Steps 1–8 are identical to Option A. Key differences at the modelling stage:\n",
+ "- `msex` and `age_death` are **excluded** from the model\n",
+ "- `med_peakwidth` (weighted median peak width per sample) is added as a technical covariate\n",
+ "\n",
+ "**Model formula:**\n",
+ "```\n",
+ "Model: ~ log_n_nuclei + med_nucleosome_signal + med_tss_enrich + log_med_n_tot_fragment + log_total_unique_peaks + med_peakwidth + sequencingBatch + pmi + study\n",
+ "```\n",
+ "\n",
+ "**Output:** `output/2_residuals/{celltype}/`\n",
+ "\n",
+ "| File | Description |\n",
+ "|------|-------------|\n",
+ "| `{celltype}_residuals.txt` | Covariate-adjusted peak accessibility (log2-CPM) |\n",
+ "| `{celltype}_results.rds` | Full results: DGEList, fit, offset, residuals, design |\n",
+ "| `{celltype}_filtered_raw_counts.txt` | Filtered raw counts before normalization |\n",
+ "| `{celltype}_summary.txt` | Filtering statistics and QC summary |\n",
+ "\n",
+ "**Variables deliberately NOT regressed out:**\n",
+ "- Sex (`msex`)\n",
+ "- Age at death (`age_death`)\n",
+ "\n",
+ "**Timing:** <5 min per celltype"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "15cbd39c-60f8-4e21-9915-14b30ebc02cd",
+ "metadata": {
+ "kernel": "SoS"
+ },
+ "source": [
+ "### Pseudobulk QC noBIOvar "
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "e741ac2e-91b3-49e7-906a-2fd6b8d8d137",
+ "metadata": {
+ "kernel": "SoS"
+ },
+ "outputs": [],
+ "source": [
+ "sos run pipeline/snatacseq_preprocessing.ipynb pseudobulk_qc \\\n",
+ " --cwd output/atac_seq \\\n",
+ " --input_dir output/atac_seq/1_files_with_sampleid_xiong \\\n",
+ " --output_dir output/atac_seq/2_residuals \\\n",
+ " --blacklist_file data/atac_seq/hg38-blacklist.v2.bed.gz \\\n",
+ " --covariates_file data/atac_seq/rosmap_cov.txt \\\n",
+ " --include_bio FALSE \\\n",
+ " --batch_correction FALSE \\\n",
+ " --min_count 5 \\\n",
+ " --celltype Ast Ex In Microglia Oligo OPC"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "25e96ad2-1b75-43d0-978e-0757bc11f135",
+ "metadata": {
+ "kernel": "SoS"
+ },
+ "source": [
+ "### Batch Correction (Optional)\n",
+ "\n",
+ "Applies to both Option A and Option B. Runs between TMM normalization and model fitting.\n",
+ "Use when batch effects are severe (e.g., visible batch clusters in PCA, multiple sequencing runs).\n",
+ "\n",
+ "> When batch correction is applied, `sequencingBatch` is **removed** from the model formula\n",
+ "> since batch variance has already been removed from the counts.\n",
+ "\n",
+ "**Method comparison:**\n",
+ "\n",
+ "| | ComBat-seq | limma `removeBatchEffect` |\n",
+ "|---|---|---|\n",
+ "| **Operates on** | Raw integer counts | log-CPM values |\n",
+ "| **Mean-variance modelling** | Yes | No |\n",
+ "| **Best for** | Large, balanced batches | Small or fragmented batches |\n",
+ "| **Robustness** | May fail with many small batches | More robust to unbalanced designs |\n",
+ "\n",
+ "**ComBat-seq:**\n",
+ "```r\n",
+ "adjusted_counts <- ComBat_seq(counts = dge$counts, batch = batches)\n",
+ "```\n",
+ "\n",
+ "**limma `removeBatchEffect`:**\n",
+ "```r\n",
+ "logCPM <- cpm(dge, log = TRUE, prior.count = 1)\n",
+ "adj_logCPM <- removeBatchEffect(logCPM, batch = batches, design = model.matrix(~1, data = dge$samples))\n",
+ "adjusted_counts <- round(pmax(2^adj_logCPM * mean(dge$samples$lib.size) / 1e6, 0))\n",
+ "```\n",
+ "\n",
+ "**Additional filtering applied before correction:**\n",
+ "- Singleton batches (only 1 sample) are removed\n",
+ "- Samples absent from the batch sheet are dropped\n",
+ "\n",
+ "**Additional output when batch correction is enabled:**\n",
+ "\n",
+ "| File | Description |\n",
+ "|------|-------------|\n",
+ "| `{celltype}_results.rds` | Includes `batch_adjusted_counts` and `batch_method` fields |\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "4d582c85-2265-46ee-8080-0ec5d8423a1d",
+ "metadata": {
+ "kernel": "SoS"
+ },
+ "source": [
+ "### Pseudobulk QC with BIOvar & with batch correction"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "d3676870-496d-4379-8d6b-acec08f1c0d7",
+ "metadata": {
+ "kernel": "SoS"
+ },
+ "outputs": [],
+ "source": [
+ "sos run pipeline/snatacseq_preprocessing.ipynb pseudobulk_qc \\\n",
+ " --cwd output/atac_seq \\\n",
+ " --input_dir output/atac_seq/1_files_with_sampleid_xiong \\\n",
+ " --output_dir output/atac_seq/2_residuals \\\n",
+ " --blacklist_file data/atac_seq/hg38-blacklist.v2.bed.gz \\\n",
+ " --covariates_file data/atac_seq/rosmap_cov.txt \\\n",
+ " --include_bio TRUE \\\n",
+ " --batch_correction TRUE \\\n",
+ " --batch_method limma \\\n",
+ " --min_count 2\n",
+ " --celltype Ast Ex In Microglia Oligo OPC"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "9bad900d-768d-45ee-815a-6847e8eba32e",
+ "metadata": {
+ "kernel": "SoS"
+ },
+ "source": [
+ "### Pseudobulk QC noBIOvar & with batch correction"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "799c0faa-4dd9-431d-a5cf-3e92d7256a3b",
+ "metadata": {
+ "kernel": "SoS"
+ },
+ "outputs": [],
+ "source": [
+ "sos run pipeline/snatacseq_preprocessing.ipynb pseudobulk_qc \\\n",
+ " --cwd output/atac_seq \\\n",
+ " --input_dir output/atac_seq/1_files_with_sampleid_xiong \\\n",
+ " --output_dir output/atac_seq/2_residuals \\\n",
+ " --blacklist_file data/atac_seq/hg38-blacklist.v2.bed.gz \\\n",
+ " --covariates_file data/atac_seq/rosmap_cov.txt \\\n",
+ " --include_bio FALSE \\\n",
+ " --batch_correction TRUE \\\n",
+ " --batch_method limma \\\n",
+ " --min_count 5\n",
+ " --celltype Ast Ex In Microglia Oligo OPC"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "096f2b32-e80d-472b-9af8-5f3d4ebb9bf2",
+ "metadata": {
+ "kernel": "SoS"
+ },
+ "source": [
+ "**Note**\n",
+ "For MIT data, add these parameters:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "ee860bb3-d628-4255-b222-f62b3c03a91a",
+ "metadata": {
+ "kernel": "SoS"
+ },
+ "outputs": [],
+ "source": [
+ "--celltype Astro Exc Inh Mic Oligo OPC \\\n",
+ "--suffix _50nuc \\\n",
+ "--input_dir output/1_files_with_sampleid_MIT"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "6d7ea976-ca30-48e4-811b-eaa0b5f246ed",
+ "metadata": {},
+ "source": [
+ "For additional parameters:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "45b41a7f-1d08-4174-858b-a0593aaadcfb",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "--min_count 5\n",
+ "--min_total_count 15\n",
+ "--min_prop 0.1\n",
+ "--min_nuclei 20"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "c85d5b04-ec21-4c0c-8879-d78563d5ed96",
+ "metadata": {
+ "kernel": "SoS"
+ },
+ "source": [
+ "## Step 2: Format Output\n",
+ "### Phenotype Reformatting\n",
+ "\n",
+ "Converts residuals into a QTL-ready BED format for genome-wide caQTL mapping.\n",
+ "\n",
+ "**Input:**\n",
+ "\n",
+ "| File | Location |\n",
+ "|------|----------|\n",
+ "| `{celltype}_residuals.txt` | `output/2_residuals/{celltype}/` |\n",
+ "\n",
+ "**Process:**\n",
+ "\n",
+ "1. Read residuals file with proper handling of peak IDs and sample columns\n",
+ "2. Parse peak coordinates from peak IDs (`chr-start-end` format)\n",
+ "3. Convert to midpoint coordinates (standard for QTLtools):\n",
+ "```\n",
+ " start = floor((peak_start + peak_end) / 2)\n",
+ " end = start + 1\n",
+ "```\n",
+ "4. Build BED format: `#chr`, `start`, `end`, `ID` followed by per-sample expression values\n",
+ "5. Sort by chromosome and position\n",
+ "6. Compress with `bgzip` and index with `tabix`\n",
+ "\n",
+ "**Output:** `output/3_phenotype_processing/phenotype/{celltype}_snatac_phenotype.bed.gz`\n",
+ "\n",
+ "| File | Description |\n",
+ "|------|-------------|\n",
+ "| `{celltype}_snatac_phenotype.bed.gz` | bgzip-compressed BED with peak midpoint coordinates |\n",
+ "| `{celltype}_snatac_phenotype.bed.gz.tbi` | tabix index for random-access queries |\n",
+ "\n",
+ "**Use case:** Standard caQTL mapping to identify genetic variants affecting chromatin\n",
+ "accessibility independent of demographic factors. Compatible with FastQTL, TensorQTL, and QTLtools.\n",
+ "\n",
+ "**Timing:** <1 min"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "adaf9d56-b53b-4a6a-b0af-b5a5fb98907b",
+ "metadata": {
+ "kernel": "SoS"
+ },
+ "outputs": [],
+ "source": [
+ "sos run pipeline/snatacseq_preprocessing.ipynb phenotype_formatting \\\n",
+ " --cwd output/atac_seq \\\n",
+ " --input_dir output/atac_seq/2_residuals \\\n",
+ " --output_dir output/atac_seq/3_pheno_reformat \\\n",
+ " --celltype Ast Ex In Microglia Oligo OPC"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "7c874b17-9a77-4e7d-a0a3-3605f7005148",
+ "metadata": {
+ "kernel": "SoS"
+ },
+ "source": [
+ "### Region Peak Filtering\n",
+ "\n",
+ "Filters peak counts to specific genomic regions of interest for locus-specific analysis.\n",
+ "\n",
+ "**Input:**\n",
+ "\n",
+ "| File | Location |\n",
+ "|------|----------|\n",
+ "| `{celltype}_filtered_raw_counts.txt` | `output/2_residuals/{celltype}/` |\n",
+ "\n",
+ "**Process:**\n",
+ "\n",
+ "1. Read filtered raw counts per cell type\n",
+ "2. Parse peak coordinates from peak IDs (`chr-start-end` format)\n",
+ "3. Calculate per-peak metrics:\n",
+ " - `peakwidth`: `end - start`\n",
+ " - `midpoint`: `(start + end) / 2`\n",
+ "4. Filter peaks overlapping target regions (includes peaks that start, end, or span boundaries):\n",
+ "\n",
+ " | Region | Coordinates | Size |\n",
+ " |--------|-------------|------|\n",
+ " | Chr7 | 28,000,000 – 28,300,000 bp | 300 kb |\n",
+ " | Chr11 | 85,050,000 – 86,200,000 bp | 1.15 Mb |\n",
+ "\n",
+ "5. Calculate summary statistics per peak:\n",
+ " - `total_count`: sum of counts across all samples\n",
+ " - `weighted_count`: `total_count / peakwidth` (normalizes for peak size)\n",
+ "\n",
+ "**Output:** `output/3_format_output/regions/{celltype}/`\n",
+ "\n",
+ "| File | Description |\n",
+ "|------|-------------|\n",
+ "| `{celltype}_filtered_regions.txt` | Full count matrix for peaks in target regions |\n",
+ "| `{celltype}_filtered_regions_summary.txt` | Peak metadata with coordinates and count statistics |\n",
+ "\n",
+ "**Use case:** Hypothesis-driven analysis of specific genomic loci (e.g., AD risk loci such as\n",
+ "the APOE or TREM2 regions) where biological variation is preserved for downstream interpretation.\n",
+ "\n",
+ "**Timing:** <1 min"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "f944afdd-fffc-4b56-863f-eee89408cfa1",
+ "metadata": {
+ "kernel": "SoS"
+ },
+ "outputs": [],
+ "source": [
+ "sos run pipeline/snatacseq_preprocessing.ipynb region_filtering \\\n",
+ " --cwd output/atac_seq \\\n",
+ " --input_dir output/atac_seq/2_residuals \\\n",
+ " --output_dir output/atac_seq/3_region_filter \\\n",
+ " --celltype Ast Ex In Microglia Oligo OPC"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "10440301-99c6-4f0e-b6ce-efe5ac9281fb",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Custom regions\n",
+ "sos run pipeline/snatacseq_preprocessing.ipynb region_filtering \\\n",
+ " --cwd output/atac_seq \\\n",
+ " --input_dir output/atac_seq/2_residuals \\\n",
+ " --output_dir output/atac_seq \\\n",
+ " --celltype Ast Ex In Microglia Oligo OPC \\\n",
+ " --regions \"chr1:1000000-2000000,chr5:50000000-51000000\""
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "1a676801-6845-4ca5-944b-7978a5ecbb1f",
+ "metadata": {
+ "kernel": "SoS"
+ },
+ "source": [
+ "## Command interface"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "486664a9-55c2-4738-91a0-b63ffdcd6cfa",
+ "metadata": {
+ "kernel": "SoS"
+ },
+ "outputs": [],
+ "source": [
+ "sos run pipeline/snatacseq_preprocessing.ipynb -h"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "0e17a301-cca9-49a1-843b-4248546f1f79",
+ "metadata": {
+ "kernel": "SoS"
+ },
+ "source": [
+ "## Setup and global parameters"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "3c57fe47-f2ca-4a6e-8789-f7dbe3a9fad2",
+ "metadata": {
+ "kernel": "SoS"
+ },
+ "outputs": [],
+ "source": [
+ "[global]\n",
+ "# Output directory\n",
+ "parameter: cwd = path(\"output\")\n",
+ "# For cluster jobs, number of commands to run per job\n",
+ "parameter: job_size = 1\n",
+ "# Wall clock time expected\n",
+ "parameter: walltime = \"5h\"\n",
+ "# Memory expected\n",
+ "parameter: mem = \"16G\"\n",
+ "# Number of threads\n",
+ "parameter: numThreads = 8\n",
+ "# Software container\n",
+ "parameter: container = \"\"\n",
+ "\n",
+ "import re\n",
+ "parameter: entrypoint = (\n",
+ " 'micromamba run -a \"\" -n' + ' ' +\n",
+ " re.sub(r'(_apptainer:latest|_docker:latest|\\.sif)$', '', container.split('/')[-1])\n",
+ ") if container else \"\"\n",
+ "\n",
+ "from sos.utils import expand_size\n",
+ "cwd = path(f'{cwd:a}')"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "cb6024cd-28be-4fb0-994e-0460e3a3beae",
+ "metadata": {
+ "kernel": "SoS"
+ },
+ "source": [
+ "## `sampleid_mapping`"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "e0b1b7c0-2819-45d1-b2ce-8d117a6cc9eb",
+ "metadata": {
+ "kernel": "SoS"
+ },
+ "outputs": [],
+ "source": [
+ "[sampleid_mapping]\n",
+ "parameter: map_file = str\n",
+ "parameter: input_dir = str\n",
+ "parameter: output_dir = str\n",
+ "parameter: celltype = ['Ast', 'Ex', 'In', 'Microglia', 'Oligo', 'OPC']\n",
+ "parameter: suffix = '' # e.g. '' for Xiong, '_50nuc' for Kellis\n",
+ "\n",
+ "input: [f'{input_dir}/metadata_{ct}{suffix}.csv' for ct in celltype]\n",
+ "output: [f'{output_dir}/metadata_{ct}{suffix}.csv' for ct in celltype]\n",
+ "\n",
+ "python: expand = \"${ }\"\n",
+ "\n",
+ " import pandas as pd\n",
+ " import gzip\n",
+ " import os\n",
+ " import subprocess\n",
+ " import csv\n",
+ " import numpy as np\n",
+ "\n",
+ " map_df = pd.read_csv(\"${map_file}\")\n",
+ " id_map = dict(zip(map_df[\"individualID\"], map_df[\"sampleid\"]))\n",
+ "\n",
+ " celltype = ${celltype}\n",
+ " input_dir = \"${input_dir}\"\n",
+ " output_dir = \"${output_dir}/1_files_with_sampleid\"\n",
+ " suffix = \"${suffix}\"\n",
+ "\n",
+ " os.makedirs(output_dir, exist_ok=True)\n",
+ "\n",
+ " def map_id(ind_id):\n",
+ " return id_map.get(ind_id, ind_id)\n",
+ " \n",
+ " def format_value(val):\n",
+ " \"\"\"Format numeric values: remove .0 from integers, keep decimals\"\"\"\n",
+ " if pd.isna(val):\n",
+ " return ''\n",
+ " if isinstance(val, (int, np.integer)):\n",
+ " return str(val)\n",
+ " if isinstance(val, (float, np.floating)):\n",
+ " if val == int(val): # Check if it's a whole number\n",
+ " return str(int(val))\n",
+ " else:\n",
+ " return str(val)\n",
+ " return str(val)\n",
+ "\n",
+ " # ── Process metadata CSV files ────────────────────────────────────────────\n",
+ " for ct in celltype:\n",
+ " fname = f\"metadata_{ct}{suffix}.csv\"\n",
+ " in_path = os.path.join(input_dir, fname)\n",
+ " out_path = os.path.join(output_dir, fname)\n",
+ "\n",
+ " if not os.path.exists(in_path):\n",
+ " print(f\"Warning: Metadata file not found: {in_path}\")\n",
+ " continue\n",
+ "\n",
+ " meta = pd.read_csv(in_path)\n",
+ "\n",
+ " if \"individualID\" not in meta.columns:\n",
+ " print(f\"Warning: individualID column not found in {fname}\")\n",
+ " continue\n",
+ "\n",
+ " # Create or update sampleid column\n",
+ " meta[\"sampleid\"] = meta[\"individualID\"].map(map_id)\n",
+ " \n",
+ " # Always reorder: sampleid FIRST, then individualID, then rest\n",
+ " cols = meta.columns.tolist()\n",
+ " cols.remove(\"sampleid\")\n",
+ " cols.remove(\"individualID\")\n",
+ " new_cols = [\"sampleid\", \"individualID\"] + cols\n",
+ " meta = meta[new_cols]\n",
+ "\n",
+ " # Write CSV with custom formatting\n",
+ " with open(out_path, 'w', newline='') as f:\n",
+ " writer = csv.writer(f, quoting=csv.QUOTE_MINIMAL)\n",
+ " # Write header\n",
+ " writer.writerow(meta.columns)\n",
+ " # Write data rows with custom formatting\n",
+ " for _, row in meta.iterrows():\n",
+ " writer.writerow([format_value(val) for val in row])\n",
+ " \n",
+ " print(f\"Processed metadata: {fname}\")\n",
+ "\n",
+ " # ── Process count matrix .csv.gz files ───────────────────────────────────\n",
+ " for ct in celltype:\n",
+ " # Try both naming patterns: with and without underscore\n",
+ " patterns = [\n",
+ " f\"pseudobulk_peaks_counts_{ct}{suffix}.csv.gz\", # Xiong pattern\n",
+ " f\"pseudobulk_peaks_counts{ct}{suffix}.csv.gz\" # Kellis pattern\n",
+ " ]\n",
+ " \n",
+ " in_path = None\n",
+ " for pattern in patterns:\n",
+ " test_path = os.path.join(input_dir, pattern)\n",
+ " if os.path.exists(test_path):\n",
+ " in_path = test_path\n",
+ " fname = pattern\n",
+ " break\n",
+ " \n",
+ " if in_path is None:\n",
+ " print(f\"Warning: Count file not found for celltype {ct}\")\n",
+ " continue\n",
+ " \n",
+ " out_path = os.path.join(output_dir, fname)\n",
+ "\n",
+ " with gzip.open(in_path, \"rt\") as fh:\n",
+ " header_line = fh.readline().rstrip(\"\\n\")\n",
+ "\n",
+ " col_names = header_line.split(\",\")\n",
+ " peak_id_col = col_names[0]\n",
+ " sample_cols = col_names[1:]\n",
+ " new_sample_cols = [map_id(s) for s in sample_cols]\n",
+ " new_header = \",\".join([peak_id_col] + new_sample_cols)\n",
+ "\n",
+ " import tempfile\n",
+ " temp_header = tempfile.NamedTemporaryFile(mode='w', delete=False, suffix='.txt')\n",
+ " temp_header.write(new_header + \"\\n\")\n",
+ " temp_header.close()\n",
+ " \n",
+ " cmd = f\"zcat {in_path} | tail -n +2 | cat {temp_header.name} - | gzip -6 > {out_path}\"\n",
+ " subprocess.run(cmd, shell=True, check=True)\n",
+ " \n",
+ " os.unlink(temp_header.name)\n",
+ " print(f\"Processed counts: {fname}\")\n",
+ "\n",
+ " print(\"\\nSample ID mapping completed!\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "f0884ae7-a851-425a-86dd-b606768a012e",
+ "metadata": {
+ "kernel": "SoS"
+ },
+ "source": [
+ "## `pseudobulk_qc`"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "0c46328b-c3d8-46f8-8c71-bad27820438e",
+ "metadata": {
+ "kernel": "SoS"
+ },
+ "outputs": [],
+ "source": [
+ "[pseudobulk_qc]\n",
+ "parameter: celltype = ['Ast','Ex','In','Microglia','Oligo','OPC']\n",
+ "parameter: input_dir = str\n",
+ "parameter: output_dir = str\n",
+ "parameter: covariates_file = str\n",
+ "parameter: blacklist_file = ''\n",
+ "parameter: include_bio = \"FALSE\" # \"TRUE\" or \"FALSE\"\n",
+ "parameter: batch_correction = \"FALSE\" # \"TRUE\" or \"FALSE\"\n",
+ "parameter: batch_method = \"limma\" # \"limma\" or \"combat\"\n",
+ "parameter: min_count = 5\n",
+ "parameter: min_total_count = 15\n",
+ "parameter: min_prop = 0.1\n",
+ "parameter: min_nuclei = 20\n",
+ "parameter: suffix = ''\n",
+ "\n",
+ "input: [f'{input_dir}/metadata_{ct}{suffix}.csv' for ct in celltype], \\\n",
+ " [f'{input_dir}/pseudobulk_peaks_counts_{ct}{suffix}.csv.gz' for ct in celltype]\n",
+ "output: [f'{output_dir}/{ct}/{ct}_residuals.txt' for ct in celltype]\n",
+ "\n",
+ "task: trunk_workers = 1, trunk_size = 1, walltime = '6:00:00', mem = '64G', cores = 4\n",
+ "\n",
+ "R: expand = \"${ }\", stdout = f'{_output[0]:n}.stdout', stderr = f'{_output[0]:n}.stderr'\n",
+ "\n",
+ " library(edgeR)\n",
+ " library(limma)\n",
+ " library(data.table)\n",
+ " library(GenomicRanges)\n",
+ " if (as.logical(\"${batch_correction}\") && \"${batch_method}\" == \"combat\") library(sva)\n",
+ "\n",
+ " # ── Helper: standardize metadata column names ─────────────────────────────\n",
+ " rename_if_found <- function(dt, target, candidates) {\n",
+ " found <- intersect(candidates, colnames(dt))[1]\n",
+ " if (!is.na(found) && found != target) setnames(dt, found, target)\n",
+ " }\n",
+ "\n",
+ " standardize_meta <- function(meta) {\n",
+ " rename_if_found(meta, \"n_nuclei\", c(\"n.nuclei\",\"nNuclei\",\"nuclei_count\"))\n",
+ " rename_if_found(meta, \"med_nucleosome_signal\", c(\"med.nucleosome_signal.ct\",\"NucleosomeRatio\",\"med_nucleosome_signal.ct\"))\n",
+ " rename_if_found(meta, \"med_tss_enrich\", c(\"med.tss.enrich.ct\",\"TSSEnrichment\",\"med_tss_enrich.ct\"))\n",
+ " rename_if_found(meta, \"med_n_tot_fragment\", c(\"med.n_tot_fragment.ct\",\"med_n_tot_fragment.ct\"))\n",
+ " return(meta)\n",
+ " }\n",
+ "\n",
+ " # ── Helper: blacklist filtering ───────────────────────────────────────────\n",
+ " filter_blacklist <- function(mat, bed) {\n",
+ " peaks <- data.table(id = rownames(mat))\n",
+ " peaks[, c(\"chr\",\"start\",\"end\") := tstrsplit(gsub(\"_\",\"-\",id), \"-\")]\n",
+ " peaks[, `:=`(start = as.numeric(start), end = as.numeric(end))]\n",
+ " bl <- fread(bed)[, 1:3]\n",
+ " setnames(bl, c(\"chr\",\"start\",\"end\"))\n",
+ " bl[, `:=`(start = as.numeric(start), end = as.numeric(end))]\n",
+ " gr1 <- GRanges(peaks$chr, IRanges(peaks$start, peaks$end))\n",
+ " gr2 <- GRanges(bl$chr, IRanges(bl$start, bl$end))\n",
+ " blacklisted <- unique(queryHits(findOverlaps(gr1, gr2)))\n",
+ " if (length(blacklisted) > 0) {\n",
+ " message(\"Blacklisted peaks removed: \", length(blacklisted))\n",
+ " return(mat[-blacklisted, , drop=FALSE])\n",
+ " }\n",
+ " return(mat)\n",
+ " }\n",
+ "\n",
+ " # ── Helper: predictOffset ─────────────────────────────────────────────────\n",
+ " predictOffset <- function(fit) {\n",
+ " D <- fit$design\n",
+ " Dm <- D\n",
+ " for (col in colnames(D)) {\n",
+ " if (col == \"(Intercept)\") next\n",
+ " if (is.numeric(D[, col]) && !all(D[, col] %in% c(0, 1)))\n",
+ " Dm[, col] <- median(D[, col], na.rm=TRUE)\n",
+ " else\n",
+ " Dm[, col] <- 0\n",
+ " }\n",
+ " B <- fit$coefficients\n",
+ " B[is.na(B)] <- 0\n",
+ " B %*% t(Dm)\n",
+ " }\n",
+ "\n",
+ " # ── Main loop ─────────────────────────────────────────────────────────────\n",
+ " cts <- c(${', '.join([f\"'{x}'\" for x in celltype])})\n",
+ "\n",
+ " for (ct in cts) {\n",
+ " message(\"\\n\", paste(rep(\"=\", 40), collapse=\"\"))\n",
+ " message(\"Processing: \", ct)\n",
+ " message(\"Mode: \", ifelse(as.logical(\"${include_bio}\"), \"BIOvar\", \"noBIOvar\"))\n",
+ " message(\"Batch correction: \", ifelse(as.logical(\"${batch_correction}\"), \"${batch_method}\", \"none\"))\n",
+ " message(paste(rep(\"=\", 40), collapse=\"\"))\n",
+ "\n",
+ " outdir <- file.path(\"${output_dir}/2_residuals\", ct)\n",
+ " dir.create(outdir, recursive=TRUE, showWarnings=FALSE)\n",
+ "\n",
+ " # ── 1. Load data ───────────────────────────────────────────────────\n",
+ " meta <- fread(sprintf(\"${input_dir}/metadata_%s${suffix}.csv\", ct))\n",
+ " counts_raw <- fread(sprintf(\"${input_dir}/pseudobulk_peaks_counts_%s${suffix}.csv.gz\", ct))\n",
+ "\n",
+ " counts <- as.matrix(counts_raw[, -1, with=FALSE])\n",
+ " rownames(counts) <- counts_raw[[1]]\n",
+ " rm(counts_raw)\n",
+ " n_original <- nrow(counts)\n",
+ " message(\"Loaded: \", n_original, \" peaks x \", ncol(counts), \" samples\")\n",
+ "\n",
+ " # ── 2. Standardize metadata columns ───────────────────────────────\n",
+ " meta <- standardize_meta(meta)\n",
+ "\n",
+ " # ── 3. Identify sample ID column ──────────────────────────────────\n",
+ " idcol <- intersect(c(\"sampleid\",\"sampleID\",\"individualID\",\"projid\"), colnames(meta))[1]\n",
+ " if (is.na(idcol)) stop(\"Cannot find sample ID column in metadata.\")\n",
+ "\n",
+ " # ── 4. Nuclei filter ──────────────────────────────────────────────\n",
+ " if (\"n_nuclei\" %in% colnames(meta)) {\n",
+ " meta <- meta[meta$n_nuclei > ${min_nuclei}]\n",
+ " message(\"Samples after nuclei (>${min_nuclei}) filter: \", nrow(meta))\n",
+ " }\n",
+ " n_after_nuclei <- nrow(meta)\n",
+ "\n",
+ " # ── 5. Align samples ───────────────────────────────────────────────\n",
+ " common <- intersect(meta[[idcol]], colnames(counts))\n",
+ " if (length(common) == 0) stop(\"Zero sample overlap between metadata and count matrix.\")\n",
+ " meta <- meta[match(common, meta[[idcol]])]\n",
+ " counts <- counts[, common, drop=FALSE]\n",
+ " message(\"Samples after alignment: \", length(common))\n",
+ "\n",
+ " # ── 6. Blacklist filtering ─────────────────────────────────────────\n",
+ " if (\"${blacklist_file}\" != \"\" && file.exists(\"${blacklist_file}\")) {\n",
+ " counts <- filter_blacklist(counts, \"${blacklist_file}\")\n",
+ " message(\"Peaks after blacklist filter: \", nrow(counts))\n",
+ " } else {\n",
+ " message(\"No blacklist file provided - skipping blacklist filtering.\")\n",
+ " }\n",
+ " n_after_blacklist <- nrow(counts)\n",
+ "\n",
+ " # ── 7. Load and merge covariates ───────────────────────────────────\n",
+ " covs <- fread(\"${covariates_file}\")\n",
+ " id2 <- intersect(c(\"#id\",\"id\",\"projid\",\"individualID\"), colnames(covs))[1]\n",
+ " bio_cols <- if (as.logical(\"${include_bio}\")) c(\"msex\",\"age_death\",\"pmi\",\"study\") else c(\"pmi\",\"study\")\n",
+ " keep_cols <- c(id2, intersect(bio_cols, colnames(covs)))\n",
+ " covs <- covs[, ..keep_cols]\n",
+ " meta <- merge(meta, covs, by.x=idcol, by.y=id2, all.x=TRUE)\n",
+ "\n",
+ " # ── CRITICAL: re-order meta back to common sample order ────────────\n",
+ " meta <- meta[match(common, meta[[idcol]])]\n",
+ "\n",
+ " # ── 8. Impute missing covariate values ─────────────────────────────\n",
+ " for (col in intersect(c(\"pmi\",\"age_death\"), colnames(meta))) {\n",
+ " if (any(is.na(meta[[col]]))) {\n",
+ " message(\"Imputing missing values for: \", col)\n",
+ " meta[[col]][is.na(meta[[col]])] <- median(meta[[col]], na.rm=TRUE)\n",
+ " }\n",
+ " }\n",
+ "\n",
+ " # ── 9. Compute technical metrics ──────────────────────────────────\n",
+ " meta$log_n_nuclei <- log1p(meta$n_nuclei)\n",
+ " meta$log_med_n_tot_fragment <- log1p(meta$med_n_tot_fragment)\n",
+ " meta$log_total_unique_peaks <- log1p(colSums(counts > 0))\n",
+ "\n",
+ " # ── 10. Select model variables ────────────────────────────────────\n",
+ " tech_vars <- c(\"log_n_nuclei\",\"med_nucleosome_signal\",\"med_tss_enrich\",\n",
+ " \"log_med_n_tot_fragment\",\"log_total_unique_peaks\",\"pmi\",\"study\")\n",
+ " bio_vars <- c(\"msex\",\"age_death\")\n",
+ " all_vars <- if (as.logical(\"${include_bio}\")) c(tech_vars, bio_vars) else tech_vars\n",
+ " all_vars <- intersect(all_vars, colnames(meta))\n",
+ " message(\"Model terms: \", paste(all_vars, collapse=\", \"))\n",
+ "\n",
+ " # ── 11. Drop samples with NA in model variables ────────────────────\n",
+ " keep_rows <- complete.cases(meta[, ..all_vars])\n",
+ " meta <- meta[keep_rows]\n",
+ " counts <- counts[, meta[[idcol]], drop=FALSE]\n",
+ " message(\"Valid samples for modelling: \", nrow(meta))\n",
+ "\n",
+ " # ── 12. Expression filtering ───────────────────────────────────────\n",
+ " dge <- DGEList(counts=counts, samples=meta)\n",
+ " dge$samples$group <- factor(rep(\"all\", ncol(dge)))\n",
+ " message(\"Peaks before expression filter: \", nrow(dge))\n",
+ "\n",
+ " keep <- filterByExpr(dge, group=dge$samples$group,\n",
+ " min.count=${min_count},\n",
+ " min.total.count=${min_total_count},\n",
+ " min.prop=${min_prop})\n",
+ " dge <- dge[keep,, keep.lib.sizes=FALSE]\n",
+ " n_after_expr <- nrow(dge)\n",
+ " message(\"Peaks after expression filter: \", n_after_expr)\n",
+ "\n",
+ " # Save filtered raw counts\n",
+ " write.table(dge$counts,\n",
+ " file.path(outdir, paste0(ct, \"_filtered_raw_counts.txt\")),\n",
+ " sep=\"\\t\", quote=FALSE, col.names=NA)\n",
+ "\n",
+ " # ── 13. TMM normalization ──────────────────────────────────────────\n",
+ " dge <- calcNormFactors(dge, method=\"TMM\")\n",
+ "\n",
+ " # ── 14. Optional batch correction ─────────────────────────────────\n",
+ " if (as.logical(\"${batch_correction}\") && \"sequencingBatch\" %in% colnames(dge$samples)) {\n",
+ " batches <- dge$samples$sequencingBatch\n",
+ " batch_counts <- table(batches)\n",
+ " valid_batches <- names(batch_counts[batch_counts > 1])\n",
+ " keep_bc <- batches %in% valid_batches\n",
+ " dge <- dge[, keep_bc, keep.lib.sizes=FALSE]\n",
+ " batches <- batches[keep_bc]\n",
+ " message(\"Samples after singleton batch removal: \", ncol(dge))\n",
+ "\n",
+ " if (\"${batch_method}\" == \"combat\") {\n",
+ " dge$counts <- ComBat_seq(as.matrix(dge$counts), batch=batches)\n",
+ " message(\"ComBat-seq batch correction applied.\")\n",
+ " } else {\n",
+ " logCPM <- cpm(dge, log=TRUE, prior.count=1)\n",
+ " logCPM <- removeBatchEffect(logCPM, batch=factor(batches))\n",
+ " dge$counts <- round(pmax(2^logCPM, 0))\n",
+ " message(\"limma removeBatchEffect applied.\")\n",
+ " }\n",
+ " }\n",
+ "\n",
+ " # ── 15. Add sequencingBatch and Library to model if multi-level ───\n",
+ " # Insert after technical vars but before pmi/study to match original order\n",
+ " tech_only <- c(\"log_n_nuclei\",\"med_nucleosome_signal\",\"med_tss_enrich\",\n",
+ " \"log_med_n_tot_fragment\",\"log_total_unique_peaks\")\n",
+ " other_vars <- setdiff(all_vars, tech_only) # pmi, study, msex, age_death\n",
+ "\n",
+ " batch_vars <- c()\n",
+ " if (\"sequencingBatch\" %in% colnames(dge$samples) &&\n",
+ " length(unique(dge$samples$sequencingBatch)) > 1) {\n",
+ " dge$samples$sequencingBatch_factor <- factor(dge$samples$sequencingBatch)\n",
+ " batch_vars <- c(batch_vars, \"sequencingBatch_factor\")\n",
+ " }\n",
+ "\n",
+ " if (\"Library\" %in% colnames(dge$samples) &&\n",
+ " length(unique(dge$samples$Library)) > 1) {\n",
+ " dge$samples$Library_factor <- factor(dge$samples$Library)\n",
+ " batch_vars <- c(batch_vars, \"Library_factor\")\n",
+ " }\n",
+ "\n",
+ " # Final order: technical + batch + other (pmi, study, bio)\n",
+ " all_vars <- c(tech_only, batch_vars, other_vars)\n",
+ " all_vars <- intersect(all_vars, c(colnames(dge$samples), colnames(meta)))\n",
+ "\n",
+ " # ── 16. Build design matrix ────────────────────────────────────────\n",
+ " form <- as.formula(paste(\"~\", paste(all_vars, collapse=\" + \")))\n",
+ " design <- model.matrix(form, data=dge$samples)\n",
+ " message(\"Formula: \", deparse(form))\n",
+ "\n",
+ " if (!is.fullrank(design)) {\n",
+ " message(\"Design not full rank - trimming.\")\n",
+ " qr_d <- qr(design)\n",
+ " design <- design[, qr_d$pivot[seq_len(qr_d$rank)], drop=FALSE]\n",
+ " }\n",
+ " message(\"Design matrix: \", nrow(design), \" x \", ncol(design))\n",
+ "\n",
+ " # ── 17. Voom + lmFit + eBayes ─────────────────────────────────────\n",
+ " v <- voom(dge, design, plot=FALSE)\n",
+ " fit <- lmFit(v, design)\n",
+ " fit <- eBayes(fit)\n",
+ "\n",
+ " # ── 18. Offset + residuals ─────────────────────────────────────────\n",
+ " off <- predictOffset(fit)\n",
+ " res <- residuals(fit, v)\n",
+ " final <- off + res\n",
+ "\n",
+ " # ── 19. Save outputs ───────────────────────────────────────────────\n",
+ " write.table(final,\n",
+ " file.path(outdir, paste0(ct, \"_residuals.txt\")),\n",
+ " sep=\"\\t\", quote=FALSE, col.names=NA)\n",
+ "\n",
+ " saveRDS(list(\n",
+ " dge = dge,\n",
+ " offset = off,\n",
+ " residuals = res,\n",
+ " final_data = final,\n",
+ " valid_samples = colnames(dge),\n",
+ " design = design,\n",
+ " fit = fit,\n",
+ " model = form,\n",
+ " mode = ifelse(as.logical(\"${include_bio}\"), \"BIOvar\", \"noBIOvar\"),\n",
+ " batch_correction = as.logical(\"${batch_correction}\"),\n",
+ " batch_method = ifelse(as.logical(\"${batch_correction}\"), \"${batch_method}\", \"none\")\n",
+ " ), file.path(outdir, paste0(ct, \"_results.rds\")))\n",
+ "\n",
+ " # ── 20. Summary report ─────────────────────────────────────────────\n",
+ " sink(file.path(outdir, paste0(ct, \"_summary.txt\")))\n",
+ " cat(\"*** Processing Summary for\", ct, \"***\\n\\n\")\n",
+ "\n",
+ " cat(\"=== Analysis Mode ===\\n\")\n",
+ " cat(\"Mode:\", ifelse(as.logical(\"${include_bio}\"), \"BIOvar\", \"noBIOvar\"), \"\\n\")\n",
+ " cat(\"Batch correction:\", ifelse(as.logical(\"${batch_correction}\"), \"${batch_method}\", \"none\"), \"\\n\")\n",
+ " cat(\"Model formula:\", deparse(form), \"\\n\\n\")\n",
+ "\n",
+ " cat(\"=== Filtering Parameters ===\\n\")\n",
+ " cat(\"Nuclei cutoff: >\", ${min_nuclei}, \"\\n\")\n",
+ " cat(\"Blacklist filtering:\", ifelse(\"${blacklist_file}\" != \"\", \"TRUE\", \"FALSE\"), \"\\n\")\n",
+ " if (\"${blacklist_file}\" != \"\") cat(\"Blacklist file:\", \"${blacklist_file}\", \"\\n\")\n",
+ " cat(\"min_count:\", ${min_count}, \"\\n\")\n",
+ " cat(\"min_total_count:\", ${min_total_count}, \"\\n\")\n",
+ " cat(\"min_prop:\", ${min_prop}, \"\\n\\n\")\n",
+ "\n",
+ " cat(\"=== Peak Counts ===\\n\")\n",
+ " cat(\"Original peak count:\", n_original, \"\\n\")\n",
+ " cat(\"Peaks after blacklist filtering:\", n_after_blacklist, \"\\n\")\n",
+ " cat(\"Peaks after expression filtering:\", n_after_expr, \"\\n\\n\")\n",
+ "\n",
+ " cat(\"=== Sample Counts ===\\n\")\n",
+ " cat(\"Number of samples after nuclei (>\", ${min_nuclei}, \") filtering:\", n_after_nuclei, \"\\n\")\n",
+ " cat(\"Number of samples in final model:\", ncol(final), \"\\n\\n\")\n",
+ "\n",
+ " cat(\"=== Technical Variables Used ===\\n\")\n",
+ " for (v in intersect(c(\"log_n_nuclei\",\"med_nucleosome_signal\",\"med_tss_enrich\",\n",
+ " \"log_med_n_tot_fragment\",\"log_total_unique_peaks\"), all_vars))\n",
+ " cat(\"-\", v, \"\\n\")\n",
+ " if (\"sequencingBatch_factor\" %in% all_vars) cat(\"- sequencingBatch: Sequencing batch ID\\n\")\n",
+ " if (\"Library_factor\" %in% all_vars) cat(\"- Library: Library ID\\n\")\n",
+ "\n",
+ " if (as.logical(\"${include_bio}\")) {\n",
+ " cat(\"\\n=== Biological Variables Used ===\\n\")\n",
+ " for (v in intersect(c(\"msex\",\"age_death\"), all_vars))\n",
+ " cat(\"-\", v, \"\\n\")\n",
+ " } else {\n",
+ " cat(\"\\n=== Biological Variables Used ===\\n\")\n",
+ " cat(\"None (noBIOvar mode - biological variation preserved)\\n\")\n",
+ " }\n",
+ "\n",
+ " cat(\"\\n=== Other Variables Used ===\\n\")\n",
+ " if (\"pmi\" %in% all_vars) cat(\"- pmi: Post-mortem interval\\n\")\n",
+ " if (\"study\" %in% all_vars) cat(\"- study: Study cohort\\n\")\n",
+ " sink()\n",
+ "\n",
+ " # ── 21. Variable explanation report ───────────────────────────────\n",
+ " sink(file.path(outdir, paste0(ct, \"_variable_explanation.txt\")))\n",
+ " cat(\"# ATAC-seq Technical Variables Explanation\\n\\n\")\n",
+ " cat(\"## Why Log Transformation?\\n\")\n",
+ " cat(\"Log transformation is applied to certain variables for several reasons:\\n\")\n",
+ " cat(\"1. To make the distribution more symmetric and closer to normal\\n\")\n",
+ " cat(\"2. To stabilize variance across the range of values\\n\")\n",
+ " cat(\"3. To match the scale of voom-transformed peak counts, which are on log2-CPM scale\\n\")\n",
+ " cat(\"4. To be consistent with the approach used in related studies like haQTL\\n\\n\")\n",
+ " cat(\"## Variables and Their Meanings\\n\\n\")\n",
+ " cat(\"### Technical Variables\\n\")\n",
+ " cat(\"- n_nuclei: Number of nuclei that contributed to this pseudobulk sample\\n\")\n",
+ " cat(\" * Filtered to include only samples with >\", ${min_nuclei}, \"nuclei\\n\")\n",
+ " cat(\" * Log-transformed because count data typically has a right-skewed distribution\\n\\n\")\n",
+ " cat(\"- med_n_tot_fragment: Median number of total fragments per cell\\n\")\n",
+ " cat(\" * Represents sequencing depth\\n\")\n",
+ " cat(\" * Log-transformed because sequencing depth typically has exponential effects\\n\\n\")\n",
+ " cat(\"- total_unique_peaks: Number of unique peaks detected in each sample\\n\")\n",
+ " cat(\" * Log-transformed similar to 'TotalNumPeaks' in haQTL pipeline\\n\\n\")\n",
+ " cat(\"- med_nucleosome_signal: Median nucleosome signal\\n\")\n",
+ " cat(\" * Measures the degree of nucleosome positioning\\n\")\n",
+ " cat(\" * Not log-transformed as it is already a ratio/normalized metric\\n\\n\")\n",
+ " cat(\"- med_tss_enrich: Median transcription start site enrichment score\\n\")\n",
+ " cat(\" * Indicates the quality of the ATAC-seq data\\n\")\n",
+ " cat(\" * Not log-transformed as it is already a ratio/normalized metric\\n\\n\")\n",
+ " if (\"sequencingBatch_factor\" %in% all_vars)\n",
+ " cat(\"- sequencingBatch: Sequencing batch ID\\n * Treated as a factor to account for batch effects\\n\\n\")\n",
+ " if (\"Library_factor\" %in% all_vars)\n",
+ " cat(\"- Library: Library preparation batch ID\\n * Treated as a factor to account for library preparation effects\\n\\n\")\n",
+ " if (as.logical(\"${include_bio}\")) {\n",
+ " cat(\"### Biological Variables\\n\")\n",
+ " cat(\"- msex: Sex (male=1, female=0)\\n\")\n",
+ " cat(\"- age_death: Age at death\\n\\n\")\n",
+ " }\n",
+ " cat(\"### Other Variables\\n\")\n",
+ " cat(\"- pmi: Post-mortem interval (time between death and tissue collection)\\n\")\n",
+ " cat(\"- study: Study cohort (ROSMAP, MAP, ROS)\\n\\n\")\n",
+ " cat(\"## Relationship to voom Transformation\\n\")\n",
+ " cat(\"The voom transformation converts count data to log2-CPM (counts per million) values \")\n",
+ " cat(\"and estimates the mean-variance relationship. By log-transforming certain technical \")\n",
+ " cat(\"covariates, we ensure they are on a similar scale to the transformed expression data, \")\n",
+ " cat(\"which can improve the fit of the linear model used for removing unwanted variation.\\n\")\n",
+ " sink()\n",
+ "\n",
+ " message(\"Completed: \", ct, \" -> \", outdir)\n",
+ " message(\" Peaks: \", nrow(final), \" | Samples: \", ncol(final))\n",
+ " }"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "db56ffb1-6c07-47ac-9a1a-abbd37f253c9",
+ "metadata": {
+ "kernel": "SoS"
+ },
+ "source": [
+ "## `phenotype_reformatting`"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "5ef18e3e-fe77-486f-89e6-e724b7126b73",
+ "metadata": {
+ "kernel": "SoS"
+ },
+ "outputs": [],
+ "source": [
+ "[phenotype_formatting]\n",
+ "parameter: celltype = ['Ast','Ex','In','Mic','Oligo','OPC']\n",
+ "parameter: input_dir = str\n",
+ "parameter: output_dir = str\n",
+ "\n",
+ "input: [f'{input_dir}/{ct}/{ct}_residuals.txt' for ct in celltype]\n",
+ "output: [f'{output_dir}/{ct}_snatac_phenotype.bed.gz' for ct in celltype]\n",
+ "\n",
+ "task: trunk_workers = 1, trunk_size = 1, walltime = '2:00:00', mem = '16G', cores = 2\n",
+ "\n",
+ "python: expand = \"${ }\", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout'\n",
+ "\n",
+ " import os\n",
+ " import subprocess\n",
+ " import pandas as pd\n",
+ "\n",
+ " celltypes = ${celltype}\n",
+ " input_dir = \"${input_dir}\"\n",
+ " output_dir = \"${output_dir}\"\n",
+ "\n",
+ " def read_residuals(path):\n",
+ " first_line = open(path).readline().rstrip(\"\\n\")\n",
+ " col_names = first_line.split(\"\\t\")\n",
+ " df = pd.read_csv(path, sep=\"\\t\", header=None, skiprows=1)\n",
+ " if df.shape[1] > len(col_names):\n",
+ " peak_ids = df.iloc[:, 0].values\n",
+ " df = df.iloc[:, 1:]\n",
+ " df.columns = col_names\n",
+ " else:\n",
+ " peak_ids = df.iloc[:, 0].values\n",
+ " df = df.iloc[:, 1:]\n",
+ " df.columns = col_names[1:]\n",
+ " return peak_ids, df\n",
+ "\n",
+ " def to_midpoint_bed(peak_ids, residuals):\n",
+ " parts = pd.Series(peak_ids).str.split(\"-\", expand=True)\n",
+ " chrs = parts[0].values\n",
+ " starts = parts[1].astype(int).values\n",
+ " ends = parts[2].astype(int).values\n",
+ " mids = ((starts + ends) // 2).astype(int)\n",
+ " bed = pd.DataFrame({\n",
+ " \"#chr\": chrs,\n",
+ " \"start\": mids,\n",
+ " \"end\": mids + 1,\n",
+ " \"ID\": peak_ids\n",
+ " })\n",
+ " bed = pd.concat([bed, residuals.reset_index(drop=True)], axis=1)\n",
+ " return bed.sort_values([\"#chr\", \"start\"]).reset_index(drop=True)\n",
+ "\n",
+ " def run_cmd(cmd, label):\n",
+ " r = subprocess.run(cmd, capture_output=True)\n",
+ " if r.returncode != 0:\n",
+ " print(f\"WARNING: {label} failed: {r.stderr.decode()}\")\n",
+ " else:\n",
+ " print(f\"{label}: OK\")\n",
+ "\n",
+ " for ct in celltypes:\n",
+ " print(f\"\\n{'='*40}\\nPhenotype Formatting: {ct}\\n{'='*40}\")\n",
+ "\n",
+ " out_dir = os.path.join(output_dir, \"3_pheno_reformat\")\n",
+ " os.makedirs(out_dir, exist_ok=True)\n",
+ "\n",
+ " res_path = os.path.join(input_dir, ct, f\"{ct}_residuals.txt\")\n",
+ " if not os.path.exists(res_path):\n",
+ " print(f\"WARNING: {res_path} not found, skipping.\")\n",
+ " continue\n",
+ "\n",
+ " peak_ids, residuals = read_residuals(res_path)\n",
+ " print(f\"Loaded {len(peak_ids)} peaks x {residuals.shape[1]} samples\")\n",
+ "\n",
+ " bed = to_midpoint_bed(peak_ids, residuals)\n",
+ " out_bed = os.path.join(out_dir, f\"{ct}_snatac_phenotype.bed\")\n",
+ " bed.to_csv(out_bed, sep=\"\\t\", index=False, float_format=\"%.15f\")\n",
+ " print(f\"Written: {out_bed}\")\n",
+ "\n",
+ " run_cmd([\"bgzip\", \"-f\", out_bed], \"bgzip\")\n",
+ " run_cmd([\"tabix\", \"-p\", \"bed\", f\"{out_bed}.gz\"], \"tabix\")\n",
+ "\n",
+ " print(f\"Completed: {ct} -> {out_dir}\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "038bc2ab-c412-40ef-a9b6-f5dddf5292ee",
+ "metadata": {
+ "kernel": "SoS"
+ },
+ "source": [
+ "## `region_filtering`"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "dd13567e-2d4c-48f6-83d7-62ab252421bf",
+ "metadata": {
+ "kernel": "SoS"
+ },
+ "outputs": [],
+ "source": [
+ "[region_filtering]\n",
+ "parameter: celltype = ['Ast','Ex','In','Mic','Oligo','OPC']\n",
+ "parameter: input_dir = str\n",
+ "parameter: output_dir = str\n",
+ "parameter: regions = \"chr7:28000000-28300000,chr11:85050000-86200000\"\n",
+ "\n",
+ "input: [f'{input_dir}/{ct}/{ct}_filtered_raw_counts.txt' for ct in celltype]\n",
+ "output: [f'{output_dir}/{ct}_filtered_regions_of_interest.txt' for ct in celltype]\n",
+ "\n",
+ "task: trunk_workers = 1, trunk_size = 1, walltime = '1:00:00', mem = '16G', cores = 2\n",
+ "\n",
+ "python: expand = \"${ }\", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout'\n",
+ "\n",
+ " import os\n",
+ " import pandas as pd\n",
+ "\n",
+ " celltypes = ${celltype}\n",
+ " input_dir = \"${input_dir}\"\n",
+ " output_dir = \"${output_dir}\"\n",
+ "\n",
+ " def parse_regions(region_str):\n",
+ " result = []\n",
+ " for r in region_str.split(\",\"):\n",
+ " chrom, coords = r.strip().split(\":\")\n",
+ " start, end = coords.split(\"-\")\n",
+ " result.append({\"chr\": chrom, \"start\": int(start), \"end\": int(end)})\n",
+ " return result\n",
+ "\n",
+ " regions = parse_regions(\"${regions}\")\n",
+ "\n",
+ " def parse_peak_ids(peak_ids):\n",
+ " parts = pd.Series(peak_ids).str.split(\"-\", expand=True)\n",
+ " return pd.DataFrame({\n",
+ " \"chr\": parts[0].values,\n",
+ " \"start\": parts[1].astype(int).values,\n",
+ " \"end\": parts[2].astype(int).values\n",
+ " })\n",
+ "\n",
+ " def overlaps_region(chr_col, start_col, end_col, reg):\n",
+ " return (\n",
+ " (chr_col == reg[\"chr\"]) &\n",
+ " (start_col < reg[\"end\"]) &\n",
+ " (end_col > reg[\"start\"])\n",
+ " )\n",
+ "\n",
+ " for ct in celltypes:\n",
+ " print(f\"\\n{'='*40}\\nRegion Filtering: {ct}\\n{'='*40}\")\n",
+ "\n",
+ " reg_dir = os.path.join(output_dir, \"3_region_filter\")\n",
+ " os.makedirs(reg_dir, exist_ok=True)\n",
+ "\n",
+ " counts_path = os.path.join(input_dir, ct, f\"{ct}_filtered_raw_counts.txt\")\n",
+ " if not os.path.exists(counts_path):\n",
+ " print(f\"WARNING: {counts_path} not found, skipping.\")\n",
+ " continue\n",
+ "\n",
+ " df = pd.read_csv(counts_path, sep=\"\\t\", index_col=0)\n",
+ " df.index.name = \"peak_id\"\n",
+ " df = df.reset_index()\n",
+ "\n",
+ " coords = parse_peak_ids(df[\"peak_id\"].values)\n",
+ " df[\"chr\"] = coords[\"chr\"].values\n",
+ " df[\"start\"] = coords[\"start\"].values\n",
+ " df[\"end\"] = coords[\"end\"].values\n",
+ " df[\"peakwidth\"] = df[\"end\"] - df[\"start\"]\n",
+ " df[\"midpoint\"] = ((df[\"start\"] + df[\"end\"]) / 2).astype(int)\n",
+ "\n",
+ " # Filter to regions of interest\n",
+ " mask = pd.Series(False, index=df.index)\n",
+ " for reg in regions:\n",
+ " mask |= overlaps_region(df[\"chr\"], df[\"start\"], df[\"end\"], reg)\n",
+ "\n",
+ " region_df = df[mask].copy()\n",
+ " print(f\"Peaks in regions of interest: {len(region_df)}\")\n",
+ "\n",
+ " # Save full filtered data\n",
+ " full_out = os.path.join(reg_dir, f\"{ct}_filtered_regions_of_interest.txt\")\n",
+ " region_df.to_csv(full_out, sep=\"\\t\", index=False)\n",
+ " print(f\"Saved: {full_out}\")\n",
+ "\n",
+ " # Save summary\n",
+ " meta_cols = [\"peak_id\",\"chr\",\"start\",\"end\",\"peakwidth\",\"midpoint\"]\n",
+ " count_cols = [c for c in region_df.columns if c not in meta_cols]\n",
+ " count_mat = region_df[count_cols].apply(pd.to_numeric, errors=\"coerce\")\n",
+ "\n",
+ " summary = region_df[meta_cols].copy()\n",
+ " summary[\"total_count\"] = count_mat.sum(axis=1).values\n",
+ " summary[\"weighted_count\"] = (summary[\"total_count\"] / summary[\"peakwidth\"]).values\n",
+ "\n",
+ " summary_out = os.path.join(reg_dir, f\"{ct}_filtered_regions_of_interest_summary.txt\")\n",
+ " summary.to_csv(summary_out, sep=\"\\t\", index=False)\n",
+ " print(f\"Saved: {summary_out}\")\n",
+ "\n",
+ " print(f\"Completed: {ct} -> {reg_dir}\")"
+ ]
+ }
+ ],
+ "metadata": {
+ "kernelspec": {
+ "display_name": "R",
+ "language": "R",
+ "name": "ir"
+ },
+ "language_info": {
+ "codemirror_mode": "r",
+ "file_extension": ".r",
+ "mimetype": "text/x-r-source",
+ "name": "R",
+ "pygments_lexer": "r",
+ "version": "4.4.3"
+ },
+ "sos": {
+ "kernels": [
+ [
+ "SoS",
+ "sos",
+ "sos",
+ "",
+ ""
+ ]
+ ],
+ "version": ""
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
diff --git a/code/molecular_phenotypes/QC/xiong_atacseq_preprocessing.ipynb b/code/molecular_phenotypes/QC/xiong_atacseq_preprocessing.ipynb
deleted file mode 100644
index 78e93a16..00000000
--- a/code/molecular_phenotypes/QC/xiong_atacseq_preprocessing.ipynb
+++ /dev/null
@@ -1,1828 +0,0 @@
-{
- "cells": [
- {
- "cell_type": "markdown",
- "id": "55783e3b-582f-4eb9-8d1a-f3647fef7c73",
- "metadata": {},
- "source": [
- "# Xiong Lab Single-nuclei ATAC-seq Preprocessing Pipeline\n",
- "---\n",
- "## Overview\n",
- "\n",
- "This pipeline preprocesses single-nucleus ATAC-seq (snATAC-seq) data from the Kellis lab (Xiong et al.) for downstream chromatin accessibility QTL (caQTL) analysis. It processes pseudobulk peak count data across six major brain cell types.\n",
- "\n",
- "**Pipeline Purpose:**\n",
- "- Transform raw pseudobulk peak counts into analysis-ready formats\n",
- "- Remove technical confounders while preserving biological variation\n",
- "- Generate QTL-ready phenotype files for genome-wide caQTL mapping\n",
- "\n",
- "**Supported Cell Types:**\n",
- "- **Mic** - Microglia\n",
- "- **Astro** - Astrocytes\n",
- "- **Oligo** - Oligodendrocytes\n",
- "- **Ex** - Excitatory neurons\n",
- "- **In** - Inhibitory neurons\n",
- "- **OPC** - Oligodendrocyte precursor cells\n",
- "\n",
- "---\n",
- "\n",
- "## Workflow Structure\n",
- "\n",
- "This pipeline consists of **three sequential steps**:\n",
- "\n",
- "#### Step 0: Sample ID Mapping\n",
- "\n",
- "**Input:**\n",
- "- Sample mapping file: `rosmap_sample_mapping_data.csv`\n",
- "- Original metadata files: `metadata_{celltype}.csv`\n",
- "- Original count files: `pseudobulk_peaks_counts_{celltype}.csv.gz`\n",
- "\n",
- "**Process:**\n",
- "1. Loads sample ID mapping between individualID and sampleid\n",
- "2. Processes metadata files:\n",
- " - Adds `sampleid` column after `individualID`\n",
- " - Maps individualID to sampleid where mapping exists\n",
- " - Keeps original individualID for unmapped samples\n",
- "3. Processes count matrix files:\n",
- " - Renames column headers from individualID to sampleid\n",
- " - Maintains count data integrity\n",
- "\n",
- "#### Step 1: Pseudobulk QC & Calculate Residuals with biological variation\n",
- "\n",
- "**Input:**\n",
- "- Mapped metadata: `metadata_{celltype}.csv` (from Step 0)\n",
- "- Mapped peak counts: `pseudobulk_peaks_counts_{celltype}.csv.gz` (from Step 0)\n",
- "- Sample covariates: `rosmap_cov.txt`\n",
- "- hg38 blacklist: `hg38-blacklist.v2.bed.gz`\n",
- "\n",
- "**Process:**\n",
- "1. Loads pseudobulk peak count matrix and metadata\n",
- "2. **Filters samples with n_nuclei > 20**\n",
- "3. Calculates technical QC metrics per sample:\n",
- " - `log_n_nuclei`: Log-transformed number of nuclei\n",
- " - `med_nucleosome_signal`: Median nucleosome signal\n",
- " - `med_tss_enrich`: Median TSS enrichment score\n",
- " - `log_med_n_tot_fragment`: Log-transformed median total fragments (sequencing depth)\n",
- " - `log_total_unique_peaks`: Log-transformed count of unique peaks detected\n",
- "4. Filters blacklisted genomic regions using `foverlaps()`\n",
- "5. Merges with covariates (pmi, study) - **excludes msex and age_death**\n",
- "6. Applies expression filtering with `filterByExpr()`:\n",
- " - `min.count = 5`: Minimum 5 reads in at least one sample\n",
- " - `min.total.count = 15`: Minimum 15 total reads across all samples\n",
- " - `min.prop = 0.1`: Peak must be expressed in ≥10% of samples\n",
- "7. TMM normalization with `calcNormFactors()`\n",
- "8. Saves **filtered raw counts** (used for region-specific analysis if needed)\n",
- "9. Handles sequencingBatch and Library as covariates\n",
- "10. Fits linear model using `voom()` and `lmFit()`:\n",
- " ```r\n",
- " model <- ~ log_n_nuclei + med_nucleosome_signal + med_tss_enrich + \n",
- " log_med_n_tot_fragment + log_total_unique_peaks + \n",
- " sequencingBatch_factor + Library_factor + pmi + study\n",
- " ```\n",
- "11. Calculates residuals using `predictOffset()`: `offset + residuals`\n",
- " - **Preserves biological variation** (sex, age)\n",
- " - Removes technical variation and study effects\n",
- "\n",
- "**Key Variables Regressed Out:**\n",
- "- Technical: sequencing depth, nuclei count, nucleosome signal, TSS enrichment, batch, library\n",
- "- Study effects: pmi, study cohort\n",
- "\n",
- "**Key Variables Preserved:**\n",
- "- Sex (msex)\n",
- "- Age at death (age_death)\n",
- "\n",
- "\n",
- "#### Step 2: Phenotype Reformatting\n",
- "\n",
- "**Input:**\n",
- "- `{celltype}_residuals.txt` from Step 1 (in `2_residuals/{celltype}/`)\n",
- "\n",
- "**Process:**\n",
- "1. Reads residuals file with proper handling of peak IDs and sample columns\n",
- "2. Parses peak coordinates from peak IDs (format: `chr-start-end`)\n",
- "3. Converts peaks to **midpoint coordinates**:\n",
- "\n",
- "Use for:\n",
- "Genome-wide caQTL mapping with FastQTL, TensorQTL, or MatrixEQTL\n",
- "Analysis that accounts for or investigates sex/age effects\n",
- "\n",
- "---\n",
- "\n",
- "### Pipeline Outputs\n",
- "\n",
- "**From Step 0:**\n",
- "`metadata_{celltype}.csv`: Metadata with mapped sampleid\n",
- "`pseudobulk_peaks_counts_{celltype}.csv.gz`: Counts with mapped sampleid headers\n",
- "\n",
- "**From Step 1:**\n",
- "`{celltype}_residuals.txt`: Covariate-adjusted residuals (log2-CPM scale)\n",
- "`{celltype}_filtered_raw_counts.txt`: TMM-normalized counts\n",
- "`{celltype}_results.rds`: Complete analysis results\n",
- "`{celltype}_summary.txt`: QC and filtering statistics\n",
- "`{celltype}_variable_explanation.txt`: Variable documentation\n",
- "\n",
- "**From Step 2:**\n",
- "`{celltype}_kellis_xiong_snatac_phenotype.bed.gz`: Genome-wide QTL-ready BED file\n",
- "\n",
- "---\n",
- "\n",
- "**Input files** needed to run this pipeline can be downloaded [here](https://drive.google.com/drive/folders/1l1RJx5toqg_WOlWW3gy-ynkrodi8oqXv?usp=drive_link)."
- ]
- },
- {
- "cell_type": "markdown",
- "id": "c58392fc-da3a-4032-9cc3-6f58fdf6c99b",
- "metadata": {},
- "source": [
- "#### Before you start, let's set your working paths."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 29,
- "id": "3509701b-e03e-49a7-944f-14539b6a46a3",
- "metadata": {},
- "outputs": [],
- "source": [
- "input_dir <- \" \" # insert your input dir\n",
- "output_dir <- \" \" #insert your output dir"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "6c6d5f6a-259b-4fa0-ba06-2a83fa19577e",
- "metadata": {},
- "source": [
- "## Step 0: Check sample ID\n",
- "\n",
- "**Purpose:** Maps original sample identifiers (individualID) to standardized sample IDs (sampleid) across metadata and count matrix files.\n",
- "\n",
- "---\n",
- "\n",
- "#### Input:\n",
- "\n",
- "**Sample Mapping Reference:**\n",
- "- `rosmap_sample_mapping_data.csv`: Contains mapping between individualID and sampleid\n",
- "\n",
- "**Metadata Files (per cell type):**\n",
- "- `metadata_Ast.csv`\n",
- "- `metadata_Ex.csv`\n",
- "- `metadata_In.csv`\n",
- "- `metadata_Microglia.csv`\n",
- "- `metadata_Oligo.csv`\n",
- "- `metadata_OPC.csv`\n",
- "\n",
- "**Count Matrix Files (per cell type):**\n",
- "- `pseudobulk_peaks_counts_Ast.csv.gz`\n",
- "- `pseudobulk_peaks_counts_Ex.csv.gz`\n",
- "- `pseudobulk_peaks_counts_In.csv.gz`\n",
- "- `pseudobulk_peaks_counts_Microglia.csv.gz`\n",
- "- `pseudobulk_peaks_counts_Oligo.csv.gz`\n",
- "- `pseudobulk_peaks_counts_OPC.csv.gz`\n",
- "\n",
- "\n",
- "#### Process:\n",
- "\n",
- "**Part 1: Process Metadata Files**\n",
- "\n",
- "1. Loads sample mapping dictionary from `rosmap_sample_mapping_data.csv`\n",
- "2. Creates a keyed data.table for fast lookups: `individualID → sampleid`\n",
- "3. For each metadata file:\n",
- " - Reads the CSV file\n",
- " - Finds the position of the `individualID` column\n",
- " - Creates a new `sampleid` column\n",
- " - For each sample:\n",
- " - If mapping exists: uses the mapped sampleid\n",
- " - If no mapping: uses the original individualID (preserves unmapped samples)\n",
- " - Inserts `sampleid` column immediately after `individualID` column\n",
- " - Saves updated metadata file\n",
- "\n",
- "**Part 2: Process Count Matrix Files**\n",
- "\n",
- "1. For each count matrix file (gzipped):\n",
- " - Extracts header line (first row with column names)\n",
- " - First column is `peak_id` (kept as-is)\n",
- " - Remaining columns are sample IDs (individualID format)\n",
- " - Maps sample IDs to sampleid where mapping exists\n",
- " - Creates new header with mapped IDs\n",
- " - Replaces original header with new header\n",
- " - Recompresses with gzip\n",
- "\n",
- "#### Output:\n",
- "Output Directory: `output/1_files_with_sampleid/`\n",
- "\n",
- "Metadata Files (with sampleid):\n",
- "- `metadata_Ast.csv`\n",
- "- `metadata_Ex.csv`\n",
- "- `metadata_In.csv`\n",
- "- `metadata_Microglia.csv`\n",
- "- `metadata_Oligo.csv`\n",
- "- `metadata_OPC.csv`\n",
- "\n",
- "Count Matrix Files (with sampleid headers):\n",
- "- `pseudobulk_peaks_counts_Ast.csv.gz`\n",
- "- `pseudobulk_peaks_counts_Ex.csv.gz`\n",
- "- `pseudobulk_peaks_counts_In.csv.gz`\n",
- "- `pseudobulk_peaks_counts_Microglia.csv.gz`\n",
- "- `pseudobulk_peaks_counts_Oligo.csv.gz`\n",
- "- `pseudobulk_peaks_counts_OPC.csv.gz`\n"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "b258d410-4973-4c25-b53e-9a2c3399ce28",
- "metadata": {},
- "source": [
- "#### Load libraries"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 2,
- "id": "23eb41f4-134f-48dc-b8ce-b347fda8af48",
- "metadata": {},
- "outputs": [],
- "source": [
- "library(data.table)"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "70a2a878-1fd0-4be2-ab1a-bcee12e9ebc1",
- "metadata": {},
- "source": [
- "#### Load input"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 3,
- "id": "f415cf24-b424-405b-9032-d225f0ed0310",
- "metadata": {},
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "Read mapping file, rows: 1200 \n"
- ]
- }
- ],
- "source": [
- "# 3. Read mapping data\n",
- "map_file <- file.path(input_dir, \"data/rosmap_sample_mapping_data.csv\")\n",
- "map <- fread(map_file)\n",
- "cat(\"Read mapping file, rows:\", nrow(map), \"\\n\")\n",
- "\n",
- "# 4. Create mapping dictionary\n",
- "id_map <- map[, .(individualID, sampleid)]\n",
- "setkey(id_map, individualID)\n",
- "\n",
- "# Define cell types and paths\n",
- "celltype <- c(\"Ast\", \"Ex\", \"In\", \"Microglia\", \"Oligo\", \"OPC\")\n",
- "\n",
- "# Your specific metadata file paths\n",
- "metadata_files <- file.path(input_dir, paste0(\"1_files_with_sampleid/metadata_\", celltype, \".csv\"))\n",
- "\n",
- "\n",
- "for (ct in celltype) {\n",
- " specific_dir <- file.path(output_dir, \"1_files_with_sampleid\")\n",
- " if (!dir.exists(specific_dir)) {\n",
- " dir.create(specific_dir, recursive = TRUE)\n",
- " cat(\"Created directory:\", specific_dir, \"\\n\")\n",
- " }\n",
- "}"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "599dbf62-9db1-4e0b-a689-71eb2f27c98d",
- "metadata": {},
- "source": [
- "### Process metadata files"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 4,
- "id": "28e8a4ec-45f7-4678-8e43-d7d9246850bf",
- "metadata": {},
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "\n",
- "Processing metadata file: metadata_Ast.csv \n",
- "Original rows: 93 columns: 10 \n",
- "Output file will be saved to: /restricted/projectnb/xqtl/jaempawi/atac_seq/output/xiong/1_files_with_sampleid/metadata_Ast.csv \n",
- "Saved to: /restricted/projectnb/xqtl/jaempawi/atac_seq/output/xiong/1_files_with_sampleid/metadata_Ast.csv \n",
- "Converted rows: 93 columns: 10 \n",
- "Mapped IDs: 84 Unmapped IDs: 9 \n",
- "\n",
- "Processing metadata file: metadata_Ex.csv \n",
- "Original rows: 92 columns: 10 \n",
- "Output file will be saved to: /restricted/projectnb/xqtl/jaempawi/atac_seq/output/xiong/1_files_with_sampleid/metadata_Ex.csv \n",
- "Saved to: /restricted/projectnb/xqtl/jaempawi/atac_seq/output/xiong/1_files_with_sampleid/metadata_Ex.csv \n",
- "Converted rows: 92 columns: 10 \n",
- "Mapped IDs: 83 Unmapped IDs: 9 \n",
- "\n",
- "Processing metadata file: metadata_In.csv \n",
- "Original rows: 93 columns: 10 \n",
- "Output file will be saved to: /restricted/projectnb/xqtl/jaempawi/atac_seq/output/xiong/1_files_with_sampleid/metadata_In.csv \n",
- "Saved to: /restricted/projectnb/xqtl/jaempawi/atac_seq/output/xiong/1_files_with_sampleid/metadata_In.csv \n",
- "Converted rows: 93 columns: 10 \n",
- "Mapped IDs: 84 Unmapped IDs: 9 \n",
- "\n",
- "Processing metadata file: metadata_Microglia.csv \n",
- "Original rows: 93 columns: 10 \n",
- "Output file will be saved to: /restricted/projectnb/xqtl/jaempawi/atac_seq/output/xiong/1_files_with_sampleid/metadata_Microglia.csv \n",
- "Saved to: /restricted/projectnb/xqtl/jaempawi/atac_seq/output/xiong/1_files_with_sampleid/metadata_Microglia.csv \n",
- "Converted rows: 93 columns: 10 \n",
- "Mapped IDs: 84 Unmapped IDs: 9 \n",
- "\n",
- "Processing metadata file: metadata_Oligo.csv \n",
- "Original rows: 93 columns: 10 \n",
- "Output file will be saved to: /restricted/projectnb/xqtl/jaempawi/atac_seq/output/xiong/1_files_with_sampleid/metadata_Oligo.csv \n",
- "Saved to: /restricted/projectnb/xqtl/jaempawi/atac_seq/output/xiong/1_files_with_sampleid/metadata_Oligo.csv \n",
- "Converted rows: 93 columns: 10 \n",
- "Mapped IDs: 84 Unmapped IDs: 9 \n",
- "\n",
- "Processing metadata file: metadata_OPC.csv \n",
- "Original rows: 93 columns: 10 \n",
- "Output file will be saved to: /restricted/projectnb/xqtl/jaempawi/atac_seq/output/xiong/1_files_with_sampleid/metadata_OPC.csv \n",
- "Saved to: /restricted/projectnb/xqtl/jaempawi/atac_seq/output/xiong/1_files_with_sampleid/metadata_OPC.csv \n",
- "Converted rows: 93 columns: 10 \n",
- "Mapped IDs: 84 Unmapped IDs: 9 \n",
- "\n",
- "Metadata file processing summary:\n",
- " file mapped_ids unmapped_ids total_ids\n",
- " \n",
- "1: metadata_Ast.csv 84 9 93\n",
- "2: metadata_Ex.csv 83 9 92\n",
- "3: metadata_In.csv 84 9 93\n",
- "4: metadata_Microglia.csv 84 9 93\n",
- "5: metadata_Oligo.csv 84 9 93\n",
- "6: metadata_OPC.csv 84 9 93\n"
- ]
- }
- ],
- "source": [
- "# Function to process metadata files - adds sampleid and uses individualID for unmapped cases\n",
- "process_metadata <- function(file_path, celltype_name) {\n",
- " cat(\"\\nProcessing metadata file:\", basename(file_path), \"\\n\")\n",
- " \n",
- " # Read data\n",
- " meta <- fread(file_path)\n",
- " cat(\"Original rows:\", nrow(meta), \"columns:\", ncol(meta), \"\\n\")\n",
- " \n",
- " # Find the position of individualID column\n",
- " id_col_index <- which(colnames(meta) == \"individualID\")\n",
- " if (length(id_col_index) == 0) {\n",
- " cat(\"Warning: individualID column not found\\n\")\n",
- " return(NULL)\n",
- " }\n",
- " \n",
- " # Find the mapped sampleids for each individualID\n",
- " meta$sampleid <- character(nrow(meta)) # Initialize with empty strings\n",
- " \n",
- " for (i in 1:nrow(meta)) {\n",
- " ind_id <- meta$individualID[i]\n",
- " mapped_id <- id_map[ind_id, sampleid]\n",
- " \n",
- " # If mapping found, use it; otherwise use the original individualID\n",
- " if (length(mapped_id) > 0 && !is.na(mapped_id)) {\n",
- " meta$sampleid[i] <- mapped_id\n",
- " } else {\n",
- " # Use the original individualID instead of NA\n",
- " meta$sampleid[i] <- ind_id\n",
- " }\n",
- " }\n",
- " \n",
- " # Move sampleid column to the front\n",
- " setcolorder(meta, c(\"sampleid\", setdiff(names(meta), \"sampleid\")))\n",
- " \n",
- " # Save results\n",
- " output_file <- file.path(output_dir, \"1_files_with_sampleid\",basename(file_path))\n",
- " cat(\"Output file will be saved to:\", output_file, \"\\n\")\n",
- " fwrite(meta, output_file)\n",
- " \n",
- " # Count mapped and unmapped IDs\n",
- " mapped_count <- sum(meta$sampleid != meta$individualID)\n",
- " unmapped_count <- sum(meta$sampleid == meta$individualID)\n",
- " \n",
- " cat(\"Saved to:\", output_file, \"\\n\")\n",
- " cat(\"Converted rows:\", nrow(meta), \"columns:\", ncol(meta), \"\\n\")\n",
- " cat(\"Mapped IDs:\", mapped_count, \"Unmapped IDs:\", unmapped_count, \"\\n\")\n",
- " \n",
- " # Return processing summary\n",
- " list(\n",
- " file = basename(file_path),\n",
- " mapped_ids = mapped_count,\n",
- " unmapped_ids = unmapped_count,\n",
- " total_ids = nrow(meta)\n",
- " )\n",
- "}\n",
- "\n",
- "# Process all metadata files\n",
- "meta_results <- mapply(process_metadata, metadata_files, celltype, SIMPLIFY = FALSE)\n",
- "meta_summary <- do.call(rbind, lapply(meta_results, as.data.table))\n",
- "\n",
- "cat(\"\\nMetadata file processing summary:\\n\")\n",
- "print(meta_summary)"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "4752c617-693c-4fd3-b9b4-a21c9326bec8",
- "metadata": {},
- "source": [
- "### Process count matrix files"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 5,
- "id": "b8c39e1c-5913-411d-bb14-749371fe5368",
- "metadata": {},
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "\n",
- "Processing count matrix file: pseudobulk_peaks_counts_Ast.csv.gz \n",
- "Original columns: 93 \n",
- "Executing command: zcat /restricted/projectnb/xqtl/jaempawi/atac_seq/atac_seq_data_xiong/1_files_with_sampleid/pseudobulk_peaks_counts_Ast.csv.gz | tail -n +2 | cat /scratch/3114076.1.casaq/RtmpQcG6rV/file1bc75a5b135eff - | gzip > /restricted/projectnb/xqtl/jaempawi/atac_seq/output/xiong/1_files_with_sampleid/pseudobulk_peaks_counts_Ast.csv.gz \n",
- "Input file size: -rw-r--r-- 1 jaempawi xqtl 22M Jan 29 12:08 /restricted/projectnb/xqtl/jaempawi/atac_seq/atac_seq_data_xiong/1_files_with_sampleid/pseudobulk_peaks_counts_Ast.csv.gz \n",
- "Output file size: -rw-r--r-- 1 jaempawi xqtl 22M Feb 12 15:32 /restricted/projectnb/xqtl/jaempawi/atac_seq/output/xiong/1_files_with_sampleid/pseudobulk_peaks_counts_Ast.csv.gz \n",
- "File processing completed and saved to: /restricted/projectnb/xqtl/jaempawi/atac_seq/output/xiong/1_files_with_sampleid/pseudobulk_peaks_counts_Ast.csv.gz \n",
- "\n",
- "Processing count matrix file: pseudobulk_peaks_counts_Ex.csv.gz \n",
- "Original columns: 92 \n",
- "Executing command: zcat /restricted/projectnb/xqtl/jaempawi/atac_seq/atac_seq_data_xiong/1_files_with_sampleid/pseudobulk_peaks_counts_Ex.csv.gz | tail -n +2 | cat /scratch/3114076.1.casaq/RtmpQcG6rV/file1bc75a1b4f71a - | gzip > /restricted/projectnb/xqtl/jaempawi/atac_seq/output/xiong/1_files_with_sampleid/pseudobulk_peaks_counts_Ex.csv.gz \n",
- "Input file size: -rw-r--r-- 1 jaempawi xqtl 24M Jan 29 12:08 /restricted/projectnb/xqtl/jaempawi/atac_seq/atac_seq_data_xiong/1_files_with_sampleid/pseudobulk_peaks_counts_Ex.csv.gz \n",
- "Output file size: -rw-r--r-- 1 jaempawi xqtl 24M Feb 12 15:32 /restricted/projectnb/xqtl/jaempawi/atac_seq/output/xiong/1_files_with_sampleid/pseudobulk_peaks_counts_Ex.csv.gz \n",
- "File processing completed and saved to: /restricted/projectnb/xqtl/jaempawi/atac_seq/output/xiong/1_files_with_sampleid/pseudobulk_peaks_counts_Ex.csv.gz \n",
- "\n",
- "Processing count matrix file: pseudobulk_peaks_counts_In.csv.gz \n",
- "Original columns: 93 \n",
- "Executing command: zcat /restricted/projectnb/xqtl/jaempawi/atac_seq/atac_seq_data_xiong/1_files_with_sampleid/pseudobulk_peaks_counts_In.csv.gz | tail -n +2 | cat /scratch/3114076.1.casaq/RtmpQcG6rV/file1bc75a24fc9c54 - | gzip > /restricted/projectnb/xqtl/jaempawi/atac_seq/output/xiong/1_files_with_sampleid/pseudobulk_peaks_counts_In.csv.gz \n",
- "Input file size: -rw-r--r-- 1 jaempawi xqtl 24M Jan 29 12:08 /restricted/projectnb/xqtl/jaempawi/atac_seq/atac_seq_data_xiong/1_files_with_sampleid/pseudobulk_peaks_counts_In.csv.gz \n",
- "Output file size: -rw-r--r-- 1 jaempawi xqtl 24M Feb 12 15:33 /restricted/projectnb/xqtl/jaempawi/atac_seq/output/xiong/1_files_with_sampleid/pseudobulk_peaks_counts_In.csv.gz \n",
- "File processing completed and saved to: /restricted/projectnb/xqtl/jaempawi/atac_seq/output/xiong/1_files_with_sampleid/pseudobulk_peaks_counts_In.csv.gz \n",
- "\n",
- "Processing count matrix file: pseudobulk_peaks_counts_Microglia.csv.gz \n",
- "Original columns: 93 \n",
- "Executing command: zcat /restricted/projectnb/xqtl/jaempawi/atac_seq/atac_seq_data_xiong/1_files_with_sampleid/pseudobulk_peaks_counts_Microglia.csv.gz | tail -n +2 | cat /scratch/3114076.1.casaq/RtmpQcG6rV/file1bc75a5e37a1a8 - | gzip > /restricted/projectnb/xqtl/jaempawi/atac_seq/output/xiong/1_files_with_sampleid/pseudobulk_peaks_counts_Microglia.csv.gz \n",
- "Input file size: -rw-r--r-- 1 jaempawi xqtl 16M Jan 29 12:08 /restricted/projectnb/xqtl/jaempawi/atac_seq/atac_seq_data_xiong/1_files_with_sampleid/pseudobulk_peaks_counts_Microglia.csv.gz \n",
- "Output file size: -rw-r--r-- 1 jaempawi xqtl 16M Feb 12 15:33 /restricted/projectnb/xqtl/jaempawi/atac_seq/output/xiong/1_files_with_sampleid/pseudobulk_peaks_counts_Microglia.csv.gz \n",
- "File processing completed and saved to: /restricted/projectnb/xqtl/jaempawi/atac_seq/output/xiong/1_files_with_sampleid/pseudobulk_peaks_counts_Microglia.csv.gz \n",
- "\n",
- "Processing count matrix file: pseudobulk_peaks_counts_Oligo.csv.gz \n",
- "Original columns: 93 \n",
- "Executing command: zcat /restricted/projectnb/xqtl/jaempawi/atac_seq/atac_seq_data_xiong/1_files_with_sampleid/pseudobulk_peaks_counts_Oligo.csv.gz | tail -n +2 | cat /scratch/3114076.1.casaq/RtmpQcG6rV/file1bc75a522197c8 - | gzip > /restricted/projectnb/xqtl/jaempawi/atac_seq/output/xiong/1_files_with_sampleid/pseudobulk_peaks_counts_Oligo.csv.gz \n",
- "Input file size: -rw-r--r-- 1 jaempawi xqtl 28M Jan 29 12:08 /restricted/projectnb/xqtl/jaempawi/atac_seq/atac_seq_data_xiong/1_files_with_sampleid/pseudobulk_peaks_counts_Oligo.csv.gz \n",
- "Output file size: -rw-r--r-- 1 jaempawi xqtl 28M Feb 12 15:33 /restricted/projectnb/xqtl/jaempawi/atac_seq/output/xiong/1_files_with_sampleid/pseudobulk_peaks_counts_Oligo.csv.gz \n",
- "File processing completed and saved to: /restricted/projectnb/xqtl/jaempawi/atac_seq/output/xiong/1_files_with_sampleid/pseudobulk_peaks_counts_Oligo.csv.gz \n",
- "\n",
- "Processing count matrix file: pseudobulk_peaks_counts_OPC.csv.gz \n",
- "Original columns: 93 \n",
- "Executing command: zcat /restricted/projectnb/xqtl/jaempawi/atac_seq/atac_seq_data_xiong/1_files_with_sampleid/pseudobulk_peaks_counts_OPC.csv.gz | tail -n +2 | cat /scratch/3114076.1.casaq/RtmpQcG6rV/file1bc75a68ad457e - | gzip > /restricted/projectnb/xqtl/jaempawi/atac_seq/output/xiong/1_files_with_sampleid/pseudobulk_peaks_counts_OPC.csv.gz \n",
- "Input file size: -rw-r--r-- 1 jaempawi xqtl 17M Jan 29 12:08 /restricted/projectnb/xqtl/jaempawi/atac_seq/atac_seq_data_xiong/1_files_with_sampleid/pseudobulk_peaks_counts_OPC.csv.gz \n",
- "Output file size: -rw-r--r-- 1 jaempawi xqtl 17M Feb 12 15:33 /restricted/projectnb/xqtl/jaempawi/atac_seq/output/xiong/1_files_with_sampleid/pseudobulk_peaks_counts_OPC.csv.gz \n",
- "File processing completed and saved to: /restricted/projectnb/xqtl/jaempawi/atac_seq/output/xiong/1_files_with_sampleid/pseudobulk_peaks_counts_OPC.csv.gz \n",
- "\n",
- "Count matrix file processing summary:\n",
- " file total_columns mapped_columns\n",
- " \n",
- "1: pseudobulk_peaks_counts_Ast.csv.gz 93 0\n",
- "2: pseudobulk_peaks_counts_Ex.csv.gz 92 0\n",
- "3: pseudobulk_peaks_counts_In.csv.gz 93 0\n",
- "4: pseudobulk_peaks_counts_Microglia.csv.gz 93 0\n",
- "5: pseudobulk_peaks_counts_Oligo.csv.gz 93 0\n",
- "6: pseudobulk_peaks_counts_OPC.csv.gz 93 0\n",
- " unmapped_columns\n",
- " \n",
- "1: 92\n",
- "2: 91\n",
- "3: 92\n",
- "4: 92\n",
- "5: 92\n",
- "6: 92\n",
- "\n",
- "All files processed!\n"
- ]
- }
- ],
- "source": [
- "# Your specific metadata file paths\n",
- "count_files <- file.path(input_dir, paste0(\"1_files_with_sampleid/pseudobulk_peaks_counts_\", celltype, \".csv.gz\"))\n",
- "\n",
- "\n",
- "# Direct column renaming for count matrix files\n",
- "process_counts_simple <- function(file_path) {\n",
- " cat(\"\\nProcessing count matrix file:\", basename(file_path), \"\\n\")\n",
- " \n",
- " # Get header line only\n",
- " header_command <- paste0(\"zcat \", file_path, \" | head -n 1\")\n",
- " header_line <- system(header_command, intern = TRUE)\n",
- " \n",
- " # Parse column names\n",
- " col_names <- unlist(strsplit(header_line, \",\"))\n",
- " cat(\"Original columns:\", length(col_names), \"\\n\")\n",
- " \n",
- " # First column is peak_id, remaining columns are sample IDs\n",
- " peak_id_col <- col_names[1]\n",
- " sample_cols <- col_names[-1]\n",
- " \n",
- " # Map sample IDs\n",
- " new_sample_cols <- character(length(sample_cols))\n",
- " mapped_count <- 0\n",
- " \n",
- " for (i in seq_along(sample_cols)) {\n",
- " ind_id <- sample_cols[i]\n",
- " mapped_id <- id_map[ind_id, sampleid]\n",
- " \n",
- " if (length(mapped_id) > 0 && !is.na(mapped_id)) {\n",
- " new_sample_cols[i] <- mapped_id\n",
- " mapped_count <- mapped_count + 1\n",
- " } else {\n",
- " # Keep original individualID if no mapping found\n",
- " new_sample_cols[i] <- ind_id\n",
- " }\n",
- " }\n",
- " \n",
- " # Create new header\n",
- " new_col_names <- c(peak_id_col, new_sample_cols)\n",
- " \n",
- " # Create temporary header file\n",
- " temp_header <- tempfile()\n",
- " writeLines(paste(new_col_names, collapse = \",\"), temp_header)\n",
- " \n",
- " # Output file path\n",
- " output_file <- file.path(output_dir, \"1_files_with_sampleid\", basename(file_path))\n",
- " \n",
- " # Use system command to process the file without chunking\n",
- " # This extracts the data (excluding header), prepends new header, and compresses\n",
- " cmd <- paste0(\n",
- " \"zcat \", file_path, \" | tail -n +2 | cat \", temp_header, \" - | gzip > \", output_file\n",
- " )\n",
- " \n",
- " cat(\"Executing command:\", cmd, \"\\n\")\n",
- " system_result <- system(cmd)\n",
- " \n",
- " # Check if command succeeded\n",
- " if (system_result != 0) {\n",
- " cat(\"ERROR: Command failed with exit code\", system_result, \"\\n\")\n",
- " cat(\"Attempting backup method...\\n\")\n",
- " \n",
- " # Backup method using R's built-in file handling\n",
- " tryCatch({\n",
- " # Create a named vector for mapping\n",
- " id_mapping <- setNames(new_sample_cols, sample_cols)\n",
- " \n",
- " # Open connections\n",
- " in_conn <- gzfile(file_path, \"r\")\n",
- " out_conn <- gzfile(output_file, \"w\")\n",
- " \n",
- " # Read and discard the header line\n",
- " readLines(in_conn, n = 1)\n",
- " \n",
- " # Write the new header\n",
- " writeLines(paste(new_col_names, collapse = \",\"), out_conn)\n",
- " \n",
- " # Copy the rest of the file line by line\n",
- " while (length(line <- readLines(in_conn, n = 1)) > 0) {\n",
- " writeLines(line, out_conn)\n",
- " }\n",
- " \n",
- " # Close connections\n",
- " close(in_conn)\n",
- " close(out_conn)\n",
- " \n",
- " cat(\"Backup method successful\\n\")\n",
- " }, error = function(e) {\n",
- " cat(\"Backup method also failed:\", e$message, \"\\n\")\n",
- " })\n",
- " } else {\n",
- " # Check file sizes to verify completion\n",
- " input_size <- system(paste(\"ls -lh\", file_path), intern = TRUE)\n",
- " output_size <- system(paste(\"ls -lh\", output_file), intern = TRUE)\n",
- " cat(\"Input file size: \", input_size, \"\\n\")\n",
- " cat(\"Output file size:\", output_size, \"\\n\")\n",
- " }\n",
- " \n",
- " # Delete temporary file\n",
- " file.remove(temp_header)\n",
- " \n",
- " cat(\"File processing completed and saved to:\", output_file, \"\\n\")\n",
- " \n",
- " # Return processing summary\n",
- " list(\n",
- " file = basename(file_path),\n",
- " total_columns = length(col_names),\n",
- " mapped_columns = mapped_count,\n",
- " unmapped_columns = length(sample_cols) - mapped_count\n",
- " )\n",
- "}\n",
- "\n",
- "# Process all count files\n",
- "count_results <- lapply(count_files, process_counts_simple)\n",
- "\n",
- "# Summarize results\n",
- "count_summary <- do.call(rbind, lapply(count_results, as.data.table))\n",
- "cat(\"\\nCount matrix file processing summary:\\n\")\n",
- "print(count_summary)\n",
- "\n",
- "cat(\"\\nAll files processed!\\n\")"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "9d97c736-2394-46c3-a76c-2d5bb82a1098",
- "metadata": {},
- "source": [
- "## Step 1: Pseudobulk QC noBIOvar\n",
- "**Purpose:** Performs quality control on pseudobulk ATAC-seq data, filters low-quality samples and peaks, normalizes data, and calculates covariate-adjusted residuals while preserving biological variation (sex, age).\n",
- "\n",
- "---\n",
- "\n",
- "#### Input:\n",
- "\n",
- "**From Step 0 (required):**\n",
- "- `metadata_{celltype}.csv` (in `output/1_files_with_sampleid/`)\n",
- "- `pseudobulk_peaks_counts_{celltype}.csv.gz` (in `output/1_files_with_sampleid/`)\n",
- "\n",
- "**Reference Files:**\n",
- "- `rosmap_cov.txt`: Sample covariates (pmi, study)\n",
- "- `hg38-blacklist.v2.bed.gz`: ENCODE blacklist regions\n",
- "\n",
- "**Cell Types:**\n",
- "- `Mic` (Microglia)\n",
- "- `Astro` (Astrocytes)\n",
- "- `Oligo` (Oligodendrocytes)\n",
- "- `Ex` (Excitatory neurons)\n",
- "- `In` (Inhibitory neurons)\n",
- "- `OPC` (Oligodendrocyte precursor cells)\n",
- "\n",
- "#### Process:\n",
- "\n",
- "1. Load Data\n",
- "2. Sample Quality Filtering\n",
- "3. Calculate Technical QC Metrics\n",
- "4. Process Peak Coordinates\n",
- "5. Filter Blacklisted Regions\n",
- "6. Merge Covariates\n",
- "7. Create DGE Object\n",
- "8. Expression Filtering\n",
- "9. Save Filtered Raw Counts\n",
- "10. TMM Normalization\n",
- "11. Handle Batch and Library Variables\n",
- "12. Build Linear Model\n",
- "13. Voom Transformation & Model Fitting\n",
- "14. Calculate Offsets and Residuals\n",
- "\n",
- "#### Output:\n",
- "Output Directory: `output/2_residuals/{celltype}/`\n",
- "\n",
- "1. Residuals File: `{celltype}_residuals.txt`\n",
- "2. Results Object: `{celltype}_results.rds`\n",
- "3. Summary Report: `{celltype}_summary.txt`\n",
- "4. Variable Explanation: `{celltype}_variable_explanation.txt`\n",
- "5. Filtered Raw Counts: `{celltype}_filtered_raw_counts.txt`"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "f4ef8b2d-64b4-4d49-9845-93a6ee4b8895",
- "metadata": {},
- "source": [
- "#### Load libaries"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 6,
- "id": "e85cd96e-2357-41c8-90ab-bd61e14cf22e",
- "metadata": {},
- "outputs": [
- {
- "name": "stderr",
- "output_type": "stream",
- "text": [
- "\n",
- "Attaching package: ‘dplyr’\n",
- "\n",
- "\n",
- "The following objects are masked from ‘package:data.table’:\n",
- "\n",
- " between, first, last\n",
- "\n",
- "\n",
- "The following objects are masked from ‘package:stats’:\n",
- "\n",
- " filter, lag\n",
- "\n",
- "\n",
- "The following objects are masked from ‘package:base’:\n",
- "\n",
- " intersect, setdiff, setequal, union\n",
- "\n",
- "\n",
- "Loading required package: limma\n",
- "\n"
- ]
- }
- ],
- "source": [
- "library(data.table)\n",
- "library(stringr)\n",
- "library(dplyr)\n",
- "library(edgeR)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 8,
- "id": "6d194542-2660-46cd-84ab-362cf147a4d9",
- "metadata": {},
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "Processing celltype: Oligo \n"
- ]
- }
- ],
- "source": [
- "# Set cell type and create output directory\n",
- "#args <- commandArgs(trailingOnly = TRUE)\n",
- "\n",
- "celltype <- \"Oligo\"\n",
- "cat(\"Processing celltype:\", celltype, \"\\n\")\n",
- "\n",
- "# Create individual directories for each cell type\n",
- "for (ct in celltype) {\n",
- " specific_dir <- file.path(output_dir, \"2_residuals\",celltype)\n",
- " if (!dir.exists(specific_dir)) {\n",
- " dir.create(specific_dir, recursive = TRUE)\n",
- " cat(\"Created directory:\", specific_dir, \"\\n\")\n",
- " }\n",
- "}"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "65bcf867-e949-41a3-8afe-b66f71217ca7",
- "metadata": {},
- "source": [
- "#### Create predictOffset funciton"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 9,
- "id": "2a63f3d9-c699-4e97-85f1-456f684e8b2e",
- "metadata": {},
- "outputs": [],
- "source": [
- "predictOffset <- function(fit) {\n",
- " # Define which variables are factors and which are continuous\n",
- " usedFactors <- c(\"sequencingBatch\", \"Library\", \"study\") \n",
- " usedContinuous <- c(\"log_n_nuclei\", \"med_nucleosome_signal\", \"med_tss_enrich\", \"log_med_n_tot_fragment\",\n",
- " \"log_total_unique_peaks\", \"pmi\")\n",
- " \n",
- " # Filter to only use variables actually in the design matrix\n",
- " usedFactors <- usedFactors[sapply(usedFactors, function(f) any(grepl(paste0(\"^\", f), colnames(fit$design))))]\n",
- " usedContinuous <- usedContinuous[sapply(usedContinuous, function(f) any(grepl(paste0(\"^\", f), colnames(fit$design))))]\n",
- " \n",
- " # Get indices for factor and continuous variables\n",
- " facInd <- unlist(lapply(as.list(usedFactors), \n",
- " function(f) {return(grep(paste0(\"^\", f), \n",
- " colnames(fit$design)))}))\n",
- " contInd <- unlist(lapply(as.list(usedContinuous), \n",
- " function(f) {return(grep(paste0(\"^\", f), \n",
- " colnames(fit$design)))}))\n",
- " \n",
- " # Add the intercept\n",
- " all_indices <- c(1, facInd, contInd)\n",
- " \n",
- " # Verify design matrix structure (using sorted indices to avoid duplication warning)\n",
- " all_indices_sorted <- sort(unique(all_indices))\n",
- " stopifnot(all(all_indices_sorted %in% 1:ncol(fit$design)))\n",
- " \n",
- " # Create new design matrix with median values\n",
- " D <- fit$design\n",
- " D[, facInd] <- 0 # Set all factor levels to reference level\n",
- " \n",
- " # For continuous variables, set to median value\n",
- " if (length(contInd) > 0) {\n",
- " medContVals <- apply(D[, contInd, drop=FALSE], 2, median)\n",
- " for (i in 1:length(medContVals)) {\n",
- " D[, names(medContVals)[i]] <- medContVals[i]\n",
- " }\n",
- " }\n",
- " \n",
- " # Calculate offsets\n",
- " stopifnot(all(colnames(coefficients(fit)) == colnames(D)))\n",
- " offsets <- apply(coefficients(fit), 1, function(c) {\n",
- " return(D %*% c)\n",
- " })\n",
- " offsets <- t(offsets)\n",
- " colnames(offsets) <- rownames(fit$design)\n",
- " \n",
- " return(offsets)\n",
- "}\n",
- "\n"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "35928e75-ca48-49ae-be1d-d1429c3171c3",
- "metadata": {},
- "source": [
- "#### Load data"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 11,
- "id": "4f3ffb5c-a52d-4f7d-ba1f-14241612be1d",
- "metadata": {},
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "Loaded metadata with 93 samples\n",
- "Filtered to 92 samples with > 20 nuclei\n",
- "Loaded peak data with 363775 peaks\n",
- "Valid samples after nuclei filtering: 92 \n",
- "Valid samples present in peak data: 90 \n",
- "Original peak data dimensions: 363775 × 92 \n",
- "Filtered peak data dimensions: 363775 × 90 \n",
- "Final metadata samples after filtering: 90 \n"
- ]
- }
- ],
- "source": [
- "celltype <- \"Oligo\"\n",
- "meta_path <- paste0(output_dir, \"/1_files_with_sampleid/metadata_\", celltype, \".csv\")\n",
- "peak_path <- paste0(output_dir, \"/1_files_with_sampleid/pseudobulk_peaks_counts_\", celltype, \".csv.gz\")\n",
- "\n",
- "# Blacklist and Covariates are in the source 'data_dir'\n",
- "blacklist_file <- file.path(input_dir, \"data/hg38-blacklist.v2.bed.gz\")\n",
- "covariates_file <- file.path(input_dir, \"data/rosmap_cov.txt\")\n",
- "\n",
- "# Load metadata\n",
- "meta <- fread(meta_path)\n",
- "cat(\"Loaded metadata with\", nrow(meta), \"samples\\n\")\n",
- "\n",
- "# Filter samples with n_nuclei > 20\n",
- "meta_filtered <- meta[n.nuclei > 20]\n",
- "cat(\"Filtered to\", nrow(meta_filtered), \"samples with > 20 nuclei\\n\")\n",
- "\n",
- "# Load peak data\n",
- "peak_data <- fread(peak_path)\n",
- "cat(\"Loaded peak data with\", nrow(peak_data), \"peaks\\n\")\n",
- "\n",
- "# Extract peak_id and set as rownames\n",
- "peak_id <- peak_data$peak_id\n",
- "peak_data <- peak_data[, -1, with = FALSE] # Remove peak_id column\n",
- "\n",
- "# Filter peak data to keep only samples with >20 nuclei\n",
- "valid_samples <- meta_filtered$sampleid\n",
- "cat(\"Valid samples after nuclei filtering:\", length(valid_samples), \"\\n\")\n",
- "\n",
- "# Find which valid samples actually exist in the peak data\n",
- "available_samples <- intersect(valid_samples, colnames(peak_data))\n",
- "cat(\"Valid samples present in peak data:\", length(available_samples), \"\\n\")\n",
- "\n",
- "# Create filtered peak matrix\n",
- "peak_data_filtered <- peak_data[, ..available_samples, with=FALSE]\n",
- "cat(\"Original peak data dimensions:\", nrow(peak_data), \"×\", ncol(peak_data), \"\\n\")\n",
- "cat(\"Filtered peak data dimensions:\", nrow(peak_data_filtered), \"×\", ncol(peak_data_filtered), \"\\n\")\n",
- "\n",
- "# Convert to matrix for downstream analysis\n",
- "peak_matrix <- as.matrix(peak_data_filtered)\n",
- "rownames(peak_matrix) <- peak_id\n",
- "\n",
- "# Update metadata to match filtered samples\n",
- "meta_filtered <- meta_filtered[sampleid %in% available_samples]\n",
- "cat(\"Final metadata samples after filtering:\", nrow(meta_filtered), \"\\n\")"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "8228d5e0-b459-421c-a3aa-e5e8a3a0f992",
- "metadata": {},
- "source": [
- "#### Process technical variables from meta data"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 12,
- "id": "92bd6ab0-27e2-4218-8a35-8f826227d9fd",
- "metadata": {},
- "outputs": [],
- "source": [
- "# Column name normalization (for easier handling)\n",
- "meta_clean <- meta_filtered %>%\n",
- " rename(\n",
- " med_nucleosome_signal = med.nucleosome_signal.ct,\n",
- " med_tss_enrich = med.tss.enrich.ct,\n",
- " med_n_tot_fragment = med.n_tot_fragment.ct,\n",
- " n_nuclei = n.nuclei\n",
- " )\n",
- "\n",
- "# Calculate peak metrics - total unique peaks per sample and median peak width\n",
- "peak_metrics <- data.frame(\n",
- " sampleid = colnames(peak_matrix),\n",
- " total_unique_peaks = colSums(peak_matrix > 0)\n",
- ") %>%\n",
- " mutate(log_total_unique_peaks = log(total_unique_peaks + 1))\n",
- "\n",
- "# Calculate median peak width for each sample using count as weight\n",
- "calculate_median_peakwidth <- function(peak_matrix, peak_info) {\n",
- " # Create a data frame with peak widths\n",
- " peak_widths <- peak_info$end - peak_info$start\n",
- " \n",
- " # Initialize a vector to store median peak widths\n",
- " median_peak_widths <- numeric(ncol(peak_matrix))\n",
- " names(median_peak_widths) <- colnames(peak_matrix)\n",
- " \n",
- " # For each sample, calculate the weighted median peak width\n",
- " for (i in 1:ncol(peak_matrix)) {\n",
- " sample_counts <- peak_matrix[, i]\n",
- " # Only consider peaks with counts > 0\n",
- " idx <- which(sample_counts > 0)\n",
- " \n",
- " if (length(idx) > 0) {\n",
- " # Method 1: Use counts as weights\n",
- " weights <- sample_counts[idx]\n",
- " # Repeat each peak width by its count for weighted calculation\n",
- " all_widths <- rep(peak_widths[idx], times=weights)\n",
- " median_peak_widths[i] <- median(all_widths)\n",
- " } else {\n",
- " median_peak_widths[i] <- NA\n",
- " }\n",
- " }\n",
- " \n",
- " return(median_peak_widths)\n",
- "}"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "3ab3869e-c624-4666-930f-97c6976c74da",
- "metadata": {},
- "source": [
- "#### Process peaks"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 13,
- "id": "b71aebba-d9c9-4836-8432-c3bba27e9864",
- "metadata": {},
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "Sample of peak coordinates:\n",
- " peak_name chr start end\n",
- " \n",
- "1: chr1-817077-817577 chr1 817077 817577\n",
- "2: chr1-827285-827785 chr1 827285 827785\n",
- "3: chr1-850237-850737 chr1 850237 850737\n",
- "4: chr1-869660-870160 chr1 869660 870160\n",
- "5: chr1-903662-904162 chr1 903662 904162\n",
- "6: chr1-904504-905004 chr1 904504 905004\n",
- "Number of blacklisted peaks: 29 \n",
- "Number of peaks after blacklist filtering: 363746 \n"
- ]
- }
- ],
- "source": [
- "# Process peak coordinates\n",
- "peak_df <- data.table(\n",
- " peak_name = peak_id,\n",
- " chr = sapply(strsplit(peak_id, \"-\"), `[`, 1),\n",
- " start = as.integer(sapply(strsplit(peak_id, \"-\"), `[`, 2)),\n",
- " end = as.integer(sapply(strsplit(peak_id, \"-\"), `[`, 3)),\n",
- " stringsAsFactors = FALSE\n",
- ")\n",
- "\n",
- "# Verify peak coordinates were extracted correctly\n",
- "cat(\"Sample of peak coordinates:\\n\")\n",
- "print(head(peak_df))\n",
- "\n",
- "if (file.exists(blacklist_file)) {\n",
- " blacklist_df <- fread(blacklist_file)\n",
- " if (ncol(blacklist_df) >= 4) {\n",
- " colnames(blacklist_df)[1:4] <- c(\"chr\", \"start\", \"end\", \"label\")\n",
- " } else {\n",
- " colnames(blacklist_df)[1:3] <- c(\"chr\", \"start\", \"end\")\n",
- " }\n",
- " \n",
- " # Filter blacklisted peaks\n",
- " setkey(blacklist_df, chr, start, end)\n",
- " setkey(peak_df, chr, start, end)\n",
- " overlapping_peaks <- foverlaps(peak_df, blacklist_df, nomatch=0)\n",
- " blacklisted_peaks <- unique(overlapping_peaks$peak_name)\n",
- " cat(\"Number of blacklisted peaks:\", length(blacklisted_peaks), \"\\n\")\n",
- " \n",
- " filtered_peak_idx <- !(peak_id %in% blacklisted_peaks)\n",
- " filtered_peak <- peak_matrix[filtered_peak_idx, ]\n",
- " cat(\"Number of peaks after blacklist filtering:\", nrow(filtered_peak), \"\\n\")\n",
- "} else {\n",
- " cat(\"Warning: Blacklist file not found at\", blacklist_file, \"\\n\")\n",
- " cat(\"Proceeding without blacklist filtering\\n\")\n",
- " filtered_peak <- peak_matrix\n",
- "}"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "fcabaa13-d809-4538-806d-d3aea0a37858",
- "metadata": {},
- "source": [
- "#### Load covariates"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 14,
- "id": "d5caf744-bf0a-4aa9-95bc-601341111872",
- "metadata": {},
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "\n",
- "Variable statistics before and after log transformation:\n",
- "n_nuclei: min=39.00, median=849.00, max=4394.00, SD=1080.03\n",
- "log_n_nuclei: min=3.66, median=6.74, max=8.39, SD=1.05\n",
- "med_n_tot_fragment: min=1308.50, median=7521.00, max=30629.00, SD=5373.50\n",
- "log_med_n_tot_fragment: min=7.18, median=8.93, max=10.33, SD=0.69\n",
- "Number of samples after joining: 83 \n",
- "Sample IDs: SM-CTECR, SM-CJK5G, SM-CJEKQ, SM-CJGGY, SM-CJK3S, SM-CTEGU ...\n",
- "Available covariates: sampleid, individualID, sequencingBatch, Library, Celltype4, n_nuclei, avg.pct.read.in.peak.ct, med_nucleosome_signal, med_n_tot_fragment, med_tss_enrich, total_unique_peaks, log_total_unique_peaks, pmi, study, log_n_nuclei, log_med_n_tot_fragment \n"
- ]
- }
- ],
- "source": [
- "covariates_file <- file.path(input_dir,'data/rosmap_cov.txt')\n",
- "\n",
- "if (file.exists(covariates_file)) {\n",
- " covariates <- fread(covariates_file)\n",
- " # Check column names and adjust if needed\n",
- " if ('#id' %in% colnames(covariates)) {\n",
- " id_col <- '#id'\n",
- " } else if ('individualID' %in% colnames(covariates)) {\n",
- " id_col <- 'individualID'\n",
- " } else {\n",
- " cat(\"Warning: Could not identify ID column in covariates file. Available columns:\", \n",
- " paste(colnames(covariates), collapse=\", \"), \"\\n\")\n",
- " id_col <- colnames(covariates)[1]\n",
- " cat(\"Using\", id_col, \"as ID column\\n\")\n",
- " }\n",
- " \n",
- " # Select relevant columns - excluding msex and age_death\n",
- " cov_cols <- intersect(c(id_col, 'pmi', 'study'), colnames(covariates))\n",
- " covariates <- covariates[, ..cov_cols]\n",
- " \n",
- " # Merge with metadata\n",
- " meta_with_ind <- meta_clean %>%\n",
- " select(sampleid, everything())\n",
- " \n",
- " all_covs <- meta_with_ind %>%\n",
- " inner_join(peak_metrics, by = \"sampleid\") %>%\n",
- " inner_join(covariates, by = setNames(id_col, \"sampleid\"))\n",
- " \n",
- " # Impute missing values\n",
- " for (col in c(\"pmi\")) {\n",
- " if (col %in% colnames(all_covs) && any(is.na(all_covs[[col]]))) {\n",
- " cat(\"Imputing missing values for\", col, \"\\n\")\n",
- " all_covs[[col]][is.na(all_covs[[col]])] <- median(all_covs[[col]], na.rm=TRUE)\n",
- " }\n",
- " }\n",
- "} else {\n",
- " cat(\"Warning: Covariates file\", covariates_file, \"not found.\\n\")\n",
- " cat(\"Proceeding with only technical variables.\\n\")\n",
- " all_covs <- meta_clean %>%\n",
- " inner_join(peak_metrics, by = \"sampleid\")\n",
- "}\n",
- "\n",
- "\n",
- "# Perform log transformations on necessary variables\n",
- "# Add a small constant to avoid log(0)\n",
- "epsilon <- 1e-6\n",
- "\n",
- "all_covs$log_n_nuclei <- log(all_covs$n_nuclei + epsilon)\n",
- "all_covs$log_med_n_tot_fragment <- log(all_covs$med_n_tot_fragment + epsilon)\n",
- "\n",
- "# Show distribution of original and log-transformed variables\n",
- "cat(\"\\nVariable statistics before and after log transformation:\\n\")\n",
- "for (var in c(\"n_nuclei\", \"med_n_tot_fragment\")) {\n",
- " orig_var <- all_covs[[var]]\n",
- " log_var <- all_covs[[paste0(\"log_\", var)]]\n",
- " \n",
- " cat(sprintf(\"%s: min=%.2f, median=%.2f, max=%.2f, SD=%.2f\\n\", \n",
- " var, min(orig_var), median(orig_var), max(orig_var), sd(orig_var)))\n",
- " cat(sprintf(\"log_%s: min=%.2f, median=%.2f, max=%.2f, SD=%.2f\\n\", \n",
- " var, min(log_var), median(log_var), max(log_var), sd(log_var)))\n",
- "}\n",
- "\n",
- "cat(\"Number of samples after joining:\", nrow(all_covs), \"\\n\")\n",
- "cat(\"Sample IDs:\", paste(head(all_covs$sampleid), collapse=\", \"), \"...\\n\")\n",
- "cat(\"Available covariates:\", paste(colnames(all_covs), collapse=\", \"), \"\\n\")"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "af1a0588-5d0d-471e-857f-754b69836303",
- "metadata": {},
- "source": [
- "#### Create DGE object"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 15,
- "id": "ccdc7318-b28e-4037-ac3a-c7794d4a72ba",
- "metadata": {},
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "Number of valid samples: 83 \n"
- ]
- }
- ],
- "source": [
- "valid_samples <- intersect(colnames(filtered_peak), all_covs$sampleid)\n",
- "cat(\"Number of valid samples:\", length(valid_samples), \"\\n\")\n",
- "\n",
- "all_covs_filtered <- all_covs[all_covs$sampleid %in% valid_samples, ]\n",
- "filtered_peak_filtered <- filtered_peak[, valid_samples]\n",
- "\n",
- "dge <- DGEList(\n",
- " counts = filtered_peak_filtered,\n",
- " samples = all_covs_filtered\n",
- ")\n",
- "rownames(dge$samples) <- dge$samples$sampleid"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "bd4a6650-bedd-4a93-ba36-fd4f091cbb99",
- "metadata": {},
- "source": [
- "#### Filter low counts and normalize"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 18,
- "id": "630dc838-d78b-445c-846d-91f1fc0bc56f",
- "metadata": {},
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "Number of peaks before filtering: 176039 \n"
- ]
- },
- {
- "name": "stderr",
- "output_type": "stream",
- "text": [
- "Warning message in filterByExpr.DGEList(dge, min.count = 5, min.total.count = 15, :\n",
- "“All samples appear to belong to the same group.”\n"
- ]
- },
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "Number of peaks after filtering: 176039 \n",
- "Saved filtered raw counts to /restricted/projectnb/xqtl/jaempawi/atac_seq/output/xiong/2_residuals/Oligo/Oligo_filtered_raw_counts.txt \n"
- ]
- }
- ],
- "source": [
- "cat(\"Number of peaks before filtering:\", nrow(dge), \"\\n\")\n",
- "keep <- filterByExpr(dge, \n",
- " min.count = 5, # for one sample, min reads \n",
- " min.total.count = 15, # min reads overall\n",
- " min.prop = 0.1) \n",
- "\n",
- "dge <- dge[keep, , keep.lib.sizes=FALSE]\n",
- "cat(\"Number of peaks after filtering:\", nrow(dge), \"\\n\") #66154 in OPC\n",
- "\n",
- "# Save filtered raw count data\n",
- "filtered_raw_counts <- dge$counts\n",
- "write.table(filtered_raw_counts,\n",
- " file = paste0(output_dir, \"/2_residuals/\", celltype, \"/\", celltype, \"_filtered_raw_counts.txt\"), \n",
- " quote=FALSE, sep=\"\\t\", row.names=TRUE, col.names=TRUE)\n",
- "cat(\"Saved filtered raw counts to\", paste0(output_dir, \"/2_residuals/\", celltype, \"/\", celltype, \"_filtered_raw_counts.txt\"), \"\\n\")\n",
- "\n",
- "dge <- calcNormFactors(dge, method=\"TMM\")\n"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "533471bd-7b96-4bd7-b3b2-00694b69507b",
- "metadata": {},
- "source": [
- "#### Handle batch and library as technical variables"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 19,
- "id": "152eaa2c-8856-4436-a27f-6064bd93dd93",
- "metadata": {},
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "Handling sequencingBatch and Library as technical variables\n",
- "Found 2 unique sequencing batches\n",
- "Batch sizes:\n",
- "batches\n",
- "190820Kel 191203Kel \n",
- " 7 76 \n",
- "Found 7 unique libraries\n",
- "Library sizes:\n",
- "libraries\n",
- "Library10 Library11 Library2 Library4 Library5 Library7 Library9 \n",
- " 26 6 7 6 7 23 8 \n"
- ]
- }
- ],
- "source": [
- "# We'll handle batch and library as technical variables rather than doing batch adjustment\n",
- "cat(\"Handling sequencingBatch and Library as technical variables\\n\")\n",
- "\n",
- "# Check batch information\n",
- "batches <- dge$samples$sequencingBatch\n",
- "cat(\"Found\", length(unique(batches)), \"unique sequencing batches\\n\")\n",
- "\n",
- "# Check batch size\n",
- "batch_counts <- table(batches)\n",
- "cat(\"Batch sizes:\\n\")\n",
- "print(batch_counts)\n",
- "\n",
- "# Convert sequencingBatch to factor with at least 2 levels\n",
- "if (length(unique(batches)) < 2) {\n",
- " cat(\"Only one sequencing batch found. Adding dummy batch for model compatibility.\\n\")\n",
- " # Create a dummy batch factor to avoid model errors\n",
- " dge$samples$sequencingBatch_factor <- factor(rep(\"batch1\", ncol(dge)))\n",
- "} else {\n",
- " # Use the existing batch information\n",
- " dge$samples$sequencingBatch_factor <- factor(dge$samples$sequencingBatch)\n",
- "}\n",
- "\n",
- "# Check library information\n",
- "libraries <- dge$samples$Library\n",
- "cat(\"Found\", length(unique(libraries)), \"unique libraries\\n\")\n",
- "\n",
- "# Check library size\n",
- "library_counts <- table(libraries)\n",
- "cat(\"Library sizes:\\n\")\n",
- "print(library_counts)\n",
- "\n",
- "# Convert Library to factor with at least 2 levels\n",
- "if (length(unique(libraries)) < 2) {\n",
- " cat(\"Only one library found. Adding dummy library for model compatibility.\\n\")\n",
- " # Create a dummy library factor to avoid model errors\n",
- " dge$samples$Library_factor <- factor(rep(\"lib1\", ncol(dge)))\n",
- "} else {\n",
- " # Use the existing library information\n",
- " dge$samples$Library_factor <- factor(dge$samples$Library)\n",
- "}"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "9bc8dda3-89ae-47ad-8785-e393695061dd",
- "metadata": {},
- "source": [
- "#### Create model and run voom"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 20,
- "id": "c6e7f374-b7a5-4666-ac99-191807b7e8e2",
- "metadata": {},
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "Using model with technical covariates plus pmi and study\n",
- "Model formula: ~log_n_nuclei + med_nucleosome_signal + med_tss_enrich + log_med_n_tot_fragment + log_total_unique_peaks + sequencingBatch_factor + Library_factor + pmi + study \n",
- "Warning: Factor variable group has only one level. Converting to character.\n",
- "Successfully created design matrix with 15 columns\n",
- "Design matrix is not full rank. Adjusting...\n",
- "Adjusted design matrix columns: 14 \n",
- "Calculating offsets and residuals...\n"
- ]
- }
- ],
- "source": [
- "# Define the model based on available covariates - using log-transformed variables\n",
- "# Removed msex and age_death from the model\n",
- "if (\"study\" %in% colnames(dge$samples) && \"pmi\" %in% colnames(dge$samples)) {\n",
- " # Technical model with pmi and study\n",
- " cat(\"Using model with technical covariates plus pmi and study\\n\")\n",
- " model <- ~ log_n_nuclei + med_nucleosome_signal + med_tss_enrich + log_med_n_tot_fragment +\n",
- " log_total_unique_peaks + sequencingBatch_factor + Library_factor + pmi + study\n",
- "} else if (\"pmi\" %in% colnames(dge$samples)) {\n",
- " # Technical model with pmi only\n",
- " cat(\"Using model with technical covariates and pmi\\n\")\n",
- " model <- ~ log_n_nuclei + med_nucleosome_signal + med_tss_enrich + log_med_n_tot_fragment +\n",
- " log_total_unique_peaks + sequencingBatch_factor + Library_factor + pmi\n",
- "} else {\n",
- " # Technical variables only model\n",
- " cat(\"Using model with technical covariates only\\n\")\n",
- " model <- ~ log_n_nuclei + med_nucleosome_signal + med_tss_enrich + log_med_n_tot_fragment +\n",
- " log_total_unique_peaks + sequencingBatch_factor + Library_factor\n",
- "}\n",
- "\n",
- "# Print the model formula\n",
- "cat(\"Model formula:\", deparse(model), \"\\n\")\n",
- "\n",
- "# Check for factor variables with only one level\n",
- "for (col in colnames(dge$samples)) {\n",
- " if (is.factor(dge$samples[[col]]) && nlevels(dge$samples[[col]]) < 2) {\n",
- " cat(\"Warning: Factor variable\", col, \"has only one level. Converting to character.\\n\")\n",
- " dge$samples[[col]] <- as.character(dge$samples[[col]])\n",
- " }\n",
- "}\n",
- "\n",
- "# Create design matrix with error checking\n",
- "tryCatch({\n",
- " design <- model.matrix(model, data=dge$samples)\n",
- " cat(\"Successfully created design matrix with\", ncol(design), \"columns\\n\")\n",
- "}, error = function(e) {\n",
- " cat(\"Error in creating design matrix:\", e$message, \"\\n\")\n",
- " cat(\"Attempting to fix model formula...\\n\")\n",
- " \n",
- " # Check each term in the model\n",
- " all_terms <- all.vars(model)\n",
- " valid_terms <- character(0)\n",
- " \n",
- " for (term in all_terms) {\n",
- " if (term %in% colnames(dge$samples)) {\n",
- " # Check if it's a factor with at least 2 levels\n",
- " if (is.factor(dge$samples[[term]])) {\n",
- " if (nlevels(dge$samples[[term]]) >= 2) {\n",
- " valid_terms <- c(valid_terms, term)\n",
- " } else {\n",
- " cat(\"Skipping factor\", term, \"with only\", nlevels(dge$samples[[term]]), \"level\\n\")\n",
- " }\n",
- " } else {\n",
- " # Non-factor variables are fine\n",
- " valid_terms <- c(valid_terms, term)\n",
- " }\n",
- " } else {\n",
- " cat(\"Variable\", term, \"not found in sample data\\n\")\n",
- " }\n",
- " }\n",
- " \n",
- " # Create a simplified model with valid terms\n",
- " if (length(valid_terms) > 0) {\n",
- " model_str <- paste(\"~\", paste(valid_terms, collapse = \" + \"))\n",
- " model <- as.formula(model_str)\n",
- " cat(\"New model formula:\", model_str, \"\\n\")\n",
- " design <- model.matrix(model, data=dge$samples)\n",
- " cat(\"Successfully created design matrix with\", ncol(design), \"columns\\n\")\n",
- " } else {\n",
- " stop(\"Could not create a valid model with the available variables\")\n",
- " }\n",
- "})\n",
- "\n",
- "# Check if the design matrix is full rank\n",
- "if (!is.fullrank(design)) {\n",
- " cat(\"Design matrix is not full rank. Adjusting...\\n\")\n",
- " # Find and remove the problematic columns\n",
- " qr_res <- qr(design)\n",
- " design <- design[, qr_res$pivot[1:qr_res$rank]]\n",
- " cat(\"Adjusted design matrix columns:\", ncol(design), \"\\n\")\n",
- "}\n",
- "\n",
- "# Run voom and fit model\n",
- "v <- voom(dge, design, plot=FALSE) #logCPM\n",
- "fit <- lmFit(v, design)\n",
- "fit <- eBayes(fit)\n",
- "\n",
- "# Calculate offset and residuals\n",
- "cat(\"Calculating offsets and residuals...\\n\")\n",
- "offset <- predictOffset(fit)\n",
- "resids <- residuals(fit, y=v)\n",
- "\n",
- "# Verify dimensions\n",
- "stopifnot(all(rownames(offset) == rownames(resids)) &\n",
- " all(colnames(offset) == colnames(resids)))\n",
- "\n",
- "# Final adjusted data\n",
- "stopifnot(all(dim(offset) == dim(resids)))\n",
- "stopifnot(all(colnames(offset) == colnames(resids)))\n",
- "\n",
- "final_data <- offset + resids"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "fac57ec8-7559-4c60-94f4-73d190a2f11a",
- "metadata": {},
- "source": [
- "#### Save results"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 22,
- "id": "6f3dbbf0-acdf-4257-8dad-529727dac1d2",
- "metadata": {},
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "Processing completed. Results and documentation saved to: /restricted/projectnb/xqtl/jaempawi/atac_seq/output/xiong/2_residuals/Oligo/ \n"
- ]
- }
- ],
- "source": [
- "# Save results\n",
- "saveRDS(list(\n",
- " dge = dge,\n",
- " offset = offset,\n",
- " residuals = resids,\n",
- " final_data = final_data,\n",
- " valid_samples = colnames(dge),\n",
- " design = design,\n",
- " fit = fit,\n",
- " model = model\n",
- "), file = paste0(output_dir, \"/2_residuals/\", celltype, \"/\", celltype, \"_results.rds\"))\n",
- "\n",
- "# Write final residual data to file\n",
- "write.table(final_data,\n",
- " file = paste0(output_dir, \"/2_residuals/\", celltype, \"/\", celltype, \"_residuals.txt\"), \n",
- " quote=FALSE, sep=\"\\t\", row.names=TRUE, col.names=TRUE)\n",
- "\n",
- "# Write summary statistics\n",
- "sink(file = paste0(output_dir, \"/2_residuals/\", celltype, \"/\", celltype, \"_summary.txt\"))\n",
- "cat(\"*** Processing Summary for\", celltype, \"***\\n\\n\")\n",
- "cat(\"Original peak count:\", length(peak_id), \"\\n\")\n",
- "cat(\"Peaks after blacklist filtering:\", nrow(filtered_peak), \"\\n\")\n",
- "cat(\"Peaks after expression filtering:\", nrow(dge), \"\\n\\n\")\n",
- "cat(\"Number of samples:\", ncol(dge), \"\\n\")\n",
- "cat(\"Number of samples after nuclei (>20) filtering:\", ncol(peak_matrix), \"\\n\")\n",
- "cat(\"\\nTechnical Variables Used:\\n\")\n",
- "cat(\"- log_n_nuclei: Log-transformed number of nuclei per sample\\n\")\n",
- "cat(\"- med_nucleosome_signal: Median nucleosome signal\\n\")\n",
- "cat(\"- med_tss_enrich: Median TSS enrichment\\n\")\n",
- "cat(\"- log_med_n_tot_fragment: Log-transformed median number of total fragments\\n\")\n",
- "cat(\"- log_total_unique_peaks: Log-transformed count of unique peaks per sample\\n\")\n",
- "cat(\"- sequencingBatch_factor: Sequencing batch ID\\n\")\n",
- "cat(\"- Library_factor: Library ID\\n\")\n",
- "cat(\"\\nOther Variables Used:\\n\")\n",
- "cat(\"- pmi: Post-mortem interval\\n\")\n",
- "cat(\"- study: Study cohort\\n\")\n",
- "sink()\n",
- "\n",
- "# Write an additional explanation file about the variables and log transformation\n",
- "sink(file = paste0(output_dir, \"/2_residuals/\", celltype, \"/\", celltype, \"_variable_explanation.txt\"))\n",
- "cat(\"# ATAC-seq Technical Variables Explanation\\n\\n\")\n",
- "\n",
- "\n",
- "cat(\"## Why Log Transformation?\\n\")\n",
- "cat(\"Log transformation is applied to certain variables for several reasons:\\n\")\n",
- "cat(\"1. To make the distribution more symmetric and closer to normal\\n\")\n",
- "cat(\"2. To stabilize variance across the range of values\\n\")\n",
- "cat(\"3. To match the scale of voom-transformed peak counts, which are on log2-CPM scale\\n\")\n",
- "cat(\"4. To be consistent with the approach used in related studies like haQTL\\n\\n\")\n",
- "\n",
- "cat(\"## Variables and Their Meanings\\n\\n\")\n",
- "\n",
- "cat(\"### Technical Variables\\n\")\n",
- "cat(\"- n_nuclei: Number of nuclei that contributed to this pseudobulk sample\\n\")\n",
- "cat(\" * Filtered to include only samples with >20 nuclei\\n\")\n",
- "cat(\" * Log-transformed because count data typically has a right-skewed distribution\\n\\n\")\n",
- "\n",
- "cat(\"- med_n_tot_fragment: Median number of total fragments per cell\\n\")\n",
- "cat(\" * Represents sequencing depth\\n\")\n",
- "cat(\" * Log-transformed because sequencing depth typically has exponential effects\\n\\n\")\n",
- "\n",
- "cat(\"- total_unique_peaks: Number of unique peaks detected in each sample\\n\")\n",
- "cat(\" * Log-transformed similar to 'TotalNumPeaks' in haQTL pipeline\\n\\n\")\n",
- "\n",
- "cat(\"- med_nucleosome_signal: Median nucleosome signal\\n\")\n",
- "cat(\" * Measures the degree of nucleosome positioning\\n\")\n",
- "cat(\" * Not log-transformed as it's already a ratio/normalized metric\\n\\n\")\n",
- "\n",
- "cat(\"- med_tss_enrich: Median transcription start site enrichment score\\n\")\n",
- "cat(\" * Indicates the quality of the ATAC-seq data\\n\")\n",
- "cat(\" * Not log-transformed as it's already a ratio/normalized metric\\n\\n\")\n",
- "\n",
- "\n",
- "cat(\"- sequencingBatch: Batch ID for the sequencing run\\n\")\n",
- "cat(\" * Treated as a factor to account for batch effects\\n\\n\")\n",
- "\n",
- "cat(\"- Library: Library preparation batch ID\\n\")\n",
- "cat(\" * Treated as a factor to account for library preparation effects\\n\\n\")\n",
- "\n",
- "cat(\"### Other Variables\\n\")\n",
- "cat(\"- pmi: Post-mortem interval (time between death and tissue collection)\\n\")\n",
- "cat(\"- study: Study cohort (ROSMAP, MAP, ROS)\\n\\n\")\n",
- "\n",
- "cat(\"## Relationship to voom Transformation\\n\")\n",
- "cat(\"The voom transformation converts count data to log2-CPM (counts per million) values \")\n",
- "cat(\"and estimates the mean-variance relationship. By log-transforming certain technical \")\n",
- "cat(\"covariates, we ensure they're on a similar scale to the transformed expression data, \")\n",
- "cat(\"which can improve the fit of the linear model used for removing unwanted variation.\\n\")\n",
- "sink()\n",
- "\n",
- "cat(\"Processing completed. Results and documentation saved to:\", paste0(output_dir, \"/2_residuals/\", celltype, \"/\"), \"\\n\")"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "177f20f0-2d2e-4674-9894-15434681504d",
- "metadata": {},
- "source": [
- "## Step 2: Phenotype Reformat\n",
- "**Purpose:** Converts covariate-adjusted residuals from Step 1 into genome-wide BED format suitable for QTL mapping tools (FastQTL, TensorQTL, MatrixEQTL).\n",
- "\n",
- "---\n",
- "\n",
- "#### Input:\n",
- "\n",
- "**From Step 1 (required):**\n",
- "- `{celltype}_residuals.txt` (in `output/2_residuals/{celltype}/`)\n",
- "\n",
- "**Cell Types:**\n",
- "- `Mic` (Microglia)\n",
- "- `Astro` (Astrocytes)\n",
- "- `Oligo` (Oligodendrocytes)\n",
- "- `Ex` (Excitatory neurons)\n",
- "- `In` (Inhibitory neurons)\n",
- "- `OPC` (Oligodendrocyte precursor cells)\n",
- "\n",
- "\n",
- "#### Process:\n",
- "\n",
- "1. Set Cell Type and Paths\n",
- "2. Load residuals file\n",
- "3. Extract and parse peak IDs\n",
- "4. Convert to Midpoint Coordinates\n",
- "5. Create BED format\n",
- "6. Sort by genomic position\n",
- "7. Write BED file\n",
- "8. Compress with bgzip\n",
- "\n",
- "#### Output:\n",
- "Output Directory: `output/3_phenotype_reformatting/{celltype}/`\n",
- "\n",
- "Output File: `{celltype}_kellis_xiong_snatac_phenotype.bed.gz`"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "55814e99-5baf-4a24-b185-7ecfd2327ed8",
- "metadata": {},
- "source": [
- "#### Load libraries"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 23,
- "id": "858a632a-3ba8-4791-a2d5-b92110dc8ce3",
- "metadata": {},
- "outputs": [],
- "source": [
- "library(data.table)\n",
- "library(stringr)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 25,
- "id": "bf1b630b-20f7-43f2-973a-28dbee1acc61",
- "metadata": {},
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "Column names from first line: SM-CTECR, SM-CJK5G, SM-CJEKQ, SM-CJGGY, SM-CJK3S, SM-CTEGU ...\n"
- ]
- }
- ],
- "source": [
- "#!/usr/bin/env Rscript\n",
- "\n",
- "# Script to reformat ATAC-seq residuals into BED format and compress with bgzip\n",
- "# Usage: Rscript reformat_residuals.R [celltype]\n",
- "\n",
- "# Get command line arguments\n",
- "#args <- commandArgs(trailingOnly = TRUE)\n",
- "#if (length(args) < 1) {\n",
- "# celltype <- \"Ex\" # Default cell type\n",
- "# cat(\"No cell type specified, using default:\", celltype, \"\\n\")\n",
- "#} else {\n",
- "# celltype <- args[1]\n",
- "# cat(\"Processing cell type:\", celltype, \"\\n\")\n",
- "#}\n",
- "\n",
- "# Define input and output paths\n",
- "#input_dir <- \"/home/al4225/project/kellis_snatac/output/xiong/2_residuals\"\n",
- "#output_dir <- \"/home/al4225/project/kellis_snatac/output/3_phenotype_processing\"\n",
- "pheno_reformat_output_dir <- paste0(output_dir, \"/3_phenotype_reformatting/\", celltype)\n",
- "\n",
- "# Create output directory if it doesn't exist\n",
- "dir.create(pheno_reformat_output_dir, recursive = TRUE, showWarnings = FALSE)\n",
- "\n",
- "# Check if input directory exists\n",
- "celltype_dir <- paste0(output_dir,\"/2_residuals/\", celltype)\n",
- "if (!dir.exists(celltype_dir)) {\n",
- " cat(\"Cell type directory not found:\", celltype_dir, \"\\n\")\n",
- " cat(\"Using backup directory...\\n\")\n",
- " celltype_dir <- file.path(output_dir,paste0(\"2_residuals/backup/\", celltype))\n",
- " if (!dir.exists(celltype_dir)) {\n",
- " dir.create(celltype_dir, recursive = TRUE)\n",
- " stop(\"Backup directory not found either: \", celltype_dir)\n",
- " }\n",
- "}\n",
- "\n",
- "input_file <- file.path(celltype_dir, paste0(celltype, \"_residuals.txt\"))\n",
- "output_bed <- file.path(output_dir, paste0(\"3_phenotype_reformatting/\",celltype ,\"/\", celltype,\"_kellis_xiong_snatac_phenotype.bed\"))\n",
- "\n",
- "# Check if input file exists\n",
- "if (!file.exists(input_file)) {\n",
- " stop(\"Input file not found: \", input_file)\n",
- "}\n",
- "\n",
- "# Read the first line manually to get the column names\n",
- "first_line <- readLines(input_file, n = 1)\n",
- "col_names <- unlist(strsplit(first_line, split = \"\\t\"))\n",
- "cat(\"Column names from first line:\", paste(head(col_names), collapse = \", \"), \"...\\n\")"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "9282f3a2-650f-4a61-abd1-5038b324cfea",
- "metadata": {},
- "source": [
- "#### Load input"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 26,
- "id": "95e4ee18-4411-4bd6-9b33-6bc426d9742b",
- "metadata": {},
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "Reading residuals file: /restricted/projectnb/xqtl/jaempawi/atac_seq/output/xiong/2_residuals/Oligo/Oligo_residuals.txt \n"
- ]
- }
- ],
- "source": [
- "cat(\"Reading residuals file:\", input_file, \"\\n\")\n",
- "first_line <- readLines(input_file, n = 1)\n",
- "col_names <- unlist(strsplit(first_line, split = \"\\t\"))\n",
- "\n",
- "residuals <- fread(input_file, header = FALSE, skip = 1)\n",
- "\n",
- "# Logic to handle row names/peak IDs\n",
- "if (ncol(residuals) > length(col_names)) {\n",
- " peak_ids <- residuals[[1]]\n",
- " residuals <- residuals[, -1, with = FALSE]\n",
- " setnames(residuals, col_names)\n",
- "} else {\n",
- " peak_ids <- residuals[[1]]\n",
- " residuals <- residuals[, -1, with = FALSE]\n",
- " setnames(residuals, col_names[-1]) # Adjusting for leading empty/ID column\n",
- "}"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "5fe44c41-8fde-4107-80fa-de5823e3f0ab",
- "metadata": {},
- "source": [
- "#### Coordinate Parsing (BED format)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 27,
- "id": "34037cc5-ad0e-48c7-b528-f67ecbc0bec7",
- "metadata": {},
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "Parsing peak IDs into BED format with midpoint coordinates\n"
- ]
- }
- ],
- "source": [
- "cat(\"Parsing peak IDs into BED format with midpoint coordinates\\n\")\n",
- "parts <- strsplit(peak_ids, \"-\")\n",
- "chrs <- sapply(parts, `[`, 1)\n",
- "starts_raw <- as.numeric(sapply(parts, `[`, 2))\n",
- "ends_raw <- as.numeric(sapply(parts, `[`, 3))\n",
- "\n",
- "# Calculate midpoints for a 1bp window (Standard for QTLtools)\n",
- "# This centers the peak signal on a single genomic coordinate\n",
- "mids <- as.integer((starts_raw + ends_raw) / 2)\n",
- "\n",
- "parsed_peaks <- data.table(\n",
- " '#chr' = chrs,\n",
- " start = mids,\n",
- " end = mids + 1,\n",
- " ID = peak_ids\n",
- ")\n",
- "\n",
- "# Combine and Sort\n",
- "bed_data <- cbind(parsed_peaks, residuals)\n",
- "setorder(bed_data, '#chr', start)\n"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "39221488-f744-402e-97c6-ef6f98c310e6",
- "metadata": {},
- "source": [
- "#### Save and compress "
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 28,
- "id": "e09883c1-9d6e-4447-ae27-e3d668c33ef2",
- "metadata": {},
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "Writing BED file to: /restricted/projectnb/xqtl/jaempawi/atac_seq/output/xiong/3_phenotype_reformatting/Oligo/Oligo_kellis_xiong_snatac_phenotype.bed \n",
- "Compressing with bgzip...\n",
- "Process completed for Oligo \n"
- ]
- }
- ],
- "source": [
- "cat(\"Writing BED file to:\", output_bed, \"\\n\")\n",
- "fwrite(bed_data, output_bed, sep = \"\\t\", col.names = TRUE, quote = FALSE)\n",
- "\n",
- "cat(\"Compressing with bgzip...\\n\")\n",
- "system(paste(\"bgzip -f\", output_bed))\n",
- "\n",
- "# Highly recommended: Index for tabix\n",
- "system(paste(\"tabix -p bed\", paste0(output_bed, \".gz\")))\n",
- "\n",
- "cat(\"Process completed for\", celltype, \"\\n\")"
- ]
- }
- ],
- "metadata": {
- "kernelspec": {
- "display_name": "R",
- "language": "R",
- "name": "ir"
- },
- "language_info": {
- "codemirror_mode": "r",
- "file_extension": ".r",
- "mimetype": "text/x-r-source",
- "name": "R",
- "pygments_lexer": "r",
- "version": "4.4.3"
- }
- },
- "nbformat": 4,
- "nbformat_minor": 5
-}