Skip to content

IARCBiostat/OmicsProcessing

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

121 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

OmicsProcessing

Pre-analysis processing for metabolomics and proteomics: missingness filtering, outlier handling, imputation, transformation, matched case-control handling, batch/plate correction, and SERRF-based normalisation across batches or strata. Please visit out website for more information and vignettes

Choose your workflow

Semi-automated pipeline (process_data())

  • End-to-end wrapper that can filter on missingness, impute, transform, remove outliers (PCA + LOF), handle matched case-control designs, correct for plate/batch effects, and centre/scale.
  • Takes three data frames (feature data, feature metadata, sample metadata) and returns processed data plus exclusion IDs and PCA/LOF plots.
  • Full walk-through: Semi-automated pipeline.

Modular workflow (build your own)

Quick start

# install.packages("remotes")
remotes::install_github("IARCBiostat/OmicsProcessing")
library(OmicsProcessing)

Run the semi-automated pipeline with three input tables:

processed <- process_data(
  data = data_features,
  data_meta_features = data_meta_features,
  data_meta_samples = data_meta_samples,
  col_samples = "ID_sample",
  exclusion_extreme_feature = TRUE,
  exclusion_extreme_sample = TRUE,
  imputation = TRUE,
  transformation = TRUE,
  outlier = TRUE,
  plate_correction = TRUE
)

Or stitch together a modular workflow:

# Load data
df <- readr::read_csv("path/to/data")

# Filter by missingness
df_filtered <- filter_by_missingness(
  df,
  row_thresh = 0.5,
  col_thresh = 0.5,
  target_cols = "@",
  is_qc = grepl("^sQC", df$sample_type),
  filter_order = "iterative"
)

# Detect outlier samples (PCA + LOF)
outliers <- remove_outliers(
  df_filtered,
  target_cols = "@",
  is_qc = grepl("^sQC", df_filtered$sample_type),
  method = "pca-lof-overall",
  impute_method = "half-min-value",
  restore_missing_values = TRUE,
  return_ggplots = FALSE
)
df_clean <- outliers$df_filtered

# Log-transform features
df_clean <- df_clean %>%
  dplyr::mutate(dplyr::across(tidyselect::contains("@"), log1p))

# Impute missing values (RF + LCMD)
df_imputed <- hybrid_imputation(
  df_clean,
  target_cols = "@",
  method = "RF-LCMD",
  oobe_threshold = 0.1
)$hybrid_rf_lcmd

# SERRF normalisation
df_normalised <- normalise_SERRF(
  df_imputed,
  target_cols = "@",
  is_qc = grepl("^sQC", df_imputed$sample_type),
  strata_col = "batch"
)

# Cluster features by RT using correlations
clusters <- cluster_features_by_retention_time(
  df = df_normalised,
  target_cols = "@",
  rt_height = 0.07,
  method = "correlations",
  cut_height = 0.26,
  corr_thresh = 0.75
)

Developers & Contributors

We welcome contributions to OmicsProcessing. Our priorities are clean code and good documentation.

Please follow these guidelines: Developers & Contributors

Resources

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Packages

No packages published

Contributors 2

  •  
  •  

Languages