experiment_key_column: Column in cell metadata used to generate pseudobulk datasets and calculate cell type proportions (combined with termproportion_covariate_column)anndata_cell_label: Column in cell metadata representing cell types to perform differential expression and gene set enrichment
Parameters are applied within each cell type as denoted by anndata_cell_label
mean_cp10k_filter: Remove genes with mean counts per 10,000 (CP10K) expression <=mean_cp10k_filtermodels: List of configurations for differential expression methods to runfilter_options: Options for the optional pre-filter.filter: Value to set as minimum for filter.modality:[cp10k|counts]metric: Default ismean. TODO -- alternate metrics implemented?by_comparison: TODO -- what is this?
method: String in the formatpackage::resolution::model. Possible values for each:package: "mast", "edger", "deseq"resolution: "singlecell" or "pseudobulk"model: Options:- For MAST, "bayesglm", "glmer" (needed for random effect models), or "glm"
- For edgeR, "glmQLFit" or "glmLRT"
- For DESeq2, "glmGamPoi" (recommended), "parametric", "local", or "mean"
formula: Formula to model the gene expression. Terms should be columns in cell metadata (Ex: "~ sex + age + disease_status"). The pipeline also supports the following operations:- R formula functions, such as "I(age^2)" and interaction effects, denoted with ":" (e.g., "time_point:disease_status"). See: https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/formula
- Random effects as denoted by "(1|participant_id)"
variable_target: Term informulato test (e.g., disease_status)variable_continuous: Terms informulathat should be cast as continuous covariatesvariable_discrete: Terms informulathat should be cast as discrete (i.e., categorical) covariatesvariable_discrete_level: Reference information for discrete covariates, formatted as "cov_1::ref;;cov_2::ref"pre_filter_genes: Logical (e.g., true or false) to applymean_cp10k_filterbefore or after performing differential expressionproportion_covariate_column: Column in cell metadata to calculate the proportion of cells from each experiment (defined byexperiment_key_column) representing each value. For instance, if same asanndata_cell_label, pipeline will calculate the proportion of each cell type for each experiment key.include_proportion_covariates: Logical (e.g., true or false) to include proportions fromproportion_covariate_columninformularuvseq: Logical (e.g., true or false) to run RUVSeqruvseq_n_empirical_genes: Number of empirical genes to use as input into RUVSeq. If value<1, we will take the proportion of total genes (value * # genes total). If value>1, we will use value as the number of genesruvseq_min_pvalue: Number representing minimum p-value threshold for empirical genes. Only genes with p-value > value will be used as empirical genesruvseq_k: Number of RUVSeq factors to adjust for
de_merge_config: Configuration for merge settingsihw_correction: Configuration for IHW correctioncovariates: Comma-separated list of covariates to include in IHW correction (e.g., "cell_label,disease_status")alpha: See IHW documentation
de_plot_config: Parameters for plotting differential expression resultsmean_expression_filter: List of mean expression thresholds to drop for plots for each group inanndata_cell_label. For example: if gene A expression is 0 counts in cluster 1 and 10 in cluster 2, it will be dropped from cluster 1 but not cluster 2.
goenrich_config:go_terms: Ontology terms: MF (Molecular Function), CC (Cellular Component), BP (Biological Process). Multiple terms can be specified, separated by commas (e.g., 'BP,MF,CC').clustering_method: Method to cluster terms. Options: "binary_cut", "louvain", "mclust"
gsea_config: Parameters for running gene set analysesfgsea_parameters: List of alternate configurations for fgseasample_size: See fGSEA documentationscore_type: See fGSEA documentationmin_set_size: See fGSEA documentationmax_set_size: See fGSEA documentationeps: See fGSEA documentationdatabase: Comma-separated list of databases to test for enrichments. Detailed descriptions of databases can be found here. Options:c2.cgp: Chemical and genetic perturbationsc2.cp.biocarta: BioCartac2.cp.kegg: KEGGc2.cp.reactome: Reactomec2.cp: PIDc5.bp: GO biological processc5.cc: GO cellular componentc5.mf: GO molecular functionc6.all: Oncogenic signaturesc7.all: Immunologic signaturesall: All gene sets (c2.cp.reactome, c2.cp.kegg, c5.bp, c5.cc, c5.mf)
gsea_summarize_parameters: Parameters to summarize GSEA datadistance_metric: Metric to calculate distance between terms. Options: "kappa", "jaccard", "dice", "overlap"clustering_method: Method to cluster terms. Options: "binary_cut", "louvain", "mclust"