diff --git a/.github/workflows/publish.yml b/.github/workflows/publish.yml index 5093b87..4c4538f 100644 --- a/.github/workflows/publish.yml +++ b/.github/workflows/publish.yml @@ -4,6 +4,8 @@ on: push: tags: - "v*.*.*" + release: + types: [published] jobs: build-and-publish: @@ -12,6 +14,8 @@ jobs: steps: - name: Check out code uses: actions/checkout@v3 + with: + ref: ${{ github.event_name == 'release' && github.event.release.tag_name || github.ref }} - name: Set up Python uses: actions/setup-python@v4 @@ -32,7 +36,7 @@ jobs: TWINE_PASSWORD: ${{ secrets.GITHUB_TOKEN }} run: | twine upload \ - --repository-url https://upload.pypi.github.io/UCD-BDLab/BioNeuralNet \ + --repository-url https://api.github.com/orgs/UCD-BDLab/packages/pypi/upload \ dist/* - name: Publish to PyPI diff --git a/CHANGELOG.md b/CHANGELOG.md index c3da11f..8a0b681 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -66,9 +66,47 @@ and this project adheres to [Semantic Versioning](https://semver.org/). - **Updated Tutorials and Documentation**: New end to end jupiter notebook example. - **Updated Test**: All test have been updated and new ones have been added. -## [1.0.1] to [1.0.9] - 2025-04-24 +## [1.1.0] - 2025-07-12 -- **BUG**: A bug related to rdata files missing -- **Updated License**: BioNeuralNet is now distributed under the [Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License (CC BY-NC-ND 4.0)](https://creativecommons.org/licenses/by-nc-nd/4.0/). +### **Added** +- **New Embedding Integration Utility** + - `_integrate_embeddings(reduced, method="multiply", alpha=2.0, beta=0.5)`: + - Integrates reduced embeddings with raw omics features via a multiplicative scheme: + - `enhanced = beta * raw + (1 - beta) * (alpha * normalized_weight * raw)` + - (default ensures ≥ 50 % of each feature’s final value is influenced by the learned weights). + +- **Graph-Generation Algorithms** + - `gen_similarity_graph`: k-NN Cosine / Gaussian RBF similarity graph + - `gen_correlation_graph`: Pearson / Spearman co-expression graph + - `gen_threshold_graph`: soft-threshold (WGCNA-style) correlation graph + - `gen_gaussian_knn_graph`: Gaussian kernel k-NN graph + - `gen_mutual_info_graph`: mutual-information graph + +- **Preprocessing Utilities** + - Clinical data pipeline `preprocess_clinical` + - Inf/NaN cleaning: `clean_inf_nan` + - Variance selection: `select_top_k_variance` + - Correlation selection (supervised / unsupervised): `select_top_k_correlation` + - RandomForest importance: `select_top_randomforest` + - ANOVA F-test selection: `top_anova_f_features` + - Network-pruning helpers: + - `prune_network`, `prune_network_by_quantile`, + - `network_remove_low_variance`, `network_remove_high_zero_fraction` + +- **Continuous-Deployment Workflow** + Added `.github/workflows/publish.yml` to auto-publish releases to PyPI when a Git tag is pushed. + +- **Updated Homepage Image** + Replaced the index-page illustration to depict the full BioNeuralNet workflow. -- **New release**: A new release will include documentation for the other updates. (1.1.0) \ No newline at end of file +### **Changed** +- **Comprehensive Documentation Update** + - Rebuilt ReadTheDocs site with a new workflow diagram on the landing page. + - Synced API reference to include all new graph-generation, preprocessing, and embedding-integration functions. + - Added quick-start guide, expanded tutorials, and refreshed examples/notebooks. + - Updated narrative docs, docstrings, and licencing info for consistency. + +- **License**: Project is now distributed under the [Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License (CC BY-NC-ND 4.0)](https://creativecommons.org/licenses/by-nc-nd/4.0/). + +### **Fixed** +- **Packaging Bug**: Missing `.csv` datasets and `.R` scripts in source distribution; `MANIFEST.in` updated to include all requisite data files. diff --git a/README.md b/README.md index 815be0e..cb6c8e2 100644 --- a/README.md +++ b/README.md @@ -8,7 +8,7 @@ [![Documentation](https://img.shields.io/badge/docs-read%20the%20docs-blue.svg)](https://bioneuralnet.readthedocs.io/en/latest/) -## Welcome to BioNeuralNet 1.0.9 +## Welcome to BioNeuralNet 1.1.0 ![BioNeuralNet Logo](assets/LOGO_WB.png) diff --git a/bioneuralnet/__init__.py b/bioneuralnet/__init__.py index 17af74f..3a8d59c 100644 --- a/bioneuralnet/__init__.py +++ b/bioneuralnet/__init__.py @@ -29,7 +29,7 @@ - `datasets`: Contains example (synthetic) datasets for testing and demonstration purposes. """ -__version__ = "1.0.9" +__version__ = "1.1.0" from .network_embedding import GNNEmbedding from .downstream_task import SubjectRepresentation diff --git a/docs/jupyter_execute/Quick_Start.ipynb b/docs/jupyter_execute/Quick_Start.ipynb index 827cc56..b7f01e0 100644 --- a/docs/jupyter_execute/Quick_Start.ipynb +++ b/docs/jupyter_execute/Quick_Start.ipynb @@ -913,7 +913,7 @@ "name": "stdout", "output_type": "stream", "text": [ - "BioNeuralNet version: 1.0.9\n" + "BioNeuralNet version: 1.1.0\n" ] } ], diff --git a/docs/jupyter_execute/TCGA-BRCA_Dataset.ipynb b/docs/jupyter_execute/TCGA-BRCA_Dataset.ipynb index b95a5af..d3e6011 100644 --- a/docs/jupyter_execute/TCGA-BRCA_Dataset.ipynb +++ b/docs/jupyter_execute/TCGA-BRCA_Dataset.ipynb @@ -60,27 +60,6 @@ "- [Direct Download BRCA](http://firebrowse.org/?cohort=BRCA&download_dialog=true)\n" ] }, - { - "cell_type": "code", - "execution_count": 1, - "id": "60a6b53c", - "metadata": {}, - "outputs": [], - "source": [ - "# adjusting global pandas options for better display on web documentation\n", - "import pandas as pd\n", - "import warnings\n", - "import logging\n", - "\n", - "pd.set_option(\"display.max_columns\", 5)\n", - "pd.set_option(\"display.expand_frame_repr\", False)\n", - "warnings.filterwarnings(\"ignore\", category=UserWarning)\n", - "warnings.filterwarnings(\"ignore\", category=DeprecationWarning)\n", - "logging.getLogger(\"ray\").setLevel(logging.ERROR)\n", - "logging.getLogger(\"ray.tune\").setLevel(logging.ERROR)\n", - "logging.getLogger(\"torch_geometric\").setLevel(logging.ERROR)" - ] - }, { "cell_type": "markdown", "id": "c9698b74", diff --git a/docs/source/Quick_Start.ipynb b/docs/source/Quick_Start.ipynb index 5c93cfe..0bc7175 100644 --- a/docs/source/Quick_Start.ipynb +++ b/docs/source/Quick_Start.ipynb @@ -913,7 +913,7 @@ "name": "stdout", "output_type": "stream", "text": [ - "BioNeuralNet version: 1.0.9\n" + "BioNeuralNet version: 1.1.0\n" ] } ], diff --git a/docs/source/_autosummary/bioneuralnet.utils.graph.rst b/docs/source/_autosummary/bioneuralnet.utils.graph.rst index f0bdf40..4b470a7 100644 --- a/docs/source/_autosummary/bioneuralnet.utils.graph.rst +++ b/docs/source/_autosummary/bioneuralnet.utils.graph.rst @@ -15,6 +15,7 @@ bioneuralnet.utils.graph gen_similarity_graph gen_snn_graph gen_threshold_graph + get_logger .. rubric:: Classes diff --git a/docs/source/_autosummary/bioneuralnet.utils.preprocess.rst b/docs/source/_autosummary/bioneuralnet.utils.preprocess.rst index ee6c2d4..1ea9552 100644 --- a/docs/source/_autosummary/bioneuralnet.utils.preprocess.rst +++ b/docs/source/_autosummary/bioneuralnet.utils.preprocess.rst @@ -15,7 +15,6 @@ bioneuralnet.utils.preprocess multipletests network_remove_high_zero_fraction network_remove_low_variance - overload preprocess_clinical prune_network prune_network_by_quantile @@ -28,7 +27,6 @@ bioneuralnet.utils.preprocess .. autosummary:: - OrdinalEncoder RandomForestClassifier RandomForestRegressor RobustScaler diff --git a/docs/source/_static/BioNeuralNet.png b/docs/source/_static/BioNeuralNet.png index 9377bfe..19845b4 100644 Binary files a/docs/source/_static/BioNeuralNet.png and b/docs/source/_static/BioNeuralNet.png differ diff --git a/docs/source/_static/BioNeuralNet_old1.png b/docs/source/_static/BioNeuralNet_old1.png new file mode 100644 index 0000000..9377bfe Binary files /dev/null and b/docs/source/_static/BioNeuralNet_old1.png differ diff --git a/docs/source/clustering.rst b/docs/source/clustering.rst index ee2c997..ecd8fb6 100644 --- a/docs/source/clustering.rst +++ b/docs/source/clustering.rst @@ -1,158 +1,95 @@ Correlated Clustering ===================== -BioNeuralNet includes internal modules for performing **correlated clustering** on complex networks. -These methods extend traditional community detection by integrating **phenotype correlation**, allowing users to extract **biologically relevant, phenotype-associated modules** from any network. +BioNeuralNet provides **correlated clustering methods** designed specifically to identify biologically relevant communities within multi-omics networks. By integrating **phenotype correlations**, these approaches enhance traditional community detection methods, capturing biologically meaningful network modules strongly associated with clinical or phenotypic outcomes. -Overview --------- +Key Features +------------ +- **Phenotype-Aware Clustering**: Incorporates external phenotype information directly into clustering algorithms, resulting in communities that are both structurally cohesive and biologically meaningful. +- **Flexible Application**: Methods are applicable to any network data represented as adjacency matrices, facilitating diverse research scenarios including biomarker discovery and functional module identification. +- **Integration with Downstream Analysis**: Clusters obtained can directly feed into downstream tasks such as disease prediction, feature selection, and biomarker identification. -Our framework supports three key **correlated clustering** approaches: +Supported Clustering Methods +---------------------------- -- **Correlated PageRank**: +Correlated PageRank +------------------- +A variant of PageRank that biases node rankings toward phenotype-relevant nodes, prioritizing features with strong phenotype associations: - - A **modified PageRank algorithm** that prioritizes nodes based on their correlation with an external phenotype. - - - The **personalization vector** is computed using phenotype correlation, ensuring that **biologically significant nodes receive more influence**. - - - This method is ideal for **identifying high-impact nodes** within a given network. +.. math:: -- **Correlated Louvain**: + \mathbf{r} = \alpha \cdot \mathbf{M} \mathbf{r} + (1 - \alpha) \mathbf{p} - - An adaptation of the **Louvain community detection algorithm**, modified to optimize for **both network modularity and phenotype correlation**. - - The objective function for community detection is given by: +- :math:`\mathbf{M}`: Normalized adjacency (transition probability matrix). +- :math:`\mathbf{p}`: Phenotype-informed personalization vector (based on correlation). +- Ideal for ranking biologically impactful nodes. - .. math:: +Correlated Louvain +------------------ +Modifies Louvain community detection to balance structural modularity and phenotype correlation, optimizing: - Q^* = k_L \cdot Q + (1 - k_L) \cdot \overline{\lvert \rho \rvert}, +.. math:: - where: + Q^* = k_L \cdot Q + (1 - k_L) \cdot \overline{\lvert \rho \rvert} - - :math:`Q` is the standard **Newman-Girvan modularity**, defined as: +- :math:`Q`: Newman-Girvan modularity, measuring network structural cohesiveness. +- :math:`\overline{\lvert \rho \rvert}`: Mean absolute Pearson correlation between cluster features and phenotype. +- :math:`k_L`: User-defined parameter balancing structure and phenotype relevance. +- Efficient for identifying phenotype-enriched communities. - .. math:: +Hybrid Louvain (Iterative Refinement) +------------------------------------- +Combines Correlated Louvain with Correlated PageRank iteratively to refine community assignments: - Q = \frac{1}{2m} \sum_{i,j} \bigl(A_{ij} - \frac{k_i k_j}{2m} \bigr) \delta(c_i, c_j), - - where :math:`A_{ij}` represents the adjacency matrix, :math:`k_i` and :math:`k_j` are node degrees, and :math:`\delta(c_i, c_j)` indicates whether nodes belong to the same community. - - :math:`\overline{\lvert \rho \rvert}` is the **mean absolute Pearson correlation** between the **first principal component (PC1) of the subgraph's features** and the phenotype. - - :math:`k_L` is a user-defined weight (e.g., :math:`k_L = 0.2`), balancing **network modularity and phenotype correlation**. - - - This method **detects communities** that are both **structurally cohesive and strongly associated with phenotype**. - -- **Hybrid Louvain**: - - - A **refinement approach** that combines **Correlated Louvain** and **Correlated PageRank** in an iterative process. - - - The key steps are: - - 1. **Initial Community Detection**: - - - The **input network (adjacency matrix)** is clustered using **Correlated Louvain**. - - This identifies **initial phenotype-associated modules**. - - 2. **Iterative Refinement with Correlated PageRank**: - - - In each iteration: - - - The **most correlated module** is **expanded** based on Correlated PageRank. - - The refined network is **re-clustered using Correlated Louvain**. - - This process continues **until convergence**. - - 3. **Final Cluster Extraction**: - - - The final **phenotype-optimized modules** are extracted and returned. - - The quality of the clustering is measured using **both modularity and phenotype correlation metrics**. +1. Initial clustering using Correlated Louvain identifies phenotype-associated modules. +2. Clusters iteratively refined by expanding highly correlated modules using Correlated PageRank. +3. Repeated until convergence, producing optimized phenotype-associated communities. .. figure:: _static/hybrid_clustering.png :align: center - :alt: Overview hybrid clustering workflow - - **Hybrid Clustering**: Precedure and steps for the hybrid clustering method. + :alt: Hybrid Clustering Workflow + Workflow: Hybrid Louvain iteratively integrates Correlated PageRank and Correlated Louvain to produce refined phenotype-associated clusters. -Mathematical Approach +Comparison of Methods --------------------- - -**Correlated PageRank:** - - - Correlated PageRank extends the traditional PageRank formulation by **biasing the random walk towards phenotype-associated nodes**. - - - The **ranking function** is defined as: - - .. math:: - - \mathbf{r} = \alpha \cdot \mathbf{M} \mathbf{r} + (1 - \alpha) \mathbf{p}, - - where: - - - :math:`\mathbf{M}` is the transition probability matrix, derived from the **normalized adjacency matrix**. - - :math:`\mathbf{p}` is the **personalization vector**, computed using **phenotype correlation**. - - :math:`\alpha` is the **teleportation factor** (default: :math:`\alpha = 0.85`). - -- Unlike standard PageRank, which assumes a **uniform teleportation distribution**, **Correlated PageRank prioritizes phenotype-relevant nodes**. - -Graphical Comparison --------------------- - -Below is an illustration of **different clustering approaches** on a sample network: +The figure below illustrates the difference between standard and correlated clustering methods, highlighting BioNeuralNet's ability to extract biologically meaningful modules. .. figure:: _static/clustercorrelation.png :align: center - :alt: Comparison of Correlated Clustering Methods - - **Figure 2:** Comparison between SmCCNet generated clusters and Correlated Louvain clusters - -Integration with BioNeuralNet ------------------------------- + :alt: Clustering Method Comparison -Our **correlated clustering methods** seamlessly integrate into **BioNeuralNet** and can be applied to **any network represented as an adjacency matrix**. + Comparison: Standard (SmCCNet) versus Correlated Louvain clusters. -Use cases include: +Applications and Use Cases +-------------------------- +BioNeuralNet correlated clustering is versatile and suitable for diverse network analyses: - - **Multi-Omics Networks**: Extracting **biologically relevant subgraphs** from gene expression, proteomics, or metabolomics data. - - **Brain Connectivity Graphs**: Identifying **functional modules associated with neurological disorders**. - - **Social & Disease Networks**: Detecting **community structures in epidemiology and patient networks**. +- **Multi-Omics Networks**: Extract biologically relevant gene/protein modules associated with clinical phenotypes. +- **Neuroimaging Networks**: Identify functional brain modules linked to neurological diseases. +- **Disease Networks**: Reveal patient or epidemiological network communities strongly linked to clinical outcomes. -Our framework supports: +Integration into BioNeuralNet Workflow +-------------------------------------- +Clustering outputs seamlessly feed into downstream BioNeuralNet modules: - - **Graph Neural Network Embedding**: Training GNNs on **phenotype-optimized clusters**. - - - **Predictive Biomarker Discovery**: Identifying key **features associated with disease outcomes**. - - - **Customizable Modularity Optimization**: Allowing users to **adjust the trade-off between structure and phenotype correlation**. +- **GNN Embedding Generation**: Train Graph Neural Networks on phenotype-enriched clusters. +- **Disease Prediction (DPMON)**: Utilize phenotype-associated modules for improved predictive accuracy. +- **Biomarker Discovery**: Extract features or modules strongly predictive of disease status. -Notes for Users ---------------- - -1. **Input Requirements**: - - - Any **graph-based dataset** can be used as input, provided as an **adjacency matrix**. - - - Phenotype data should be supplied in **numerical format** (e.g., disease severity scores, expression levels). - -2. **Cluster Comparison**: - - - **Correlated Louvain extracts phenotype-associated modules.** - - - **Hybrid Louvain iteratively refines clusters using Correlated PageRank.** - - - Users can compare results using **modularity scores and phenotype correlation metrics**. - -3. **Method Selection**: +User Recommendations +-------------------- +- **Correlated PageRank**: Best for prioritizing individual high-impact features or nodes. +- **Correlated Louvain**: Ideal for extracting phenotype-associated functional communities efficiently. +- **Hybrid Louvain**: Recommended for maximal biological interpretability, particularly in complex multi-omics scenarios. - - **Correlated PageRank** is ideal for **ranking high-impact nodes in a phenotype-aware manner**. - - - **Correlated Louvain** is best for **detecting phenotype-associated communities**. - - - **Hybrid Louvain** provides the most refined, **biologically meaningful clusters**. +Reference and Further Reading +----------------------------- +For detailed methodology and benchmarking, refer to our publication: -Conclusion ----------- +- Abdel-Hafiz et al., Frontiers in Big Data, 2022. [1]_ -The **correlated clustering methods** implemented in BioNeuralNet provide a **powerful, flexible framework** for extracting **highly structured, phenotype-associated modules** from any network. -By integrating **phenotype correlation directly into the clustering process**, these methods enable **more biologically relevant and disease-informative network analysis**. +Return to :doc:`../index` -paper link: https://doi.org/10.3389/fdata.2022.894632 +.. [1] Abdel-Hafiz, M., Najafi, M., et al. "Significant Subgraph Detection in Multi-omics Networks for Disease Pathway Identification." *Frontiers in Big Data*, 5 (2022). DOI: `10.3389/fdata.2022.894632 `_. -Return to :doc:`../index` diff --git a/docs/source/conf.py b/docs/source/conf.py index 40953d5..a73203d 100644 --- a/docs/source/conf.py +++ b/docs/source/conf.py @@ -7,7 +7,7 @@ try: release = metadata.version("bioneuralnet") except metadata.PackageNotFoundError: - release = "1.0.9" + release = "1.1.0" project = "BioNeuralNet" version = release diff --git a/docs/source/downstream_tasks.rst b/docs/source/downstream_tasks.rst index 6bdad86..8951dba 100644 --- a/docs/source/downstream_tasks.rst +++ b/docs/source/downstream_tasks.rst @@ -1,33 +1,33 @@ Downstream Tasks ================ -BioNeuralNet's core innovation is the generation of low-dimensional embeddings that unlock a variety of downstream applications. By reducing the complexity of multi-omics data, these embeddings not only boost computational efficiency but also improve the accuracy of predictive models and enable exploratory analyses. +BioNeuralNet leverages **graph neural network (GNN)-based embeddings** to transform complex multi-omics networks into compact, biologically meaningful representations. These embeddings serve as a powerful foundation for diverse downstream analyses, such as disease prediction, enhanced subject-level profiling, biomarker discovery, and exploratory visualization. -Overview --------- +Core Downstream Applications +---------------------------- -The embeddings generated by Graph Neural Networks (GNNs) serve as a transformative feature of BioNeuralNet. They allow users to: +By capturing both structural and functional relationships inherent in multi-omics data, BioNeuralNet-generated embeddings unlock key applications: -- **Predict Disease Outcomes**: Using the end-to-end DPMON pipeline. -- **Enhance Subject Representation**: By integrating embeddings back into omics data, thus enriching patient-level profiles. -- **Facilitate Other Downstream Analyses**: Such as clustering, biomarker discovery, and visualization. +- **Disease Prediction**: Leverage network-derived embeddings in an end-to-end pipeline (DPMON) for accurate disease classification. +- **Enhanced Subject Representation**: Integrate embeddings with original omics data to improve predictive modeling and patient stratification. +- **Exploratory Analysis**: Facilitate visualization, biomarker identification, and phenotype-associated clustering using low-dimensional embeddings. .. image:: _static/Overview.png :align: center - :alt: Overview of Downstream Tasks - :width: 90% + :alt: BioNeuralNet Embedding Applications + :width: 100% -DPMON: Disease Prediction Pipeline ----------------------------------- +Disease Prediction (DPMON) +-------------------------- -The **DPMON** module provides a seamless, end-to-end workflow for disease prediction. It combines the network adjacency matrix, GNN-based embeddings, and subject-level data to deliver robust predictions. With built-in hyperparameter tuning, DPMON adapts to your data for optimal performance. +BioNeuralNet **DPMON module** provides a streamlined, end-to-end framework for disease classification tasks. It integrates multi-omics data, phenotype-informed network embeddings, and clinical covariates to deliver robust predictive models with optional automated hyperparameter tuning. .. image:: _static/DPMON.png :align: center - :alt: Disease Prediction with DPMON - :width: 90% + :alt: DPMON Disease Prediction Workflow + :width: 100% -Example Usage: +**Example Usage**: .. code-block:: python @@ -36,14 +36,14 @@ Example Usage: from bioneuralnet.downstream_task import DPMON from bioneuralnet.datasets import DatasetLoader - # Step 1: Load your data or use one of the provided datasets - Example = DatasetLoader("example1") - omics_genes = Example.data["X1"] - omics_proteins = Example.data["X2"] - phenotype = Example.data["Y"] - clinical = Example.data["clinical_data"] + # Step 1: Load sample data + example = DatasetLoader("example1") + omics_genes = example.data["X1"] + omics_proteins = example.data["X2"] + phenotype = example.data["Y"] + clinical = example.data["clinical_data"] - # Step 2: Network Construction + # Step 2: Construct phenotype-aware network smccnet = SmCCNet( phenotype_df=phenotype, omics_dfs=[omics_genes, omics_proteins], @@ -52,9 +52,8 @@ Example Usage: summarization="PCA", ) global_network, clusters = smccnet.run() - print("Adjacency matrix generated.") - # Step 3: Disease Prediction (DPMON) + # Step 3: Disease prediction with DPMON dpmon = DPMON( adjacency_matrix=global_network, omics_list=[omics_genes, omics_proteins], @@ -65,49 +64,46 @@ Example Usage: predictions, avg_accuracy = dpmon.run() print("Disease phenotype predictions:\n", predictions) -Subject Representation & Embedding Integration ------------------------------------------------ +Enhanced Subject Representation +------------------------------- -Beyond disease prediction, the learned embeddings can be re-integrated into subject-level data to enrich the feature set. This enhanced subject representation supports downstream tasks such as: +Beyond predictive modeling, BioNeuralNet embeddings can be integrated directly into subject-level multi-omics data to enhance the discriminative power and interpretability of patient profiles. This enriched representation supports several analytical tasks: -- **Biomarker Discovery**: Identifying key omics features that drive disease. -- **Enhanced Clustering**: Grouping patients more effectively based on integrated data. -- **Data Visualization**: Leveraging low-dimensional representations for intuitive plotting and network analysis. +- **Biomarker Discovery**: Highlight key molecular features strongly associated with clinical outcomes. +- **Patient Stratification**: Improve clustering and subgroup identification based on embedding-enhanced profiles. +- **Visualization and Interpretation**: Facilitate intuitive exploration of high-dimensional data through embedding-based visualizations. .. image:: _static/SubjectRepresentation.png :align: center - :alt: Subject Representation Workflow - :width: 80% + :alt: Subject-Level Embedding Integration + :width: 100% -Unlocking Downstream Applications ---------------------------------- +Embedding-Based Exploratory Analysis +------------------------------------ -By lowering the dimensionality, BioNeuralNet's embeddings simplify complex multi-omics data into actionable insights. This approach: +BioNeuralNet low-dimensional embeddings simplify complex omics relationships, providing intuitive entry points into exploratory data analysis, including: -- **Accelerates Predictive Modeling**: Making it easier to integrate with machine learning frameworks. -- **Improves Interpretability**: Allowing users to trace back the contribution of each omics feature. -- **Enables Custom Workflows**: While we support key pipelines like DPMON out-of-the-box, the embeddings can also be used in custom downstream applications. +- **Community Detection**: Uncover biologically relevant clusters linked to clinical phenotypes. +- **Feature Importance Analysis**: Evaluate embedding contributions to predictive models, enhancing interpretability. +- **Interactive Visualization**: Integrate embeddings seamlessly with common Python visualization libraries (e.g., Matplotlib, Seaborn) for insightful plots and network representations. -Other downstream tasks include, but are not limited to: +Customization and Extensibility +------------------------------- -- **Predictive Analytics** -- **Community Detection in Networks** -- **Interactive Data Exploration** +BioNeuralNet is designed with modularity and flexibility in mind. Users can easily adapt embedding outputs for custom analytical workflows, integrating them into broader bioinformatics pipelines or developing novel downstream applications tailored to specific research goals. -Get Started ------------ +Getting Started +--------------- -BioNeuralNet is designed not only to provide robust downstream pipelines but also to empower researchers to develop their own custom analyses. With the combination of disease prediction, subject representation, and other downstream tools, users can seamlessly integrate these components into their broader multi-omics workflows. +To explore comprehensive end-to-end analyses and practical tutorials, refer to: -For further details, check out our end-to-end jupyter notebook and code tutorials: - - - :doc:`Quick_Start` - - :doc:`TCGA-BRCA_Dataset` - - :doc:`tutorials/example_1` - - :doc:`tutorials/example_2` +- :doc:`Quick_Start` +- :doc:`TCGA-BRCA_Dataset` +- :doc:`tutorials/example_1` +- :doc:`tutorials/example_2` References ---------- -For more in-depth information on the methodologies and models, please refer to the related documentation pages and our published works. +Further methodological details and model insights can be found in our documentation and accompanying publications. Return to :doc:`../index` diff --git a/docs/source/faq.rst b/docs/source/faq.rst index 5535c33..1b87a24 100644 --- a/docs/source/faq.rst +++ b/docs/source/faq.rst @@ -21,7 +21,7 @@ BioNeuralNet integrates multiple open-source libraries to deliver advanced multi We also acknowledge R-based tools for external network construction: -- **SmCCNet** - Sparse multiple canonical correlation network tool. `SmCCNet `_ +- **SmCCNet**: Sparse multiple canonical correlation network tool. `SmCCNet `_ These tools enhance BioNeuralNet's capabilities without being required for its core functionality. @@ -38,7 +38,7 @@ Frequently Asked Questions (FAQ) **Q1: What is BioNeuralNet?**: - - BioNeuralNet is a Python framework for integrating multi-omics data with Graph Neural Networks (GNNs). It provides end-to-end solutions for network embedding, clustering, subject representation, and disease prediction. + - BioNeuralNet is a **flexible, modular Python framework** developed to facilitate end-to-end **network-based multi-omics analysis** using **Graph Neural Networks (GNNs)**. It addresses the complexities associated with multi-omics data—such as high dimensionality, sparsity, and intricate molecular interactions—by converting biological networks into meaningful, low-dimensional embeddings suitable for downstream tasks. **Q2: What are the key features of BioNeuralNet?**: @@ -77,11 +77,13 @@ Frequently Asked Questions (FAQ) **Q8: How can I contribute to BioNeuralNet?**: - Contributions are welcome! You can: + - Report issues or bugs on our `GitHub Issues page `_. - Suggest new features or improvements. - Share your experiences or use cases with the community. - How to contribute: + - Fork the repository, add your features, components, or algorithms, and submit a pull request. - Please refer to our `contribution guidelines `_ for more details. @@ -95,6 +97,6 @@ Frequently Asked Questions (FAQ) **Q10: What license is BioNeuralNet released under?**: - - BioNeuralNet is released under the MIT License. You can find the full license text in the `MIT LICENSE `_ file in the repository. + - BioNeuralNet is distributed under the [Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License (CC BY-NC-ND 4.0)](https://creativecommons.org/licenses/by-nc-nd/4.0/). Return to :doc:`../index` diff --git a/docs/source/gnns.rst b/docs/source/gnns.rst index 1e02d39..9e957e8 100644 --- a/docs/source/gnns.rst +++ b/docs/source/gnns.rst @@ -1,147 +1,102 @@ GNN Embeddings ============== -BioNeuralNet leverages Graph Neural Networks (GNNs) to generate rich, low-dimensional embeddings that capture the complex relationships inherent in multi-omics data. These embeddings not only preserve the network topology but also integrate biological signals, providing a robust foundation for downstream tasks such as disease prediction. +BioNeuralNet leverages **Graph Neural Networks (GNNs)** to generate biologically meaningful, low-dimensional embeddings from multi-omics network data. These embeddings integrate complex biological interactions and structural information, facilitating accurate downstream analyses, such as phenotype prediction and biomarker discovery. -Key Contributions: ------------------- -- **Enhanced Representation:** By training models such as GCN, GAT, GraphSAGE, and GIN, BioNeuralNet generates node embeddings that reflect both local connectivity and supervised signals each node (omics feature) is associated with a numeric label (e.g., Pearson correlation with phenotype) that guides learning. - -- **Modularity and Interoperability:** The framework is designed in a modular, Python-based fashion. Its outputs are returned as pandas DataFrames, allowing seamless integration with existing data analysis pipelines and facilitating further exploration with external tools. - -- **End-to-End Workflow:** Whether you start with raw multi-omics data or supply your own network, the pipeline proceeds through network construction, embedding generation, and ultimately disease prediction. Ensuring a streamlined workflow from data to actionable insights. +Core Features +------------- +- **Biologically Informed Embeddings:** Models like GCN, GAT, GraphSAGE, and GIN produce embeddings informed by network connectivity and biologically relevant supervised signals (e.g., phenotype correlations). +- **Flexible, Modular Integration:** Outputs structured as pandas DataFrames, seamlessly compatible with common bioinformatics workflows. +- **Comprehensive Workflow:** Handles data from initial network construction through embedding generation to disease prediction in a unified, end-to-end pipeline. -GNN Model Overviews -------------------- -**Graph Convolutional Network (GCN)**: GCN layers aggregate information from neighboring nodes via a spectral-based convolution: +Supported GNN Architectures +--------------------------- +**Graph Convolutional Network (GCN)**: +GCN aggregates node features based on local neighborhood structure using spectral-based convolution: .. math:: - X^{(l+1)} \;=\; \mathrm{ReLU}\!\Bigl(\widehat{D}^{-\tfrac{1}{2}}\,\widehat{A}\,\widehat{D}^{-\tfrac{1}{2}}\, - X^{(l)}\,W^{(l)}\Bigr), + X^{(l+1)} \;=\; \mathrm{ReLU}\!\Bigl(\widehat{D}^{-\tfrac{1}{2}}\,\widehat{A}\,\widehat{D}^{-\tfrac{1}{2}}\ + X^{(l)}\,W^{(l)}\Bigr) -where :math:`\widehat{A}` adds self-loops to the adjacency matrix, ensuring that each node also considers its own features. +- where :math:`\widehat{A}` adds self-loops to the adjacency matrix, ensuring that each node also considers its own features. -**Graph Attention Network (GAT)**: GAT layers learn attention weights to prioritize the most informative neighbors: +**Graph Attention Network (GAT)**: +GAT assigns learned attention scores to neighbors, enhancing model interpretability and accuracy: .. math:: - h_{i}^{(l+1)} \;=\; \mathrm{ELU}\!\Bigl(\sum_{j \in \mathcal{N}(i)} \alpha_{ij}^{(l)}\,W^{(l)}\,h_{j}^{(l)}\Bigr), + h_{i}^{(l+1)} \;=\; \mathrm{ELU}\!\Bigl(\sum_{j \in \mathcal{N}(i)} \alpha_{ij}^{(l)}\,W^{(l)}\,h_{j}^{(l)}\Bigr) -with :math:`\alpha_{ij}^{(l)}` representing the attention coefficient for node :math:`j`'s contribution to node :math:`i`. +- with :math:`\alpha_{ij}^{(l)}` representing the attention coefficient for node :math:`j`'s contribution to node :math:`i`. -**GraphSAGE**: GraphSAGE computes embeddings by concatenating a node's own features with an aggregated summary of its neighbors: +**GraphSAGE**: +GraphSAGE performs inductive learning by aggregating neighboring node features to generalize effectively to unseen data: .. math:: h_{i}^{(l+1)} \;=\; \sigma\!\Bigl(W^{(l)}\Bigl( h_{i}^{(l)} \,\|\, \mathrm{mean}_{j \,\in\, \mathcal{N}(i)}(h_{j}^{(l)}) - \Bigr)\Bigr), + \Bigr)\Bigr) -where the mean aggregator provides a simple yet effective way to summarize local neighborhood information. +- where the mean aggregator provides a simple yet effective way to summarize local neighborhood information. -**Graph Isomorphism Network (GIN)**: GIN uses a sum-aggregator combined with a learnable parameter and an MLP to capture subtle differences in network structure: +**Graph Isomorphism Network (GIN)**: +GIN leverages sum-aggregation and an MLP to discriminate subtle structural variations between graphs: .. math:: h_i^{(l+1)} \;=\; \mathrm{MLP}^{(l)}\!\Bigl(\,\bigl(1 + \epsilon^{(l)}\bigr) - h_{i}^{(l)} + \sum_{j \in \mathcal{N}(i)} h_{j}^{(l)}\Bigr), - -where :math:`\epsilon^{(l)}` is either learnable or fixed. - -Dimensionality Reduction and Downstream Integration ---------------------------------------------------- - -After obtaining high-dimensional node embeddings from the penultimate GNN layer, BioNeuralNet applies dimensionality reduction (using PCA or autoencoders) to summarize each node with a single value. These reduced embeddings are then integrated into subject-level omics data, yielding enhanced feature representations that boost the performance of predictive models (e.g., via DPMON for disease prediction). - -By using GNNs to capture both structural and biological signals, BioNeuralNet delivers embeddings that truly reflect the complexity of multi-omics networks. + h_{i}^{(l)} + \sum_{j \in \mathcal{N}(i)} h_{j}^{(l)}\Bigr) -Task-Driven (Supervised/Semi-Supervised) GNNs ---------------------------------------------- -In our work, the GNNs are primarily **task-driven**: +- where :math:`\epsilon^{(l)}` is either learnable or fixed. -- **Node Labeling via Phenotype Correlation:** For each node, we compute the Pearson correlation between the omics data and phenotype (or clinical) data. This correlation serves as the target label during training. +Task-Driven Embeddings for Phenotype Prediction +----------------------------------------------- +BioNeuralNet generates embeddings optimized for disease prediction through supervised and semi-supervised training: -- **Supervised Training Objective:** The GNN is trained to predict these correlation values using a Mean Squared Error (MSE) loss. This strategy aligns node embeddings with biological signals relevant to the disease phenotype. +- **Phenotype-Guided Labels:** Nodes labeled by correlation with clinical or phenotype data. +- **Supervised Training Objective:** Minimizes MSE between predicted node correlations and actual phenotype correlations, ensuring biologically relevant embeddings. +- **Subject-Level Integration:** Embeddings enhance patient-level datasets, significantly improving classification performance via DPMON (Disease Prediction using Multi-Omics Networks). -- **Downstream Integration:** The learned node embeddings can be integrated into patient-level datasets for sample-level classification tasks. For example, **DPMON** (Disease Prediction using Multi-Omics Networks) leverages these embeddings in an end-to-end pipeline where the final objective is to classify disease outcomes. - -Generating Low-Dimensional Embeddings for Multi-Omics ------------------------------------------------------ -The following figure illustrates an end-to-end workflow, from raw omics data to correlation-based node labeling, GNN-driven embedding generation, dimensionality reduction, and final integration into subject-level features: +Embedding Generation Workflow +----------------------------- +Embeddings produced by BioNeuralNet capture both topological and biological insights from multi-omics networks: .. figure:: _static/SubjectRepresentation.png :align: center :alt: Subject Representation Workflow - A high-level overview of BioNeuralNet's process for creating enhanced subject-level representations. Nodes represent omics features, labeled by correlation to a phenotype; GNNs learn embeddings that are reduced (PCA/autoencoder) and then reintegrated into the original omics data for improved predictive performance. + Workflow: Nodes labeled by phenotype correlation, embedded via GNNs, dimensionally reduced (PCA/Autoencoder), then integrated into subject-level data for enhanced predictive accuracy. `View full-size image: Subject Representation `_ -Key Insights into GNN Parameters and Outputs --------------------------------------------- -1. **Input Parameters:** - - - **Node Features Matrix:** Built by correlating omics data with clinical variables. - - - **Edge Index:** Derived from the network's adjacency matrix. - - - **Target Labels:** Numeric values representing the correlation between omics features and phenotype data. - -2. **Output Embeddings:** - - - The penultimate layer of the GNN produces dense node embeddings that capture both local connectivity and supervised signals. - - - These embeddings can be further reduced (e.g., via PCA or an Autoencoder) for visualization or integrated into subject-level data. - -Dimensionality Reduction: PCA vs. Autoencoders ----------------------------------------------- - -After training a GNN, the resulting node embeddings are typically high-dimensional. To integrate these embeddings into the original omics data-by reweighting each feature-a further reduction step is performed to obtain a single summary value per feature. BioNeuralNet supports two primary approaches for this reduction: +Dimensionality Reduction +------------------------ +BioNeuralNet provides two main dimensionality reduction techniques post-GNN embedding: -**Principal Component Analysis (PCA):** +- **PCA**: Simple, linear, interpretable, suitable for datasets where linear assumptions hold. +- **Autoencoders**: Nonlinear, flexible neural-network-based approach capturing complex biological patterns. Recommended with hyperparameter tuning (`tune=True`) for superior performance on highly dimensional or complex data. -PCA is a linear dimensionality reduction technique that computes orthogonal components capturing the maximum variance in the data. The first principal component (PC1) is often used as a concise summary of each feature's variation. PCA is: +How DPMON Utilizes GNN Embeddings +--------------------------------- +**DPMON** extends embedding applications to patient-level phenotype prediction: -- **Deterministic and Fast:** A closed-form solution is computed from the covariance matrix. - -- **Simple and Interpretable:** The linear combination of the original variables is straightforward to understand. - -- **Limited to Linear Relationships:** It may not capture more complex, nonlinear structures in the data. - -**Autoencoders (AE):** - -Autoencoders are neural network models designed to learn a compressed representation (latent code) through a bottleneck architecture. They use nonlinear activations (e.g., ReLU) to model complex relationships: - -- **Nonlinear Transformation:** The encoder learns to capture intricate patterns that a linear method might miss. - -- **Learned Representations:** The latent code is obtained by minimizing a reconstruction loss, making it adaptive to the data. - -- **Flexible and Tunable:** Being neural network-based, autoencoders allow tuning of architecture parameters (e.g., number of layers, hidden dimensions, epochs, learning rate) to better capture the signal. In our framework, we highly recommend using autoencoders (i.e., setting `tune=True`) to leverage their enhanced expressivity for complex multi-omics data. - -In practice, PCA offers simplicity and interpretability, whereas autoencoders may yield superior performance by capturing more nuanced nonlinear relationships. The choice depends on the complexity of your data and the computational resources available. Our recommendation is to enable tuning (using `tune=True`) to optimize the autoencoder parameters for your specific dataset. - -How DPMON Uses GNNs Differently -------------------------------- -**DPMON** (Disease Prediction using Multi-Omics Networks) reuses the same GNN architectures but with a different objective: - -- Instead of node-level MSE regression, DPMON aggregates node embeddings with patient-level omics data. - -- A downstream classification head (e.g., softmax layer with CrossEntropyLoss) is applied for sample-level disease prediction. - -- This end-to-end approach leverages both local (node-level) and global (patient-level) network information. +- Integrates node embeddings directly into patient-level features. +- Uses a classification head (e.g., softmax with cross-entropy) trained to predict clinical outcomes. +- Leverages both local molecular interaction information (node-level embeddings) and global omics data, yielding highly accurate phenotype predictions. .. figure:: _static/DPMON.png :align: center :alt: Disease Prediction (DPMON) - Embedding-enhanced subject data using DPMON for improved disease prediction. + DPMON leverages GNN embeddings integrated with patient data for robust disease prediction. `View full-size image: Disease Prediction (DPMON) `_ -Example Usage -------------- -Below is a simplified example that demonstrates the task-driven approach-where node labels are derived from phenotype correlations and used to train the GNN: +Example Code: Training a GNN Embedding Model +-------------------------------------------- +Below is a simplified example showing how to train GNN embeddings guided by phenotype correlations: .. code-block:: python diff --git a/docs/source/index.rst b/docs/source/index.rst index f9f4f9b..ed8d8a1 100644 --- a/docs/source/index.rst +++ b/docs/source/index.rst @@ -1,5 +1,5 @@ -BioNeuralNet: A Graph Neural Network based Multi-Omics Network Data Analysis Tool -================================================================================= +BioNeuralNet: Graph Neural Networks for Multi-Omics Network Analysis +==================================================================== .. image:: https://img.shields.io/badge/License-CC%20BY--NC--ND%204.0-lightgrey.svg :target: https://creativecommons.org/licenses/by-nc-nd/4.0/ @@ -18,89 +18,92 @@ BioNeuralNet: A Graph Neural Network based Multi-Omics Network Data Analysis Too :alt: BioNeuralNet Logo -Install BioNeuralNet via pip: ------------------------------ +Installation +------------ + +BioNeuralNet is available as a Python package on PyPI: .. code-block:: bash pip install bioneuralnet -For additional installation details, see :doc:`installation`. +For additional installation details and troubleshooting, see :doc:`installation`. + +Quick Start Examples +-------------------- -For end-to-end examples of `BioNeuralNet`: +Get started quickly with these end-to-end examples demonstrating the BioNeuralNet workflow: - - :doc:`Quick_Start`. - - :doc:`TCGA-BRCA_Dataset`. +- :doc:`Quick_Start` +- :doc:`TCGA-BRCA_Dataset` + +**BioNeuralNet Workflow Overview** +---------------------------------- -**BioNeuralNet Overview** -------------------------- .. figure:: _static/BioNeuralNet.png :align: center - :alt: BioNeuralNet Logo - - Embeddings form the core of BioNeuralNet, enabling a number of downstream applications. + :alt: BioNeuralNet Workflow Overview + Embeddings form the core of BioNeuralNet, supporting diverse downstream applications. + `View full-size image: Network-Based Multi-Omics Analysis for Disease Prediction `_ What is BioNeuralNet? --------------------- -BioNeuralNet is a **Python-based** framework designed to bridge the gap between **multi-omics data analysis** and **Graph Neural Networks (GNNs)**. By leveraging advanced techniques, it enables: -- **Graph Clustering**: Identifies biologically meaningful communities within omics networks. -- **GNN Embeddings**: Learns network-based feature representations from biological graphs, capturing both **biological structure** and **feature correlations** for enhanced analysis. -- **Subject Representation**: Generates high-quality embeddings for individuals based on multi-omics profiles. -- **Disease Prediction**: Builds predictive models using integrated multi-layer biological networks. -- **Interoperability**: Component outputs are structured as **pandas DataFrames**, ensuring easy integration with existing workflows and tools. +BioNeuralNet is a **flexible, modular Python framework** developed to facilitate end-to-end **network-based multi-omics analysis** using **Graph Neural Networks (GNNs)**. It addresses the complexities associated with multi-omics data—such as high dimensionality, sparsity, and intricate molecular interactions—by converting biological networks into meaningful, low-dimensional embeddings suitable for downstream tasks. + +BioNeuralNet provides: -Why GNNs? ---------- -Traditional methods often struggle to model complex multi-omics relationships due to their inability to capture **biological interactions and dependencies**. BioNeuralNet addresses this challenge by utilizing **GNN-powered embeddings**, incorporating models such as: +- **Network Construction**: Easily build informative networks from multi-omics datasets to capture biologically relevant molecular interactions. +- **GNN Embeddings**: Transform complex biological networks into versatile embeddings, capturing both structural relationships and molecular interactions. +- **Phenotype-Aware Analysis**: Integrate phenotype or clinical variables to enhance the biological relevance of the embeddings. +- **Disease Prediction**: Utilize network-derived embeddings for accurate and scalable predictive modeling of diseases and phenotypes. +- **Interoperability**: Outputs structured as **pandas DataFrames**, ensuring compatibility with common Python tools and seamless integration into existing bioinformatics pipelines. -- **Graph Convolutional Networks (GCN)**: Aggregates features from neighboring nodes to capture local structure. -- **Graph Attention Networks (GAT)**: Applies attention mechanisms to prioritize important interactions between biomolecules. -- **GraphSAGE**: Enables inductive learning, making it applicable to unseen omics data. -- **Graph Isomorphism Networks (GIN)**: Improves expressiveness in graph-based learning tasks. +BioNeuralNet emphasizes usability, reproducibility, and adaptability, making advanced network-based multi-omics analyses accessible to researchers working in precision medicine and systems biology. -By integrating omics features within a **network-aware framework**, BioNeuralNet preserves biological interactions, leading to **more accurate and interpretable predictions**. -For a deeper dive into how BioNeuralNet applies GNN embeddings, see :doc:`gnns`. +Why Graph Neural Networks for Multi-Omics? +------------------------------------------ -Seamless Data Integration -------------------------- -One of BioNeuralNet's core strengths is **interoperability**: +Traditional machine learning methods often struggle with the complexity and high dimensionality of multi-omics data, particularly their inability to effectively capture intricate molecular interactions and dependencies. BioNeuralNet overcomes these limitations by using **graph neural networks (GNNs)**, which naturally encode biological structures and relationships. -- Outputs are structured as **pandas DataFrames**, ensuring easy downstream analysis. -- Supports integration with **external tools and machine learning frameworks**, making it adaptable to various research workflows. -- Works seamlessly with network-based and graph-learning pipelines. -- Our :doc:`user_api` provides detailed information on how to use BioNeuralNet's modules and functions. +BioNeuralNet supports several state-of-the-art GNN architectures optimized for biological applications: +- **Graph Convolutional Networks (GCN)**: Aggregate biological signals from neighboring molecules, effectively modeling local interactions such as gene co-expression or regulatory relationships. +- **Graph Attention Networks (GAT)**: Use attention mechanisms to dynamically prioritize important molecular interactions, highlighting the most biologically relevant connections. +- **GraphSAGE**: Facilitate inductive learning, enabling the model to generalize embeddings to previously unseen molecular data, thereby enhancing predictive power and scalability. +- **Graph Isomorphism Networks (GIN)**: Provide powerful and expressive graph embeddings, accurately distinguishing subtle differences in molecular interaction patterns. -**Example: Transforming Multi-Omics for Enhanced Disease Prediction** ---------------------------------------------------------------------- +For detailed explanations of BioNeuralNet's supported GNN architectures and their biological relevance, see :doc:`gnns`. -`View full-size image: Transforming Multi-Omics for Enhanced Disease Prediction `_ +Example: Network-Based Multi-Omics Analysis for Disease Prediction +------------------------------------------------------------------ + +`View full-size image: Network-Based Multi-Omics Analysis for Disease Prediction `_ .. figure:: _static/Overview.png :align: center - :alt: Overview of BioNeuralNet's multi-omics integration process + :alt: BioNeuralNet's workflow for network-based multi-omics analysis - **BioNeuralNet**: Transforming Multi-Omics for Enhanced Disease Prediction + **BioNeuralNet Workflow**: Network-Based Multi-Omics Analysis for Disease Prediction -Below is a quick example demonstrating the following steps: +Below is a concise example demonstrating the following key steps: 1. **Data Preparation**: - - - Input your multi-omics data (e.g., proteomics, metabolomics) along with phenotype and clinical data. + + - Load your multi-omics data (e.g., transcriptomics, proteomics) along with phenotype and clinical covariates. 2. **Network Construction**: + + - Here, we construct the multi-omics network using an external R package, **SmCCNet** [1]_. + - BioNeuralNet provides convenient wrappers for external tools (like SmCCNet) through `bioneuralnet.external_tools`. Note: R must be installed for these wrappers. - - In this example we generate the network using a external R package (SmCCNet[3]_). - - Lightweight wrappers (SmCCNet) are available in `bioneuralnet.external_tools` for convenience, R is required for their usage. +3. **Disease Prediction with DPMON**: + + - **DPMON** [2]_ integrates omics data and network structures to predict disease phenotypes. + - It provides an end-to-end pipeline, complete with built-in hyperparameter tuning, and outputs predictions directly as pandas DataFrames for easy interoperability. -3. **Disease Prediction**: - - - Use **DPMON** to predict disease phenotypes by integrating the network information with omics data. - - DPMON supports an end-to-end pipeline with hyperparameter tuning that can return predictions as pandas DataFrames, enabling seamless integration with existing workflows. - -**Code Example**: +**Example Usage**: .. code-block:: python @@ -109,14 +112,14 @@ Below is a quick example demonstrating the following steps: from bioneuralnet.downstream_task import DPMON from bioneuralnet.datasets import DatasetLoader - # Step 1: Load your data or use one of the provided datasets - Example = DatasetLoader("example1") - omics_genes = Example.data["X1"] - omics_proteins = Example.data["X2"] - phenotype = Example.data["Y"] - clinical = Example.data["clinical_data"] + # Step 1: Load the dataset and access individual omics modalities + example = DatasetLoader("example1") + omics_genes = example.data["X1"] + omics_proteins = example.data["X2"] + phenotype = example.data["Y"] + clinical = example.data["clinical"] - # Step 2: Network Construction + # Step 2: Network Construction with SmCCNet smccnet = SmCCNet( phenotype_df=phenotype, omics_dfs=[omics_genes, omics_proteins], @@ -127,63 +130,39 @@ Below is a quick example demonstrating the following steps: global_network, clusters = smccnet.run() print("Adjacency matrix generated.") - # Step 3: Disease Prediction (DPMON) + # Step 3: Disease Prediction using DPMON dpmon = DPMON( adjacency_matrix=global_network, omics_list=[omics_genes, omics_proteins], phenotype_data=phenotype, clinical_data=clinical, model="GCN", + repeat_num=5, + tune=True, + gpu=True, + cuda=0, + output_dir="./output" ) predictions, avg_accuracy = dpmon.run() print("Disease phenotype predictions:\n", predictions) -**BioNeuralNet Core Features** ------------------------------- - -For an End-to-End example example of BioNeuralNet, see :doc:`Quick_Start` and :doc:`TCGA-BRCA_Dataset`. - -:doc:`gnns`: - - Given a multi-omics network as input, BioNeuralNet can generate embeddings using Graph Neural Networks (GNNs). - - Generate embeddings using methods such as **GCN**, **GAT**, **GraphSAGE**, and **GIN**. - - Outputs can be obtained as native tensors or converted to pandas DataFrames for easy analysis and visualization. - - Embeddings unlock numerous downstream applications, including disease prediction, enhanced subject representation, clustering, and more. - -:doc:`clustering`: - - Identify functional modules or communities using **correlated clustering methods** (e.g., CorrelatedPageRank, CorrelatedLouvain, HybridLouvain) that integrate phenotype correlation to extract biologically relevant modules [1]_. - - Clustering methods can be applied to any network represented allowing flexible analysis across different domains. - - All clustering components return either raw partitions dictionaries or induced subnetwork adjacency matrices (as DataFrames) for visualization. - - Use cases include, feature selection, biomarker discovery, and network-based analysis. - -:doc:`downstream_tasks`: - - **Subject Representation**: - - Integrate node embeddings back into omics data to enrich subject-level profiles by weighting features with learned embedding. - - This embedding-enriched data can be used for downstream tasks such as disease prediction or biomarker discovery. - - The result can be returned as a DataFrame or a PyTorch tensor, fitting naturally into downstream analyses. - - - **Disease Prediction for Multi-Omics Network DPMON** [2]_: - - Classification End-to-End pipeline for disease prediction using Graph Neural Network embeddings. - - DPMON supports hyperparameter tuning-when enabled, it finds the best for the given data. - - This approach, along with the native pandas integration across modules, ensures that BioNeuralNet can be easily incorporated into your analysis workflows. - -:doc:`metrics`: - - Visualize embeddings, feature variance, clustering comparison, and network structure in 2D. - - Evaluate embedding quality and clustering relevance using correlation with phenotype. - - Performance benchmarking tools for classification tasks using various models. - - Useful for assessing feature importance, validating network structure, and comparing cluster outputs. - -:doc:`utils`: - - Build graphs using k-NN similarity, Pearson/Spearman correlation, RBF kernels, mutual information, or soft-thresholding. - - Filter and preprocess omics or clinical data by variance, correlation, random forest importance, or ANOVA F-test. - - Tools for network pruning, feature selection, and data cleaning. - - Quickly summarize datasets with variance, zero-fraction, expression level, or correlation overviews. - - Includes conversion tools for RData and integrated logging. - -:doc:`external_tools/index`: - - **Graph Construction**: - - BioNeuralNet provides additional tools in the `bioneuralnet.external_tools` module. - - Includes support for **SmCCNet** (Sparse Multiple Canonical Correlation Network), an R-based tool for constructing phenotype-informed correlation networks [3]_. - - These tools are optional but enhance BioNeuralNet's graph construction capabilities and are recommended for more integrative or exploratory workflows. +Explore BioNeuralNet's Documentation +------------------------------------ + +For detailed examples and tutorials, visit: + +- :doc:`Quick_Start`: A quick walkthrough demonstrating the BioNeuralNet workflow from start to finish. +- :doc:`TCGA-BRCA_Dataset`: A detailed real-world example applying BioNeuralNet to breast cancer subtype prediction. + +**Documentation Sections:** + +- :doc:`gnns`: Overview of supported GNN architectures (GCN, GAT, GraphSAGE, GIN) and embedding generation. +- :doc:`clustering`: How to identify biologically relevant functional modules using correlated clustering methods. +- :doc:`downstream_tasks`: Performing downstream analyses such as subject representation and phenotype prediction (DPMON). +- :doc:`metrics`: Methods for visualization, quality evaluation, and performance benchmarking. +- :doc:`utils`: Tools for preprocessing, feature selection, network construction, and data summarization. +- :doc:`external_tools/index`: Integration of external methods, such as SmCCNet, for advanced network construction. + .. toctree:: :maxdepth: 2 @@ -209,6 +188,5 @@ Indices and References * :ref:`modindex` * :ref:`search` -.. [1] Abdel-Hafiz, M., Najafi, M., et al. "Significant Subgraph Detection in Multi-omics Networks for Disease Pathway Identification." *Frontiers in Big Data*, 5 (2022). DOI: `10.3389/fdata.2022.894632 `_. +.. [1] Liu, W., Vu, T., Konigsberg, I. R., Pratte, K. A., Zhuang, Y., & Kechris, K. J. (2023). "Network-Based Integration of Multi-Omics Data for Biomarker Discovery and Phenotype Prediction." *Bioinformatics*, 39(5), btat204. DOI: `10.1093/bioinformatics/btat204 `_. .. [2] Hussein, S., Ramos, V., et al. "Learning from Multi-Omics Networks to Enhance Disease Prediction: An Optimized Network Embedding and Fusion Approach." In *2024 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)*, Lisbon, Portugal, 2024, pp. 4371-4378. DOI: `10.1109/BIBM62325.2024.10822233 `_. -.. [3] Liu, W., Vu, T., Konigsberg, I. R., Pratte, K. A., Zhuang, Y., & Kechris, K. J. (2023). "Network-Based Integration of Multi-Omics Data for Biomarker Discovery and Phenotype Prediction." *Bioinformatics*, 39(5), btat204. DOI: `10.1093/bioinformatics/btat204 `_. diff --git a/docs/source/installation.rst b/docs/source/installation.rst index 4bc0125..f90d3f0 100644 --- a/docs/source/installation.rst +++ b/docs/source/installation.rst @@ -1,61 +1,64 @@ Installation ============ -BioNeuralNet is fully compatible with Python 3.10, 3.11, and 3.12, and runs seamlessly on Windows, macOS, and Linux. Follow the steps below to set up BioNeuralNet and its dependencies. +BioNeuralNet is fully compatible with Python 3.10 or higher and supports Windows, macOS, and Linux platforms. Follow these steps to install BioNeuralNet along with necessary dependencies. -1. **Install BioNeuralNet via pip**: +1. **Install BioNeuralNet via pip** + + The core modules, including graph embeddings, disease prediction (DPMON), and clustering, can be installed directly: .. code-block:: bash pip install bioneuralnet - This installs the core BioNeuralNet modules for GNN embeddings, subject representation, - disease prediction (DPMON), and clustering. - -2. **Install PyTorch and PyTorch Geometric** (Separately): +2. **Install PyTorch and PyTorch Geometric (Required)** - BioNeuralNet relies on PyTorch and PyTorch Geometric for GNN operations: + BioNeuralNet utilizes PyTorch and PyTorch Geometric for graph neural network computations. Install these separately: .. code-block:: bash pip install torch pip install torch_geometric - For GPU-accelerated builds or other configurations visit the official sites: + For GPU-enabled installations or advanced configurations, please refer to the official documentation: - `PyTorch Installation Guide `_ - `PyTorch Geometric Installation Guide `_ - Select the appropriate build for your system (e.g., Stable, Linux, pip, Python, CPU). + Choose the appropriate build based on your system and GPU availability. .. figure:: _static/pytorch.png :align: center - :alt: PyTorch Installation + :alt: PyTorch Installation Guide Example .. figure:: _static/geometric.png :align: center - :alt: PyTorch Geometric Installation + :alt: PyTorch Geometric Installation Guide Example -3. **(Optional) Install R and External Tools**: +3. **Optional: Install R and External Tools (e.g., SmCCNet)** - If you plan to use **SmCCNet** for network construction: + For advanced network construction using external R tools like **SmCCNet**, follow these additional steps: - - Install R from `The R Project `_. - - Version 4.4.2 or higher is recommended. - - Install the required R packages. Open R and run: + - Install R (version 4.4.2 or newer recommended) from the `R Project `_. + - Within R, install the required packages: .. code-block:: r - if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") + if (!requireNamespace("BiocManager", quietly = TRUE)) + install.packages("BiocManager") + install.packages(c("dplyr", "jsonlite")) BiocManager::install(c("impute", "preprocessCore", "GO.db", "AnnotationDbi")) install.packages("SmCCNet") install.packages("WGCNA") -4. **Additional Notes for External Tools**: + See :doc:`external_tools/index` for further details on external tools. - Refer to the :doc:`external_tools/index`. +Next Steps +---------- -5. **Next Steps**: +After installation, explore our step-by-step tutorials: - - Explore :doc:`tutorials/index` and :doc:`Quick_Start` or :doc:`TCGA-BRCA_Dataset` for end-to-end workflows and examples. +- :doc:`tutorials/index` +- :doc:`Quick_Start` +- :doc:`TCGA-BRCA_Dataset` diff --git a/setup.cfg b/setup.cfg index 703b93e..572f2b4 100644 --- a/setup.cfg +++ b/setup.cfg @@ -1,6 +1,6 @@ [metadata] name = bioneuralnet -version = 1.0.9 +version = 1.1.0 author = Vicente Ramos author_email = vicente.ramos@ucdenver.edu description = A comprehensive framework for integrating multi-omics data with neural network embeddings.