NS-Forest v4.1

Documentation: https://nsforest.readthedocs.io/en/latest/

Citation: Liu A, Peng B, Pankajam A, Duong TE, Pryhuber G, Scheuermann RH, Zhang Y. (2024) Discovery of optimal cell type classification marker genes from single cell RNA sequencing data. BMC Methods. https://doi.org/10.1186/s44330-024-00015-2

To contribute, please open an issue on this Github repository.

Download and installation

In terminal:

git clone https://github.com/NLM-DIR/NSForest.git

cd NSForest

conda env create -f environment.yml

conda activate nsforest

pip install .

Tutorial

Please find tutorials in the documentation.

Prerequisites

This package is written and tested in python 3.11+, scanpy 1.9.6+.
Other required libraries: numpy, pandas, sklearn, plotly, time, tqdm.

NS-Forest workflow

NS-Forest is an algorithm designed to identify minimum combinations of necessary and sufficient marker genes for a cell type cluster identified in a single cell or single nucleus RNA sequencing experiment that optimizes classification accuracy. NS-Forest proceeds through the following steps (default setting):

Data input: An AnnData object (e.g., .h5ad file) with cell type cluster labels.
Binary score calculation: Each gene is assigned a binary score for every cluster. Binary score is a measurement of the binary expression pattern of a gene. A higher binary score means a gene is expressed in one cluster and not others. A lower binary score means a gene is expressed in many clusters and would not be an ideal candidate for a cell type-specific marker gene.
Binary scoring criterion: NS-Forest then filters for genes with high binary scores. Candidate genes are selected if their binary scores are 2 standard deviations above the mean of all genes expressed in the cluster.
Random forest: Pre-selected gene candidates based on binary scoring are used as input into a random forest classifier, which ranks the genes by Gini Impurity, while producing a classification model for each cluster.
Decision tree evaluation: The top 6 ranked random forest genes are used as input into decision trees where all combinations of input genes are evaluated and the combination with the highest F-beta score is selected.
Output: The NS-Forest algorithm outputs 1-6 marker genes per cluster along with the classification metrics (F-beta, precision, recall) and the On-Target Fraction expression metric.

Marker gene evaluation module

The final module in the NS-Forest workflow can also be used to assess the performance of any collection of marker genes identified using any approach. The marker gene evaluation module includes the following steps (default setting):

Data input: 1) An AnnData object (e.g., .h5ad file) with cell type cluster labels. 2) A dictionary of marker genes for every cluster to be evaluated.
Decision tree evaluation: One-vs-all decision trees are created for each gene in the marker combination. If there are more than one gene in the marker combination, an AND logic is used when calculating the true positives from multiple decision trees for one cell type cluster.
Output: The marker gene evaluation module outputs the classification metrics (F-beta, precision, recall) and On-Target Fraction for evaluated markers of every cluster.

Versions and citations

Earlier versions are managed in Releases.

Version 4.0:

Liu A, Peng B, Pankajam A, Duong TE, Pryhuber G, Scheuermann RH, Zhang Y. (2024) Discovery of optimal cell type classification marker genes from single cell RNA sequencing data. BMC Methods. https://doi.org/10.1186/s44330-024-00015-2

Version 2.0:

Aevermann BD, Zhang Y, Novotny M, Keshk M, Bakken TE, Miller JA, Hodge RD, Lelieveldt B, Lein ES, Scheuermann RH. (2021) A machine learning method for the discovery of minimum marker gene combinations for cell-type identification from single-cell RNA sequencing. Genome Res. https://pubmed.ncbi.nlm.nih.gov/34088715/

Version 1.3/1.0:

Aevermann BD, Novotny M, Bakken T, Miller JA, Diehl AD, Osumi-Sutherland D, Lasken RS, Lein ES, Scheuermann RH. (2018) Cell type discovery using single-cell transcriptomics: implications for ontological representation. Hum Mol Genet. https://pubmed.ncbi.nlm.nih.gov/29590361/

Authors

Angela Liu (aliu@jcvi.org)
Beverly Peng (bpeng@jcvi.org)
Brian Aevermann (baevermann@chanzuckerberg.com)
Richard Scheuermann (richard.scheuermann@nih.gov)
Yun (Renee) Zhang (yun.zhang@nih.gov)

Acknowledgments

Division of Intramural Research, National Library of Medicine

Our collaborators:

Allen Institute of Brain Science
Brain Initiative Cell Census Network
Chan Zuckerberg Initiative
California Institute for Regenerative Medicine
J. Craig Venter Institute

Name		Name	Last commit message	Last commit date
Latest commit History 316 Commits
demo_data		demo_data
dist		dist
docs		docs
gencode_annotation		gencode_annotation
nsforest		nsforest
outputs_layer1		outputs_layer1
.gitignore		.gitignore
.readthedocs.yaml		.readthedocs.yaml
LICENSE.txt		LICENSE.txt
NS-Forest-sticker-2.png		NS-Forest-sticker-2.png
README.md		README.md
SECURITY.md		SECURITY.md
environment notes.yml		environment notes.yml
environment.yml		environment.yml
environment_3_13.yml		environment_3_13.yml
evaluation.png		evaluation.png
main.py		main.py
pyproject.toml		pyproject.toml
workflow.png		workflow.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

NS-Forest v4.1

Download and installation

Tutorial

Prerequisites

NS-Forest workflow

Marker gene evaluation module

Versions and citations

Authors

Acknowledgments

About

Uh oh!

Releases 7

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

NS-Forest v4.1

Download and installation

Tutorial

Prerequisites

NS-Forest workflow

Marker gene evaluation module

Versions and citations

Authors

Acknowledgments

About

Topics

Resources

License

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 7

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages