Skip to content

CellArr/cellarr-se

Repository files navigation

PyPI-Server CI License: MIT

cellarr-se

cellarr-se is a read-only, out-of-core coordinator for TileDB-backed genomic datasets. It wraps the cellarr-array and cellarr-frame primitives into a lazy, SummarizedExperiment-compatible interface, so you can slice large genomics datasets stored on disk without loading them into memory.

Single-cell and bulk RNA-seq datasets frequently exceed available RAM. cellarr-se keeps assay matrices and metadata tables on disk as TileDB arrays, performing synchronized lazy slices across all components only when you request them. The result is always a standard in-memory SummarizedExperiment object.

Install

pip install cellarr-se

Usage

Construction

CellArraySE wraps existing TileDB arrays and frames; it does not create them. Use cellarr-array and cellarr-frame to build the backing stores first.

from cellarr_se import CellArraySE

se = CellArraySE(
    assays={"counts": my_cell_array, "tpm": my_tpm_array},
    row_data=my_row_frame,   # gene annotations (CellArrayFrame)
    col_data=my_col_frame,   # sample annotations (CellArrayFrame)
)

Inspection

se.shape          # (n_genes, n_samples)
se.assay_names    # ["counts", "tpm"]
se.row_names      # pd.Index of gene identifiers
se.col_names      # pd.Index of sample identifiers
se.row_columns    # list of gene metadata fields
se.col_columns    # list of sample metadata fields

se.show()         # print a summary with the first 5 rows of each metadata table
repr(se)          # <CellArraySE: 20000x500 | counts, tpm>

Slicing

Bracket notation supports integer indices, slices, name strings, and lists:

# Positional slice
subset = se[0:100, 0:50]

# Single element
gene = se[5, 3]

# Lists of indices or names
subset = se[["BRCA1", "TP53"], ["sample_001", "sample_042"]]

For attribute-filtered access, use slice() with TileDB query strings:

# Filter rows and columns by metadata attributes
subset = se.slice(
    row_query="gene_type == 'protein_coding'",
    col_query="tissue == 'liver'",
)

# Combine query with explicit column selection
subset = se.slice(
    row_query="gene_type == 'protein_coding'",
    col_subset=slice(0, 50),
    assays=["counts"],
    row_columns=["gene_id", "gene_name"],
)

Both se[...] and se.slice(...) return a standard in-memory SummarizedExperiment.

Assay metadata

se.is_sparse("counts")        # True if backed by SparseCellArray
se.get_assay_type("counts")   # numpy dtype of the assay

Demo

A worked example covering construction, inspection, and slicing is available in the demo notebook.

Note

This project has been set up using BiocSetup and PyScaffold.

About

TileDB backed SummarizedExperiment

Resources

License

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages