This document outlines the available gene sets for modeling in the Bowel Cancer dataset, based on an analysis of 84 Human Bowel samples across different technologies (Visium, Visium HD, and Xenium).
Biological analysis is constrained by the intersection of genes available across different spatial transcriptomics platforms.
| Scope | Sample Count | Common Genes | Recommendation |
|---|---|---|---|
| All Human Bowel | 84 (Visium + Xenium) | ~405 | Cross-platform benchmarking |
| Visium Only | 78 | ~1060 | In-depth spatial profiling |
This set represents the intersection of the Xenium panel and the Visium whole-transcriptome capture.
- Pros: Allows the model to be trained on Visium data and evaluated on high-resolution Xenium data.
- Cons: Limited to a smaller subset of genes, which might miss important specific pathways.
This set includes all genes present in every Visium sample in the HEST dataset.
- Pros: Provides a much larger feature space (predicting 1000+ genes).
- Cons: Cannot be directly evaluated on Xenium samples without imputation or subsetting.
The current dataloader implementation in src/spatial_transcript_former/data/dataset.py uses a "Gene Lock" mechanism:
- The first sample in the training loop determines the target gene list.
- All subsequent samples are aligned to this list (missing genes are filled with zeros).
To ensure the best model stability, it is recommended to provide an explicit list of gene names to the get_hest_dataloader function instead of relying on the first sample's top genes.
# Create a fixed gene list for the project
bowel_genes = [...] # The 405 or 1060 common genes
dataloader = get_hest_dataloader(
ids=sample_ids,
selected_gene_names=bowel_genes,
...
)You can use the inspection/analyze_gene_overlap.py script to generate custom gene sets based on your specific sample filtering criteria.