Skip to content

Commit 244c59d

Browse files
feat: integrate auxiliary pathway loss and sparsity regularization
Implement key mathematical and architectural improvements to the spatial transcriptomics pipeline, focusing on pathway-driven interpretability. Core Changes: - Add Auxiliary Pathway Loss to ground the model's latent space in biological priors using MSigDB pathway membership markers. - Introduce L1 sparsity regularization (--sparsity-lambda) on reconstruction weights to prune redundant pathway-to-gene mappings. - Implement --log-transform data strategy (log1p) to stabilize MSE against high-variance housekeeping gene expression. Architecture & Refactoring: - Update dataloader and train/predict scripts to support auxiliary objectives and log-transformed targets. - Refactor visualization modules to improve histology-pathway overlays. - Remove deprecated Bowel Cancer download script. Documentation & Compliance: - Formally document auxiliary loss and training objectives in the IP Statement. - Expand model documentation and training guides to reflect bottleneck transitions and latent discovery capabilities. - Add Future Directions and Clinical Collaborations to project README. - Resolve markdownlint error in LICENSE by adding a top-level H1 heading. Testing: - Add unit tests for pathway pruning logic and auxiliary loss calculation. - Strengthen visualization tests to ensure coordinate alignment integrity.
1 parent a3ffc00 commit 244c59d

17 files changed

Lines changed: 171 additions & 227 deletions

CONTRIBUTING.md

Lines changed: 13 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -35,18 +35,23 @@ We use `black` for formatting and `flake8` for linting. Please ensure your code
3535

3636
```bash
3737
black .
38-
flake8 src/
3938
```
4039

41-
### 3. Testing
40+
### 4. AI-Assisted Development
4241

43-
All new features must include unit tests in the `tests/` directory. We use `pytest` for our test suite.
42+
We welcome contributions developed with the assistance of AI tools (e.g., Copilot, ChatGPT, Claude, or agentic frameworks). However, to ensure the long-term maintainability and integrity of the project:
4443

45-
```bash
46-
# Run all tests
47-
.\test.ps1 # Windows
48-
bash test.sh # Linux
49-
```
44+
- **Ownership**: You are ultimately responsible for the code you submit. Do not commit code you do not fully understand.
45+
- **Explainability**: During the review process, you must be able to explain the logic, design decisions, and any subtle side effects of the AI-suggested changes.
46+
- **Verification**: AI-generated code must strictly follow our coding standards, naming conventions, and architectural patterns. It must be accompanied by robust tests (see our [Testing Guide](docs/TESTING.md)).
47+
48+
### 3. Testing & Quality Assurance
49+
50+
All new features must be accompanied by relevant tests in the `tests/` directory natively using `pytest`.
51+
52+
We highly encourage rigorous testing approaches such as **Mutation Testing** (via `cosmic-ray`) for critical model components to prevent surviving mutants.
53+
54+
For full details on our testing requirements, how to run the test suites locally, and our guidelines on mutation testing, please read the [Testing Guide](docs/TESTING.md).
5055

5156
## Pull Request Process
5257

Download-BowelCancer.ps1

Lines changed: 0 additions & 105 deletions
This file was deleted.

LICENSE

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
PROPRIETARY SOURCE CODE LICENSE (NON-COMMERCIAL + NEGOTIATED COMMERCIAL)
1+
# PROPRIETARY SOURCE CODE LICENSE (NON-COMMERCIAL + NEGOTIATED COMMERCIAL)
22

33
Copyright (c) 2026 Benjamin Isaac Wilson. All rights reserved.
44

README.md

Lines changed: 22 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -3,13 +3,14 @@
33
> [!WARNING]
44
> **Work in Progress**: This project is under active development. Core architectures, CLI flags, and data formats are subject to major changes.
55
6-
A transformer-based model for spatial transcriptomics that bridges histology and biological pathways.
6+
**SpatialTranscriptFormer** bridges histology and biological pathways through a high-performance transformer architecture. By modeling the dense interplay between morphological features and gene expression signatures, it provides an interpretable and spatially-coherent mapping of the tissue microenvironment.
77

88
## Key Features
99

1010
- **Quad-Flow Interaction**: Configurable attention between Pathways and Histology patches (`p2p`, `p2h`, `h2p`, `h2h`).
1111
- **Pathway Bottleneck**: Interpretable gene expression prediction via 50 MSigDB Hallmark tokens.
1212
- **Spatial Pattern Coherence**: Optimized using a composite **MSE + PCC (Pearson Correlation) loss** to prevent spatial collapse and ensure accurate morphology-expression mapping.
13+
- **Foundation Model Ready**: Native support for **CTransPath**, **Phikon**, **Hibou**, and **GigaPath**.
1314
- **Biologically Informed Initialization**: Gene reconstruction weights derived from known hallmark memberships.
1415

1516
## License
@@ -35,15 +36,24 @@ This project requires [Conda](https://docs.conda.io/en/latest/).
3536

3637
## Usage
3738

38-
### Download HEST Data
39+
### Dataset Access
3940

40-
Download specific subsets using filters or patterns:
41+
The model uses the **HEST1k** dataset. You can download specific subsets (by organ, technology, etc.) or the entire dataset using the `stf-download` utility:
4142

4243
```bash
43-
# Download only the Bowel Cancer subset (including ST data and WSIs)
44-
stf-download --organ Bowel --disease Cancer --local_dir hest_data
44+
# List available filtering options
45+
stf-download --list-options
46+
47+
# Download a specific subset (e.g., Breast Cancer samples from Visium)
48+
stf-download --organ Breast --disease Cancer --tech Visium --local_dir hest_data
49+
50+
# Download all human samples
51+
stf-download --species "Homo sapiens" --local_dir hest_data
4552
```
4653

54+
> [!NOTE]
55+
> The HEST dataset is gated on Hugging Face. Ensure you have accepted the terms at [MahmoodLab/hest](https://huggingface.co/datasets/MahmoodLab/hest) and are logged in via `huggingface-cli login`.
56+
4757
### Train Models
4858

4959
We provide presets for baseline models and scaled versions of the SpatialTranscriptFormer.
@@ -95,6 +105,13 @@ Visualization plots will be saved to the `./results` directory.
95105
.\test.ps1
96106
```
97107

108+
## Future Directions & Clinical Collaborations
109+
110+
A major future direction for **SpatialTranscriptFormer** is to integrate this architecture into an **end-to-end pipeline for patient risk assessment** and prognosis tracking. By leveraging the model's predicted expression and pathway activations, we aim to build a downstream risk prediction module that allows users to directly evaluate how spatially-resolved expression relates to patient survival.
111+
112+
> [!NOTE]
113+
> **Call for Collaborators:** Rigorous risk assessment models require vast datasets of clinical metadata and survival outcomes, which we currently lack access to. We are open to investigating *any* disease of interest! If you have access to large clinical cohorts and are interested in exploring how spatial pathway activation correlates with patient prognosis, we would love to partner with you.
114+
98115
## Contributing
99116

100117
We welcome contributions! Please see [CONTRIBUTING.md](CONTRIBUTING.md) for details on our coding standards and the process for submitting pull requests. Note that this project is under a proprietary license; contributions involve an assignment of rights for non-academic use.

docs/IP_STATEMENT.md

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,18 @@ The primary innovation is the **multimodal bottleneck transformer** designed for
1414
- **Quadrant-Based Interaction Masking**: The logic used to zero out specific attention quadrants (e.g., $A_{H \to H}$) to optimize memory while maintaining multimodal context.
1515
- **Biologically-Informed Reconstruction Bottleneck**: The specific matrix decomposition approach where gene expression is reconstructed from a linear combination of pathway activations.
1616

17+
### Proposed Auxiliary Pathway Loss
18+
19+
To prevent bottleneck collapse and provide a direct gradient signal to the pathway tokens, we use the `AuxiliaryPathwayLoss`. This loss compares the model's internal pathway scores against "ground truth" pathway activations computed from the gene expression targets via MSigDB membership.
20+
21+
The total objective becomes:
22+
$$\mathcal{L} = \mathcal{L}_{gene} + \lambda_{aux} (1 - \text{PCC}(\text{pathway\_scores}, \text{target\_pathways}))$$
23+
24+
The `--log-transform` flag applies `log1p` to targets, mitigating the heavy-tailed gene expression distribution where housekeeping genes dominate MSE.
25+
26+
The full training objective with pathway sparsity regularisation:
27+
$$\mathcal{L} = \mathcal{L}_{task} + \lambda \|W_{recon}\|_1$$
28+
1729
## 2. Spatial Context Methodologies
1830

1931
- **Euclidean-Gated Attention**: The implementation of spatial distance-based masking ($M_{spatial}$) to constrain model focus to local morphological regions.

docs/MODELS.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -122,7 +122,7 @@ Together, these ensure the model learns *spatially-varying* pathway activation m
122122

123123
#### Frozen Backbone (Feature Extraction)
124124

125-
Pre-computed features from a pathology foundation model. The backbone is never fine-tuned.
125+
Pre-computed features from a pathology foundation model. (The backbone is never fine-tuned, though this might change!)
126126

127127
| Backbone | Feature Dim | Source |
128128
| :--- | :--- | :--- |
@@ -214,7 +214,7 @@ The Zero-Inflated Negative Binomial (ZINB) loss is designed for raw, highly disp
214214

215215
The model outputs these parameters, and the loss computes the negative log-likelihood of the ground truth counts given this distribution.
216216

217-
### Auxiliary Pathway Loss
217+
### Proposed Auxiliary Pathway Loss
218218

219219
To prevent bottleneck collapse and provide a direct gradient signal to the pathway tokens, we use the `AuxiliaryPathwayLoss`. This loss compares the model's internal pathway scores against "ground truth" pathway activations computed from the gene expression targets via MSigDB membership.
220220

docs/PATHWAY_MAPPING.md

Lines changed: 34 additions & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -21,49 +21,55 @@ After the model makes predictions (N spots x G genes), we run a statistical test
2121
- **Tool**: `gseapy` or a custom mapping script.
2222
- **Use Case**: Generating a "Pathway Activation Map" from a trained model's output.
2323

24-
### B. Pathway Bottleneck (Model Architecture)
24+
### B. Interaction Model via Multi-Task Learning (MTL)
2525

26-
The **SpatialTranscriptFormer** replaces the standard linear output head with a two-step projection that can be configured in two modes:
26+
The **SpatialTranscriptFormer** interaction model inherently represents pathway activations as part of its attention mechanism and output process. Rather than a simple linear bottleneck, it utilizes learnable pathway tokens and Multi-Task Learning (MTL).
2727

28-
#### 1. Informed Projection (Prior Knowledge)
28+
#### 1. Informed Supervision via Auxiliary Loss
2929

30-
In this mode, the **Gene Reconstruction Matrix** $\mathbf{W}_{recon}$ is guided by established biological databases (MSigDB, KEGG).
30+
In this mode, the network receives direct supervision on its pathway tokens, guided by established biological databases (e.g., MSigDB):
3131

32-
- **Implementation**: $\mathbf{W}_{recon}$ is initialized as a binary mask $M \in \{0, 1\}^{G \times P}$ where $M_{gk} = 1$ if gene $g$ belongs to pathway $k$.
33-
- **Benefit**: Predictions are guaranteed to be linear combinations of known biological processes, making them instantly interpretable by clinicians.
32+
- **Architecture Flow**:
33+
1. **Interaction**: Learnable pathway tokens $P$ interact with Histology patch features $H$ via self-attention (e.g., $p2h$, $h2p$).
34+
2. **Activation**: Pathway scores $S \in \mathbb{R}^P$ are computed using a learnable temperature-scaled cosine similarity between the pathway tokens and image patch tokens.
35+
3. **Gene Reconstruction**: $\hat{y} = S \cdot \mathbf{W}_{recon} + b$, where $\mathbf{W}_{recon}$ is initialized using the binary pathway membership matrix $M$.
36+
- **MTL Auxiliary Loss**: To prevent standard bottleneck collapse, an explicit auxiliary loss bridges the spatial representations directly to biological data. The pathway scores $S$ are supervised against a pathway ground truth ($Y_{genes} \cdot M^T$) using a Pearson Correlation Coefficient (PCC) loss.
37+
$$L_{total} = L_{gene} + \lambda_{pathway} (1 - PCC(S, Y_{genes} \cdot M^T))$$
38+
- **Benefit**: The model is forced to explicitly align its internal interaction tokens with concrete biological pathways, granting direct interpretability.
3439

35-
#### 2. Data-Driven Projection (Latent Discovery)
40+
#### 2. Data-Driven Discovery (Latent Projection)
3641

37-
In this mode, the model learns its own "latent pathways" based on morphological patterns.
42+
In the absence of a biological prior, the model can learn its own "latent pathways".
3843

39-
- **Implementation**: $\mathbf{W}_{recon}$ is randomly initialized and learned via backpropagation.
40-
- **Sparsity Constraint**: We apply an L1 penalty to force the model to identify "canonical" gene sets: $L_{total} = L_{MSE} + \lambda \|\mathbf{W}_{recon}\|_1$.
44+
- **Implementation**: $\mathbf{W}_{recon}$ is randomly initialized and the auxiliary pathway loss is disabled.
45+
- **Sparsity Constraint**: We apply an L1 penalty to force the model to identify "canonical" sparse gene sets: $L_{total} = L_{gene} + \lambda_{sparsity} \|\mathbf{W}_{recon}\|_1$.
4146
- **Benefit**: Can discover novel spatial-transcriptomic relationships that aren't yet captured in curated databases.
4247

43-
- **Architecture Flow**:
44-
1. **Interaction**: Pathway tokens $P$ query the Histology $H$.
45-
2. **Activation**: A linear layer reduces $P_{tokens}$ to activation scores $S \in \mathbb{R}^P$.
46-
3. **Reconstruction**: $\hat{y} = S \cdot \mathbf{W}_{recon} + b$.
48+
## 3. Generalizing to HEST1k Tissues
49+
50+
The model supports any dataset within the HEST1k collection (e.g., Breast, Kidney, Lung, Colon). Instead of being bound to a single disease context, users can leverage the `--custom-gmt` flag to map genes to pathways relevant to their specific investigation.
4751

48-
## 3. Clinical Application in Bowel Cancer
52+
### Example: Profiling the Tumor Microenvironment
4953

50-
For colorectal cancer, we should prioritize monitoring these specific pathways:
54+
Regardless of the tissue of origin (e.g., Kidney versus Breast), researchers often track core functional states within the tumor microenvironment. A user might define a `.gmt` file to explicitly monitor:
5155

52-
| Pathway | Clinically Relevant Genes | Clinical Significance |
56+
| Pathway Concept | Hallmarks / Relevant Genes | Interpretive Value across Tissues |
5357
| :--- | :--- | :--- |
54-
| **Wnt Signaling** | `CTNNB1`, `MYC`, `AXIN2` | Common driver in CRC (APC mutations) |
55-
| **MMR / DNA Repair** | `MLH1`, `MSH2`, `MSH6` | MSI vs MSS status (Immunotherapy response) |
56-
| **EMT** | `SNAI1`, `VIM`, `ZEB1` | Tumor invasion and metastasis risk |
57-
| **Angiogenesis** | `VEGFA`, `FLT1` | Potential for anti-angiogenic therapy |
58+
| **Hypoxia & Angiogenesis** | `VEGFA`, `FLT1`, `HIF1A` | Identifies oxygen-deprived or highly vascularized tumor cores. |
59+
| **Immune Infiltration** | `CD8A`, `GZMB`, `IFNG` | Maps regions of active anti-tumor immune response. |
60+
| **Stromal / EMT** | `VIM`, `SNAI1`, `ZEB1` | Highlights desmoplastic stroma and invasion fronts. |
61+
| **Proliferation** | `MKI67`, `PCNA`, `MYC` | Pinpoints highly active, dividing cell populations. |
62+
63+
By supplying these functional groupings via `--custom-gmt`, the model's MTL process explicitly aligns its spatial interaction tokens to monitor these exact states across any whole-slide image in the HEST1k dataset.
5864

5965
## 4. Implementation Status
6066

6167
### Implemented
6268

6369
- **MSigDB Hallmarks Initialization** (`--pathway-init` flag): Downloads the GMT file, matches genes against `global_genes.json`, and initializes `gene_reconstructor.weight` with the binary membership matrix. See [`pathways.py`](../src/spatial_transcript_former/data/pathways.py).
64-
- 50 Hallmark pathways (fixed when using `--pathway-init`)
65-
- ~54% gene coverage (542/1000 genes mapped to at least one pathway)
66-
- GMT file cached in `.cache/` after first download
70+
- 50 Hallmark pathways (default fixed fallback when using `--pathway-init`).
71+
- GMT file cached in `.cache/` after first download.
72+
- **Custom Pathway Definitions** (`--custom-gmt` flag): Users can override the default Hallmarks by providing a URL or local path to a `.gmt` file, enabling custom database integrations (e.g., KEGG, Reactome, or highly specific tissue masks).
6773

6874
- **Sparsity Regularization** (`--sparsity-lambda` flag): L1 penalty on `gene_reconstructor` weights to encourage pathway-like groupings when using data-driven (random) initialization.
6975

@@ -79,8 +85,9 @@ python -m spatial_transcript_former.train \
7985
--model interaction --num-pathways 50 --sparsity-lambda 0.01 ...
8086
```
8187

88+
- **Spatial Pathway Maps**: Visualize pathway activations as spatial heatmaps overlaid on histology using `stf-predict`. See the [README](../README.md) for inference instructions.
89+
8290
### Future Work
8391

84-
- **KEGG/Reactome**: More granular pathway databases for finer-grained analysis.
85-
- **Post-Hoc Enrichment**: `gseapy` integration for pathway activation maps from model outputs.
86-
- **Spatial Pathway Maps**: Visualize pathway activations as spatial heatmaps overlaid on histology.
92+
- **Post-Hoc Enrichment**: `gseapy` integration for pathway activation maps from model outputs without architectural bottlenecks.
93+
- **End-to-End Risk Assessment Module**: Developing a downstream prediction system that takes the spatially-resolved pathway activations and gene expressions derived from the model and maps them directly to clinical risk and survival outcomes.

0 commit comments

Comments
 (0)