You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
feat: integrate auxiliary pathway loss and sparsity regularization
Implement key mathematical and architectural improvements to the spatial
transcriptomics pipeline, focusing on pathway-driven interpretability.
Core Changes:
- Add Auxiliary Pathway Loss to ground the model's latent space in biological
priors using MSigDB pathway membership markers.
- Introduce L1 sparsity regularization (--sparsity-lambda) on reconstruction
weights to prune redundant pathway-to-gene mappings.
- Implement --log-transform data strategy (log1p) to stabilize MSE against
high-variance housekeeping gene expression.
Architecture & Refactoring:
- Update dataloader and train/predict scripts to support auxiliary objectives
and log-transformed targets.
- Refactor visualization modules to improve histology-pathway overlays.
- Remove deprecated Bowel Cancer download script.
Documentation & Compliance:
- Formally document auxiliary loss and training objectives in the IP Statement.
- Expand model documentation and training guides to reflect bottleneck
transitions and latent discovery capabilities.
- Add Future Directions and Clinical Collaborations to project README.
- Resolve markdownlint error in LICENSE by adding a top-level H1 heading.
Testing:
- Add unit tests for pathway pruning logic and auxiliary loss calculation.
- Strengthen visualization tests to ensure coordinate alignment integrity.
Copy file name to clipboardExpand all lines: CONTRIBUTING.md
+13-8Lines changed: 13 additions & 8 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -35,18 +35,23 @@ We use `black` for formatting and `flake8` for linting. Please ensure your code
35
35
36
36
```bash
37
37
black .
38
-
flake8 src/
39
38
```
40
39
41
-
### 3. Testing
40
+
### 4. AI-Assisted Development
42
41
43
-
All new features must include unit tests in the `tests/` directory. We use `pytest` for our test suite.
42
+
We welcome contributions developed with the assistance of AI tools (e.g., Copilot, ChatGPT, Claude, or agentic frameworks). However, to ensure the long-term maintainability and integrity of the project:
44
43
45
-
```bash
46
-
# Run all tests
47
-
.\test.ps1 # Windows
48
-
bash test.sh # Linux
49
-
```
44
+
-**Ownership**: You are ultimately responsible for the code you submit. Do not commit code you do not fully understand.
45
+
-**Explainability**: During the review process, you must be able to explain the logic, design decisions, and any subtle side effects of the AI-suggested changes.
46
+
-**Verification**: AI-generated code must strictly follow our coding standards, naming conventions, and architectural patterns. It must be accompanied by robust tests (see our [Testing Guide](docs/TESTING.md)).
47
+
48
+
### 3. Testing & Quality Assurance
49
+
50
+
All new features must be accompanied by relevant tests in the `tests/` directory natively using `pytest`.
51
+
52
+
We highly encourage rigorous testing approaches such as **Mutation Testing** (via `cosmic-ray`) for critical model components to prevent surviving mutants.
53
+
54
+
For full details on our testing requirements, how to run the test suites locally, and our guidelines on mutation testing, please read the [Testing Guide](docs/TESTING.md).
Copy file name to clipboardExpand all lines: README.md
+22-5Lines changed: 22 additions & 5 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -3,13 +3,14 @@
3
3
> [!WARNING]
4
4
> **Work in Progress**: This project is under active development. Core architectures, CLI flags, and data formats are subject to major changes.
5
5
6
-
A transformer-based model for spatial transcriptomics that bridges histology and biological pathways.
6
+
**SpatialTranscriptFormer** bridges histology and biological pathways through a high-performance transformer architecture. By modeling the dense interplay between morphological features and gene expression signatures, it provides an interpretable and spatially-coherent mapping of the tissue microenvironment.
7
7
8
8
## Key Features
9
9
10
10
-**Quad-Flow Interaction**: Configurable attention between Pathways and Histology patches (`p2p`, `p2h`, `h2p`, `h2h`).
-**Spatial Pattern Coherence**: Optimized using a composite **MSE + PCC (Pearson Correlation) loss** to prevent spatial collapse and ensure accurate morphology-expression mapping.
13
+
-**Foundation Model Ready**: Native support for **CTransPath**, **Phikon**, **Hibou**, and **GigaPath**.
13
14
-**Biologically Informed Initialization**: Gene reconstruction weights derived from known hallmark memberships.
14
15
15
16
## License
@@ -35,15 +36,24 @@ This project requires [Conda](https://docs.conda.io/en/latest/).
35
36
36
37
## Usage
37
38
38
-
### Download HEST Data
39
+
### Dataset Access
39
40
40
-
Download specific subsets using filters or patterns:
41
+
The model uses the **HEST1k** dataset. You can download specific subsets (by organ, technology, etc.) or the entire dataset using the `stf-download` utility:
41
42
42
43
```bash
43
-
# Download only the Bowel Cancer subset (including ST data and WSIs)
44
-
stf-download --organ Bowel --disease Cancer --local_dir hest_data
44
+
# List available filtering options
45
+
stf-download --list-options
46
+
47
+
# Download a specific subset (e.g., Breast Cancer samples from Visium)
48
+
stf-download --organ Breast --disease Cancer --tech Visium --local_dir hest_data
> The HEST dataset is gated on Hugging Face. Ensure you have accepted the terms at [MahmoodLab/hest](https://huggingface.co/datasets/MahmoodLab/hest) and are logged in via `huggingface-cli login`.
56
+
47
57
### Train Models
48
58
49
59
We provide presets for baseline models and scaled versions of the SpatialTranscriptFormer.
@@ -95,6 +105,13 @@ Visualization plots will be saved to the `./results` directory.
95
105
.\test.ps1
96
106
```
97
107
108
+
## Future Directions & Clinical Collaborations
109
+
110
+
A major future direction for **SpatialTranscriptFormer** is to integrate this architecture into an **end-to-end pipeline for patient risk assessment** and prognosis tracking. By leveraging the model's predicted expression and pathway activations, we aim to build a downstream risk prediction module that allows users to directly evaluate how spatially-resolved expression relates to patient survival.
111
+
112
+
> [!NOTE]
113
+
> **Call for Collaborators:** Rigorous risk assessment models require vast datasets of clinical metadata and survival outcomes, which we currently lack access to. We are open to investigating *any* disease of interest! If you have access to large clinical cohorts and are interested in exploring how spatial pathway activation correlates with patient prognosis, we would love to partner with you.
114
+
98
115
## Contributing
99
116
100
117
We welcome contributions! Please see [CONTRIBUTING.md](CONTRIBUTING.md) for details on our coding standards and the process for submitting pull requests. Note that this project is under a proprietary license; contributions involve an assignment of rights for non-academic use.
Copy file name to clipboardExpand all lines: docs/IP_STATEMENT.md
+12Lines changed: 12 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -14,6 +14,18 @@ The primary innovation is the **multimodal bottleneck transformer** designed for
14
14
-**Quadrant-Based Interaction Masking**: The logic used to zero out specific attention quadrants (e.g., $A_{H \to H}$) to optimize memory while maintaining multimodal context.
15
15
-**Biologically-Informed Reconstruction Bottleneck**: The specific matrix decomposition approach where gene expression is reconstructed from a linear combination of pathway activations.
16
16
17
+
### Proposed Auxiliary Pathway Loss
18
+
19
+
To prevent bottleneck collapse and provide a direct gradient signal to the pathway tokens, we use the `AuxiliaryPathwayLoss`. This loss compares the model's internal pathway scores against "ground truth" pathway activations computed from the gene expression targets via MSigDB membership.
-**Euclidean-Gated Attention**: The implementation of spatial distance-based masking ($M_{spatial}$) to constrain model focus to local morphological regions.
Copy file name to clipboardExpand all lines: docs/MODELS.md
+2-2Lines changed: 2 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -122,7 +122,7 @@ Together, these ensure the model learns *spatially-varying* pathway activation m
122
122
123
123
#### Frozen Backbone (Feature Extraction)
124
124
125
-
Pre-computed features from a pathology foundation model. The backbone is never fine-tuned.
125
+
Pre-computed features from a pathology foundation model. (The backbone is never fine-tuned, though this might change!)
126
126
127
127
| Backbone | Feature Dim | Source |
128
128
| :--- | :--- | :--- |
@@ -214,7 +214,7 @@ The Zero-Inflated Negative Binomial (ZINB) loss is designed for raw, highly disp
214
214
215
215
The model outputs these parameters, and the loss computes the negative log-likelihood of the ground truth counts given this distribution.
216
216
217
-
### Auxiliary Pathway Loss
217
+
### Proposed Auxiliary Pathway Loss
218
218
219
219
To prevent bottleneck collapse and provide a direct gradient signal to the pathway tokens, we use the `AuxiliaryPathwayLoss`. This loss compares the model's internal pathway scores against "ground truth" pathway activations computed from the gene expression targets via MSigDB membership.
Copy file name to clipboardExpand all lines: docs/PATHWAY_MAPPING.md
+34-27Lines changed: 34 additions & 27 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -21,49 +21,55 @@ After the model makes predictions (N spots x G genes), we run a statistical test
21
21
-**Tool**: `gseapy` or a custom mapping script.
22
22
-**Use Case**: Generating a "Pathway Activation Map" from a trained model's output.
23
23
24
-
### B. Pathway Bottleneck (Model Architecture)
24
+
### B. Interaction Model via Multi-Task Learning (MTL)
25
25
26
-
The **SpatialTranscriptFormer**replaces the standard linear output head with a two-step projection that can be configured in two modes:
26
+
The **SpatialTranscriptFormer**interaction model inherently represents pathway activations as part of its attention mechanism and output process. Rather than a simple linear bottleneck, it utilizes learnable pathway tokens and Multi-Task Learning (MTL).
27
27
28
-
#### 1. Informed Projection (Prior Knowledge)
28
+
#### 1. Informed Supervision via Auxiliary Loss
29
29
30
-
In this mode, the **Gene Reconstruction Matrix** $\mathbf{W}_{recon}$ is guided by established biological databases (MSigDB, KEGG).
30
+
In this mode, the network receives direct supervision on its pathway tokens, guided by established biological databases (e.g., MSigDB):
31
31
32
-
-**Implementation**: $\mathbf{W}_{recon}$ is initialized as a binary mask $M \in \{0, 1\}^{G \times P}$ where $M_{gk} = 1$ if gene $g$ belongs to pathway $k$.
33
-
-**Benefit**: Predictions are guaranteed to be linear combinations of known biological processes, making them instantly interpretable by clinicians.
32
+
-**Architecture Flow**:
33
+
1.**Interaction**: Learnable pathway tokens $P$ interact with Histology patch features $H$ via self-attention (e.g., $p2h$, $h2p$).
34
+
2.**Activation**: Pathway scores $S \in \mathbb{R}^P$ are computed using a learnable temperature-scaled cosine similarity between the pathway tokens and image patch tokens.
35
+
3.**Gene Reconstruction**: $\hat{y} = S \cdot \mathbf{W}_{recon} + b$, where $\mathbf{W}_{recon}$ is initialized using the binary pathway membership matrix $M$.
36
+
-**MTL Auxiliary Loss**: To prevent standard bottleneck collapse, an explicit auxiliary loss bridges the spatial representations directly to biological data. The pathway scores $S$ are supervised against a pathway ground truth ($Y_{genes} \cdot M^T$) using a Pearson Correlation Coefficient (PCC) loss.
-**Benefit**: The model is forced to explicitly align its internal interaction tokens with concrete biological pathways, granting direct interpretability.
34
39
35
-
#### 2. Data-Driven Projection (Latent Discovery)
40
+
#### 2. Data-Driven Discovery (Latent Projection)
36
41
37
-
In this mode, the model learns its own "latent pathways" based on morphological patterns.
42
+
In the absence of a biological prior, the model can learn its own "latent pathways".
38
43
39
-
-**Implementation**: $\mathbf{W}_{recon}$ is randomly initialized and learned via backpropagation.
40
-
-**Sparsity Constraint**: We apply an L1 penalty to force the model to identify "canonical" gene sets: $L_{total} = L_{MSE} + \lambda\|\mathbf{W}_{recon}\|_1$.
44
+
-**Implementation**: $\mathbf{W}_{recon}$ is randomly initialized and the auxiliary pathway loss is disabled.
45
+
-**Sparsity Constraint**: We apply an L1 penalty to force the model to identify "canonical" sparse gene sets: $L_{total} = L_{gene} + \lambda_{sparsity}\|\mathbf{W}_{recon}\|_1$.
41
46
-**Benefit**: Can discover novel spatial-transcriptomic relationships that aren't yet captured in curated databases.
42
47
43
-
-**Architecture Flow**:
44
-
1.**Interaction**: Pathway tokens $P$ query the Histology $H$.
45
-
2.**Activation**: A linear layer reduces $P_{tokens}$ to activation scores $S \in \mathbb{R}^P$.
46
-
3.**Reconstruction**: $\hat{y} = S \cdot \mathbf{W}_{recon} + b$.
48
+
## 3. Generalizing to HEST1k Tissues
49
+
50
+
The model supports any dataset within the HEST1k collection (e.g., Breast, Kidney, Lung, Colon). Instead of being bound to a single disease context, users can leverage the `--custom-gmt` flag to map genes to pathways relevant to their specific investigation.
47
51
48
-
##3. Clinical Application in Bowel Cancer
52
+
### Example: Profiling the Tumor Microenvironment
49
53
50
-
For colorectal cancer, we should prioritize monitoring these specific pathways:
54
+
Regardless of the tissue of origin (e.g., Kidney versus Breast), researchers often track core functional states within the tumor microenvironment. A user might define a `.gmt` file to explicitly monitor:
By supplying these functional groupings via `--custom-gmt`, the model's MTL process explicitly aligns its spatial interaction tokens to monitor these exact states across any whole-slide image in the HEST1k dataset.
58
64
59
65
## 4. Implementation Status
60
66
61
67
### Implemented
62
68
63
69
-**MSigDB Hallmarks Initialization** (`--pathway-init` flag): Downloads the GMT file, matches genes against `global_genes.json`, and initializes `gene_reconstructor.weight` with the binary membership matrix. See [`pathways.py`](../src/spatial_transcript_former/data/pathways.py).
64
-
- 50 Hallmark pathways (fixed when using `--pathway-init`)
65
-
-~54% gene coverage (542/1000 genes mapped to at least one pathway)
66
-
- GMT file cached in `.cache/` after first download
70
+
- 50 Hallmark pathways (default fixed fallback when using `--pathway-init`).
71
+
-GMT file cached in `.cache/` after first download.
72
+
-**Custom Pathway Definitions** (`--custom-gmt` flag): Users can override the default Hallmarks by providing a URL or local path to a `.gmt` file, enabling custom database integrations (e.g., KEGG, Reactome, or highly specific tissue masks).
67
73
68
74
-**Sparsity Regularization** (`--sparsity-lambda` flag): L1 penalty on `gene_reconstructor` weights to encourage pathway-like groupings when using data-driven (random) initialization.
-**Spatial Pathway Maps**: Visualize pathway activations as spatial heatmaps overlaid on histology using `stf-predict`. See the [README](../README.md) for inference instructions.
89
+
82
90
### Future Work
83
91
84
-
-**KEGG/Reactome**: More granular pathway databases for finer-grained analysis.
85
-
-**Post-Hoc Enrichment**: `gseapy` integration for pathway activation maps from model outputs.
86
-
-**Spatial Pathway Maps**: Visualize pathway activations as spatial heatmaps overlaid on histology.
92
+
-**Post-Hoc Enrichment**: `gseapy` integration for pathway activation maps from model outputs without architectural bottlenecks.
93
+
-**End-to-End Risk Assessment Module**: Developing a downstream prediction system that takes the spatially-resolved pathway activations and gene expressions derived from the model and maps them directly to clinical risk and survival outcomes.
0 commit comments