Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
21 changes: 13 additions & 8 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,18 +35,23 @@ We use `black` for formatting and `flake8` for linting. Please ensure your code

```bash
black .
flake8 src/
```

### 3. Testing
### 4. AI-Assisted Development

All new features must include unit tests in the `tests/` directory. We use `pytest` for our test suite.
We welcome contributions developed with the assistance of AI tools (e.g., Copilot, ChatGPT, Claude, or agentic frameworks). However, to ensure the long-term maintainability and integrity of the project:

```bash
# Run all tests
.\test.ps1 # Windows
bash test.sh # Linux
```
- **Ownership**: You are ultimately responsible for the code you submit. Do not commit code you do not fully understand.
- **Explainability**: During the review process, you must be able to explain the logic, design decisions, and any subtle side effects of the AI-suggested changes.
- **Verification**: AI-generated code must strictly follow our coding standards, naming conventions, and architectural patterns. It must be accompanied by robust tests (see our [Testing Guide](docs/TESTING.md)).

### 3. Testing & Quality Assurance

All new features must be accompanied by relevant tests in the `tests/` directory natively using `pytest`.

We highly encourage rigorous testing approaches such as **Mutation Testing** (via `cosmic-ray`) for critical model components to prevent surviving mutants.

For full details on our testing requirements, how to run the test suites locally, and our guidelines on mutation testing, please read the [Testing Guide](docs/TESTING.md).

## Pull Request Process

Expand Down
105 changes: 0 additions & 105 deletions Download-BowelCancer.ps1

This file was deleted.

2 changes: 1 addition & 1 deletion LICENSE
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
PROPRIETARY SOURCE CODE LICENSE (NON-COMMERCIAL + NEGOTIATED COMMERCIAL)
# PROPRIETARY SOURCE CODE LICENSE (NON-COMMERCIAL + NEGOTIATED COMMERCIAL)

Copyright (c) 2026 Benjamin Isaac Wilson. All rights reserved.

Expand Down
44 changes: 17 additions & 27 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,40 +36,23 @@ This project requires [Conda](https://docs.conda.io/en/latest/).

## Usage

**Before running any commands**, you must activate the conda environment:
### Dataset Access

```bash
conda activate SpatialTranscriptFormer
```

### Download HEST Data

> [!CAUTION]
> **Authentication Required**: The HEST dataset is gated. You must accept the terms of use at [MahmoodLab/hest](https://huggingface.co/datasets/MahmoodLab/hest) and authenticate with your Hugging Face account to download the data.

Please provide your token using ONE of the following methods before running the download tool:

1. **Persistent Login**: Run `huggingface-cli login` and paste your access token when prompted.
2. **Environment Variable**: Set the `HF_TOKEN` environment variable in your active terminal session.

Once authenticated, download specific subsets using filters or the entire dataset:
The model uses the **HEST1k** dataset. You can download specific subsets (by organ, technology, etc.) or the entire dataset using the `stf-download` utility:

```bash
# Option 1: Download the ENTIRE HEST dataset (requires confirmation)
stf-download --local_dir hest_data
# List available filtering options
stf-download --list-options

# Option 2: Download a specific subset (e.g., Bowel Cancer)
stf-download --organ Bowel --disease Cancer --local_dir hest_data
# Download a specific subset (e.g., Breast Cancer samples from Visium)
stf-download --organ Breast --disease Cancer --tech Visium --local_dir hest_data

# Option 3: Filter by technology (e.g., Visium)
stf-download --tech Visium --local_dir hest_data
# Download all human samples
stf-download --species "Homo sapiens" --local_dir hest_data
```

To see all available organs in the metadata:

```bash
stf-download --list_organs
```
> [!NOTE]
> The HEST dataset is gated on Hugging Face. Ensure you have accepted the terms at [MahmoodLab/hest](https://huggingface.co/datasets/MahmoodLab/hest) and are logged in via `huggingface-cli login`.

### Train Models

Expand Down Expand Up @@ -122,6 +105,13 @@ Visualization plots will be saved to the `./results` directory.
.\test.ps1
```

## Future Directions & Clinical Collaborations

A major future direction for **SpatialTranscriptFormer** is to integrate this architecture into an **end-to-end pipeline for patient risk assessment** and prognosis tracking. By leveraging the model's predicted expression and pathway activations, we aim to build a downstream risk prediction module that allows users to directly evaluate how spatially-resolved expression relates to patient survival.

> [!NOTE]
> **Call for Collaborators:** Rigorous risk assessment models require vast datasets of clinical metadata and survival outcomes, which we currently lack access to. We are open to investigating *any* disease of interest! If you have access to large clinical cohorts and are interested in exploring how spatial pathway activation correlates with patient prognosis, we would love to partner with you.

## Contributing

We welcome contributions! Please see [CONTRIBUTING.md](CONTRIBUTING.md) for details on our coding standards and the process for submitting pull requests. Note that this project is under a proprietary license; contributions involve an assignment of rights for non-academic use.
12 changes: 12 additions & 0 deletions docs/IP_STATEMENT.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,18 @@ The primary innovation is the **multimodal bottleneck transformer** designed for
- **Quadrant-Based Interaction Masking**: The logic used to zero out specific attention quadrants (e.g., $A_{H \to H}$) to optimize memory while maintaining multimodal context.
- **Biologically-Informed Reconstruction Bottleneck**: The specific matrix decomposition approach where gene expression is reconstructed from a linear combination of pathway activations.

### Proposed Auxiliary Pathway Loss

To prevent bottleneck collapse and provide a direct gradient signal to the pathway tokens, we use the `AuxiliaryPathwayLoss`. This loss compares the model's internal pathway scores against "ground truth" pathway activations computed from the gene expression targets via MSigDB membership.

The total objective becomes:
$$\mathcal{L} = \mathcal{L}_{gene} + \lambda_{aux} (1 - \text{PCC}(\text{pathway\_scores}, \text{target\_pathways}))$$

The `--log-transform` flag applies `log1p` to targets, mitigating the heavy-tailed gene expression distribution where housekeeping genes dominate MSE.

The full training objective with pathway sparsity regularisation:
$$\mathcal{L} = \mathcal{L}_{task} + \lambda \|W_{recon}\|_1$$

## 2. Spatial Context Methodologies

- **Euclidean-Gated Attention**: The implementation of spatial distance-based masking ($M_{spatial}$) to constrain model focus to local morphological regions.
Expand Down
4 changes: 2 additions & 2 deletions docs/MODELS.md
Original file line number Diff line number Diff line change
Expand Up @@ -122,7 +122,7 @@ Together, these ensure the model learns *spatially-varying* pathway activation m

#### Frozen Backbone (Feature Extraction)

Pre-computed features from a pathology foundation model. The backbone is never fine-tuned.
Pre-computed features from a pathology foundation model. (The backbone is never fine-tuned, though this might change!)

| Backbone | Feature Dim | Source |
| :--- | :--- | :--- |
Expand Down Expand Up @@ -214,7 +214,7 @@ The Zero-Inflated Negative Binomial (ZINB) loss is designed for raw, highly disp

The model outputs these parameters, and the loss computes the negative log-likelihood of the ground truth counts given this distribution.

### Auxiliary Pathway Loss
### Proposed Auxiliary Pathway Loss

To prevent bottleneck collapse and provide a direct gradient signal to the pathway tokens, we use the `AuxiliaryPathwayLoss`. This loss compares the model's internal pathway scores against "ground truth" pathway activations computed from the gene expression targets via MSigDB membership.

Expand Down
61 changes: 34 additions & 27 deletions docs/PATHWAY_MAPPING.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,49 +21,55 @@ After the model makes predictions (N spots x G genes), we run a statistical test
- **Tool**: `gseapy` or a custom mapping script.
- **Use Case**: Generating a "Pathway Activation Map" from a trained model's output.

### B. Pathway Bottleneck (Model Architecture)
### B. Interaction Model via Multi-Task Learning (MTL)

The **SpatialTranscriptFormer** replaces the standard linear output head with a two-step projection that can be configured in two modes:
The **SpatialTranscriptFormer** interaction model inherently represents pathway activations as part of its attention mechanism and output process. Rather than a simple linear bottleneck, it utilizes learnable pathway tokens and Multi-Task Learning (MTL).

#### 1. Informed Projection (Prior Knowledge)
#### 1. Informed Supervision via Auxiliary Loss

In this mode, the **Gene Reconstruction Matrix** $\mathbf{W}_{recon}$ is guided by established biological databases (MSigDB, KEGG).
In this mode, the network receives direct supervision on its pathway tokens, guided by established biological databases (e.g., MSigDB):

- **Implementation**: $\mathbf{W}_{recon}$ is initialized as a binary mask $M \in \{0, 1\}^{G \times P}$ where $M_{gk} = 1$ if gene $g$ belongs to pathway $k$.
- **Benefit**: Predictions are guaranteed to be linear combinations of known biological processes, making them instantly interpretable by clinicians.
- **Architecture Flow**:
1. **Interaction**: Learnable pathway tokens $P$ interact with Histology patch features $H$ via self-attention (e.g., $p2h$, $h2p$).
2. **Activation**: Pathway scores $S \in \mathbb{R}^P$ are computed using a learnable temperature-scaled cosine similarity between the pathway tokens and image patch tokens.
3. **Gene Reconstruction**: $\hat{y} = S \cdot \mathbf{W}_{recon} + b$, where $\mathbf{W}_{recon}$ is initialized using the binary pathway membership matrix $M$.
- **MTL Auxiliary Loss**: To prevent standard bottleneck collapse, an explicit auxiliary loss bridges the spatial representations directly to biological data. The pathway scores $S$ are supervised against a pathway ground truth ($Y_{genes} \cdot M^T$) using a Pearson Correlation Coefficient (PCC) loss.
$$L_{total} = L_{gene} + \lambda_{pathway} (1 - PCC(S, Y_{genes} \cdot M^T))$$
- **Benefit**: The model is forced to explicitly align its internal interaction tokens with concrete biological pathways, granting direct interpretability.

#### 2. Data-Driven Projection (Latent Discovery)
#### 2. Data-Driven Discovery (Latent Projection)

In this mode, the model learns its own "latent pathways" based on morphological patterns.
In the absence of a biological prior, the model can learn its own "latent pathways".

- **Implementation**: $\mathbf{W}_{recon}$ is randomly initialized and learned via backpropagation.
- **Sparsity Constraint**: We apply an L1 penalty to force the model to identify "canonical" gene sets: $L_{total} = L_{MSE} + \lambda \|\mathbf{W}_{recon}\|_1$.
- **Implementation**: $\mathbf{W}_{recon}$ is randomly initialized and the auxiliary pathway loss is disabled.
- **Sparsity Constraint**: We apply an L1 penalty to force the model to identify "canonical" sparse gene sets: $L_{total} = L_{gene} + \lambda_{sparsity} \|\mathbf{W}_{recon}\|_1$.
- **Benefit**: Can discover novel spatial-transcriptomic relationships that aren't yet captured in curated databases.

- **Architecture Flow**:
1. **Interaction**: Pathway tokens $P$ query the Histology $H$.
2. **Activation**: A linear layer reduces $P_{tokens}$ to activation scores $S \in \mathbb{R}^P$.
3. **Reconstruction**: $\hat{y} = S \cdot \mathbf{W}_{recon} + b$.
## 3. Generalizing to HEST1k Tissues

The model supports any dataset within the HEST1k collection (e.g., Breast, Kidney, Lung, Colon). Instead of being bound to a single disease context, users can leverage the `--custom-gmt` flag to map genes to pathways relevant to their specific investigation.

## 3. Clinical Application in Bowel Cancer
### Example: Profiling the Tumor Microenvironment

For colorectal cancer, we should prioritize monitoring these specific pathways:
Regardless of the tissue of origin (e.g., Kidney versus Breast), researchers often track core functional states within the tumor microenvironment. A user might define a `.gmt` file to explicitly monitor:

| Pathway | Clinically Relevant Genes | Clinical Significance |
| Pathway Concept | Hallmarks / Relevant Genes | Interpretive Value across Tissues |
| :--- | :--- | :--- |
| **Wnt Signaling** | `CTNNB1`, `MYC`, `AXIN2` | Common driver in CRC (APC mutations) |
| **MMR / DNA Repair** | `MLH1`, `MSH2`, `MSH6` | MSI vs MSS status (Immunotherapy response) |
| **EMT** | `SNAI1`, `VIM`, `ZEB1` | Tumor invasion and metastasis risk |
| **Angiogenesis** | `VEGFA`, `FLT1` | Potential for anti-angiogenic therapy |
| **Hypoxia & Angiogenesis** | `VEGFA`, `FLT1`, `HIF1A` | Identifies oxygen-deprived or highly vascularized tumor cores. |
| **Immune Infiltration** | `CD8A`, `GZMB`, `IFNG` | Maps regions of active anti-tumor immune response. |
| **Stromal / EMT** | `VIM`, `SNAI1`, `ZEB1` | Highlights desmoplastic stroma and invasion fronts. |
| **Proliferation** | `MKI67`, `PCNA`, `MYC` | Pinpoints highly active, dividing cell populations. |

By supplying these functional groupings via `--custom-gmt`, the model's MTL process explicitly aligns its spatial interaction tokens to monitor these exact states across any whole-slide image in the HEST1k dataset.

## 4. Implementation Status

### Implemented

- **MSigDB Hallmarks Initialization** (`--pathway-init` flag): Downloads the GMT file, matches genes against `global_genes.json`, and initializes `gene_reconstructor.weight` with the binary membership matrix. See [`pathways.py`](../src/spatial_transcript_former/data/pathways.py).
- 50 Hallmark pathways (fixed when using `--pathway-init`)
- ~54% gene coverage (542/1000 genes mapped to at least one pathway)
- GMT file cached in `.cache/` after first download
- 50 Hallmark pathways (default fixed fallback when using `--pathway-init`).
- GMT file cached in `.cache/` after first download.
- **Custom Pathway Definitions** (`--custom-gmt` flag): Users can override the default Hallmarks by providing a URL or local path to a `.gmt` file, enabling custom database integrations (e.g., KEGG, Reactome, or highly specific tissue masks).

- **Sparsity Regularization** (`--sparsity-lambda` flag): L1 penalty on `gene_reconstructor` weights to encourage pathway-like groupings when using data-driven (random) initialization.

Expand All @@ -79,8 +85,9 @@ python -m spatial_transcript_former.train \
--model interaction --num-pathways 50 --sparsity-lambda 0.01 ...
```

- **Spatial Pathway Maps**: Visualize pathway activations as spatial heatmaps overlaid on histology using `stf-predict`. See the [README](../README.md) for inference instructions.

### Future Work

- **KEGG/Reactome**: More granular pathway databases for finer-grained analysis.
- **Post-Hoc Enrichment**: `gseapy` integration for pathway activation maps from model outputs.
- **Spatial Pathway Maps**: Visualize pathway activations as spatial heatmaps overlaid on histology.
- **Post-Hoc Enrichment**: `gseapy` integration for pathway activation maps from model outputs without architectural bottlenecks.
- **End-to-End Risk Assessment Module**: Developing a downstream prediction system that takes the spatially-resolved pathway activations and gene expressions derived from the model and maps them directly to clinical risk and survival outcomes.
Loading