Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
17 changes: 17 additions & 0 deletions .github/CODEOWNERS
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
# Global owners (catch-all for all files)
* @BenjaminIsaac0111

# Core Model and Training Logic
src/spatial_transcript_former/models/ @BenjaminIsaac0111
src/spatial_transcript_former/training/ @BenjaminIsaac0111

# Data Management and Scripts
src/spatial_transcript_former/data/ @BenjaminIsaac0111
scripts/ @BenjaminIsaac0111

# Documentation
docs/ @BenjaminIsaac0111
*.md @BenjaminIsaac0111

# GitHub Actions and Infrastructure
.github/ @BenjaminIsaac0111
69 changes: 69 additions & 0 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,69 @@
# Contributing to SpatialTranscriptFormer

Thank you for your interest in contributing! As a project at the intersection of deep learning and pathology, we value rigorous, well-tested contributions.

## Project Status

> [!IMPORTANT]
> This project is a **Work in Progress**. We are actively refining the core interaction logic and scaling behaviors. Expect breaking changes in the CLI and data schemas.

## Intellectual Property & Licensing

SpatialTranscriptFormer is protected under a **Proprietary Source Code License**.

- **Academic/Non-Profit**: We encourage contributions from the research community. Contributions made under an academic affiliation are generally welcome.
- **Commercial/For-Profit**: Contributions from commercial entities or individuals intended for profit-seeking use require a separate agreement.
- **Assignment**: By submitting a Pull Request, you agree that your contributions will be licensed under the project's existing license, granting the author the right to include them in both the open-access and proprietary versions of the software.

## Development Workflow

### 1. Environment Setup

Use the provided setup scripts to ensure a consistent development environment:

```bash
# Windows
.\setup.ps1

# Linux/HPC
bash setup.sh
```

### 2. Coding Standards

We use `black` for formatting and `flake8` for linting. Please ensure your code passes these checks before submitting.

```bash
black .
flake8 src/
```

### 3. Testing

All new features must include unit tests in the `tests/` directory. We use `pytest` for our test suite.

```bash
# Run all tests
.\test.ps1 # Windows
bash test.sh # Linux
```

## Pull Request Process

1. **Open an Issue**: For major changes, please open an issue first to discuss the design.
2. **Branching**: Work on a descriptive feature branch (e.g., `feature/pathway-attention-mask`).
3. **Documentation**: Update relevant files in `docs/` and the `README.md` if your change affects usage.
4. **Verification**: Ensure all CI checks (GitHub Actions) pass.

### Branch Protections

To maintain code quality and stability, the following protections are enforced on the `main` branch:

- **Require Pull Request Reviews**: All merges to `main` require at least one approval from a project maintainer.
- **Required Status Checks**: The `CI` workflow must pass successfully before a PR can be merged. This includes formatting checks (`black`) and the full test suite (`pytest`).
- **No Direct Pushes**: Pushing directly to `main` is disabled. All changes must go through the Pull Request process.
- **Linear History**: We prefer **Squash and Merge** to keep the `main` branch history clean and concise.

## Contact

For questions regarding commercial licensing or complex architectural changes, please contact the author directly.
69 changes: 37 additions & 32 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,16 @@
# SpatialTranscriptFormer

A transformer-based model for spatial transcriptomics.
> [!WARNING]
> **Work in Progress**: This project is under active development. Core architectures, CLI flags, and data formats are subject to major changes.

A transformer-based model for spatial transcriptomics that bridges histology and biological pathways.

## Key Features

- **Quad-Flow Interaction**: Configurable attention between Pathways and Histology patches (`p2p`, `p2h`, `h2p`, `h2h`).
- **Pathway Bottleneck**: Interpretable gene expression prediction via 50 MSigDB Hallmark tokens.
- **Spatial Pattern Coherence**: Optimized using a composite **MSE + PCC (Pearson Correlation) loss** to prevent spatial collapse and ensure accurate morphology-expression mapping.
- **Biologically Informed Initialization**: Gene reconstruction weights derived from known hallmark memberships.

## License

Expand All @@ -25,71 +35,66 @@ This project requires [Conda](https://docs.conda.io/en/latest/).

## Usage

After installation, the following command-line tools are available in your `SpatialTranscriptFormer` environment:

### Download HEST Data

Download specific subsets using filters or patterns:

```bash
# List available organs
stf-download --list_organs

# Download only the Bowel Cancer subset (including ST data and WSIs)
stf-download --organ Bowel --disease Cancer --local_dir hest_data

# Download any other organ
stf-download --organ Kidney
```

### Split Dataset
### Train Models

We provide presets for baseline models and scaled versions of the SpatialTranscriptFormer.

Perform patient-stratified splitting on the metadata:
```bash
# Recommended: Run the Interaction model with 4 transformer layers
python scripts/run_preset.py --preset stf_interaction_l4

```powershell
stf-split HEST_v1_3_0.csv --val_ratio 0.2
# Run the lightweight 2-layer version
python scripts/run_preset.py --preset stf_interaction_l2

# Run baselines
python scripts/run_preset.py --preset he2rna_baseline
```

### Train Models
For a complete list of configurations, see the [Training Guide](docs/TRAINING_GUIDE.md).

Train baseline models (HE2RNA, ViT) or the proposed interaction model. For a complete list of configurations and examples, see the [Training Guide](docs/TRAINING_GUIDE.md).
### Real-Time Monitoring

```bash
# Option 1: Using the standard command
stf-train --data-dir A:\hest_data --model he2rna --epochs 20
Monitor training progress, loss curves, and **prediction variance (collapse detector)** via the web dashboard:

# Option 2: Using the preset launcher (recommended for complex models)
python scripts/run_preset.py --preset stf_interaction --epochs 30
```bash
python scripts/monitor.py --run-dir runs/stf_interaction_l4
```

### Inference & Visualization

Generate spatial maps comparing Ground Truth vs Predictions for specific samples:
Generate spatial maps comparing Ground Truth vs Predictions:

```bash
stf-predict --data-dir A:\hest_data --sample-id MEND29 --model-path checkpoints/best_model_he2rna.pth --model-type he2rna
stf-predict --data-dir A:\hest_data --sample-id MEND29 --model-path checkpoints/best_model.pth --model-type interaction
```

Visualization plots will be saved to the `./results` directory.

## Documentation

For detailed information on the data and code implementation, see:

- [Models](docs/MODELS.md): Detailed model architectures and scaling parameters.
- [Data Structure](docs/DATA_STRUCTURE.md): Organization of HEST data on disk.
- [Dataloader](docs/DATALOADER.md): Technical implementation of the PyTorch dataset and loaders.
- [Gene Analysis](docs/GENE_ANALYSIS.md): Analysis of available genes and modeling strategies.
- [Pathway Mapping](docs/PATHWAY_MAPPING.md): Strategies for clinical interpretability and pathway integration.
- [Latent Discovery](docs/LATENT_DISCOVERY.md): Unsupervised discovery of biological pathways from data.
- [Models](docs/MODELS.md): Model architectures and literature references.
- [Pathway Mapping](docs/PATHWAY_MAPPING.md): Clinical interpretability and pathway integration.
- [Gene Analysis](docs/GENE_ANALYSIS.md): Modeling strategies for high-dimensional gene space.

## Development

### Running Tests

Use the included test wrapper:

```bash
# Run all tests
# Run all tests (Pytest wrapper)
.\test.ps1
```

## Contributing

We welcome contributions! Please see [CONTRIBUTING.md](CONTRIBUTING.md) for details on our coding standards and the process for submitting pull requests. Note that this project is under a proprietary license; contributions involve an assignment of rights for non-academic use.
5 changes: 1 addition & 4 deletions config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -4,15 +4,12 @@
# Data Paths
# Candidates for the HEST data directory (checked in order)
data_dirs:
- "hest_data"
- "../hest_data"
- "./data"
- "A:\\hest_data"

# Training Defaults
training:
num_genes: 1000
batch_size: 32
batch_size: 8
learning_rate: 0.0001
output_dir: "./checkpoints"

Expand Down
Loading