Gene-aware multimodal foundation modeling for spatial omics and pathology reasoning.
SciCore-Omics is a gene-aware multimodal framework for joint reasoning over histology images, natural language, and transcriptomic profiles. Built on the MiniCPM-V stack, it introduces a dedicated gene branch that encodes expression profiles with Nicheformer, compresses gene representations through a Gene Q-Former, and injects the resulting embeddings into the language-model token space. The project is designed for spatial omics and pathology scenarios where molecular signals and visual morphology should be interpreted together rather than handled as isolated modalities.
- Gene-aware tri-modal foundation model. SciCore-Omics extends MiniCPM-V from image-text modeling to gene-image-text reasoning with an explicit transcriptomic input pathway.
- Dedicated gene representation bridge. The gene branch uses Nicheformer, a Gene Q-Former, and a Gene Projector to transform variable-length gene-expression signals into fixed-length LLM-compatible embeddings.
- Staged training pipeline. The repository provides gene bridge distillation, Swift-based CPT/SFT, and GSPO/PPO-style RL refinement as separate training stages.
- Practical release path. The project includes a live demo, a public Hugging Face Space, reproducible training entrypoints, and a clear path toward future weight release.
The model augments a MiniCPM-V style vision-language model with a transcriptomics pathway:
gene expression (.h5ad)
-> gene tokenizer
-> Nicheformer gene encoder
-> Gene Q-Former bridge
-> Gene Projector
-> <gene> span embeddings in the LLM token space
image
-> vision tower
-> resampler
-> <image> span embeddings in the LLM token space
text prompt
-> tokenizer
all modalities
-> merged input embeddings
-> MiniCPM-V / Qwen2 language model
This design allows the model to consume transcriptomic context either alone or together with histology images and text instructions, while preserving the standard autoregressive language-model interface. The gene signal is not merely converted into plain text; it is encoded as multimodal embeddings and inserted into the LLM token embedding sequence.
The project is organized around four main code areas:
| Path | Role |
|---|---|
model/ |
Core model and processor definitions for the gene-aware MiniCPM-V variant. |
train-distill-gene/ |
Gene bridge distillation utilities for training gene_qformer and gene_projector, plus weight injection into a full model directory. |
train-swift-cpt-sft/ |
ms-swift based CPT/SFT example scripts and custom registration for gene-aware MiniCPM-V workflows. |
train-rl/ |
GSPO/PPO-style score-guided reinforcement learning pipeline for multimodal optimization. |
environment.yml |
Conda environment specification for the research stack. |
If you are new to the codebase, the most useful reading order is:
model/train-distill-gene/train-swift-cpt-sft/train-rl/
-
Online demo
A live demo is available here:
| Hugging Face | SciCore-Omics Space |
This is the quickest way to inspect the current behavior while public weights are not yet released.
-
Environment setup
To use the model locally, first create the project environment from
environment.yml:conda env create -f environment.yml conda activate OMICS
The reference environment was developed on Linux with NVIDIA A800-SXM4-80GB GPUs. The
flash-attnpackage can be sensitive to the local CUDA, PyTorch, and GPU setup, so it may need to be adjusted for a different machine. -
Release status
Item Status Demo Available Training code Available Model weights TODO
The heart of the repository lives in model/, where the multimodal model is defined.
| File | Purpose |
|---|---|
model/configuration_minicpm.py |
Defines MiniCPMVConfig, extending Qwen2Config with vision_config, slice_config, and gene_config. |
model/configuration_nicheformer.py |
Defines NicheformerConfig, the configuration object for the gene encoder. |
model/modeling_nicheformer.py |
Implements NicheformerModel, a transformer encoder over gene tokens. |
model/gene_qformer_module.py |
Implements GeneQFormerBiomedBERT, a learnable-query bridge that compresses variable-length gene token sequences into a fixed set of query tokens. |
model/gene_projector_module.py |
Projects Q-Former outputs from the bridge hidden size into the language-model embedding dimension. |
model/modeling_minicpmv.py |
Integrates the LLM, vision tower, resampler, Nicheformer, Gene Q-Former, and Gene Projector into one multimodal model. |
model/processing_minicpmv.py |
Implements the processor that packages text, image, and gene inputs into model-ready tensors. |
model/gene_tokenizer/ |
Gene-tokenization resources, tokenizer logic, vocabulary, and reference .h5ad assets used by the processor and training scripts. |
At a high level, the repository uses the following sequence:
- Gene expression is tokenized into a gene-token sequence.
NicheformerModelencodes that sequence into contextual gene embeddings.GeneQFormerBiomedBERTcompresses those embeddings into a fixed number of query tokens.GeneProjectormaps the bridge outputs into the hidden space of the MiniCPM-V language model.- The projected embeddings are inserted into the language-model input stream at the positions corresponding to the textual placeholder token span for
"<gene>".
The multimodal merge happens inside the MiniCPM-V modeling logic, where image features and gene features are both converted into embedding spans and then scattered into the final inputs_embeds sequence before language-model forward or generation.
SciCore-Omics uses a staged training design rather than a single monolithic training script. Gene bridge distillation first aligns transcriptomic representations with the language-model space; Swift-based CPT/SFT then adapts the multimodal model to instruction-following data; RL refinement further optimizes selected modules with score-guided rollouts.
The train-distill-gene/ directory isolates training for the gene bridge modules:
gene_qformergene_projector- optionally
gene_cls_headin the more complete training path
This stage is useful when the core multimodal model already exists but the gene branch needs better alignment with the language-model representation space.
There are three main scripts:
| File | Purpose |
|---|---|
train-distill-gene/train_gene_bridge_distill.py |
Simplest single-GPU bridge distillation. |
train-distill-gene/train_gene_bridge_distill_ddp.py |
Distributed version with cross-rank negatives. |
train-distill-gene/train_gene_bridge_distill_real_processor.py |
Preferred current training path using the real processor and reference-gene alignment. |
After distillation, train-distill-gene/inject_gene_bridge_weights.py copies the trained bridge weights into a full sharded model directory.
The train-swift-cpt-sft/ directory contains Swift-based CPT/SFT entrypoints for the gene-aware MiniCPM-V model. These scripts use the ms-swift framework directly through:
swift pt
swift sftThe gene-specific logic is injected through Swift's custom registration mechanism rather than by modifying the Swift framework itself. In other words, the training scripts call swift pt or swift sft, and pass the custom register file through --custom_register_path.
The custom registration file lives under:
train-swift-cpt-sft/register/my_register_qformer.py
This register file defines the minicpm_v2_6_gene model/template path for Swift and already contains the gene handling logic. It reads .h5ad gene inputs, tokenizes gene names, builds gene_input_ids, gene_attention_mask, and gene_bound, expands the <gene> placeholder into the Q-Former gene span, and exposes the resulting fields to the model batch. Because this logic is handled in the register/template layer, no extra changes to the ms-swift source code are required as long as the model forward path supports the gene fields.
| File / Folder | Purpose |
|---|---|
train-swift-cpt-sft/register/my_register_qformer.py |
Swift custom register file for the gene-aware MiniCPM-V + Q-Former model. |
train-swift-cpt-sft/script/cpt-example.sh |
Example continued-training launcher using the custom register path. |
train-swift-cpt-sft/script/sft-example.sh |
Example SFT launcher using LoRA, gene/Q-Former target modules, and the custom register path. |
The train-rl/ directory contains a GSPO/PPO-style reinforcement learning pipeline for score-guided multimodal optimization. It separates rollout preparation, reference-model scoring, and distributed actor updates:
gen_worker.pysamples examples, builds candidate batches withGSPODataset, expands single-token<gene>placeholders into 32-token gene spans when needed, computes old-policy token log probabilities for fixed outputs, and uploads packed rollouts.ref_server.pyruns a Flask reference server, restores packed image/gene/text tensors, computes reference-model token log probabilities, and queues batches for training.finetune_gspo.pyruns the DDP training loop, pulls rollout batches from the reference server, dynamically enables trainable parameter groups according to modality, and optimizes a clipped GSPO/PPO-style objective with a KL penalty.
The RL script freezes the full model by default and selectively trains the gene bridge, image resampler, and final LLM layer depending on whether the current rollout contains gene and/or image inputs.
If your goal is:
- understand the architecture: start with
model/ - align or improve the gene bridge: start with
train-distill-gene/ - run CPT/SFT with Swift custom registration: start with
train-swift-cpt-sft/ - run score-guided RL optimization: start with
train-rl/
If you find SciCore-Omics useful for your research, please consider citing our work:
@article{xiao2026scicoreomics,
title={SciCore-Omics: a tri-modal foundation model unifying histology, spatial transcriptomics and language for spatial biology},
author={},
journal={arXiv preprint arXiv:XXXX.XXXXX},
year={2026}
}If you have questions, suggestions, or bug reports, please open an issue in this repository or contact us by email:
- Xinyu Xiao: xinyuxiao1@outlook.com
