This repo contains the implementation of Cuttlefish, a unified all-atom LLM that grounds language reasoning in geometric cues while scaling structural tokens with structural complexity.
Large language models (LLMs) are enabling reasoning over biomolecular structures, yet existing methods remain modality-specific and typically compress structural inputs through sequence-based tokenization or fixed-length query connectors. Such architectures either omit the geometric groundings requisite for mitigating structural hallucinations, or impose inflexible modality fusion bottlenecks that concurrently over-compress and suboptimally allocate structural tokens, thereby impeding the realization of generalized all-atom reasoning. We introduce Cuttlefish, a unified all-atom LLM that grounds language reasoning in geometric cues while scaling modality tokens with structural complexity. First, Scaling-Aware Patching leverages an instruction-conditioned gating mechanism to generate variable-size patches over structural graphs, adaptively scaling the query token budget with structural complexity to mitigate fixed-length connector bottlenecks. Second, Geometry Grounding Adapter refines these adaptive tokens via cross-attention to modality embeddings and injects the resulting modality tokens into the LLM, exposing explicit geometric cues to reduce structural hallucination. Experiments across diverse all-atom benchmarks demonstrate that Cuttlefish achieves superior performance in heterogeneous structure-grounded reasoning.
Note: GitHub Markdown does not reliably render PDFs inline; the PNG above is for inline preview and the PDF link is the high-resolution version.
| Resource | Link |
|---|---|
| Cuttlefish (full LLM) | zihaojing/Cuttlefish |
| Cuttlefish-Encoder (graph encoder) | zihaojing/Cuttlefish-Encoder |
| Cuttlefish-SFT-Data (instruction tuning data) | zihaojing/Cuttlefish-SFT-Data |
| Cuttlefish-Encoder-Data (encoder pretraining data) | zihaojing/Cuttlefish-Encoder-Data |
# Install dependencies
# pip install -r requirements.txt
from huggingface_hub import snapshot_download
from src.models.octopus import OctopusModel # project model wrapper
# Download the full Cuttlefish model
local_dir = snapshot_download(repo_id="zihaojing/Cuttlefish")
# Or download just the encoder
encoder_dir = snapshot_download(repo_id="zihaojing/Cuttlefish-Encoder")Prepare your input as a .parquet file (or .jsonl) following the unified schema below, then run:
cd /path/to/cuttlefish
source env.sh
# Edit configs/inference/octopus_8B_s3_v1_5.yaml to point to your checkpoint and input file
bash scripts/inference/run_inference.shOr programmatically:
from scripts.inference.run_inference_example import run_inference
results = run_inference(
checkpoint_path="zihaojing/Cuttlefish", # HF repo id or local path
input_file="your_data.parquet",
output_dir="outputs/results.jsonl",
)from datasets import load_dataset
# SFT instruction-tuning data (molecules, proteins)
sft = load_dataset("zihaojing/Cuttlefish-SFT-Data")
# Encoder pretraining data (molecules, proteins, DNA/RNA)
enc = load_dataset("zihaojing/Cuttlefish-Encoder-Data")- Modalities: protein (PDB/CIF), molecule (SMILES→3D), DNA, RNA
- Unified parquet data format across pretraining and instruction tuning
- Masked reconstruction pretraining for the graph encoder
- Multimodal SFT / benchmarking via a single entrypoint (
src/runner/sft.py) - Inference on parquet/JSONL (
src/runner/inference.py)
For deeper diagrams and additional details, see docs/README.md (this top-level README is meant to be runnable on its own).
python -m pip install -r requirements.txtSome provided scripts expect these env vars to be defined:
BASE_DIRDATA_DIRCHECKPOINT_DIR
Use the provided env.sh template to set paths and activate your environment (do not commit secrets):
cd /path/to/EntroAdap
source env.shpytest -qBoth pretraining and SFT use the same structural columns. The only SFT-only addition is messages.
| Field | Required | Description |
|---|---|---|
modality |
✓ | "molecule", "protein", "dna", "rna" |
node_feat |
✓ | Node/atom features ((N \times d)) |
pos |
✓ | 3D coordinates in Å ((N \times 3)) |
edge_index |
✓ | Spatial graph edges in COO ((2 \times E_s)) |
edge_feat_dist |
optional | Distances for spatial edges ((E_s \times 1)) |
chem_edge_index |
optional | Chemical-bond edges for molecules ((2 \times E_c)) |
chem_edge_feat_cat |
optional | Chemical-bond categorical features ((E_c \times 3)) |
For instruction tuning / evaluation, each sample includes a chat-style field:
messages = [
{"role": "system", "content": "..."},
{"role": "user", "content": "Question about <STRUCTURE> ..."},
{"role": "assistant", "content": "..."}
]<STRUCTURE> marks where structure embeddings are injected. The runner supports multiple structure tokens (see src/runner/sft.py / src/runner/inference.py): ["<STRUCTURE>", "<mol>", "<STRUCT>", "<DNA>"].
The docs use (and most configs assume) paths like:
data/encoder/... # encoder pretraining parquet(s)
data/sft/... # SFT parquet(s) with messages
data/benchmark/... # benchmark parquet(s)
See also: docs/README.md for modality-specific feature details and examples.
Entrypoint: src/runner/aa_encoder.py
Script wrapper: scripts/training/encoder/mlm.sh
Default config: configs/aa_encoder/all_modality.yaml
Run via the wrapper script:
cd /path/to/EntroAdap
source env.sh
bash scripts/training/encoder/mlm.sh configs/aa_encoder/all_modality.yamlOr run directly:
python src/runner/aa_encoder.py --config configs/aa_encoder/all_modality.yamlNotes
- The config supports multiple parquet files under
data.train_data_path. - For multi-modality mixtures, the trainer can enforce same-modality batches (
use_modality_sampler: true). - Runtime filtering is supported (
max_atoms,max_edges,skip_on_error).
Entrypoint: src/runner/sft.py (train and/or evaluate)
Script wrapper: scripts/training/ft/octopus.sh
Configs: configs/sft/octopus/*.yaml
Run the provided SFT script (it sequentially runs two configs by default):
cd /path/to/EntroAdap
source env.sh
bash scripts/training/ft/octopus.shRun a single config explicitly:
python src/runner/sft.py --config configs/sft/octopus/octopus_8B_s3_v3.yamlWhat src/runner/sft.py does
- Loads a dataset specified by config (
dataset_name, or train/eval parquet paths depending on config style) - Tokenizes chat prompts; in multimodal mode it uses
MultiModalDataCollator - Trains and/or evaluates depending on config flags (
do_train,do_eval) - Supports evaluation metrics controlled by
eval_metrics:"qa": multiple-choice QA metrics"text": text metrics (BLEU/ROUGE/METEOR)
Benchmarks are run through the same src/runner/sft.py entrypoint using configs under configs/benchmark/.
Configs live in configs/benchmark/octupus/ (e.g., qa.yaml, molgen.yaml, retro.yaml, ...).
Wrapper scripts:
scripts/benchmark/forward.sh→configs/benchmark/octupus/forward.yamlscripts/benchmark/molqa.sh→configs/benchmark/octupus/qa.yamlscripts/benchmark/retro.sh→configs/benchmark/octupus/retro.yamlscripts/benchmark/molgen.sh→configs/benchmark/octupus/molgen.yamlscripts/benchmark/mol.sh→ (currently points toconfigs/benchmark/disease.yaml, which is not present in this artifact; use a config underconfigs/benchmark/octupus/instead)
Example (QA benchmark):
cd /path/to/EntroAdap
source env.sh
bash scripts/benchmark/molqa.shscripts/benchmark/llm.sh runs:
configs/benchmark/llama3/qa.yaml
cd /path/to/EntroAdap
source env.sh
bash scripts/benchmark/llm.shEntrypoint: src/runner/inference.py
Wrapper: scripts/inference/run_inference.sh
Example config: configs/inference/octopus_8B_s3_v1_5.yaml
Run inference via the wrapper:
cd /path/to/EntroAdap
source env.sh
bash scripts/inference/run_inference.shKey config fields:
input_file:.parquet(unified format) or.jsonloctopus_checkpoint_path: checkpoint directoryoutput_dir: output JSONL file path (despite the name)- generation params:
max_new_tokens,temperature,top_p
There is also a programmatic example in scripts/inference/run_inference_example.py.
- Full technical docs:
docs/README.md - Pretraining runner:
src/runner/aa_encoder.py - SFT/benchmark runner:
src/runner/sft.py - Inference runner:
src/runner/inference.py - Data collators / samplers:
src/data_loader/aa_dataset.py(masked reconstruction)src/data_loader/octopus_collator.py(SFT + inference collators, modality-aware batching)
- Data builders (optional; not required if you already have parquet):
src/data_factory/molecule/src/data_factory/protein/src/data_factory/nacid/
The datasets used in this paper are released on HuggingFace:
- Cuttlefish-SFT-Data — instruction-tuning data covering molecules, proteins, DNA, and RNA across diverse reasoning tasks.
- Cuttlefish-Encoder-Data — masked-reconstruction pretraining data for the all-atom graph encoder.
Both follow the unified parquet schema described above.
- Do not commit secrets (API keys, private paths). Keep any local environment setup scripts untracked.
@article{jing2026cuttlefish,
title = {Cuttlefish: Scaling-Aware Adapter for Structure-Grounded LLM Reasoning},
author = {Jing, Zihao and Zeng, Qiuhao and Fang, Ruiyi and Li, Yan Yi and Sun, Yan and Wang, Boyu and Hu, Pingzhao},
booktitle = {Proceedings of the 43rd International Conference on Machine Learning (ICML)},
year = {2026},
url = {https://arxiv.org/abs/2602.02780}
}