GitHub - zihao-jing/Cuttlefish: Offical repo for work "Scaling-Aware Adapter for Structure-Grounded LLM Reasoning", model Cuttlefish, accepted by ICML 2026.

Cuttlefish: Scaling-Aware Adapter for Structure-Grounded LLM Reasoning

This repo contains the implementation of Cuttlefish, a unified all-atom LLM that grounds language reasoning in geometric cues while scaling structural tokens with structural complexity.

Abstract

Large language models (LLMs) are enabling reasoning over biomolecular structures, yet existing methods remain modality-specific and typically compress structural inputs through sequence-based tokenization or fixed-length query connectors. Such architectures either omit the geometric groundings requisite for mitigating structural hallucinations, or impose inflexible modality fusion bottlenecks that concurrently over-compress and suboptimally allocate structural tokens, thereby impeding the realization of generalized all-atom reasoning. We introduce Cuttlefish, a unified all-atom LLM that grounds language reasoning in geometric cues while scaling modality tokens with structural complexity. First, Scaling-Aware Patching leverages an instruction-conditioned gating mechanism to generate variable-size patches over structural graphs, adaptively scaling the query token budget with structural complexity to mitigate fixed-length connector bottlenecks. Second, Geometry Grounding Adapter refines these adaptive tokens via cross-attention to modality embeddings and injects the resulting modality tokens into the LLM, exposing explicit geometric cues to reduce structural hallucination. Experiments across diverse all-atom benchmarks demonstrate that Cuttlefish achieves superior performance in heterogeneous structure-grounded reasoning.

Note: GitHub Markdown does not reliably render PDFs inline; the PNG above is for inline preview and the PDF link is the high-resolution version.

Models & Datasets

Resource	Link
Cuttlefish (full LLM)	zihaojing/Cuttlefish
Cuttlefish-Encoder (graph encoder)	zihaojing/Cuttlefish-Encoder
Cuttlefish-SFT-Data (instruction tuning data)	zihaojing/Cuttlefish-SFT-Data
Cuttlefish-Encoder-Data (encoder pretraining data)	zihaojing/Cuttlefish-Encoder-Data

Quick start

Load the model

# Install dependencies
# pip install -r requirements.txt

from huggingface_hub import snapshot_download
from src.models.octopus import OctopusModel  # project model wrapper

# Download the full Cuttlefish model
local_dir = snapshot_download(repo_id="zihaojing/Cuttlefish")

# Or download just the encoder
encoder_dir = snapshot_download(repo_id="zihaojing/Cuttlefish-Encoder")

Run inference on your own structures

Prepare your input as a .parquet file (or .jsonl) following the unified schema below, then run:

cd /path/to/cuttlefish
source env.sh

# Edit configs/inference/octopus_8B_s3_v1_5.yaml to point to your checkpoint and input file
bash scripts/inference/run_inference.sh

Or programmatically:

from scripts.inference.run_inference_example import run_inference

results = run_inference(
    checkpoint_path="zihaojing/Cuttlefish",   # HF repo id or local path
    input_file="your_data.parquet",
    output_dir="outputs/results.jsonl",
)

Load the datasets

from datasets import load_dataset

# SFT instruction-tuning data (molecules, proteins)
sft = load_dataset("zihaojing/Cuttlefish-SFT-Data")

# Encoder pretraining data (molecules, proteins, DNA/RNA)
enc = load_dataset("zihaojing/Cuttlefish-Encoder-Data")

What this repo supports

Modalities: protein (PDB/CIF), molecule (SMILES→3D), DNA, RNA
Unified parquet data format across pretraining and instruction tuning
Masked reconstruction pretraining for the graph encoder
Multimodal SFT / benchmarking via a single entrypoint (src/runner/sft.py)
Inference on parquet/JSONL (src/runner/inference.py)

For deeper diagrams and additional details, see docs/README.md (this top-level README is meant to be runnable on its own).

Install

python -m pip install -r requirements.txt

Environment setup (recommended)

Some provided scripts expect these env vars to be defined:

BASE_DIR
DATA_DIR
CHECKPOINT_DIR

Use the provided env.sh template to set paths and activate your environment (do not commit secrets):

cd /path/to/EntroAdap
source env.sh

Run tests

pytest -q

Data format (unified parquet schema)

Both pretraining and SFT use the same structural columns. The only SFT-only addition is messages.

Core structural columns

Field	Required	Description
`modality`	✓	`"molecule"`, `"protein"`, `"dna"`, `"rna"`
`node_feat`	✓	Node/atom features ((N \times d))
`pos`	✓	3D coordinates in Å ((N \times 3))
`edge_index`	✓	Spatial graph edges in COO ((2 \times E_s))
`edge_feat_dist`	optional	Distances for spatial edges ((E_s \times 1))
`chem_edge_index`	optional	Chemical-bond edges for molecules ((2 \times E_c))
`chem_edge_feat_cat`	optional	Chemical-bond categorical features ((E_c \times 3))

SFT-only column: `messages`

For instruction tuning / evaluation, each sample includes a chat-style field:

messages = [
  {"role": "system", "content": "..."},
  {"role": "user", "content": "Question about <STRUCTURE> ..."},
  {"role": "assistant", "content": "..."}
]

<STRUCTURE> marks where structure embeddings are injected. The runner supports multiple structure tokens (see src/runner/sft.py / src/runner/inference.py): ["<STRUCTURE>", "<mol>", "<STRUCT>", "<DNA>"].

Where the repo expects data

The docs use (and most configs assume) paths like:

data/encoder/...   # encoder pretraining parquet(s)
data/sft/...       # SFT parquet(s) with messages
data/benchmark/... # benchmark parquet(s)

See also: docs/README.md for modality-specific feature details and examples.

Training: masked reconstruction pretraining (graph encoder)

Entrypoint: src/runner/aa_encoder.py
Script wrapper: scripts/training/encoder/mlm.sh
Default config: configs/aa_encoder/all_modality.yaml

Run via the wrapper script:

cd /path/to/EntroAdap
source env.sh

bash scripts/training/encoder/mlm.sh configs/aa_encoder/all_modality.yaml

Or run directly:

python src/runner/aa_encoder.py --config configs/aa_encoder/all_modality.yaml

Notes

The config supports multiple parquet files under data.train_data_path.
For multi-modality mixtures, the trainer can enforce same-modality batches (use_modality_sampler: true).
Runtime filtering is supported (max_atoms, max_edges, skip_on_error).

Training / tuning: multimodal SFT (Cuttlefish / Octopus wrapper)

Entrypoint: src/runner/sft.py (train and/or evaluate)
Script wrapper: scripts/training/ft/octopus.sh
Configs: configs/sft/octopus/*.yaml

Run the provided SFT script (it sequentially runs two configs by default):

cd /path/to/EntroAdap
source env.sh

bash scripts/training/ft/octopus.sh

Run a single config explicitly:

python src/runner/sft.py --config configs/sft/octopus/octopus_8B_s3_v3.yaml

What src/runner/sft.py does

Loads a dataset specified by config (dataset_name, or train/eval parquet paths depending on config style)
Tokenizes chat prompts; in multimodal mode it uses MultiModalDataCollator
Trains and/or evaluates depending on config flags (do_train, do_eval)
Supports evaluation metrics controlled by eval_metrics:
- "qa": multiple-choice QA metrics
- "text": text metrics (BLEU/ROUGE/METEOR)

Evaluation / benchmarking

Benchmarks are run through the same src/runner/sft.py entrypoint using configs under configs/benchmark/.

Octopus/Cuttlefish-style multimodal benchmarks

Configs live in configs/benchmark/octupus/ (e.g., qa.yaml, molgen.yaml, retro.yaml, ...).

Wrapper scripts:

scripts/benchmark/forward.sh → configs/benchmark/octupus/forward.yaml
scripts/benchmark/molqa.sh → configs/benchmark/octupus/qa.yaml
scripts/benchmark/retro.sh → configs/benchmark/octupus/retro.yaml
scripts/benchmark/molgen.sh → configs/benchmark/octupus/molgen.yaml
scripts/benchmark/mol.sh → (currently points to configs/benchmark/disease.yaml, which is not present in this artifact; use a config under configs/benchmark/octupus/ instead)

Example (QA benchmark):

cd /path/to/EntroAdap
source env.sh

bash scripts/benchmark/molqa.sh

Text-only LLM benchmark

scripts/benchmark/llm.sh runs:

configs/benchmark/llama3/qa.yaml

cd /path/to/EntroAdap
source env.sh

bash scripts/benchmark/llm.sh

Inference

Entrypoint: src/runner/inference.py
Wrapper: scripts/inference/run_inference.sh
Example config: configs/inference/octopus_8B_s3_v1_5.yaml

Run inference via the wrapper:

cd /path/to/EntroAdap
source env.sh

bash scripts/inference/run_inference.sh

Key config fields:

input_file: .parquet (unified format) or .jsonl
octopus_checkpoint_path: checkpoint directory
output_dir: output JSONL file path (despite the name)
generation params: max_new_tokens, temperature, top_p

There is also a programmatic example in scripts/inference/run_inference_example.py.

Useful pointers (code + docs)

Full technical docs: docs/README.md
Pretraining runner: src/runner/aa_encoder.py
SFT/benchmark runner: src/runner/sft.py
Inference runner: src/runner/inference.py
Data collators / samplers:
- src/data_loader/aa_dataset.py (masked reconstruction)
- src/data_loader/octopus_collator.py (SFT + inference collators, modality-aware batching)
Data builders (optional; not required if you already have parquet):
- src/data_factory/molecule/
- src/data_factory/protein/
- src/data_factory/nacid/

Dataset

The datasets used in this paper are released on HuggingFace:

Cuttlefish-SFT-Data — instruction-tuning data covering molecules, proteins, DNA, and RNA across diverse reasoning tasks.
Cuttlefish-Encoder-Data — masked-reconstruction pretraining data for the all-atom graph encoder.

Both follow the unified parquet schema described above.

Safety notes

Do not commit secrets (API keys, private paths). Keep any local environment setup scripts untracked.

Citation

@article{jing2026cuttlefish,
  title     = {Cuttlefish: Scaling-Aware Adapter for Structure-Grounded LLM Reasoning},
  author    = {Jing, Zihao and Zeng, Qiuhao and Fang, Ruiyi and Li, Yan Yi and Sun, Yan and Wang, Boyu and Hu, Pingzhao},
  booktitle = {Proceedings of the 43rd International Conference on Machine Learning (ICML)},
  year      = {2026},
  url       = {https://arxiv.org/abs/2602.02780}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Cuttlefish: Scaling-Aware Adapter for Structure-Grounded LLM Reasoning

Abstract

Models & Datasets

Quick start

Load the model

Run inference on your own structures

Load the datasets

What this repo supports

Install

Environment setup (recommended)

Run tests

Data format (unified parquet schema)

Core structural columns

SFT-only column: `messages`

Where the repo expects data

Training: masked reconstruction pretraining (graph encoder)

Training / tuning: multimodal SFT (Cuttlefish / Octopus wrapper)

Evaluation / benchmarking

Octopus/Cuttlefish-style multimodal benchmarks

Text-only LLM benchmark

Inference

Useful pointers (code + docs)

Dataset

Safety notes

Citation

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
configs		configs
examples		examples
figs		figs
scripts		scripts
src		src
tests		tests
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
env.sh		env.sh
pytest.ini		pytest.ini
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Cuttlefish: Scaling-Aware Adapter for Structure-Grounded LLM Reasoning

Abstract

Models & Datasets

Quick start

Load the model

Run inference on your own structures

Load the datasets

What this repo supports

Install

Environment setup (recommended)

Run tests

Data format (unified parquet schema)

Core structural columns

SFT-only column: messages

Where the repo expects data

Training: masked reconstruction pretraining (graph encoder)

Training / tuning: multimodal SFT (Cuttlefish / Octopus wrapper)

Evaluation / benchmarking

Octopus/Cuttlefish-style multimodal benchmarks

Text-only LLM benchmark

Inference

Useful pointers (code + docs)

Dataset

Safety notes

Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

SFT-only column: `messages`

Packages