LilyBench is an evaluation framework for large language models on LilyPond, a code-like textual score format. It accompanies the following paper, currently under review at Ital-IA 2026:
Can LLMs understand LilyPond? A benchmark for symbolic music generation and understanding. Matteo Spanio, Mohammad Torabi, Andrea Poltronieri, Antonio Rodà. Submitted to Ital-IA 2026 — 6th National Conference on Artificial Intelligence (CINI), Rome, Italy.
The framework pairs
- a generation benchmark — a fixed metadata-conditioned prompt bank evaluated with compile rate, Jensen–Shannon similarity over three MusPy descriptors, and a LilyBERT-based Fréchet Music Distance, and
- an understanding benchmark — ten ABC-Eval-adapted tasks (bar count, metadata QA, bar sequencing, next-bar prediction, metadata prediction, music captioning, composer recognition, genre recognition, emotion recognition, error detection) scored with accuracy, exact match, penalised Kendall-τ, and macro-F1.
Both benchmarks are modular and extendable: new generation regimes plug
in through lilybench.generation.regimes.Regime, new understanding
tasks through the @register_task decorator. The four backbones used in
the submission — phi4, qwen-coder, deepseek-coder, codestral —
are registered out of the box; adding a new HuggingFace model is one
register_model(ModelSpec(...)) call.
git clone https://github.com/CSCPadova/lilybench.git
cd lilybench
python -m venv .venv && source .venv/bin/activate
pip install -e .
# Optional: bitsandbytes int4 / int8 decoding
pip install -e ".[quant]"LilyBench shells out to the LilyPond binary for the compile-rate metric.
Install LilyPond 2.24.4 from the official site or
your distribution; set LILYPOND_BIN if it is not on PATH.
The companion archive on Zenodo (10.5281/zenodo.20267079) bundles:
| Corpus | Use case | Source |
|---|---|---|
| BMdataset | Generation in-domain; understanding | Spanio et al. 2026 |
| Mutopia | Generation out-of-domain; understanding | Mutopia Project |
| EMOPIA | Emotion-recognition task | Hung et al. 2021 |
Unpack the archive under data/ (see data/README.md
for the expected layout). Two helper scripts are shipped only for users
who want to rebuild the archive from upstream sources:
scripts/convert_mutopia.py and
scripts/prepare_emopia.py. The
LilyBERT checkpoint used for FMD lives at
https://github.com/CSCPadova/lilybert.
LilyBench exposes a single CLI with four verbs. Every paper experiment fits in roughly the following pipeline.
lilybench prompt-bank build \
--bmdataset-dir data/bmdataset/preprocessed \
--bmdataset-metadata data/bmdataset/metadata.json \
--n 200 --seed 1234 \
--out data/prompt_bank.jsonlThe bank is reused byte-for-byte across every (model, regime) cell so
comparisons are fair. The 200-prompt size matches the paper; pass
--n to evaluate at a different budget.
# Zero-shot
lilybench generation run \
--model phi4 --regime zero \
--prompts data/prompt_bank.jsonl \
--out runs/phi4_zero
# Few-shot from the training distribution
lilybench generation run \
--model phi4 --regime few \
--fewshot-file configs/fewshot_train_distribution.txt \
--prompts data/prompt_bank.jsonl \
--out runs/phi4_few
# Few-shot ablation (hand-written A-minor demos, §3.2 of the paper)
lilybench generation run \
--model phi4 --regime few \
--fewshot-file configs/fewshot_ablation_amin.txt \
--prompts data/prompt_bank.jsonl \
--out runs/phi4_few_ablationGenerated .ly files land under runs/<cell>/samples/sample_####.ly,
with the raw decoded text alongside in raw_####.txt for debugging.
# Mutopia tasks (eight of the ten)
lilybench understanding build \
--corpus mutopia \
--mutopia-manifest data/mutopia/dataset_mutopia.json \
--tasks all --seed 1234 \
--out data/understanding/mutopia.jsonl
# EMOPIA emotion task
lilybench understanding build \
--corpus emopia \
--emopia-manifest data/emopia/manifest.csv \
--emopia-ly-root data/emopia/ly \
--tasks emotion_recognition --seed 1234 \
--out data/understanding/emopia.jsonllilybench understanding run \
--model phi4 \
--bench data/understanding/mutopia.jsonl \
--out runs/understandingOne JSONL of predictions per task is written under
runs/understanding/phi4/<task>.jsonl.
lilybench metrics generation \
--samples runs/phi4_zero/samples \
--reference-test data/splits/test.jsonl \
--reference-mutopia data/mutopia/dataset_mutopia.json \
--reference-test-midi-dir data/splits/test_midi \
--reference-mutopia-midi-dir data/mutopia/midi \
--lilybert /path/to/lilybert \
--out runs/phi4_zero/summary.jsonThe output JSON contains compile_rate, js_similarity (in-domain and
out-of-domain), and fmd (in-domain and out-of-domain), matching the
columns of Table 1 in the paper. Provide only the metrics you care about
— each --reference-* is optional.
lilybench metrics understanding \
--bench data/understanding/mutopia.jsonl \
--predictions runs/understanding/phi4 \
--model-id phi4 \
--out runs/understanding/phi4/summary.jsonPer-task metrics plus a macro / weighted aggregate are written to
summary.json.
The package is intentionally small and orthogonal:
lilybench/
cli.py unified argparse CLI (prompt-bank, generation,
understanding, metrics)
models.py backbone registry + HF loader (+ DeepSeek shims)
utils.py JSONL I/O, HF env helpers, LilyPond binary lookup
data/
bmdataset.py BMdataset preprocessed/*.ly loader
mutopia.py Mutopia manifest loader
emopia.py EMOPIA manifest-CSV loader (midi2ly outputs)
splits.py deterministic work-level splitter
types.py CorpusEntry dataclass (uniform input type)
generation/
prompt_bank.py stratified prompt-bank builder (paper: 200 prompts)
regimes.py ZeroShot, FewShot, plus a `register_regime` hook
runner.py backbone + regime + bank -> sample directory
metadata_block.py %% === METADATA === block renderer
understanding/
base.py UnderstandingTask abstract base class
registry.py @register_task decorator + lookup
runner.py greedy-decoding inference loop
tasks/ ten registered tasks, one module each
bar_utils.py |-delimited bar splitting / counting
corruptor.py five error-injection categories
score_metadata.py extract/mask key / meter / note-length
title_parser.py extract title from \header blocks
midi_to_lily.py midi2ly subprocess wrapper (EMOPIA prep)
metrics/
compile_rate.py LilyPond compile rate
muspy_descriptors.py the three MusPy descriptors used by JS-similarity
js_similarity.py JS-similarity over Gaussian-fit descriptors
fmd.py LilyBERT-based Fréchet Music Distance
understanding.py accuracy / Kendall-τ / F1 / bar-count tolerance
from lilybench.generation.regimes import Regime, register_regime
class ChainOfThought(Regime):
name = "cot"
def build_prompt(self, prompt, *, tokenizer):
... # build any chat-template string here
register_regime(ChainOfThought)Then lilybench generation run --regime cot ... works without any
other edits.
from lilybench.understanding import UnderstandingTask, register_task
@register_task
class HarmonicFunctionQA(UnderstandingTask):
name = "harmonic_function"
template_kind = "multiple_choice"
task_instruction = "Pick the most plausible Roman-numeral function..."
default_n = 60
def build(self, corpus, *, n, seed): ...
def score(self, bench, predictions): ...The task is picked up automatically by lilybench understanding build
(--tasks all) and lilybench metrics understanding.
from lilybench.models import ModelSpec, register_model
register_model(ModelSpec(
model_id="my-model",
hf_id="org/my-model",
dtype="bf16",
family="general",
))The paper reports four backbones × three generation regimes (zero,
few-train, few-ablation) and four backbones × ten understanding tasks.
With the Zenodo archive unpacked, the full sweep is twelve generation
runs and four understanding runs. The slurm/ directory contains three
thin wrappers (generation.slurm, understanding.slurm,
metrics.slurm) showing how to drive the CLI from a SLURM cluster.
Determinism notes:
- The prompt bank is built with
seed=1234; per-prompt inference seeds areseed_base + i. Generation isdo_sample=True, so numeric metrics drift slightly between runs — compare trends, not exact text. - Understanding decoding is greedy (
do_sample=False,T=0,max_new_tokens=20) per ABC-Eval, so understanding predictions are deterministic on identical hardware/library versions. - Splits are computed at the work level (no part-of-the-same-piece can cross the split boundary). The Zenodo archive ships the splits used in the paper.
pytest # sequential
pytest -n auto # parallel via pytest-xdistCode: MIT (see LICENSE). Datasets retain their upstream licenses; see the corresponding entries on Zenodo for redistribution terms.
The accompanying paper is currently under review. Until it is accepted, please cite the Zenodo companion archive:
@unpublished{spanio2026lilybench,
title = {Can LLMs understand LilyPond? A benchmark for symbolic music
generation and understanding},
author = {Spanio, Matteo and Torabi, Mohammad and Poltronieri, Andrea
and Rod{\`a}, Antonio},
year = {2026},
note = {Under review at Ital-IA 2026},
}