Dataset Download and Pre-processing

PairFlow: Closed-Form Source-Target Coupling for Few-Step Generation in Discrete Flow Models

Mingue Park* · Jisung Hwang* · Seungwoo Yoo* · Kyeongmin Yeo · Minhyuk Sung

(* Equal Contribution)

KAIST

ICLR 2026

TL;DR

PairFlow is a teacher-free acceleration framework for Discrete Flow Models that builds source–target training pairs via a closed-form inversion/backward velocity, so the model learns straighter, few-step paths. It’s cheap (≈0.2–1.7% of full training compute) yet improves few-step sampling and can even strengthen the base model for later distillation.

Environment and Requirements

Tested Environment

Python: 3.12
CUDA: 12.4
GPU: Tested on NVIDIA RTX 3090 and RTX A6000

Installation

conda create -n pairflow python=3.12
conda activate pairflow
conda install nvidia/label/cuda-12.4.0::cuda-toolkit
pip install -r requirements.txt
pip install flash_attn==2.7.4.post1
pip install rdkit pytorch_image_generation_metrics pillow

# Install local CUDA extension
cd df_cuda
pip install -e .
cd ..

Dataset Download and Pre-processing

In the commands below, <DATASET> is one of: qm9, zinc-250k, mnist-binary, cifar-10.

1. Dataset Download & Parsing

This script downloads the dataset and parses it into tokenized sequences.

cd data/preprocessed/<DATASET>
python download_dataset.py

2. Pairing with Closed-Form Backward Velocity

We provide the pairing script using Closed-Form Backward Velocity (Alg. 1 in the paper).

cd data/preprocessed
python pairflow_preprocess.py --dataset_type=<DATASET> --batch_size=<BATCH_SIZE>
# We recommend using the full VRAM of your GPU for faster generation.
# This script supports multi-GPU parallel generation.

# You can also customize the step size and tau (default=1.0).
# python pairflow_preprocess.py --dataset_type=<DATASET> \
#   --batch_size=<BATCH_SIZE> --num_steps=<NUM_STEPS> --tau=<TAU>

Training

Training scripts are provided at scripts/train/<DATASET>/<MODEL>.sh, where <MODEL> is one of mdlm, udlm, pairflow. Run them from the project root:

bash scripts/train/<DATASET>/<MODEL>.sh

Note. Training writes checkpoints to the Hydra output directory (under outputs/). Before running distillation or evaluation, you must place the trained checkpoint at the path expected by each script — checkpoints/<DATASET>/<MODEL>.ckpt (e.g., checkpoints/qm9/pairflow.ckpt). The DCD/ReDi/eval scripts will not find the model otherwise.

Distillation

We provide two distillation methods: Discrete Consistency Distillation (DCD) and Rectified Discrete Flow (ReDi). Each method requires a short preparation step before running. For both methods, <MODEL> is one of udlm, pairflow.

1. DCD

First, generate the pre-computed integral for each dataset. Please refer to the DUO codebase and follow its instructions. The DCD trainer resolves the cache path as integral/${tokenizer_name_or_path}.pkl, so place the generated file at one of the following paths (matching the tokenizer_name_or_path field in configs/data/<DATASET>.yaml):

Dataset	Integral file path
`mnist-binary`	`integral/binary_pixels.pkl`
`cifar-10`	`integral/raw_pixels.pkl`
`qm9`	`integral/yairschiff/qm9-tokenizer.pkl`
`zinc-250k`	`integral/yairschiff/zinc250k-tokenizer.pkl`

Then run:

# DCD
bash scripts/dcd/<DATASET>/<MODEL>.sh

2. ReDi

First, generate the source–target pairs using a pre-trained model with the script below. Place your checkpoint at the expected path (checkpoints/<DATASET>/<MODEL>.ckpt) and run the script. It will automatically generate the pairs and save them to data/redi/<DATASET>/<MODEL>/, which the ReDi script reads from in the next step.

# Generate pairs (writes to data/redi/<DATASET>/<MODEL>/)
bash scripts/generate-pair/<DATASET>/<MODEL>.sh
# You can override the sampling steps, total samples, sampling type, and predictor.
bash scripts/generate-pair/<DATASET>/<MODEL>.sh \
  --SAMPLING_STEPS=<SAMPLING_STEPS> --TOTAL_SAMPLES=<TOTAL_SAMPLES> \
  --SAMPLING_TYPE=<SAMPLING_TYPE> --PREDICTOR=<PREDICTOR>
# ReDi (reads from data/redi/<DATASET>/<MODEL>/)
bash scripts/redi/<DATASET>/<MODEL>.sh

Sampling & Eval

We also provide evaluation scripts for each dataset. These scripts automatically generate samples and compute the metrics.

FID Reference Stats (image datasets only)

For image datasets (cifar-10, mnist-binary), FID is computed against a reference statistics file loaded from ./fid_features/<DATASET>.npz. This file is not included in the repository — you must generate it yourself from the real training images before running evaluation. We use pytorch_image_generation_metrics for the computation; please follow its documentation to precompute the Inception statistics and save them to:

Dataset	Reference stats path
`cifar-10`	`fid_features/cifar-10.npz`
`mnist-binary`	`fid_features/mnist-binary.npz`

QM9 and ZINC-250k use SMILES-based molecular metrics (validity / uniqueness / novelty) and do not require this file.

Running Evaluation

bash scripts/eval/<DATASET>/<MODEL>.sh \
  --SAMPLING_STEPS=<SAMPLING_STEPS> --TOTAL_SAMPLES=<TOTAL_SAMPLES> \
  --SAMPLING_TYPE=<SAMPLING_TYPE>
# For QM9 and ZINC-250k, you can additionally pass --NUM_TRIALS to average over multiple sampling trials:
bash scripts/eval/{qm9, zinc-250k}/<MODEL>.sh \
  --SAMPLING_STEPS=<SAMPLING_STEPS> --TOTAL_SAMPLES=<TOTAL_SAMPLES> \
  --SAMPLING_TYPE=<SAMPLING_TYPE> --NUM_TRIALS=<NUM_TRIALS>

Generated samples and metrics are written to ./evaluation/<DATASET>/<MODEL>-<SAMPLING_TYPE>/num_steps-<SAMPLING_STEPS>_total-<TOTAL_SAMPLES>/.

Here, <MODEL> is one of:

mdlm — MDLM baseline
udlm — UDLM baseline
pairflow — PairFlow (ours)
udlm+dcd, pairflow+dcd — after DCD distillation
udlm+redi, pairflow+redi — after ReDi distillation

Acknowledgements & Citation

This repository is built on top of the DUO codebase. If you find our work useful, please cite:

@article{park2025pairflow,
  title={PairFlow: Closed-Form Source-Target Coupling for Few-Step Generation in Discrete Flow Models},
  author={Park, Mingue and Hwang, Jisung and Yoo, Seungwoo and Yeo, Kyeongmin and Sung, Minhyuk},
  journal={arXiv preprint arXiv:2512.20063},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
assets		assets
configs		configs
custom_datasets		custom_datasets
data/preprocessed		data/preprocessed
df_cuda		df_cuda
models		models
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
algo.py		algo.py
dataloader.py		dataloader.py
main.py		main.py
metrics.py		metrics.py
redi.py		redi.py
requirements.txt		requirements.txt
trainer_base.py		trainer_base.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PairFlow: Closed-Form Source-Target Coupling for Few-Step Generation in Discrete Flow Models

TL;DR

Environment and Requirements

Tested Environment

Installation

Dataset Download and Pre-processing

1. Dataset Download & Parsing

2. Pairing with Closed-Form Backward Velocity

Training

Distillation

1. DCD

2. ReDi

Sampling & Eval

FID Reference Stats (image datasets only)

Running Evaluation

Acknowledgements & Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

PairFlow: Closed-Form Source-Target Coupling for Few-Step Generation in Discrete Flow Models

TL;DR

Environment and Requirements

Tested Environment

Installation

Dataset Download and Pre-processing

1. Dataset Download & Parsing

2. Pairing with Closed-Form Backward Velocity

Training

Distillation

1. DCD

2. ReDi

Sampling & Eval

FID Reference Stats (image datasets only)

Running Evaluation

Acknowledgements & Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages