[Feature] Support lmmseval in unfied eval script by oscarqjh · Pull Request #35 · EvolvingLMMs-Lab/EASI

oscarqjh · 2026-04-14T04:11:44Z

Summary

Add lmms-eval as a second evaluation backend alongside VLMEvalKit, behind a shared adapter interface
Refactor run_easi_eval.py from a VLMEvalKit-only script into a backend-agnostic orchestrator
Users can now run all EASI benchmarks with either backend using aligned CLI options

Changes

Backend adapter pattern (scripts/submissions/backends/):

base.py — BackendAdapter ABC defining the shared interface (command building, progress polling, score extraction, result archiving)
vlmevalkit.py — VLMEvalKit adapter extracted from the original run_easi_eval.py
lmmseval.py — New lmms-eval adapter with task mapping, metric extraction, SiteBench JSONL merge, and resume logic

Orchestrator refactor (run_easi_eval.py):

New CLI flags: --backend {vlmevalkit,lmms-eval}, --model-args, --accelerate/--no-accelerate, --rerun
Backend-agnostic benchmark registry (user-facing keys mapped to backend task names by adapters)
Resume logic: skips completed benchmarks on rerun, --rerun forces fresh evaluation
ProgressDisplay adapted for lmms-eval (single status icon, no sample-level progress bars)
Benchmarks run individually per subprocess to avoid HuggingFace datasets filelock deadlocks

Postprocess (postprocess.py):

build_payload() and build_results_archive() accept optional backend_adapter parameter
Result archive includes only the latest files per task (no stale results from previous runs)

Environment & docs:

scripts/setup.sh installs both backends with pinned filelock==3.20.3 and datasets==4.5.0
Updated docs/Submit_results.md with lmms-eval examples, task name mapping, and resume/rerun docs
Updated README.md and README_CN.md with unified eval script examples
Updated lmms-eval submodule to latest

Usage

# VLMEvalKit (default, unchanged)
python scripts/submissions/run_easi_eval.py \
  --model Qwen/Qwen2.5-VL-7B-Instruct --nproc 4

# lmms-eval
python scripts/submissions/run_easi_eval.py \
  --backend lmms-eval \
  --model qwen3_vl \
  --model-args "pretrained=Qwen/Qwen3-VL-8B-Instruct,attn_implementation=flash_attention_2" \
  --nproc 4

# OOM recovery (single-GPU, resumes from where it left off)
python scripts/submissions/run_easi_eval.py \
  --backend lmms-eval --model qwen3_vl \
  --model-args "pretrained=Qwen/Qwen3-VL-8B-Instruct,attn_implementation=flash_attention_2" \
  --no-accelerate

Test plan

VLMEvalKit path: CLI parsing, command construction, benchmark selection verified via dry-run
lmms-eval path: full EASI-8 evaluation completed on Qwen3-VL-8B-Instruct with 4 GPUs
Score extraction: all 8 benchmarks correctly scaled to 0-100 with sub-scores
Resume logic: completed benchmarks skipped on rerun, --rerun forces fresh evaluation
Result archive: only latest files per task included in zip
Submission: successfully submitted to EASI leaderboard

Move VLMEvalKit-specific constants, dataset preparation, verification, and progress monitoring functions into backends/vlmevalkit.py. The VLMEvalKitAdapter class implements the BackendAdapter interface with build_cmd, prepare_datasets, poll_progress, detect_completion, get_result_files, extract_scores, and get_env_overrides methods. run_easi_eval.py now imports these from the adapter module while keeping EASI_8, EXTRA, display classes, and main() unchanged.

Create LmmsEvalAdapter with task/metric maps, accelerate command building, result detection from *_results.json, score extraction, and SiteBench image+video JSONL merge logic.

For benchmarks like vsi_bench where the overall metric returns a dict containing both the overall score and sub-scores, also search inside that dict when looking up sub-score values.

…ults_archive When backend_adapter is provided, build_payload delegates score extraction to adapter.extract_scores() and uses adapter.name for the backend field. build_results_archive delegates file listing to adapter.get_result_files(). Both functions retain full backward compatibility when adapter is None.

Refactor the main orchestrator to support both VLMEvalKit and lmms-eval backends via the adapter pattern introduced in earlier commits. Key changes: - Replace EASI_8/EXTRA tuples with key-only EASI_8_KEYS/EXTRA_KEYS lists; benchmark-to-task mapping now comes from adapter.TASK_MAP - Add CLI arguments: --backend, --model-args, --accelerate/--no-accelerate, --rerun - Create adapter early in main() via get_backend() factory - Phase 1 (dataset prep): delegate to adapter for vlmevalkit; skip for lmms-eval (manages its own data) - Phase 2 (subprocess): add _run_lmmseval() path alongside existing _run_vlmevalkit(); add resume logic via adapter.find_completed_tasks() - Phase 3 (verification): vlmevalkit uses existing verify_results(); lmms-eval uses adapter.detect_completion() - Phase 4 (postprocess): pass backend_adapter to build_payload() and build_results_archive() - ProgressDisplay: add backend parameter; simplified single-status rows for non-vlmevalkit backends (no dual infer/eval phases or progress bars) - _build_cmd respects --rerun flag (omits --reuse when rerun is True) - Remove unused tempfile import The VLMEvalKit path remains functionally identical to before.

When running all benchmarks in a single --tasks call with accelerate multi-GPU, all workers try to download/cache datasets simultaneously, causing filelock deadlocks in the HuggingFace datasets library. Run one benchmark per subprocess call instead. The resume logic handles skipping already-completed benchmarks on rerun. Each benchmark's completion is checked after it finishes so the display shows real-time progress.

- Update lmms-eval submodule to v0.6-94 (latest main) - Pin filelock==3.20.3 and datasets==4.5.0 in setup.sh to avoid deadlocks with accelerate multi-GPU dataset loading - Install both VLMEvalKit and lmms-eval backends in setup.sh - Disable HF_HUB_ENABLE_HF_TRANSFER in lmms-eval subprocess env to avoid filelock issues on shared HuggingFace caches

- vsi_bench: overall metric changed from vsibench_score (dict) to vsibench_overall (float); MRA sub-score keys now lowercase without range suffix (e.g. object_abs_distance_mra) - mindcube_tiny, mmsi_bench, viewspatial: fix scale from 1 to 100 (lmms-eval returns 0-1 values, payload expects 0-100) - 3dsrbench: expand sub-scores from 4 to all 12, fix metric key names to include _accuracy suffix (e.g. height_higher_accuracy) - embspatial: add ai2thor/mp3d/scannet sub-scores - viewspatial: add 5 perspective-level sub-scores - vsi_debiased: same fixes as vsi_bench - mmsi_video_bench, omnispatial: fix scale to 100

get_result_files now returns the latest samples JSONL per task instead of all accumulated files from multiple runs. Ensures the zip archive matches the scores in the submission payload.

oscarqjh added 12 commits April 10, 2026 13:54

Add backend adapter base class and factory

5a55127

Add lmms-eval backend adapter

a7be66c

Create LmmsEvalAdapter with task/metric maps, accelerate command building, result detection from *_results.json, score extraction, and SiteBench image+video JSONL merge logic.

Fix sub-score extraction for overall_is_dict benchmarks

d84a87e

For benchmarks like vsi_bench where the overall metric returns a dict containing both the overall score and sub-scores, also search inside that dict when looking up sub-score values.

Only archive latest result files per task for submission

1376bca

get_result_files now returns the latest samples JSONL per task instead of all accumulated files from multiple runs. Ensures the zip archive matches the scores in the submission payload.

Filter stale results JSONs from archive, keep only latest per task

2ee8cf4

updated readme

05f0e32

oscarqjh requested review from PeterWangyi and caizhongang April 14, 2026 04:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Support lmmseval in unfied eval script#35

[Feature] Support lmmseval in unfied eval script#35
oscarqjh wants to merge 12 commits intomainfrom
feature/lmmseval-integration

oscarqjh commented Apr 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

oscarqjh commented Apr 14, 2026

Summary

Changes

Orchestrator refactor (run_easi_eval.py):

Postprocess (postprocess.py):

Environment & docs:

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant