Skip to content

[Feature] Support lmmseval in unfied eval script#35

Open
oscarqjh wants to merge 12 commits intomainfrom
feature/lmmseval-integration
Open

[Feature] Support lmmseval in unfied eval script#35
oscarqjh wants to merge 12 commits intomainfrom
feature/lmmseval-integration

Conversation

@oscarqjh
Copy link
Copy Markdown
Collaborator

Summary

  • Add lmms-eval as a second evaluation backend alongside VLMEvalKit, behind a shared adapter interface
  • Refactor run_easi_eval.py from a VLMEvalKit-only script into a backend-agnostic orchestrator
  • Users can now run all EASI benchmarks with either backend using aligned CLI options

Changes

Backend adapter pattern (scripts/submissions/backends/):

  • base.py — BackendAdapter ABC defining the shared interface (command building, progress polling, score extraction, result archiving)
  • vlmevalkit.py — VLMEvalKit adapter extracted from the original run_easi_eval.py
  • lmmseval.py — New lmms-eval adapter with task mapping, metric extraction, SiteBench JSONL merge, and resume logic

Orchestrator refactor (run_easi_eval.py):

  • New CLI flags: --backend {vlmevalkit,lmms-eval}, --model-args, --accelerate/--no-accelerate, --rerun
  • Backend-agnostic benchmark registry (user-facing keys mapped to backend task names by adapters)
  • Resume logic: skips completed benchmarks on rerun, --rerun forces fresh evaluation
  • ProgressDisplay adapted for lmms-eval (single status icon, no sample-level progress bars)
  • Benchmarks run individually per subprocess to avoid HuggingFace datasets filelock deadlocks

Postprocess (postprocess.py):

  • build_payload() and build_results_archive() accept optional backend_adapter parameter
  • Result archive includes only the latest files per task (no stale results from previous runs)

Environment & docs:

  • scripts/setup.sh installs both backends with pinned filelock==3.20.3 and datasets==4.5.0
  • Updated docs/Submit_results.md with lmms-eval examples, task name mapping, and resume/rerun docs
  • Updated README.md and README_CN.md with unified eval script examples
  • Updated lmms-eval submodule to latest

Usage

# VLMEvalKit (default, unchanged)
python scripts/submissions/run_easi_eval.py \
  --model Qwen/Qwen2.5-VL-7B-Instruct --nproc 4

# lmms-eval
python scripts/submissions/run_easi_eval.py \
  --backend lmms-eval \
  --model qwen3_vl \
  --model-args "pretrained=Qwen/Qwen3-VL-8B-Instruct,attn_implementation=flash_attention_2" \
  --nproc 4

# OOM recovery (single-GPU, resumes from where it left off)
python scripts/submissions/run_easi_eval.py \
  --backend lmms-eval --model qwen3_vl \
  --model-args "pretrained=Qwen/Qwen3-VL-8B-Instruct,attn_implementation=flash_attention_2" \
  --no-accelerate

Test plan

  • VLMEvalKit path: CLI parsing, command construction, benchmark selection verified via dry-run
  • lmms-eval path: full EASI-8 evaluation completed on Qwen3-VL-8B-Instruct with 4 GPUs
  • Score extraction: all 8 benchmarks correctly scaled to 0-100 with sub-scores
  • Resume logic: completed benchmarks skipped on rerun, --rerun forces fresh evaluation
  • Result archive: only latest files per task included in zip
  • Submission: successfully submitted to EASI leaderboard

oscarqjh added 12 commits April 10, 2026 13:54
Move VLMEvalKit-specific constants, dataset preparation, verification,
and progress monitoring functions into backends/vlmevalkit.py. The
VLMEvalKitAdapter class implements the BackendAdapter interface with
build_cmd, prepare_datasets, poll_progress, detect_completion,
get_result_files, extract_scores, and get_env_overrides methods.

run_easi_eval.py now imports these from the adapter module while
keeping EASI_8, EXTRA, display classes, and main() unchanged.
Create LmmsEvalAdapter with task/metric maps, accelerate command
building, result detection from *_results.json, score extraction,
and SiteBench image+video JSONL merge logic.
For benchmarks like vsi_bench where the overall metric returns a dict
containing both the overall score and sub-scores, also search inside
that dict when looking up sub-score values.
…ults_archive

When backend_adapter is provided, build_payload delegates score extraction
to adapter.extract_scores() and uses adapter.name for the backend field.
build_results_archive delegates file listing to adapter.get_result_files().
Both functions retain full backward compatibility when adapter is None.
Refactor the main orchestrator to support both VLMEvalKit and lmms-eval
backends via the adapter pattern introduced in earlier commits.

Key changes:
- Replace EASI_8/EXTRA tuples with key-only EASI_8_KEYS/EXTRA_KEYS lists;
  benchmark-to-task mapping now comes from adapter.TASK_MAP
- Add CLI arguments: --backend, --model-args, --accelerate/--no-accelerate,
  --rerun
- Create adapter early in main() via get_backend() factory
- Phase 1 (dataset prep): delegate to adapter for vlmevalkit; skip for
  lmms-eval (manages its own data)
- Phase 2 (subprocess): add _run_lmmseval() path alongside existing
  _run_vlmevalkit(); add resume logic via adapter.find_completed_tasks()
- Phase 3 (verification): vlmevalkit uses existing verify_results();
  lmms-eval uses adapter.detect_completion()
- Phase 4 (postprocess): pass backend_adapter to build_payload() and
  build_results_archive()
- ProgressDisplay: add backend parameter; simplified single-status rows
  for non-vlmevalkit backends (no dual infer/eval phases or progress bars)
- _build_cmd respects --rerun flag (omits --reuse when rerun is True)
- Remove unused tempfile import

The VLMEvalKit path remains functionally identical to before.
When running all benchmarks in a single --tasks call with accelerate
multi-GPU, all workers try to download/cache datasets simultaneously,
causing filelock deadlocks in the HuggingFace datasets library.

Run one benchmark per subprocess call instead. The resume logic
handles skipping already-completed benchmarks on rerun. Each
benchmark's completion is checked after it finishes so the display
shows real-time progress.
- Update lmms-eval submodule to v0.6-94 (latest main)
- Pin filelock==3.20.3 and datasets==4.5.0 in setup.sh to avoid
  deadlocks with accelerate multi-GPU dataset loading
- Install both VLMEvalKit and lmms-eval backends in setup.sh
- Disable HF_HUB_ENABLE_HF_TRANSFER in lmms-eval subprocess env
  to avoid filelock issues on shared HuggingFace caches
- vsi_bench: overall metric changed from vsibench_score (dict) to
  vsibench_overall (float); MRA sub-score keys now lowercase without
  range suffix (e.g. object_abs_distance_mra)
- mindcube_tiny, mmsi_bench, viewspatial: fix scale from 1 to 100
  (lmms-eval returns 0-1 values, payload expects 0-100)
- 3dsrbench: expand sub-scores from 4 to all 12, fix metric key names
  to include _accuracy suffix (e.g. height_higher_accuracy)
- embspatial: add ai2thor/mp3d/scannet sub-scores
- viewspatial: add 5 perspective-level sub-scores
- vsi_debiased: same fixes as vsi_bench
- mmsi_video_bench, omnispatial: fix scale to 100
get_result_files now returns the latest samples JSONL per task
instead of all accumulated files from multiple runs. Ensures
the zip archive matches the scores in the submission payload.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant