[Feature] Support lmmseval in unfied eval script#35
Open
Conversation
Move VLMEvalKit-specific constants, dataset preparation, verification, and progress monitoring functions into backends/vlmevalkit.py. The VLMEvalKitAdapter class implements the BackendAdapter interface with build_cmd, prepare_datasets, poll_progress, detect_completion, get_result_files, extract_scores, and get_env_overrides methods. run_easi_eval.py now imports these from the adapter module while keeping EASI_8, EXTRA, display classes, and main() unchanged.
Create LmmsEvalAdapter with task/metric maps, accelerate command building, result detection from *_results.json, score extraction, and SiteBench image+video JSONL merge logic.
For benchmarks like vsi_bench where the overall metric returns a dict containing both the overall score and sub-scores, also search inside that dict when looking up sub-score values.
…ults_archive When backend_adapter is provided, build_payload delegates score extraction to adapter.extract_scores() and uses adapter.name for the backend field. build_results_archive delegates file listing to adapter.get_result_files(). Both functions retain full backward compatibility when adapter is None.
Refactor the main orchestrator to support both VLMEvalKit and lmms-eval backends via the adapter pattern introduced in earlier commits. Key changes: - Replace EASI_8/EXTRA tuples with key-only EASI_8_KEYS/EXTRA_KEYS lists; benchmark-to-task mapping now comes from adapter.TASK_MAP - Add CLI arguments: --backend, --model-args, --accelerate/--no-accelerate, --rerun - Create adapter early in main() via get_backend() factory - Phase 1 (dataset prep): delegate to adapter for vlmevalkit; skip for lmms-eval (manages its own data) - Phase 2 (subprocess): add _run_lmmseval() path alongside existing _run_vlmevalkit(); add resume logic via adapter.find_completed_tasks() - Phase 3 (verification): vlmevalkit uses existing verify_results(); lmms-eval uses adapter.detect_completion() - Phase 4 (postprocess): pass backend_adapter to build_payload() and build_results_archive() - ProgressDisplay: add backend parameter; simplified single-status rows for non-vlmevalkit backends (no dual infer/eval phases or progress bars) - _build_cmd respects --rerun flag (omits --reuse when rerun is True) - Remove unused tempfile import The VLMEvalKit path remains functionally identical to before.
When running all benchmarks in a single --tasks call with accelerate multi-GPU, all workers try to download/cache datasets simultaneously, causing filelock deadlocks in the HuggingFace datasets library. Run one benchmark per subprocess call instead. The resume logic handles skipping already-completed benchmarks on rerun. Each benchmark's completion is checked after it finishes so the display shows real-time progress.
- Update lmms-eval submodule to v0.6-94 (latest main) - Pin filelock==3.20.3 and datasets==4.5.0 in setup.sh to avoid deadlocks with accelerate multi-GPU dataset loading - Install both VLMEvalKit and lmms-eval backends in setup.sh - Disable HF_HUB_ENABLE_HF_TRANSFER in lmms-eval subprocess env to avoid filelock issues on shared HuggingFace caches
- vsi_bench: overall metric changed from vsibench_score (dict) to vsibench_overall (float); MRA sub-score keys now lowercase without range suffix (e.g. object_abs_distance_mra) - mindcube_tiny, mmsi_bench, viewspatial: fix scale from 1 to 100 (lmms-eval returns 0-1 values, payload expects 0-100) - 3dsrbench: expand sub-scores from 4 to all 12, fix metric key names to include _accuracy suffix (e.g. height_higher_accuracy) - embspatial: add ai2thor/mp3d/scannet sub-scores - viewspatial: add 5 perspective-level sub-scores - vsi_debiased: same fixes as vsi_bench - mmsi_video_bench, omnispatial: fix scale to 100
get_result_files now returns the latest samples JSONL per task instead of all accumulated files from multiple runs. Ensures the zip archive matches the scores in the submission payload.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Changes
Backend adapter pattern (scripts/submissions/backends/):
Orchestrator refactor (run_easi_eval.py):
Postprocess (postprocess.py):
Environment & docs:
Usage
Test plan