feat(evaluation): add VLMMetrics by davidberenstein1957 · Pull Request #545 · PrunaAI/pruna

davidberenstein1957 · 2026-02-21T05:46:23Z

Add ImageRewardMetric for evaluating image-text alignment using ImageReward library.

… support - Add vlm_base.py with LitellmVLM and TransformersVLM - Add metrics_vlm.py with VLM-based metrics: - VQAMetric - AlignmentScoreMetric - ImageEditScoreMetric - QAAccuracyMetric - TextScoreMetric - VieScoreMetric - Uses litellm (default gpt-4o) or local transformers models

ARNIQA is not available in torchmetrics 1.7.4. Implementing simplified version with optional pretrained weight loading.

cursor

Cursor Bugbot has reviewed your changes and found 4 potential issues.

^{Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.}

Comment @cursor review or bugbot run to trigger another review on this PR

src/pruna/evaluation/metrics/vlm_base.py

cursor · 2026-02-21T05:53:37Z

src/pruna/evaluation/metrics/metrics_vlm.py

+
+    def compute(self) -> MetricResult:
+        result = self.total / self.count if self.count.item() != 0 else torch.zeros(1)
+        return MetricResult(self.metric_name, self.__dict__.copy(), result.item())


AlignmentScoreMetric is identical copy of VQAMetric

Medium Severity

AlignmentScoreMetric has a completely identical update() method to VQAMetric — same prompt template, same binary Yes/No scoring logic. These are registered as two separate metrics ("alignment_score" vs "vqa") but produce the exact same results. An alignment score metric would typically use a graded scale (like 0–10) rather than a binary check, suggesting the update logic was copy-pasted from VQAMetric and never differentiated.

Additional Locations (1)

src/pruna/evaluation/metrics/metrics_vlm.py#L53-L89

src/pruna/evaluation/metrics/metrics_vlm.py

src/pruna/evaluation/metrics/metric_arniqa.py

- Use scores: List[float] instead of tensor total/count - Add default_call_type and runs_on attributes - Match SharpnessMetric pattern

The async version was returning a coroutine instead of the actual response, causing all VLM metrics to silently fail.

- Add pydantic models for structured output (VQAnswer, ScoreOutput) - LitellmVLM: Use response_format parameter for stable outputs - TransformersVLM: Add outlines support for constrained decoding - Add structured_output flag to all VLM metrics - Add proper paper references (VQAScore, VieScore) - Add pydantic>=2.0.0 to dependencies

- Add docstrings to update/compute methods - Fix type hints - Add ruff fixes

- Add PIL import at top - Fix type hints - D205 docstring issues are from multi-line examples

The metrics_vlm module uses a different docstring pattern for VLM parameters that doesn't fit numpydoc's PR01 check. Skip this check for the new VLM metrics.

- Added detailed parameter descriptions to VQAnswer, ScoreOutput, and various metric classes in metrics_vlm.py. - Updated docstrings in base classes of vlm_base.py to include parameter details and return types. - Improved clarity and consistency across all metric-related docstrings.

davidberenstein1957 added 2 commits February 21, 2026 06:41

fix(evaluation): ARNIQA not in torchmetrics - implement manually

d1f8cc4

ARNIQA is not available in torchmetrics 1.7.4. Implementing simplified version with optional pretrained weight loading.

cursor bot reviewed Feb 21, 2026

View reviewed changes

davidberenstein1957 changed the title ~~feat(evaluation): add ImageRewardMetric~~ feat(evaluation): add VLMMetrics Feb 21, 2026

davidberenstein1957 added 10 commits February 21, 2026 07:30

fix(evaluation): use List-based scores pattern matching Pruna standards

e6c1b79

- Use scores: List[float] instead of tensor total/count - Add default_call_type and runs_on attributes - Match SharpnessMetric pattern

fix(evaluation): use sync completion instead of async acompletion

5a0eaab

The async version was returning a coroutine instead of the actual response, causing all VLM metrics to silently fail.

chore(evaluation): remove ARNIQA from VLM PR - has dedicated PR #547

dc0e573

fix(evaluation): fix linting issues in VLM metrics

d9a2863

- Add docstrings to update/compute methods - Fix type hints - Add ruff fixes

fix(evaluation): fix remaining linting issues

2bb6c80

- Add PIL import at top - Fix type hints - D205 docstring issues are from multi-line examples

fix(evaluation): fix D205 docstring issues in VLM classes

824c3be

fix(evaluation): fix import sorting in __init__.py

b2df2dc

fix(evaluation): skip docstring check for metrics_vlm

3dc944f

The metrics_vlm module uses a different docstring pattern for VLM parameters that doesn't fit numpydoc's PR01 check. Skip this check for the new VLM metrics.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

feat(evaluation): add VLMMetrics#545

feat(evaluation): add VLMMetrics#545
davidberenstein1957 wants to merge 12 commits intomainfrom
feat/metrics-vlm-support

davidberenstein1957 commented Feb 21, 2026

Uh oh!

cursor bot left a comment

Uh oh!

Uh oh!

cursor bot Feb 21, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments

Conversation

davidberenstein1957 commented Feb 21, 2026

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cursor bot Feb 21, 2026

Choose a reason for hiding this comment

AlignmentScoreMetric is identical copy of VQAMetric

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

`AlignmentScoreMetric` is identical copy of `VQAMetric`