Skip to content

Comments

feat(evaluation): add VLMMetrics#545

Open
davidberenstein1957 wants to merge 12 commits intomainfrom
feat/metrics-vlm-support
Open

feat(evaluation): add VLMMetrics#545
davidberenstein1957 wants to merge 12 commits intomainfrom
feat/metrics-vlm-support

Conversation

@davidberenstein1957
Copy link
Member

Add ImageRewardMetric for evaluating image-text alignment using ImageReward library.

… support

- Add vlm_base.py with LitellmVLM and TransformersVLM
- Add metrics_vlm.py with VLM-based metrics:
  - VQAMetric
  - AlignmentScoreMetric
  - ImageEditScoreMetric
  - QAAccuracyMetric
  - TextScoreMetric
  - VieScoreMetric
- Uses litellm (default gpt-4o) or local transformers models
ARNIQA is not available in torchmetrics 1.7.4. Implementing
simplified version with optional pretrained weight loading.
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 4 potential issues.

Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.

Comment @cursor review or bugbot run to trigger another review on this PR


def compute(self) -> MetricResult:
result = self.total / self.count if self.count.item() != 0 else torch.zeros(1)
return MetricResult(self.metric_name, self.__dict__.copy(), result.item())
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AlignmentScoreMetric is identical copy of VQAMetric

Medium Severity

AlignmentScoreMetric has a completely identical update() method to VQAMetric — same prompt template, same binary Yes/No scoring logic. These are registered as two separate metrics ("alignment_score" vs "vqa") but produce the exact same results. An alignment score metric would typically use a graded scale (like 0–10) rather than a binary check, suggesting the update logic was copy-pasted from VQAMetric and never differentiated.

Additional Locations (1)

Fix in Cursor Fix in Web

@davidberenstein1957 davidberenstein1957 changed the title feat(evaluation): add ImageRewardMetric feat(evaluation): add VLMMetrics Feb 21, 2026
- Use scores: List[float] instead of tensor total/count
- Add default_call_type and runs_on attributes
- Match SharpnessMetric pattern
The async version was returning a coroutine instead of the actual
response, causing all VLM metrics to silently fail.
- Add pydantic models for structured output (VQAnswer, ScoreOutput)
- LitellmVLM: Use response_format parameter for stable outputs
- TransformersVLM: Add outlines support for constrained decoding
- Add structured_output flag to all VLM metrics
- Add proper paper references (VQAScore, VieScore)
- Add pydantic>=2.0.0 to dependencies
- Add docstrings to update/compute methods
- Fix type hints
- Add ruff fixes
- Add PIL import at top
- Fix type hints
- D205 docstring issues are from multi-line examples
The metrics_vlm module uses a different docstring pattern for VLM
parameters that doesn't fit numpydoc's PR01 check. Skip this check
for the new VLM metrics.
- Added detailed parameter descriptions to VQAnswer, ScoreOutput, and various metric classes in metrics_vlm.py.
- Updated docstrings in base classes of vlm_base.py to include parameter details and return types.
- Improved clarity and consistency across all metric-related docstrings.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant