Conversation
… support - Add vlm_base.py with LitellmVLM and TransformersVLM - Add metrics_vlm.py with VLM-based metrics: - VQAMetric - AlignmentScoreMetric - ImageEditScoreMetric - QAAccuracyMetric - TextScoreMetric - VieScoreMetric - Uses litellm (default gpt-4o) or local transformers models
ARNIQA is not available in torchmetrics 1.7.4. Implementing simplified version with optional pretrained weight loading.
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 4 potential issues.
Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.
Comment @cursor review or bugbot run to trigger another review on this PR
|
|
||
| def compute(self) -> MetricResult: | ||
| result = self.total / self.count if self.count.item() != 0 else torch.zeros(1) | ||
| return MetricResult(self.metric_name, self.__dict__.copy(), result.item()) |
There was a problem hiding this comment.
AlignmentScoreMetric is identical copy of VQAMetric
Medium Severity
AlignmentScoreMetric has a completely identical update() method to VQAMetric — same prompt template, same binary Yes/No scoring logic. These are registered as two separate metrics ("alignment_score" vs "vqa") but produce the exact same results. An alignment score metric would typically use a graded scale (like 0–10) rather than a binary check, suggesting the update logic was copy-pasted from VQAMetric and never differentiated.
Additional Locations (1)
- Use scores: List[float] instead of tensor total/count - Add default_call_type and runs_on attributes - Match SharpnessMetric pattern
The async version was returning a coroutine instead of the actual response, causing all VLM metrics to silently fail.
- Add pydantic models for structured output (VQAnswer, ScoreOutput) - LitellmVLM: Use response_format parameter for stable outputs - TransformersVLM: Add outlines support for constrained decoding - Add structured_output flag to all VLM metrics - Add proper paper references (VQAScore, VieScore) - Add pydantic>=2.0.0 to dependencies
- Add docstrings to update/compute methods - Fix type hints - Add ruff fixes
- Add PIL import at top - Fix type hints - D205 docstring issues are from multi-line examples
The metrics_vlm module uses a different docstring pattern for VLM parameters that doesn't fit numpydoc's PR01 check. Skip this check for the new VLM metrics.
- Added detailed parameter descriptions to VQAnswer, ScoreOutput, and various metric classes in metrics_vlm.py. - Updated docstrings in base classes of vlm_base.py to include parameter details and return types. - Improved clarity and consistency across all metric-related docstrings.


Add ImageRewardMetric for evaluating image-text alignment using ImageReward library.