Enhance FinMME and Verifier with new features and improvements by aravind-3105 · Pull Request #16 · VectorInstitute/AgentFinVQA

aravind-3105 · 2026-06-11T19:47:02Z

Summary

Builds out the full AgentFinVQA pipeline from the initial skeleton, multi-stage agentic framework (planner → vision → verifier), FinMME and ChartQAPro dataset loaders, MEP schema and writer, eval framework, Langfuse integration, zero-shot baselines, iterative fix experiments (v1–v10), and vLLM/Qwen backend support. This branch represents the complete working system.

Clickup Ticket(s): N/A

Type of Change

🐛 Bug fix (non-breaking change that fixes an issue)
✨ New feature (non-breaking change that adds functionality)
💥 Breaking change (fix or feature that would cause existing functionality to not work as expected)
📝 Documentation update
🔧 Refactoring (no functional changes)
⚡ Performance improvement
🧪 Test improvements
🔒 Security fix

Changes Made

Agent pipeline — planner_agent, vision_agent, verifier_agent with CrewAI, plus tools: legend_grounder_tool, color_area_tool (OpenCV pixel-counting), ocr_reader_tool, verifier_tool, vision_qa_tool
Dataset loaders — FinMME and ChartQAPro loaders with perceived-sample abstraction and image utilities
MEP framework — mep/schema.py (dataclasses for full execution trace incl. MEPColorArea), mep/writer.py
Eval framework — rule-based scorer with numeric tolerance and MCQ partial credit, LLM judge, error taxonomy, trace evaluator, top-k eval, summary/report/dashboard
Langfuse integration — tracing, scoring, dataset ingestion, prompt management
vLLM / Qwen backend — openai_compat.py centralises endpoint resolution and vLLM dummy-key injection; model_compat.py handles reasoning-model temperature quirks; Qwen3.5 thinking-block suppression via extra_body
Strict JSON parsing — json_strict.py with thinking-block stripping, CoT preamble handling, json_repair fallback
Scripts — run_batch.py (single-pass generate + eval), run_finmme_batch.py, submission scripts, compare_mep_runs.py
Pre-commit — ruff, mypy, typos, nbqa-ruff, uv-lock hooks configured

Testing

Tests pass locally (uv run pytest tests/)
Type checking passes (uv run mypy src/)
Linting passes (uv run pre-commit run --all-files — all hooks green)
Manual testing performed (describe below)

Manual testing details:
Full pipeline runs validated end-to-end on FinMME train.

Screenshots/Recordings

N/A

Related Issues

N/A

Deployment Notes

Requires .env with OPENAI_API_KEY / GEMINI_API_KEY (see .env.example)
For local vLLM: set OPENAI_BASE_URL to the vLLM endpoint; api_key is auto-substituted with "EMPTY" so the OpenAI SDK does not reject a blank string
Cache dirs must point to /fs02 paths — the /projects symlink resolves to broken /fs01

Checklist

Code follows the project's style guidelines
Self-review of code completed
Documentation updated (results.md, updates.md, README.md)
No sensitive information (API keys, credentials) exposed

…seline scripts

…p_runs, tests, and pre-commit order

…nd enhance vision prompts - Added for pixel-counting of chart areas based on legend colors. - Introduced schema to capture color area measurement results. - Updated vision agent and prompts to include color area information. - Created scripts for repairing MEPs and handling confidence gating. - Added unit tests for the new color area functionality.

…andling (v9) - Added a new script for running FinMME fixes with related sentences. - Updated VerifierAgent to incorporate related sentences for improved verification context. - Enhanced VerifierTool with a function to format source sentences for prompts. - Adjusted verifier prompts to include source sentences in the evaluation process.

…tionality - v10 - Introduced functions to manage high-confidence choice analysis and format ambiguity blocks for the verifier. - Updated VerifierAgent to include ambiguity hints based on vision agent confidence. - Enhanced VerifierTool to support ambiguity flags in prompts and added handling for existing metrics during evaluation. - Modified scripts to allow resuming evaluations by merging with existing metrics, improving efficiency in processing MEPs.

- results.md §8b: v9 full-scale row, McNemar p-values, latency tail - updates.md: v9 full-eval paragraph; fair-baseline now lists all three agents - camera_ready_metrics.md: v9 added to headline table, verifier subsection - README.md: v9 row updated, fair-baseline rewritten - scripts: add v10 runner, eval-resume tweak to submit_eval.sh

- Added support for local OpenAI-compatible endpoints, allowing the use of vLLM for Qwen models. - Refactored LLM initialization across agents to centralize API key and base URL handling. - Updated various tools to utilize the new OpenAI compatibility functions, ensuring seamless integration with local servers. - Improved JSON parsing to handle Qwen-specific output formats, including thinking blocks and markdown fences. - Added unit tests for strict JSON extraction from LLM outputs to ensure robustness.

Remove from git tracking (files retained locally): - results.md, updates.md, markdown/camera_ready_metrics.md - notebooks/analysis.ipynb, notebooks/run_pipeline.ipynb - baselines/fix_zeroshot_scores.py, baselines/submit_zeroshot.sh - scripts/repair_meps_v6.py, scripts/repair_meps_v7.py - scripts/run_finmme_fixes_v{8,9,10}.sh, run_finmme_legend_grounding.sh - scripts/submit_eval.sh, scripts/submit_pipeline.sh Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- Upgraded jupyterlab to version 4.5.7 for improved security. - Added security lower bounds for transitive dependencies: aiohttp, cryptography, gitpython, idna, jupyter-server, and mistune. - Updated the version of aiohttp to 3.14.1 in the lock file. - Added new CVE exclusions in the code checks workflow.

- Bump security lower-bounds: aiohttp, cryptography, gitpython, idna, jupyter-server, mistune, nbconvert, notebook, pillow, pyjwt, pymdown-extensions, python-multipart, requests, urllib3, xgrammar - Upgrade vllm>=0.22.0; remove numpy<2.0 cap (vllm 0.22 requires numpy>=2) - Add json-repair>=0.25.2 to main deps (was only transitive via crewai group) - Move `from crewai import LLM` back inside build_crewai_llm() as lazy import (noqa: PLC0415) so unit-test CI doesn't fail without agentic-xai-eval group - Bump uv to 0.11.15 in both CI workflows; add pip>=26.1.2, uv>=0.11.15 to dev deps for their own CVE fixes - Add ignore-vulns entries for no-fix / acceptable-risk CVEs - Regenerate uv.lock (391 packages)

Top-level `from crewai import Agent, Crew, Task` caused ModuleNotFoundError during pytest collection for test_legend_grounding.py because the agentic-xai-eval dependency group is not installed in CI (uv sync --all-extras --dev only installs extras, not named groups). Mirrors the same fix applied to openai_compat.build_crewai_llm().

Top-level `from crewai import Agent, Crew, Task` in vision_agent.py and `from crewai.tools import BaseTool` in all four tool files caused ModuleNotFoundError during pytest collection because the agentic-xai-eval dependency group is not installed in CI (uv sync --all-extras --dev only installs extras, not named groups). - vision_agent.py: moved Agent/Crew/Task import inside VisionAgent.run() with # noqa: PLC0415 - vision_qa_tool.py, legend_grounder_tool.py, ocr_reader_tool.py, verifier_tool.py: wrapped BaseTool import in try/except ImportError, falling back to pydantic.BaseModel so the class definition succeeds without crewai installed

…lve test collection issues Moved the import of `Agent`, `Crew`, and `Task` from crewai inside the respective classes in planner_agent.py and verifier_agent.py to prevent ModuleNotFoundError during pytest collection, as the agentic-xai-eval dependency group is not installed in CI.

aravind-3105 added 9 commits April 7, 2026 13:30

Add updated code for FinMME

6b46e11

Update pre-commit configuration, enhance README, and add zero-shot ba…

8b47e0d

…seline scripts

feat(finmme): legend grounding, eval/verifier/MCQ updates, compare_me…

3774505

…p_runs, tests, and pre-commit order

chore: gitignore run artifacts (output/, meps/, logs/)

c622f5d

aravind-3105 self-assigned this Jun 11, 2026

aravind-3105 added the enhancement New feature or request label Jun 11, 2026

aravind-3105 and others added 6 commits June 11, 2026 15:56

aravind-3105 merged commit 6d9d00f into main Jun 14, 2026
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhance FinMME and Verifier with new features and improvements#16

Enhance FinMME and Verifier with new features and improvements#16
aravind-3105 merged 15 commits into
mainfrom
feat/qwen-backend

aravind-3105 commented Jun 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

aravind-3105 commented Jun 11, 2026

Summary

Type of Change

Changes Made

Testing

Screenshots/Recordings

Related Issues

Deployment Notes

Checklist

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant