Enhance FinMME and Verifier with new features and improvements#16
Merged
Conversation
…p_runs, tests, and pre-commit order
…nd enhance vision prompts - Added for pixel-counting of chart areas based on legend colors. - Introduced schema to capture color area measurement results. - Updated vision agent and prompts to include color area information. - Created scripts for repairing MEPs and handling confidence gating. - Added unit tests for the new color area functionality.
…andling (v9) - Added a new script for running FinMME fixes with related sentences. - Updated VerifierAgent to incorporate related sentences for improved verification context. - Enhanced VerifierTool with a function to format source sentences for prompts. - Adjusted verifier prompts to include source sentences in the evaluation process.
…tionality - v10 - Introduced functions to manage high-confidence choice analysis and format ambiguity blocks for the verifier. - Updated VerifierAgent to include ambiguity hints based on vision agent confidence. - Enhanced VerifierTool to support ambiguity flags in prompts and added handling for existing metrics during evaluation. - Modified scripts to allow resuming evaluations by merging with existing metrics, improving efficiency in processing MEPs.
- results.md §8b: v9 full-scale row, McNemar p-values, latency tail - updates.md: v9 full-eval paragraph; fair-baseline now lists all three agents - camera_ready_metrics.md: v9 added to headline table, verifier subsection - README.md: v9 row updated, fair-baseline rewritten - scripts: add v10 runner, eval-resume tweak to submit_eval.sh
- Added support for local OpenAI-compatible endpoints, allowing the use of vLLM for Qwen models. - Refactored LLM initialization across agents to centralize API key and base URL handling. - Updated various tools to utilize the new OpenAI compatibility functions, ensuring seamless integration with local servers. - Improved JSON parsing to handle Qwen-specific output formats, including thinking blocks and markdown fences. - Added unit tests for strict JSON extraction from LLM outputs to ensure robustness.
Remove from git tracking (files retained locally):
- results.md, updates.md, markdown/camera_ready_metrics.md
- notebooks/analysis.ipynb, notebooks/run_pipeline.ipynb
- baselines/fix_zeroshot_scores.py, baselines/submit_zeroshot.sh
- scripts/repair_meps_v6.py, scripts/repair_meps_v7.py
- scripts/run_finmme_fixes_v{8,9,10}.sh, run_finmme_legend_grounding.sh
- scripts/submit_eval.sh, scripts/submit_pipeline.sh
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Upgraded jupyterlab to version 4.5.7 for improved security. - Added security lower bounds for transitive dependencies: aiohttp, cryptography, gitpython, idna, jupyter-server, and mistune. - Updated the version of aiohttp to 3.14.1 in the lock file. - Added new CVE exclusions in the code checks workflow.
- Bump security lower-bounds: aiohttp, cryptography, gitpython, idna, jupyter-server, mistune, nbconvert, notebook, pillow, pyjwt, pymdown-extensions, python-multipart, requests, urllib3, xgrammar - Upgrade vllm>=0.22.0; remove numpy<2.0 cap (vllm 0.22 requires numpy>=2) - Add json-repair>=0.25.2 to main deps (was only transitive via crewai group) - Move `from crewai import LLM` back inside build_crewai_llm() as lazy import (noqa: PLC0415) so unit-test CI doesn't fail without agentic-xai-eval group - Bump uv to 0.11.15 in both CI workflows; add pip>=26.1.2, uv>=0.11.15 to dev deps for their own CVE fixes - Add ignore-vulns entries for no-fix / acceptable-risk CVEs - Regenerate uv.lock (391 packages)
Top-level `from crewai import Agent, Crew, Task` caused ModuleNotFoundError during pytest collection for test_legend_grounding.py because the agentic-xai-eval dependency group is not installed in CI (uv sync --all-extras --dev only installs extras, not named groups). Mirrors the same fix applied to openai_compat.build_crewai_llm().
Top-level `from crewai import Agent, Crew, Task` in vision_agent.py and `from crewai.tools import BaseTool` in all four tool files caused ModuleNotFoundError during pytest collection because the agentic-xai-eval dependency group is not installed in CI (uv sync --all-extras --dev only installs extras, not named groups). - vision_agent.py: moved Agent/Crew/Task import inside VisionAgent.run() with # noqa: PLC0415 - vision_qa_tool.py, legend_grounder_tool.py, ocr_reader_tool.py, verifier_tool.py: wrapped BaseTool import in try/except ImportError, falling back to pydantic.BaseModel so the class definition succeeds without crewai installed
…lve test collection issues Moved the import of `Agent`, `Crew`, and `Task` from crewai inside the respective classes in planner_agent.py and verifier_agent.py to prevent ModuleNotFoundError during pytest collection, as the agentic-xai-eval dependency group is not installed in CI.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Builds out the full AgentFinVQA pipeline from the initial skeleton, multi-stage agentic framework (planner → vision → verifier), FinMME and ChartQAPro dataset loaders, MEP schema and writer, eval framework, Langfuse integration, zero-shot baselines, iterative fix experiments (v1–v10), and vLLM/Qwen backend support. This branch represents the complete working system.
Clickup Ticket(s): N/A
Type of Change
Changes Made
planner_agent,vision_agent,verifier_agentwith CrewAI, plus tools:legend_grounder_tool,color_area_tool(OpenCV pixel-counting),ocr_reader_tool,verifier_tool,vision_qa_toolmep/schema.py(dataclasses for full execution trace incl.MEPColorArea),mep/writer.pyopenai_compat.pycentralises endpoint resolution and vLLM dummy-key injection;model_compat.pyhandles reasoning-model temperature quirks; Qwen3.5 thinking-block suppression viaextra_bodyjson_strict.pywith thinking-block stripping, CoT preamble handling,json_repairfallbackrun_batch.py(single-pass generate + eval),run_finmme_batch.py, submission scripts,compare_mep_runs.pyTesting
uv run pytest tests/)uv run mypy src/)uv run pre-commit run --all-files— all hooks green)Manual testing details:
Full pipeline runs validated end-to-end on FinMME train.
Screenshots/Recordings
N/A
Related Issues
N/A
Deployment Notes
.envwithOPENAI_API_KEY/GEMINI_API_KEY(see.env.example)OPENAI_BASE_URLto the vLLM endpoint;api_keyis auto-substituted with"EMPTY"so the OpenAI SDK does not reject a blank string/fs02paths — the/projectssymlink resolves to broken/fs01Checklist
results.md,updates.md,README.md)