Skip to content

Enhance FinMME and Verifier with new features and improvements#16

Merged
aravind-3105 merged 15 commits into
mainfrom
feat/qwen-backend
Jun 14, 2026
Merged

Enhance FinMME and Verifier with new features and improvements#16
aravind-3105 merged 15 commits into
mainfrom
feat/qwen-backend

Conversation

@aravind-3105

Copy link
Copy Markdown
Member

Summary

Builds out the full AgentFinVQA pipeline from the initial skeleton, multi-stage agentic framework (planner → vision → verifier), FinMME and ChartQAPro dataset loaders, MEP schema and writer, eval framework, Langfuse integration, zero-shot baselines, iterative fix experiments (v1–v10), and vLLM/Qwen backend support. This branch represents the complete working system.

Clickup Ticket(s): N/A

Type of Change

  • 🐛 Bug fix (non-breaking change that fixes an issue)
  • ✨ New feature (non-breaking change that adds functionality)
  • 💥 Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • 📝 Documentation update
  • 🔧 Refactoring (no functional changes)
  • ⚡ Performance improvement
  • 🧪 Test improvements
  • 🔒 Security fix

Changes Made

  • Agent pipelineplanner_agent, vision_agent, verifier_agent with CrewAI, plus tools: legend_grounder_tool, color_area_tool (OpenCV pixel-counting), ocr_reader_tool, verifier_tool, vision_qa_tool
  • Dataset loaders — FinMME and ChartQAPro loaders with perceived-sample abstraction and image utilities
  • MEP frameworkmep/schema.py (dataclasses for full execution trace incl. MEPColorArea), mep/writer.py
  • Eval framework — rule-based scorer with numeric tolerance and MCQ partial credit, LLM judge, error taxonomy, trace evaluator, top-k eval, summary/report/dashboard
  • Langfuse integration — tracing, scoring, dataset ingestion, prompt management
  • vLLM / Qwen backendopenai_compat.py centralises endpoint resolution and vLLM dummy-key injection; model_compat.py handles reasoning-model temperature quirks; Qwen3.5 thinking-block suppression via extra_body
  • Strict JSON parsingjson_strict.py with thinking-block stripping, CoT preamble handling, json_repair fallback
  • Scriptsrun_batch.py (single-pass generate + eval), run_finmme_batch.py, submission scripts, compare_mep_runs.py
  • Pre-commit — ruff, mypy, typos, nbqa-ruff, uv-lock hooks configured

Testing

  • Tests pass locally (uv run pytest tests/)
  • Type checking passes (uv run mypy src/)
  • Linting passes (uv run pre-commit run --all-files — all hooks green)
  • Manual testing performed (describe below)

Manual testing details:
Full pipeline runs validated end-to-end on FinMME train.

Screenshots/Recordings

N/A

Related Issues

N/A

Deployment Notes

  • Requires .env with OPENAI_API_KEY / GEMINI_API_KEY (see .env.example)
  • For local vLLM: set OPENAI_BASE_URL to the vLLM endpoint; api_key is auto-substituted with "EMPTY" so the OpenAI SDK does not reject a blank string
  • Cache dirs must point to /fs02 paths — the /projects symlink resolves to broken /fs01

Checklist

  • Code follows the project's style guidelines
  • Self-review of code completed
  • Documentation updated (results.md, updates.md, README.md)
  • No sensitive information (API keys, credentials) exposed

…nd enhance vision prompts

- Added  for pixel-counting of chart areas based on legend colors.
- Introduced  schema to capture color area measurement results.
- Updated vision agent and prompts to include color area information.
- Created scripts for repairing MEPs and handling confidence gating.
- Added unit tests for the new color area functionality.
…andling (v9)

- Added a new script for running FinMME fixes with related sentences.
- Updated VerifierAgent to incorporate related sentences for improved verification context.
- Enhanced VerifierTool with a function to format source sentences for prompts.
- Adjusted verifier prompts to include source sentences in the evaluation process.
…tionality - v10

- Introduced functions to manage high-confidence choice analysis and format ambiguity blocks for the verifier.
- Updated VerifierAgent to include ambiguity hints based on vision agent confidence.
- Enhanced VerifierTool to support ambiguity flags in prompts and added handling for existing metrics during evaluation.
- Modified scripts to allow resuming evaluations by merging with existing metrics, improving efficiency in processing MEPs.
- results.md §8b: v9 full-scale row, McNemar p-values, latency tail
- updates.md: v9 full-eval paragraph; fair-baseline now lists all three agents
- camera_ready_metrics.md: v9 added to headline table, verifier subsection
- README.md: v9 row updated, fair-baseline rewritten
- scripts: add v10 runner, eval-resume tweak to submit_eval.sh
- Added support for local OpenAI-compatible endpoints, allowing the use of vLLM for Qwen models.
- Refactored LLM initialization across agents to centralize API key and base URL handling.
- Updated various tools to utilize the new OpenAI compatibility functions, ensuring seamless integration with local servers.
- Improved JSON parsing to handle Qwen-specific output formats, including thinking blocks and markdown fences.
- Added unit tests for strict JSON extraction from LLM outputs to ensure robustness.
@aravind-3105 aravind-3105 self-assigned this Jun 11, 2026
@aravind-3105 aravind-3105 added the enhancement New feature or request label Jun 11, 2026
aravind-3105 and others added 6 commits June 11, 2026 15:56
Remove from git tracking (files retained locally):
- results.md, updates.md, markdown/camera_ready_metrics.md
- notebooks/analysis.ipynb, notebooks/run_pipeline.ipynb
- baselines/fix_zeroshot_scores.py, baselines/submit_zeroshot.sh
- scripts/repair_meps_v6.py, scripts/repair_meps_v7.py
- scripts/run_finmme_fixes_v{8,9,10}.sh, run_finmme_legend_grounding.sh
- scripts/submit_eval.sh, scripts/submit_pipeline.sh

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Upgraded jupyterlab to version 4.5.7 for improved security.
- Added security lower bounds for transitive dependencies: aiohttp, cryptography, gitpython, idna, jupyter-server, and mistune.
- Updated the version of aiohttp to 3.14.1 in the lock file.
- Added new CVE exclusions in the code checks workflow.
- Bump security lower-bounds: aiohttp, cryptography, gitpython, idna,
  jupyter-server, mistune, nbconvert, notebook, pillow, pyjwt,
  pymdown-extensions, python-multipart, requests, urllib3, xgrammar
- Upgrade vllm>=0.22.0; remove numpy<2.0 cap (vllm 0.22 requires numpy>=2)
- Add json-repair>=0.25.2 to main deps (was only transitive via crewai group)
- Move `from crewai import LLM` back inside build_crewai_llm() as lazy import
  (noqa: PLC0415) so unit-test CI doesn't fail without agentic-xai-eval group
- Bump uv to 0.11.15 in both CI workflows; add pip>=26.1.2, uv>=0.11.15
  to dev deps for their own CVE fixes
- Add ignore-vulns entries for no-fix / acceptable-risk CVEs
- Regenerate uv.lock (391 packages)
Top-level `from crewai import Agent, Crew, Task` caused ModuleNotFoundError
during pytest collection for test_legend_grounding.py because the
agentic-xai-eval dependency group is not installed in CI (uv sync
--all-extras --dev only installs extras, not named groups).

Mirrors the same fix applied to openai_compat.build_crewai_llm().
Top-level `from crewai import Agent, Crew, Task` in vision_agent.py and
`from crewai.tools import BaseTool` in all four tool files caused
ModuleNotFoundError during pytest collection because the agentic-xai-eval
dependency group is not installed in CI (uv sync --all-extras --dev only
installs extras, not named groups).

- vision_agent.py: moved Agent/Crew/Task import inside VisionAgent.run()
  with # noqa: PLC0415
- vision_qa_tool.py, legend_grounder_tool.py, ocr_reader_tool.py,
  verifier_tool.py: wrapped BaseTool import in try/except ImportError,
  falling back to pydantic.BaseModel so the class definition succeeds
  without crewai installed
…lve test collection issues

Moved the import of `Agent`, `Crew`, and `Task` from crewai inside the respective classes in planner_agent.py and verifier_agent.py to prevent ModuleNotFoundError during pytest collection, as the agentic-xai-eval dependency group is not installed in CI.
@aravind-3105 aravind-3105 merged commit 6d9d00f into main Jun 14, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant