Summary
I'd like to add jfinqa — a Japanese financial numerical reasoning QA benchmark — as a new task in lighteval.
About jfinqa
- 1,000 questions across 3 subtasks:
- Numerical Reasoning (550): Calculate growth rates, margins, ratios from financial statements
- Consistency Checking (200): Verify internal consistency of figures
- Temporal Reasoning (250): Analyze year-over-year trends
- 68 companies from EDINET (Japan's securities filing system)
- Covers J-GAAP, IFRS, and US-GAAP accounting standards
- HuggingFace Dataset: ajtgjmdjp/jfinqa
- GitHub: ajtgjmdjp/jfinqa
Metrics
Two metrics per subtask:
- Exact Match — with Japanese financial normalisation (fullwidth→halfwidth, △→minus, comma removal, NFKC)
- Numerical Match — 1% relative tolerance, handles kanji multipliers (千/百万/億/兆) and unit suffixes (円/ドル/bps)
Prior Art
Baselines (zero-shot, temperature=0)
| Model |
Overall |
Numerical |
Consistency |
Temporal |
| GPT-4o |
87.0% |
80.2% |
90.5% |
99.2% |
| Gemini 2.0 Flash |
80.4% |
86.2% |
83.5% |
65.2% |
| GPT-4o-mini |
67.7% |
79.3% |
83.5% |
29.6% |
| Qwen2.5-3B |
39.6% |
46.4% |
51.0% |
15.6% |
I have a PR ready — happy to adjust the implementation based on your feedback (e.g., inspect-ai format if preferred).
Summary
I'd like to add jfinqa — a Japanese financial numerical reasoning QA benchmark — as a new task in lighteval.
About jfinqa
Metrics
Two metrics per subtask:
Prior Art
Baselines (zero-shot, temperature=0)
I have a PR ready — happy to adjust the implementation based on your feedback (e.g., inspect-ai format if preferred).