Add jfinqa: Japanese Financial Numerical Reasoning QA

## Summary

I'd like to add **jfinqa** — a Japanese financial numerical reasoning QA benchmark — as a new task in lighteval.

## About jfinqa

- **1,000 questions** across 3 subtasks:
  - Numerical Reasoning (550): Calculate growth rates, margins, ratios from financial statements
  - Consistency Checking (200): Verify internal consistency of figures
  - Temporal Reasoning (250): Analyze year-over-year trends
- **68 companies** from EDINET (Japan's securities filing system)
- Covers J-GAAP, IFRS, and US-GAAP accounting standards
- **HuggingFace Dataset**: [ajtgjmdjp/jfinqa](https://huggingface.co/datasets/ajtgjmdjp/jfinqa)
- **GitHub**: [ajtgjmdjp/jfinqa](https://github.com/ajtgjmdjp/jfinqa)

## Metrics

Two metrics per subtask:
1. **Exact Match** — with Japanese financial normalisation (fullwidth→halfwidth, △→minus, comma removal, NFKC)
2. **Numerical Match** — 1% relative tolerance, handles kanji multipliers (千/百万/億/兆) and unit suffixes (円/ドル/bps)

## Prior Art

- [lm-evaluation-harness PR #3570](https://github.com/EleutherAI/lm-evaluation-harness/pull/3570) (open, mergeable)

## Baselines (zero-shot, temperature=0)

| Model | Overall | Numerical | Consistency | Temporal |
|-------|---------|-----------|-------------|----------|
| GPT-4o | **87.0%** | 80.2% | 90.5% | 99.2% |
| Gemini 2.0 Flash | 80.4% | 86.2% | 83.5% | 65.2% |
| GPT-4o-mini | 67.7% | 79.3% | 83.5% | 29.6% |
| Qwen2.5-3B | 39.6% | 46.4% | 51.0% | 15.6% |

I have a PR ready — happy to adjust the implementation based on your feedback (e.g., inspect-ai format if preferred).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add jfinqa: Japanese Financial Numerical Reasoning QA #1168

Summary

About jfinqa

Metrics

Prior Art

Baselines (zero-shot, temperature=0)

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Model	Overall	Numerical	Consistency	Temporal
GPT-4o	87.0%	80.2%	90.5%	99.2%
Gemini 2.0 Flash	80.4%	86.2%	83.5%	65.2%
GPT-4o-mini	67.7%	79.3%	83.5%	29.6%
Qwen2.5-3B	39.6%	46.4%	51.0%	15.6%

Add jfinqa: Japanese Financial Numerical Reasoning QA #1168

Description

Summary

About jfinqa

Metrics

Prior Art

Baselines (zero-shot, temperature=0)

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions