Teigaku Genzei (the fixed-amount tax reduction) is an example of SDG with Japanese documents. by ryokawajp · Pull Request #26 · opendatahub-io/data-processing

ryokawajp · 2025-10-10T13:52:17Z

Description

This example is located in scripts/teigaku_genzei/. I created this new directory.

This example contains a set of python scripts to

preprocess documents (common)
- json_glossary.py
- json_qa.py
- json_util.py
- jsonl_util.py
convert the source documents to SDG seed data,
- context_util.py
- extract_icl_from_seed.py
- make_context.py
- make_icl.py
- make_seed.py
- seed_jsonl_2_qna_yaml.py
convert the source documents to test data, and
- extract_qa.sh
- qna_yaml_2_qa_csv.py
evaluate the fine-tuned model with the test data.
- catalog_util.py
- llm_as_judge_direct_criteria_correctness2.py
- llm_as_judge_direct_positional_bias.py
- llm_as_judge_direct_provider_importer.py
- llm_as_judge_direct_rits_llama4.py
- llm_as_judge_direct_rits_phi.py
- remove_newline_from_csv.py
- test_qa.py

All the shell scripts in this directory are examples of calling command line interfaces of the above python scripts. The usages of those are described in README in detail.

The evaluation code depends on Unitxt with a custom catalog. The custom catalog is stored in local_catalog/.

The source documents are not included this directory, as described in README. Each user needs to comply with the terms of use by NTA (National Tax Agency) of Japan and download those by oneself.

How Has This Been Tested?

The test has been done manually by following the instructions in README.

We tested this examples in the following environment.

Red Hat Enterprise Linux 8, Red Hat Enterprise Linux 9
Python 3.12.11
sdg_hub v0.2.x + Japanese support PR (#457, #458), and sdg_hub v0.7.1
nvidia-cuda-runtime-cu12 12.8.90
Student model: granite-3.3-8b-instruct (local)
Teacher model: phi-4 (RITS)
LLM-as-a-Judge: phi-4 (RITS)

This example does not alter sdg_hub or other examples in this repo. It is provided as an independent example.

Merge criteria:

The commits are squashed in a cohesive manner and have meaningful messages.
Testing instructions have been added in the PR body (for PRs involving changes that are not immediately obvious).
The developer has manually tested the changes and verified that the changes work

Summary by CodeRabbit

New Features
- End-to-end SDG data pipeline: PDF extraction → QA & glossary → context assembly → ICL example generation → seed dataset export
- Large set of preconfigured multi-provider LLM evaluators including positional-bias variants
Documentation
- Comprehensive README and example notebook documenting workflows and CLI usage
Automation
- New CLI utilities and shell scripts for parsing, conversion, testing, and data preparation
Chores
- Added project ignore rules and dependency manifests for the notebook workspace

_{✏️ Tip: You can customize this high-level summary in your review settings.}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Teigaku Genzei (the fixed-amount tax reduction) is an example of SDG with Japanese documents.#26

Teigaku Genzei (the fixed-amount tax reduction) is an example of SDG with Japanese documents.#26
ryokawajp wants to merge 74 commits intoopendatahub-io:mainfrom
ryokawajp:dev_ryokawa

ryokawajp commented Oct 10, 2025 •

edited by coderabbitai Bot

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ryokawajp commented Oct 10, 2025 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

How Has This Been Tested?

Merge criteria:

Summary by CodeRabbit

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ryokawajp commented Oct 10, 2025 •

edited by coderabbitai Bot

Loading