Skip to content

Teigaku Genzei (the fixed-amount tax reduction) is an example of SDG with Japanese documents.#26

Open
ryokawajp wants to merge 74 commits intoopendatahub-io:mainfrom
ryokawajp:dev_ryokawa
Open

Teigaku Genzei (the fixed-amount tax reduction) is an example of SDG with Japanese documents.#26
ryokawajp wants to merge 74 commits intoopendatahub-io:mainfrom
ryokawajp:dev_ryokawa

Conversation

@ryokawajp
Copy link
Copy Markdown

@ryokawajp ryokawajp commented Oct 10, 2025

Description

This example is located in scripts/teigaku_genzei/. I created this new directory.

This example contains a set of python scripts to

  • preprocess documents (common)
    • json_glossary.py
    • json_qa.py
    • json_util.py
    • jsonl_util.py
  • convert the source documents to SDG seed data,
    • context_util.py
    • extract_icl_from_seed.py
    • make_context.py
    • make_icl.py
    • make_seed.py
    • seed_jsonl_2_qna_yaml.py
  • convert the source documents to test data, and
    • extract_qa.sh
    • qna_yaml_2_qa_csv.py
  • evaluate the fine-tuned model with the test data.
    • catalog_util.py
    • llm_as_judge_direct_criteria_correctness2.py
    • llm_as_judge_direct_positional_bias.py
    • llm_as_judge_direct_provider_importer.py
    • llm_as_judge_direct_rits_llama4.py
    • llm_as_judge_direct_rits_phi.py
    • remove_newline_from_csv.py
    • test_qa.py

All the shell scripts in this directory are examples of calling command line interfaces of the above python scripts. The usages of those are described in README in detail.

The evaluation code depends on Unitxt with a custom catalog. The custom catalog is stored in local_catalog/.

The source documents are not included this directory, as described in README. Each user needs to comply with the terms of use by NTA (National Tax Agency) of Japan and download those by oneself.

How Has This Been Tested?

The test has been done manually by following the instructions in README.

We tested this examples in the following environment.

  • Red Hat Enterprise Linux 8, Red Hat Enterprise Linux 9
  • Python 3.12.11
  • sdg_hub v0.2.x + Japanese support PR (#457, #458), and sdg_hub v0.7.1
  • nvidia-cuda-runtime-cu12 12.8.90
  • Student model: granite-3.3-8b-instruct (local)
  • Teacher model: phi-4 (RITS)
  • LLM-as-a-Judge: phi-4 (RITS)

This example does not alter sdg_hub or other examples in this repo. It is provided as an independent example.

Merge criteria:

  • The commits are squashed in a cohesive manner and have meaningful messages.
  • Testing instructions have been added in the PR body (for PRs involving changes that are not immediately obvious).
  • The developer has manually tested the changes and verified that the changes work

Summary by CodeRabbit

  • New Features

    • End-to-end SDG data pipeline: PDF extraction → QA & glossary → context assembly → ICL example generation → seed dataset export
    • Large set of preconfigured multi-provider LLM evaluators including positional-bias variants
  • Documentation

    • Comprehensive README and example notebook documenting workflows and CLI usage
  • Automation

    • New CLI utilities and shell scripts for parsing, conversion, testing, and data preparation
  • Chores

    • Added project ignore rules and dependency manifests for the notebook workspace

✏️ Tip: You can customize this high-level summary in your review settings.

Loading
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant