Skip to content

A comprehensive framework for fine-tuning Arabic language models on culturally relevant question-answering tasks using the NativQA dataset and PalmX evaluation framework.

Notifications You must be signed in to change notification settings

hunzed/CultranAI

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CultranAI: Arabic Cultural Question Answering

CultranAI is a framework for fine-tuning models on culturally grounded Arabic multiple-choice question (MCQ) tasks. It expands the original PalmX dataset using the NativQA curation framework and also converts the Palm instructional dataset into MCQ format. CultranAI offers an integrated environment to train models, benchmark performance, and curate or augment datasets.

Project Structure

CultranAI/
├── configs/                 # Example configs for finetuning scripts
├── data/                    # Datasets organized by processing stage
│   ├── datasets/            # Original datasets
│   ├── identified_countries/# PalmX data labeled by country
│   ├── processed_qa/        # Culturally filtered Q&A pairs
│   └── mcq_generated/       # Generated multiple-choice questions
├── models/                  # Trained model outputs (created during training)
├── outputs/                 # Evaluation results and logs
│   ├── predictions/         # Model predictions
│   └── logs/                # Training and evaluation logs
├── scripts/                 # Processing, training, and evaluation scripts
│   ├── utils/               # Data processing utilities
│   ├── finetuning/          # Model training scripts
│   └── evaluation/          # Evaluation scripts
├── requirements.txt         # Python dependencies
├── .env_example             # Environment variables template
├── .gitignore               # Ignore rules
└── README.md

Setup

# Clone and navigate to the project
git clone https://github.com/hunzed/CultranAI.git
cd CultranAI

# Install dependencies
pip install -r requirements.txt

# Set up environment variables
cp .env_example .env
# Edit .env with OpenAI API credentials

Data Processing Pipeline

The pipeline follows these stages:

1. Label PalmX Data by Country

python scripts/utils/identify_location.py
  • Purpose: Assigns each PalmX question to the most relevant Arab country.
  • Output: Country-labeled CSV files in data/identified_countries/

NativQA Step: Use the country-labeled CSV as input to the NativQA framework. The .tsv output from NativQA is then passed to the cultural relevance step.

2. Filter for Cultural Relevance

python scripts/utils/cultural_filter_reformatting.py
  • Purpose:
    • Classifies questions by Arabic cultural relevance
    • Refines and shortens answers
  • Output:
    • culture_relevant.csv – Arabic culture questions
    • not_culture_relevant.csv – Irrelevant to Arabic culture
    • unsure_culture.csv – Ambiguous cases

3. Generate Distractors

python scripts/utils/generating_distractors.py
  • Purpose: Adds three plausible but incorrect options to each Q&A pair.
  • Input: culture_relevant.csv and unsure_culture.csv
  • Output: MCQ datasets in data/mcq_generated/

Palm Dataset Augmentation: In addition to mcqs from NativQA, this code also directly converts the Palm instructional dataset into MCQ format for augmentation.

4. Prepare Final Dataset

python scripts/utils/prepare_dataset.py
  • Purpose: Merges and formats augmented data into a .jsonl file for fine-tuning.

Model Training

Three fine-tuning modes are supported:

LoRA Fine-tuning

python scripts/finetuning/finetuning-lora.py

4-bit Quantized Training

python scripts/finetuning/finetuning-quantized.py

Full Precision Training

python scripts/finetuning/finetuning-full.py

Configuration:

  • Example configs in /configs
  • Configs define hyperparameters, training schedules, Weights & Biases tracking, and output directories.
  • Dataset path must be set manually before training.
  • Training configs are saved alongside trained models.

Model Evaluation

LoRA Model Evaluation

# Base model
python scripts/evaluation/evaluate.py QCRI/Fanar-1-9B-Instruct --dataset hf --split test

# LoRA adapter
python scripts/evaluation/evaluate.py QCRI/Fanar-1-9B-Instruct /path/to/lora/adapter --dataset local --jsonl_path /path/to/test.jsonl

Full Fine-tuned Model Evaluation

python scripts/evaluation/evaluate-full.py /path/to/full/model --dataset both --split dev --jsonl_path /path/to/test.jsonl

Options:

  • --dataset hf – HuggingFace dataset
  • --dataset local – Local JSONL file
  • --dataset both – Both sources
  • --batch_size – Batch size for memory constraints

Data Analysis

Token Length Distribution

python scripts/utils/analyze_token_lengths.py
  • Reports token distributions and flags sequences exceeding model context limits.

About

A comprehensive framework for fine-tuning Arabic language models on culturally relevant question-answering tasks using the NativQA dataset and PalmX evaluation framework.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages