CultranAI is a framework for fine-tuning models on culturally grounded Arabic multiple-choice question (MCQ) tasks. It expands the original PalmX dataset using the NativQA curation framework and also converts the Palm instructional dataset into MCQ format. CultranAI offers an integrated environment to train models, benchmark performance, and curate or augment datasets.
CultranAI/
├── configs/ # Example configs for finetuning scripts
├── data/ # Datasets organized by processing stage
│ ├── datasets/ # Original datasets
│ ├── identified_countries/# PalmX data labeled by country
│ ├── processed_qa/ # Culturally filtered Q&A pairs
│ └── mcq_generated/ # Generated multiple-choice questions
├── models/ # Trained model outputs (created during training)
├── outputs/ # Evaluation results and logs
│ ├── predictions/ # Model predictions
│ └── logs/ # Training and evaluation logs
├── scripts/ # Processing, training, and evaluation scripts
│ ├── utils/ # Data processing utilities
│ ├── finetuning/ # Model training scripts
│ └── evaluation/ # Evaluation scripts
├── requirements.txt # Python dependencies
├── .env_example # Environment variables template
├── .gitignore # Ignore rules
└── README.md
# Clone and navigate to the project
git clone https://github.com/hunzed/CultranAI.git
cd CultranAI
# Install dependencies
pip install -r requirements.txt
# Set up environment variables
cp .env_example .env
# Edit .env with OpenAI API credentialsThe pipeline follows these stages:
python scripts/utils/identify_location.py- Purpose: Assigns each PalmX question to the most relevant Arab country.
- Output: Country-labeled CSV files in
data/identified_countries/
NativQA Step: Use the country-labeled CSV as input to the NativQA framework. The
.tsvoutput from NativQA is then passed to the cultural relevance step.
python scripts/utils/cultural_filter_reformatting.py- Purpose:
- Classifies questions by Arabic cultural relevance
- Refines and shortens answers
- Output:
culture_relevant.csv– Arabic culture questionsnot_culture_relevant.csv– Irrelevant to Arabic cultureunsure_culture.csv– Ambiguous cases
python scripts/utils/generating_distractors.py- Purpose: Adds three plausible but incorrect options to each Q&A pair.
- Input:
culture_relevant.csvandunsure_culture.csv - Output: MCQ datasets in
data/mcq_generated/
Palm Dataset Augmentation: In addition to mcqs from NativQA, this code also directly converts the Palm instructional dataset into MCQ format for augmentation.
python scripts/utils/prepare_dataset.py- Purpose: Merges and formats augmented data into a
.jsonlfile for fine-tuning.
Three fine-tuning modes are supported:
python scripts/finetuning/finetuning-lora.pypython scripts/finetuning/finetuning-quantized.pypython scripts/finetuning/finetuning-full.pyConfiguration:
- Example configs in
/configs - Configs define hyperparameters, training schedules, Weights & Biases tracking, and output directories.
- Dataset path must be set manually before training.
- Training configs are saved alongside trained models.
# Base model
python scripts/evaluation/evaluate.py QCRI/Fanar-1-9B-Instruct --dataset hf --split test
# LoRA adapter
python scripts/evaluation/evaluate.py QCRI/Fanar-1-9B-Instruct /path/to/lora/adapter --dataset local --jsonl_path /path/to/test.jsonlpython scripts/evaluation/evaluate-full.py /path/to/full/model --dataset both --split dev --jsonl_path /path/to/test.jsonlOptions:
--dataset hf– HuggingFace dataset--dataset local– Local JSONL file--dataset both– Both sources--batch_size– Batch size for memory constraints
python scripts/utils/analyze_token_lengths.py- Reports token distributions and flags sequences exceeding model context limits.