This repository contains the full implementation of the experiments described in the manuscript:
“Transformer-Based Personality Trait Recognition Enhanced by Contextual Augmentation” Saberi & Ravanmehr, 2025
The project provides a complete pipeline for data preprocessing, augmentation, model training, k-fold cross-validation, grid search, evaluation, and inference for the Big Five Personality Traits using ELECTRA-based classifiers.
Each personality trait, Openness, Conscientiousness, Extraversion, Agreeableness, Neuroticism, is modeled independently following the single-trait design described in the manuscript.
├── augment.py
├── data_analyzer.py
├── data_splitter.py
├── eval.py
├── infer.py
├── train.py
├── kfold_train.py
├── utils/
├── nltk_res/
│ └── nltk_res_download.py
├── grid_search/
│ ├── grid_search.py
│ └── grid_search_results_results/ # automatically generated during grid search
├── dataset/
│ ├── essay_training.csv # original dataset
│ ├── folds/ # 5-fold cross validation splits
│ ├── split/ # standard train/val/test split
│ └── augmentation/ # synonym + contextual augmentation results
conda env create --name personality python=3.11
conda activate personalitypip install -r requirements.txtpython nltk_res/nltk_res_download.pyThis repository uses the Pennebaker & King Essays dataset, containing 2,467 essays labeled with the Big Five personality traits. Two augmented version is created through WordNet synonym replacement and contextual augmentation using gemma-27b-it, consistent with the manuscript methodology.
All processed data (splits, folds, augmented essays) are located in ./dataset/.
Before any training, you must create the dataset split:
python data_splitter.pyOnce the split is created, inside the training script, set your target trait, for example:
self.trait_name = "Openness"python train.pyThis trains one ELECTRA classifier for the selected trait.
Repeat for all five traits by changing self.trait_name accordingly.
To reproduce the manuscript’s k-fold training experiments:
python k-fold_train.pyThis uses the pre-generated 5 folds located in:
./dataset/folds/
Grid search over ELECTRA hyperparameters using k-fold CV:
python grid_search/grid_search.pyGrid search results are saved automatically to:
./grid_search/
To evaluate any trained model on the test set:
python eval.pyThis computes accuracy, precision, recall, F1, ROC-AUC, and PR-AUC (as described in the manuscript).
For running inference on new text samples using the trained model:
python infer.pyYou must load any trait-specific checkpoint to generate a predicted high/low label.
To apply the augmentation pipeline:
python augment.pyThis produces augmented essays using:
- Synonym Replacement (WordNet)
- Contextual Augmentation (Gemma-based)
Augmented outputs populate:
./dataset/augmentation/
If NLTK resources are missing:
python nltk_res/nltk_res_download.pyThis downloads WordNet, punkt, averaged_perceptron_tagger, etc.
This repository implements the full experimental pipeline from the paper, which developed five independent ELECTRA-based binary classifiers, one per personality trait, achieving:
- Average Accuracy: ~72.4%
- Test AUC Scores: >0.75 for all traits
- Dataset Expanded: 2,467 → 4,934 samples (synonym + contextual augmentation)
Using a trait-isolated approach reduces cross-trait interference and improves generalization, as detailed in the manuscript. All training/evaluation curves, PR/ROC results, and metrics are reproducible through the scripts in this repo.
If you use this repository, please cite the associated papers:
@article{saberi2026personality,
title={Transformer-Based Personality Trait Recognition Enhanced by Contextual Augmentation},
author={Saberi, Hossein and Ravanmehr, Reza},
journal={International Journal of Web Research},
volume={9},
pages={1-24},
year={2026},
publisher={University of Science and Culture}
doi={http://dx.doi.org/10.22133/ijwr.2025.543527.1305}
}
@conference{saberi2025personality,
title={Personality Recognition Using Transformer Model: A Study on the Big Five Traits},
author={Saberi, Hossein and Ghofrani, Sara and Ravanmehr, Reza},
conference={2025 11th International Conference on Web Research (ICWR)},
pages={228-234},
year={2025},
publisher={IEEE}
doi={https://doi.org/10.1109/ICWR65219.2025.11006181}
}Contributions, suggestions, or improvements are welcome. Feel free to open issues or pull requests.