Skip to content

hsisaberi/single-trait-electra

Repository files navigation

Transformer-Based Personality Trait Recognition

Using ELECTRA + Data Augmentation + K-Fold Cross Validation

This repository contains the full implementation of the experiments described in the manuscript:

“Transformer-Based Personality Trait Recognition Enhanced by Contextual Augmentation” Saberi & Ravanmehr, 2025

The project provides a complete pipeline for data preprocessing, augmentation, model training, k-fold cross-validation, grid search, evaluation, and inference for the Big Five Personality Traits using ELECTRA-based classifiers.

Each personality trait, Openness, Conscientiousness, Extraversion, Agreeableness, Neuroticism, is modeled independently following the single-trait design described in the manuscript.


Repository Overview

├── augment.py
├── data_analyzer.py
├── data_splitter.py
├── eval.py
├── infer.py
├── train.py
├── kfold_train.py
├── utils/
├── nltk_res/
│   └── nltk_res_download.py
├── grid_search/
│   ├── grid_search.py
│   └── grid_search_results_results/   # automatically generated during grid search
├── dataset/
│   ├── essay_training.csv       # original dataset
│   ├── folds/                   # 5-fold cross validation splits
│   ├── split/                   # standard train/val/test split
│   └── augmentation/            # synonym + contextual augmentation results

Installation

1. Create the Conda Environment (Python 3.11)

conda env create --name personality python=3.11
conda activate personality

2. Install Project Dependencies

pip install -r requirements.txt

3. Install NLTK Resources (WordNet, tokenizers, etc.)

python nltk_res/nltk_res_download.py

Dataset Description

This repository uses the Pennebaker & King Essays dataset, containing 2,467 essays labeled with the Big Five personality traits. Two augmented version is created through WordNet synonym replacement and contextual augmentation using gemma-27b-it, consistent with the manuscript methodology. All processed data (splits, folds, augmented essays) are located in ./dataset/.


1. Data Splitting

Before any training, you must create the dataset split:

python data_splitter.py

2. Training a Model (Single-Trait)

Once the split is created, inside the training script, set your target trait, for example:

self.trait_name = "Openness"
python train.py

This trains one ELECTRA classifier for the selected trait. Repeat for all five traits by changing self.trait_name accordingly.


3. Training With K-Fold Cross-Validation

To reproduce the manuscript’s k-fold training experiments:

python k-fold_train.py

This uses the pre-generated 5 folds located in:

./dataset/folds/

4. Grid Search (Hyperparameter Optimization)

Grid search over ELECTRA hyperparameters using k-fold CV:

python grid_search/grid_search.py

Grid search results are saved automatically to:

./grid_search/

5. Evaluation

To evaluate any trained model on the test set:

python eval.py

This computes accuracy, precision, recall, F1, ROC-AUC, and PR-AUC (as described in the manuscript).


6. Inference

For running inference on new text samples using the trained model:

python infer.py

You must load any trait-specific checkpoint to generate a predicted high/low label.


7. Data Augmentation

To apply the augmentation pipeline:

python augment.py

This produces augmented essays using:

  • Synonym Replacement (WordNet)
  • Contextual Augmentation (Gemma-based)

Augmented outputs populate:

./dataset/augmentation/

8. NLTK Resource Downloader

If NLTK resources are missing:

python nltk_res/nltk_res_download.py

This downloads WordNet, punkt, averaged_perceptron_tagger, etc.


Manuscript Summary

This repository implements the full experimental pipeline from the paper, which developed five independent ELECTRA-based binary classifiers, one per personality trait, achieving:

  • Average Accuracy: ~72.4%
  • Test AUC Scores: >0.75 for all traits
  • Dataset Expanded: 2,467 → 4,934 samples (synonym + contextual augmentation)

Using a trait-isolated approach reduces cross-trait interference and improves generalization, as detailed in the manuscript. All training/evaluation curves, PR/ROC results, and metrics are reproducible through the scripts in this repo.


Citation

If you use this repository, please cite the associated papers:

@article{saberi2026personality,
  title={Transformer-Based Personality Trait Recognition Enhanced by Contextual Augmentation},
  author={Saberi, Hossein and Ravanmehr, Reza},
  journal={International Journal of Web Research},
  volume={9},
  pages={1-24},
  year={2026},
  publisher={University of Science and Culture}
  doi={http://dx.doi.org/10.22133/ijwr.2025.543527.1305}
}

@conference{saberi2025personality,
  title={Personality Recognition Using Transformer Model: A Study on the Big Five Traits},
  author={Saberi, Hossein and Ghofrani, Sara and Ravanmehr, Reza},
  conference={2025 11th International Conference on Web Research (ICWR)},
  pages={228-234},
  year={2025},
  publisher={IEEE}
  doi={https://doi.org/10.1109/ICWR65219.2025.11006181}
}

🤝 Contributions

Contributions, suggestions, or improvements are welcome. Feel free to open issues or pull requests.


About

A complete ELECTRA-based framework for Big Five personality trait recognition, featuring data augmentation, single-trait model training, k-fold cross-validation, grid search optimization, inference tools, and full reproducibility of the associated research article.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors