Semi-Supervised Cross-Domain Learning Framework for Multitask Multimodal Psychological States Recognition.
This repository contains a configurable training pipeline for multimodal psychological state recognition across multiple datasets. The project combines several input modalities, builds cached feature representations with pretrained encoders, fuses them in a multitask model, and predicts:
- emotion recognition
- personality traits
- AH (Ambivalence/Hesitancy) as a binary target
At a high level, the pipeline works as follows:
- Read dataset metadata from CSV files.
- Find matching video and audio files for each sample.
- Extract modality-specific embeddings for face, audio, text, and behavior descriptions.
- Cache extracted features to avoid recomputing them on every run.
- Merge enabled datasets into a shared training pipeline.
- Train a multitask fusion model with optional ablations and optional hyperparameter search.
- Evaluate on dev/test splits and save checkpoints and logs.
The main entry point is main.py.
main.py: orchestration entry point; loads config, initializes extractors, builds datasets/loaders, launches training or hyperparameter search.config.toml: main experiment configuration.search_params.toml: search space and defaults for greedy/exhaustive hyperparameter search.data/: local folder for annotation and metadata CSV files.pytorch_Qwen2.5-VL.py: utility script for generatingtext_llmbehavior descriptions from videos with Qwen2.5-VL.src/train.py: training loop, validation/test evaluation, metric aggregation, checkpointing, and early stopping.src/data_loading/dataset_builder.py: dataset and dataloader creation, split fractions, and collate logic.src/data_loading/dataset_multimodal.py: sample indexing, label assembly, per-modality feature extraction, and feature caching.src/data_loading/pretrained_extractors.py: pretrained encoders for face/video, audio, text, and behavior modalities.src/models/models.py: multitask fusion architectures and ablation-aware variants.src/utils/feature_store.py: cache storage for extracted features and metadata.
The project expects a local data/ directory with split annotation tables prepared for this pipeline. A typical layout is:
data/
cmu_mosei/
train_full_with_description.csv
dev_full_with_description.csv
test_full_with_description.csv
fiv2/
train_full_with_description.csv
dev_full_with_description.csv
test_full_with_description.csv
bah/
train_full_with_description.csv
dev_full_with_description.csv
test_full_with_description.csv
These files store annotation and metadata tables used by the training pipeline. They do not replace the original raw datasets and do not include the source video or audio files.
The current configuration supports three datasets:
cmu_mosei: emotion labelsfiv2: personality labelsbah: BAH (Behavioural Ambivalence/Hesitancy) dataset with AH (Ambivalence/Hesitancy) labels
Each dataset is configured independently in config.toml through:
base_dircsv_pathvideo_diraudio_dirtrain_fraction,dev_fraction,test_fraction
The training loader is built as a concatenation of enabled training subsets from all configured datasets.
This project does not create, own, or redistribute the original source datasets. It relies on previously released third-party research datasets and project-specific CSV annotations prepared for this pipeline.
When using this repository, make sure that your use of each dataset complies with its original license, access policy, consent terms, and any institutional or ethics requirements that may apply.
Official sources for the datasets used here:
CMU-MOSEI: official paper: https://aclanthology.org/P18-1208/ ; CMU MultiComp / SDK resources: https://github.com/CMU-MultiComp-Lab/CMU-MultimodalSDKFirst Impressions V2: official ChaLearn dataset page: https://chalearnlap.cvc.uab.cat/dataset/24/description/BAH(Behavioural Ambivalence/Hesitancy): official dataset page: https://liviaets.github.io/bah-dataset/ ; paper page: https://openreview.net/forum?id=jYDHVscRO3
Notes on access conditions:
CMU-MOSEIis a public research dataset described by CMU MultiComp.First Impressions V2is publicly documented through ChaLearn and may require registration or sign-in to access the files.BAH(Behavioural Ambivalence/Hesitancy) is publicly documented, but the dataset page states that access is provided under a research-only license and requires following the authors' request instructions.
The demographic or sensitive attributes available in the source datasets, such as age, sex, nationality, or ethnicity where applicable, originate from the original dataset providers. This repository does not create new demographic annotations and does not claim authorship over those labels.
Each dataset split is expected to provide:
- a CSV file with sample metadata
- a directory with video files
- a directory with audio files
Inside this repository, the data/ folder is intended for local copies of the annotation CSV files only. It does not contain the raw video or audio data.
The code expects a video_name column in each CSV. Depending on the target task and enabled modalities, the CSV should also contain:
- emotion columns for
cmu_mosei:Neutral,Anger,Disgust,Fear,Happiness,Sadness,Surprise - personality columns for
fiv2:openness,conscientiousness,extraversion,agreeableness,non-neuroticism - AH (Ambivalence/Hesitancy) columns for
bah:absence_full,presence_full textcolumn for the text modality- a behavior-description column for the behavior modality: by default
text_llm, configurable viadataloader.text_description_column
Example path pattern from the default config:
[datasets.cmu_mosei]
base_dir = "/path/to/CMU-MOSEI/"
csv_path = "{base_dir}/{split}_full_with_description.csv"
video_dir = "{base_dir}/video/{split}/"
audio_dir = "{base_dir}/audio/{split}/"Each dataset should be organized so that the configured CSV file, video directory, and audio directory exist for every split used by the experiment.
Example for one dataset:
DATASET_ROOT/
|- train_full_with_description.csv
|- dev_full_with_description.csv
|- test_full_with_description.csv
|- video/
| |- train/
| | |- sample_0001.mp4
| | |- sample_0002.mp4
| | `- ...
| |- dev/
| | `- ...
| `- test/
| `- ...
`- audio/
|- train/
| |- sample_0001.wav
| |- sample_0002.wav
| `- ...
|- dev/
| `- ...
`- test/
`- ...
For the default configuration, the repository assumes the following structure pattern for each dataset:
base_dir: dataset root directory{base_dir}/{split}_full_with_description.csv: metadata for a split{base_dir}/video/{split}/: video files for that split{base_dir}/audio/{split}/: audio files for that split
- CSV file: one row per sample, with
video_nameand the labels and text fields required by the enabled tasks and modalities video/<split>/: source videos used for face-frame extractionaudio/<split>/: audio tracks used for audio embedding extraction
Inside this repository, data/ is intended for annotation tables such as train_full_with_description.csv, dev_full_with_description.csv, and test_full_with_description.csv.
The loader matches files by video_name without requiring a specific extension in the code. It searches for a file whose basename matches the value from the CSV.
For example, if a CSV row contains:
video_name = sample_0001
then the loader will look for files such as:
video/train/sample_0001.mp4audio/train/sample_0001.wav
At minimum, each CSV should contain:
video_name
Depending on which labels and modalities are used, the CSV may additionally need:
text: source text for the text modalitytext_llmor another configured description column: source text for the behavior modality- emotion label columns for
cmu_mosei - personality label columns for
fiv2 - AH (Ambivalence/Hesitancy) label columns for
bah
The repository also contains pytorch_Qwen2.5-VL.py, a standalone utility for generating the text_llm column used by the behavior modality.
This script:
- loads a video-language model based on Qwen2.5-VL
- processes input videos
- generates a short natural-language description of visible nonverbal behavior
- writes the generated text back into a CSV column named
text_llm
In practice, this script can be used to augment dataset annotations with LLM-generated behavioral descriptions before training. Those generated descriptions can then be consumed by the behavior modality through dataloader.text_description_column.
The script is not part of the default training entry point in main.py. It is a preprocessing utility that should be run separately when you want to create or refresh text_llm annotations.
Before running the script, update its local configuration values such as:
video_dirinput_csvoutput_csvmodel_name
The system supports four modalities:
face: extracted from video frames after face detectionaudio: extracted from audio filestext: extracted from thetextcolumnbehavior: extracted from the configured description column such astext_llm
Available extractor families in the current codebase include:
- video/face: CLIP-based image encoder
- audio: CLAP or Wav2Vec2-style encoder
- text/behavior: CLIP text, CLAP text, RoBERTa/XLM-R style models, or
michellejieli/emotion_text_classifier
Feature extraction is configured in the [embeddings] section of config.toml. Extracted embeddings can be stored in the local feature cache to speed up repeated experiments.
The fusion model is defined in src/models/models.py. The model:
- projects each modality into a shared hidden space
- optionally applies graph-based interaction between modalities
- optionally applies cross-attention between task-specific and modality-level representations
- predicts multiple tasks jointly
- optionally uses guide-bank representations for task heads
Implemented model variants:
MultiModalFusionModel_v1MultiModalFusionModel_v2MultiModalFusionModel_v3
The default config currently uses MultiModalFusionModel_v2.
Most experiment behavior is controlled from config.toml:
[datasets.*]: dataset paths and subset fractions[dataloader]: worker count, shuffle,prepare_only, behavior-text column[train.general]: seed, batch size, epochs, patience, checkpointing, cache saving, device, search mode[train.model]: model architecture hyperparameters[train.losses]: multitask and semi-supervised loss settings[train.optimizer]: optimizer and learning rate[train.scheduler]: scheduler setup[embeddings]: extractor choice and embedding aggregation strategy[cache]: cache behavior and forced re-extraction[ablation]: module, task, and modality ablations
Use Python 3.10+ and install dependencies from requirements.txt.
python -m venv .venv
.venv\Scripts\activate
pip install -r requirements.txtIf you plan to use Telegram notifications, also install:
pip install python-dotenvThen create a .env file with:
TELEGRAM_BOT_TOKEN=...
TELEGRAM_CHAT_ID=...Telegram is optional. You can disable it with use_telegram = false in config.toml.
Update dataset paths in config.toml so they point to your local copies of cmu_mosei, fiv2, and bah.
The pipeline supports three main modes through train.general.search_type:
none: single training rungreedy: greedy hyperparameter searchexhaustive: exhaustive hyperparameter search
python main.pyIf you want to build caches without starting training:
[dataloader]
prepare_only = trueThen run:
python main.pySet:
[dataloader]
prepare_only = false
[train.general]
search_type = "none"Then run:
python main.pySet search_type = "greedy" or search_type = "exhaustive" in config.toml. Search values are read from search_params.toml.
Each run creates a timestamped directory under results/, for example:
results/results_multimodalfusionmodel_v2_YYYY-MM-DD_HH-MM-SS/
The run directory contains:
config_copy.toml: snapshot of the run configurationsession_log.txt: full log outputmetrics_by_epoch/: metric logscheckpoints/: saved best model checkpointsoverrides.txt: search overrides when search mode is used
Cached modality features are stored separately under the path configured by train.general.save_feature_path, which defaults to ./features/.
To reproduce the reported or best configuration results:
- Install the dependencies listed above.
- Prepare the datasets with the expected CSV schema and folder layout.
- Set dataset paths in
config.toml. - Use the best-performing settings in
config.tomland, if applicable,search_params.toml. - Run
python main.py. - Collect metrics from the log file and the saved checkpoints in the generated
results/directory.
If the paper or benchmark section reports a specific best setup, it is recommended to explicitly mark that setup in the config or document it in a dedicated subsection.
- The default sample paths in
config.tomlare local machine paths and should be changed before running the project elsewhere. - The repository includes
yolov8n-face.pt, but face extraction also supports MediaPipe-based detection depending on config. - If a dataset split does not provide a
testfile, the code falls back to the dev loader for test-time evaluation.