SCD-MMPSR

Semi-Supervised Cross-Domain Learning Framework for Multitask Multimodal Psychological States Recognition.

This repository contains a configurable training pipeline for multimodal psychological state recognition across multiple datasets. The project combines several input modalities, builds cached feature representations with pretrained encoders, fuses them in a multitask model, and predicts:

emotion recognition
personality traits
AH (Ambivalence/Hesitancy) as a binary target

What The System Does

At a high level, the pipeline works as follows:

Read dataset metadata from CSV files.
Find matching video and audio files for each sample.
Extract modality-specific embeddings for face, audio, text, and behavior descriptions.
Cache extracted features to avoid recomputing them on every run.
Merge enabled datasets into a shared training pipeline.
Train a multitask fusion model with optional ablations and optional hyperparameter search.
Evaluate on dev/test splits and save checkpoints and logs.

The main entry point is main.py.

Project Structure

main.py: orchestration entry point; loads config, initializes extractors, builds datasets/loaders, launches training or hyperparameter search.
config.toml: main experiment configuration.
search_params.toml: search space and defaults for greedy/exhaustive hyperparameter search.
data/: local folder for annotation and metadata CSV files.
pytorch_Qwen2.5-VL.py: utility script for generating text_llm behavior descriptions from videos with Qwen2.5-VL.
src/train.py: training loop, validation/test evaluation, metric aggregation, checkpointing, and early stopping.
src/data_loading/dataset_builder.py: dataset and dataloader creation, split fractions, and collate logic.
src/data_loading/dataset_multimodal.py: sample indexing, label assembly, per-modality feature extraction, and feature caching.
src/data_loading/pretrained_extractors.py: pretrained encoders for face/video, audio, text, and behavior modalities.
src/models/models.py: multitask fusion architectures and ablation-aware variants.
src/utils/feature_store.py: cache storage for extracted features and metadata.

Local Annotation Files

The project expects a local data/ directory with split annotation tables prepared for this pipeline. A typical layout is:

data/
  cmu_mosei/
    train_full_with_description.csv
    dev_full_with_description.csv
    test_full_with_description.csv
  fiv2/
    train_full_with_description.csv
    dev_full_with_description.csv
    test_full_with_description.csv
  bah/
    train_full_with_description.csv
    dev_full_with_description.csv
    test_full_with_description.csv

These files store annotation and metadata tables used by the training pipeline. They do not replace the original raw datasets and do not include the source video or audio files.

Supported Datasets

The current configuration supports three datasets:

cmu_mosei: emotion labels
fiv2: personality labels
bah: BAH (Behavioural Ambivalence/Hesitancy) dataset with AH (Ambivalence/Hesitancy) labels

Each dataset is configured independently in config.toml through:

base_dir
csv_path
video_dir
audio_dir
train_fraction, dev_fraction, test_fraction

The training loader is built as a concatenation of enabled training subsets from all configured datasets.

Dataset Provenance, Access, and Ethics

This project does not create, own, or redistribute the original source datasets. It relies on previously released third-party research datasets and project-specific CSV annotations prepared for this pipeline.

When using this repository, make sure that your use of each dataset complies with its original license, access policy, consent terms, and any institutional or ethics requirements that may apply.

Official sources for the datasets used here:

CMU-MOSEI: official paper: https://aclanthology.org/P18-1208/ ; CMU MultiComp / SDK resources: https://github.com/CMU-MultiComp-Lab/CMU-MultimodalSDK
First Impressions V2: official ChaLearn dataset page: https://chalearnlap.cvc.uab.cat/dataset/24/description/
BAH (Behavioural Ambivalence/Hesitancy): official dataset page: https://liviaets.github.io/bah-dataset/ ; paper page: https://openreview.net/forum?id=jYDHVscRO3

Notes on access conditions:

CMU-MOSEI is a public research dataset described by CMU MultiComp.
First Impressions V2 is publicly documented through ChaLearn and may require registration or sign-in to access the files.
BAH (Behavioural Ambivalence/Hesitancy) is publicly documented, but the dataset page states that access is provided under a research-only license and requires following the authors' request instructions.

The demographic or sensitive attributes available in the source datasets, such as age, sex, nationality, or ethnicity where applicable, originate from the original dataset providers. This repository does not create new demographic annotations and does not claim authorship over those labels.

Expected Input Data

Each dataset split is expected to provide:

a CSV file with sample metadata
a directory with video files
a directory with audio files

Inside this repository, the data/ folder is intended for local copies of the annotation CSV files only. It does not contain the raw video or audio data.

The code expects a video_name column in each CSV. Depending on the target task and enabled modalities, the CSV should also contain:

emotion columns for cmu_mosei: Neutral, Anger, Disgust, Fear, Happiness, Sadness, Surprise
personality columns for fiv2: openness, conscientiousness, extraversion, agreeableness, non-neuroticism
AH (Ambivalence/Hesitancy) columns for bah: absence_full, presence_full
text column for the text modality
a behavior-description column for the behavior modality: by default text_llm, configurable via dataloader.text_description_column

Example path pattern from the default config:

[datasets.cmu_mosei]
base_dir = "/path/to/CMU-MOSEI/"
csv_path = "{base_dir}/{split}_full_with_description.csv"
video_dir = "{base_dir}/video/{split}/"
audio_dir = "{base_dir}/audio/{split}/"

Recommended Dataset Directory Structure

Each dataset should be organized so that the configured CSV file, video directory, and audio directory exist for every split used by the experiment.

Example for one dataset:

DATASET_ROOT/
|- train_full_with_description.csv
|- dev_full_with_description.csv
|- test_full_with_description.csv
|- video/
|  |- train/
|  |  |- sample_0001.mp4
|  |  |- sample_0002.mp4
|  |  `- ...
|  |- dev/
|  |  `- ...
|  `- test/
|     `- ...
`- audio/
   |- train/
   |  |- sample_0001.wav
   |  |- sample_0002.wav
   |  `- ...
   |- dev/
   |  `- ...
   `- test/
      `- ...

For the default configuration, the repository assumes the following structure pattern for each dataset:

base_dir: dataset root directory
{base_dir}/{split}_full_with_description.csv: metadata for a split
{base_dir}/video/{split}/: video files for that split
{base_dir}/audio/{split}/: audio files for that split

What Each Data Component Contains

CSV file: one row per sample, with video_name and the labels and text fields required by the enabled tasks and modalities
video/<split>/: source videos used for face-frame extraction
audio/<split>/: audio tracks used for audio embedding extraction

Inside this repository, data/ is intended for annotation tables such as train_full_with_description.csv, dev_full_with_description.csv, and test_full_with_description.csv.

The loader matches files by video_name without requiring a specific extension in the code. It searches for a file whose basename matches the value from the CSV.

For example, if a CSV row contains:

video_name = sample_0001

then the loader will look for files such as:

video/train/sample_0001.mp4
audio/train/sample_0001.wav

Minimal CSV Schema

At minimum, each CSV should contain:

video_name

Depending on which labels and modalities are used, the CSV may additionally need:

text: source text for the text modality
text_llm or another configured description column: source text for the behavior modality
emotion label columns for cmu_mosei
personality label columns for fiv2
AH (Ambivalence/Hesitancy) label columns for bah

LLM-Generated Behavior Descriptions

The repository also contains pytorch_Qwen2.5-VL.py, a standalone utility for generating the text_llm column used by the behavior modality.

This script:

loads a video-language model based on Qwen2.5-VL
processes input videos
generates a short natural-language description of visible nonverbal behavior
writes the generated text back into a CSV column named text_llm

In practice, this script can be used to augment dataset annotations with LLM-generated behavioral descriptions before training. Those generated descriptions can then be consumed by the behavior modality through dataloader.text_description_column.

The script is not part of the default training entry point in main.py. It is a preprocessing utility that should be run separately when you want to create or refresh text_llm annotations.

Before running the script, update its local configuration values such as:

video_dir
input_csv
output_csv
model_name

Modalities And Feature Extraction

The system supports four modalities:

face: extracted from video frames after face detection
audio: extracted from audio files
text: extracted from the text column
behavior: extracted from the configured description column such as text_llm

Available extractor families in the current codebase include:

video/face: CLIP-based image encoder
audio: CLAP or Wav2Vec2-style encoder
text/behavior: CLIP text, CLAP text, RoBERTa/XLM-R style models, or michellejieli/emotion_text_classifier

Feature extraction is configured in the [embeddings] section of config.toml. Extracted embeddings can be stored in the local feature cache to speed up repeated experiments.

Model Overview

The fusion model is defined in src/models/models.py. The model:

projects each modality into a shared hidden space
optionally applies graph-based interaction between modalities
optionally applies cross-attention between task-specific and modality-level representations
predicts multiple tasks jointly
optionally uses guide-bank representations for task heads

Implemented model variants:

MultiModalFusionModel_v1
MultiModalFusionModel_v2
MultiModalFusionModel_v3

The default config currently uses MultiModalFusionModel_v2.

Configuration

Most experiment behavior is controlled from config.toml:

[datasets.*]: dataset paths and subset fractions
[dataloader]: worker count, shuffle, prepare_only, behavior-text column
[train.general]: seed, batch size, epochs, patience, checkpointing, cache saving, device, search mode
[train.model]: model architecture hyperparameters
[train.losses]: multitask and semi-supervised loss settings
[train.optimizer]: optimizer and learning rate
[train.scheduler]: scheduler setup
[embeddings]: extractor choice and embedding aggregation strategy
[cache]: cache behavior and forced re-extraction
[ablation]: module, task, and modality ablations

Installation

Use Python 3.10+ and install dependencies from requirements.txt.

python -m venv .venv
.venv\Scripts\activate
pip install -r requirements.txt

If you plan to use Telegram notifications, also install:

pip install python-dotenv

Then create a .env file with:

TELEGRAM_BOT_TOKEN=...
TELEGRAM_CHAT_ID=...

Telegram is optional. You can disable it with use_telegram = false in config.toml.

How To Run

1. Configure dataset paths

Update dataset paths in config.toml so they point to your local copies of cmu_mosei, fiv2, and bah.

2. Choose the run mode

The pipeline supports three main modes through train.general.search_type:

none: single training run
greedy: greedy hyperparameter search
exhaustive: exhaustive hyperparameter search

3. Launch the pipeline

python main.py

Common Workflows

Prepare features only

If you want to build caches without starting training:

[dataloader]
prepare_only = true

Then run:

python main.py

Run a single training experiment

Set:

[dataloader]
prepare_only = false

[train.general]
search_type = "none"

Then run:

python main.py

Run hyperparameter search

Set search_type = "greedy" or search_type = "exhaustive" in config.toml. Search values are read from search_params.toml.

Outputs

Each run creates a timestamped directory under results/, for example:

results/results_multimodalfusionmodel_v2_YYYY-MM-DD_HH-MM-SS/

The run directory contains:

config_copy.toml: snapshot of the run configuration
session_log.txt: full log output
metrics_by_epoch/: metric logs
checkpoints/: saved best model checkpoints
overrides.txt: search overrides when search mode is used

Cached modality features are stored separately under the path configured by train.general.save_feature_path, which defaults to ./features/.

Reproducing Results

To reproduce the reported or best configuration results:

Install the dependencies listed above.
Prepare the datasets with the expected CSV schema and folder layout.
Set dataset paths in config.toml.
Use the best-performing settings in config.toml and, if applicable, search_params.toml.
Run python main.py.
Collect metrics from the log file and the saved checkpoints in the generated results/ directory.

If the paper or benchmark section reports a specific best setup, it is recommended to explicitly mark that setup in the config or document it in a dedicated subsection.

Notes

The default sample paths in config.toml are local machine paths and should be changed before running the project elsewhere.
The repository includes yolov8n-face.pt, but face extraction also supports MediaPipe-based detection depending on config.
If a dataset split does not provide a test file, the code falls back to the dev loader for test-time evaluation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SCD-MMPSR

What The System Does

Project Structure

Local Annotation Files

Supported Datasets

Dataset Provenance, Access, and Ethics

Expected Input Data

Recommended Dataset Directory Structure

What Each Data Component Contains

Minimal CSV Schema

LLM-Generated Behavior Descriptions

Modalities And Feature Extraction

Model Overview

Configuration

Installation

How To Run

1. Configure dataset paths

2. Choose the run mode

3. Launch the pipeline

Common Workflows

Prepare features only

Run a single training experiment

Run hyperparameter search

Outputs

Reproducing Results

Notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
data		data
src		src
LICENSE		LICENSE
README.md		README.md
config.toml		config.toml
main.py		main.py
pytorch_Qwen2.5-VL.py		pytorch_Qwen2.5-VL.py
requirements.txt		requirements.txt
search_params.toml		search_params.toml
yolov8n-face.pt		yolov8n-face.pt

Folders and files

Latest commit

History

Repository files navigation

SCD-MMPSR

What The System Does

Project Structure

Local Annotation Files

Supported Datasets

Dataset Provenance, Access, and Ethics

Expected Input Data

Recommended Dataset Directory Structure

What Each Data Component Contains

Minimal CSV Schema

LLM-Generated Behavior Descriptions

Modalities And Feature Extraction

Model Overview

Configuration

Installation

How To Run

1. Configure dataset paths

2. Choose the run mode

3. Launch the pipeline

Common Workflows

Prepare features only

Run a single training experiment

Run hyperparameter search

Outputs

Reproducing Results

Notes

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages