MOSAIC: Mapping Of Subjective Accounts into Interpreted Clusters

A comprehensive topic modeling pipeline for consciousness-related textual data across multiple datasets, using BERTopic, BERT embeddings, and UMAP-HDBSCAN clustering with multilingual support.

Overview

MOSAIC is a research framework that analyzes subjective experiential reports from various consciousness studies through:

Advanced NLP with BERT embeddings and multilingual models
Dimensionality reduction via UMAP
Density-based clustering with HDBSCAN
Hyperparameter optimization with Optuna
Topic coherence optimization
Large Language Model integration with Llama CPP for deeper insights
Support for multiple datasets and languages

Supported Datasets

This repository contains analysis pipelines for several consciousness research datasets:

Dreamachine: Stroboscopic light-induced altered states of consciousness
Inner Speech: Japanese phenomenological reports on inner speech experiences
Depression/MPE: Mental health and psychological experience reports
NDE: Near-death experience accounts
Ganzfeld: Sensory deprivation experimental reports

Project Structure

MOSAIC/
├── src/                           # Core functionality
│   ├── preprocessor.py            # Text cleaning, sentence splitting
│   ├── model.py                   # BERTopic configuration
│   ├── utils.py                   # Metrics and helper functions
│   ├── optuna_search.py           # Hyperparameter search with Optuna
│   └── optuna_search_allmetrics.py # Multi-objective optimization
├── configs/                       # Dataset-specific configurations
│   └── dreamachine2.py           # Dreamachine dataset settings
├── preproc/                       # Data preprocessing utilities
│   ├── prepare_data.ipynb        # Data preparation notebook
│   └── preprocess_data_*.ipynb   # Dataset-specific preprocessing
├── scripts/                       # Analysis notebooks and tools
├── EVAL/                          # Model evaluation and analysis
│   ├── dreamachine/              # Dreamachine-specific evaluations
│   │   ├── demographics.ipynb    # Demographic analysis
│   │   └── stability_tests/      # Model stability testing
│   ├── conditions_similarity.ipynb # Cross-condition comparisons
│   └── optuna_search/            # Hyperparameter optimization results
├── MULTILINGUAL/                  # Multilingual analysis pipeline
│   ├── DREAMACHINE/              # Multilingual Dreamachine analysis
│   ├── INNERSPEECH/              # Japanese inner speech analysis
│   │   ├── app.py               # Streamlit dashboard
│   │   ├── app_hosted.py        # Hosted version of dashboard
│   │   └── local_translator.py  # Local translation utilities
│   ├── translate/                # Translation utilities
│   └── prepare_data.ipynb        # Multilingual data preparation
├── DATA/                          # Local data storage
├── pyproject.toml                # Project configuration and dependencies
├── requirements.txt              # Python dependencies
└── .mosaicvenv/                  # Virtual environment

Key Features

Core Analysis Pipeline

Preprocessing: Text cleaning, sentence splitting, duplicate removal
Embedding: Support for multiple transformer models (Qwen, E5, BGE, etc.)
Clustering: UMAP dimensionality reduction + HDBSCAN clustering
Topic Modeling: BERTopic with custom representation models
Evaluation: Coherence metrics, stability testing, bootstrap analysis

Multilingual Support

Translation pipelines for non-English datasets
Support for Japanese text processing
API-based and local translation options using Llama models

Interactive Dashboards

Streamlit applications for real-time analysis (MULTILINGUAL/INNERSPEECH/app.py)
Parameter tuning interfaces
Visualization tools with datamapplot integration

Hyperparameter Optimization

Optuna-based search for optimal model parameters
Multi-objective optimization across multiple metrics
Dataset-specific parameter spaces

Installation

Clone the repository:

git clone https://github.com/romybeaute/MOSAIC.git
cd MOSAIC

Create and activate virtual environment:

python3 -m venv .mosaicvenv
source .mosaicvenv/bin/activate  # On Windows: .mosaicvenv\Scripts\activate

Install the package in development mode:

pip install -e .

This will install all dependencies specified in pyproject.toml and make the MOSAIC package available for import.

Usage

Basic Analysis

The primary way to use MOSAIC is through Jupyter notebooks in the scripts/ directory and dataset-specific folders:

# Navigate to analysis notebooks
cd EVAL/dreamachine/
jupyter lab demographics.ipynb

Hyperparameter Optimization

Run hyperparameter search using Optuna:

python src/optuna_search.py --dataset dreamachine --condition DL --sentences --n_trials 200

Parameters:

--dataset: Dataset name (dreamachine, innerspeech, etc.)
--condition: Experimental condition (HS, DL, HW)
--sentences: Enable sentence-level analysis
--n_trials: Number of optimization trials

Interactive Dashboard

Launch the Streamlit dashboard for interactive analysis:

streamlit run MULTILINGUAL/INNERSPEECH/app.py

Translation Pipeline

For multilingual datasets, use the translation utilities:

python MULTILINGUAL/translate/local_translator.py \
    --dataset nde \
    --input-csv NDE_reflection_reports.csv \
    --text-column reflection_answer \
    --model llama \
    --task translate \
    --num-samples 100

Configuration

Dataset-specific configurations are stored in the configs/ directory. Example configuration for Dreamachine dataset in configs/dreamachine2.py:

class DreamachineConfig:
    def __init__(self):
        self.transformer_model = "Qwen/Qwen3-Embedding-0.6B"
        self.ngram_range = (1, 3)
        self.top_n_words = 15
        # ... other parameters

Development

The project uses a modular structure:

Core functionality in src/
Dataset-specific code in respective folders
Shared utilities in DATA/helpers.py
Path management through mosaic.path_utils

Citation

If using this code, please cite:

Beauté, R. et al. (2024). Analysing the phenomenology of stroboscopically induced phenomena using natural language topic modelling
Grootendorst, M. (2022). BERTopic: Neural topic modeling with a class-based TF-IDF procedure

License

See LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 222 Commits
DATA		DATA
RESULTS		RESULTS
notebooks		notebooks
src/mosaic		src/mosaic
.DS_Store		.DS_Store
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
pyproject.toml		pyproject.toml
readme.md		readme.md
requirements.txt		requirements.txt
run_narval.sh		run_narval.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MOSAIC: Mapping Of Subjective Accounts into Interpreted Clusters

Overview

Supported Datasets

Project Structure

Key Features

Core Analysis Pipeline

Multilingual Support

Interactive Dashboards

Hyperparameter Optimization

Installation

Usage

Basic Analysis

Hyperparameter Optimization

Interactive Dashboard

Translation Pipeline

Configuration

Development

Citation

License

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

romybeaute/MOSAIC

Folders and files

Latest commit

History

Repository files navigation

MOSAIC: Mapping Of Subjective Accounts into Interpreted Clusters

Overview

Supported Datasets

Project Structure

Key Features

Core Analysis Pipeline

Multilingual Support

Interactive Dashboards

Hyperparameter Optimization

Installation

Usage

Basic Analysis

Hyperparameter Optimization

Interactive Dashboard

Translation Pipeline

Configuration

Development

Citation

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages