MindMirror Personality is the personality-only slice of the MindMirror system. It trains, registers, serves, and persists Big Five weekly personality direction predictions derived from Reddit activity.
The repository now focuses on one domain only:
personality
The active production path is:
- A user links a Reddit username through the API.
- The API creates a signup inference job in Postgres.
- The worker polls queued jobs from the database.
- Reddit posts are loaded from a static demo bundle or extracted live.
- Raw posts are normalized into weekly trait and embedding features.
- Five per-trait inference models predict weekly direction states.
- Results are stored in the database and exposed through prediction endpoints.
This codebase combines four concerns in one repository:
API application: FastAPI service for authentication, user profile data, job status, and prediction retrieval.ML pipelines: trait-specific data preparation, training, and offline inference entry points.Inference runtime: shared inference engine, feature builders, artifact loading, and signup prediction orchestration.MLOps and deployment: MLflow integration, Docker packaging, compose-based local orchestration, SQL bootstrap, and operational scripts.
The repository intentionally excludes the older non-personality domains. Everything active here is built around the Big Five traits:
opennessconscientiousnessextraversionagreeablenessneuroticism
Each trait has its own training, artifact paths, and MLflow registry mapping in config.yaml.
client -> FastAPI -> Postgres job row -> worker poll loop -> Reddit extraction -> feature building -> MLflow model loading -> prediction persistence -> API readback
raw Reddit posts -> preprocessing -> weekly aggregation -> delta target labeling -> train/test split -> trait model training -> evaluation -> MLflow logging -> MLflow model registry
api: request handling, auth, persistence, job orchestrationworker: signup inference background consumersrc/personality: preprocessing, features, training, evaluation, inferencesrc/inference: shared model loading and schema logicutils: configuration and MLflow helperspipelines/personality/*: trait entry points for data prep, training, inference
This repository predicts trait direction state, not raw trait score regression. For each Big Five trait, the model predicts a weekly directional label derived from future-vs-past movement over a rolling time window.
The pipeline is built around Reddit post data with fields such as:
- author or equivalent username column
- timestamp (
created_date,created_at, orcreated_utc) - text fields (
title,selftext) - optional precomputed Big Five scores
- optional precomputed embeddings
When upstream signals are missing, the runtime can compute them on demand:
- Big Five text scores from the scoring model
- BERT-style text embeddings from the embedding model
The preprocessing code in src/personality and the trait data pipelines follows this sequence:
- Normalize input columns and author identity.
- Concatenate
titleandselftextinto a processed text field. - Clean text and replace emoji textually.
- Parse or generate Big Five scores.
- Parse or generate text embeddings.
- Aggregate post-level data into user-week records.
- Compute week gaps and temporal spacing features.
- Fill standard deviation gaps safely for sparse weeks.
The system converts post-level Reddit data into weekly user summaries:
- weekly trait means
- weekly trait standard deviations
- weekly embedding means
- week gap features
This is the bridge between raw social content and model-ready longitudinal data.
Training labels are generated from temporal movement in each trait:
past_meanis computed over the configured lookback windowfuture_meanis computed over the configured forward windowdelta_futurecaptures the future-minus-past change- a quantile threshold
tauconverts the delta into directional classes
Those labels become the trait-specific targets such as openness_dir.
The main feature families are:
- temporal gap features:
time_gap_weeks,log_time_gap_weeks - lag features: current observation shifted to prior-week context
- rolling features: rolling mean and rolling std over prior weeks
- per-week dispersion features: trait weekly std
- shifted embedding vector features: prior weekly embedding representation
At inference time, the system drops rows that do not have enough historical context. In practice, at least two weekly observations are needed to produce lagged predictions.
The default configured model type is xgboost. The training stack also keeps compatibility helpers for:
random_foresthist_gradient_boosting
Trait training loads or creates dataset splits, trains a classifier, evaluates it, saves artifacts locally, and registers the trained model in MLflow.
Online inference is registry-first:
- models are resolved from MLflow registry URIs such as
models:/.../Production - local model paths are intentionally disabled for the shared runtime inference engine
- label encoders are downloaded from MLflow model artifacts when available
This keeps runtime serving aligned with promoted registry versions rather than ad hoc local files.
Each trait under pipelines/personality/<trait>/ follows the same pattern:
data_pipeline.py: builds or loads trait-specific train/test datasetstrain_pipeline.py: trains and evaluates the model, then registers it in MLflowinference_pipeline.py: command-line wrapper over shared inference for one trait
The shared offline inference entry point is pipelines/personality/run_trait_inference.py.
The trait data pipelines do the heavy transformation work:
- ingest raw JSON or CSV
- detect whether the input is already weekly
- preprocess text if needed
- parse score payloads into trait columns
- generate embeddings if not already present
- aggregate post data by user-week
- compute future delta labels
- create lag and rolling features
- split by author and time
- persist
X_train,X_test,Y_train,Y_test - log dataset metadata and thresholds to MLflow
The trait training pipelines:
- ensure split artifacts exist
- instantiate the configured model builder
- train the model
- evaluate it on the held-out split
- persist evaluation reports
- log metrics, datasets, parameters, and artifacts to MLflow
- optionally save and log a label encoder
- register the trained model version in MLflow Model Registry
The trait inference wrappers:
- load posts from CSV, JSON, Python data structures, or DataFrames
- build inference features from raw Reddit posts
- resolve the correct registered model
- make predictions
- optionally export JSON or CSV prediction outputs
The API app is created in api/app/main.py. On startup it:
- configures structured logging
- sets CORS
- installs request ID middleware
- optionally creates database tables
- mounts the versioned API router
- exposes
/health
POST /api/v1/users/me/redditSaves a Reddit username, queues signup inference, and optionally waits for completion.GET /api/v1/inference/jobs/{job_id}Returns queued, failed, or completed job state.GET /api/v1/users/me/predictions/personalityReturns the latest personality inference run for the current user.GET /api/v1/users/me/predictionsReturns all available domain payloads. In this repo that means personality only.GET /api/v1/users/meReturns the current user plus linked Reddit username and latest signup inference timestamp.
Legacy extraction and caption routes are still present in jobs.py, but the core product path is the signup inference route plus prediction retrieval.
The end-to-end signup path spans multiple modules:
users.pyaccepts the Reddit username.signup_inference_service.pyupserts the Reddit identity and creates an inference job.src/workers/signup_inference_consumer.pypolls the next queued job.src/services/reddit_extractor.pyloads cached, static, or live Reddit post bundles.src/services/personality_signup_inference.pybuilds features and runs all five traits.prediction_repository.pypersists the run header and weekly trait predictions.predictions.pyreads the latest run back for clients.
The live extraction path uses src/extractors/reddit.py through src/services/reddit_extractor.py.
Notable behavior:
- static demo user
ok_celery_4705reads from a committed JSON bundle - cached bundles are reused from
artifacts/extractions/... - image URLs can be downloaded locally
- BLIP captioning can enrich media summaries when vision dependencies are available
config.yaml is the central project configuration.
It defines:
- project metadata
- service naming
- database settings
- artifact locations
- active domains
- trait-specific data paths
- training settings
- model hyperparameters
- MLflow tracking and registry mappings
- inference loading stage
domains.personality.traits.<trait>Holds per-trait paths and direction-label settings.mlflow.modelsMaps model keys likebig5_openness_stateto registry model names.inference_loading.model_stageControls which MLflow stage the runtime uses, unlessMODEL_STAGEoverrides it.
This repository treats MLflow as both an experiment tracker and the serving registry boundary.
Depending on the pipeline stage, MLflow captures:
- dataset sizes
- feature names
- train/test split stats
- label threshold
tau - model hyperparameters
- evaluation metrics
- saved model artifacts
- label encoder artifacts
- pipeline metadata JSON
The training pipelines register trait models with model keys such as:
big5_openness_statebig5_conscientiousness_statebig5_extraversion_statebig5_agreeableness_statebig5_neuroticism_state
The inference engine resolves those model keys into registry URIs and loads them from the configured stage, typically Production.
utils/mlflow_utils.py and utils/mlflow_registry.py provide:
- tracking URI setup
- file-store fallback support
- experiment bootstrap
- run naming and tagging
- registry model registration helpers
- registry URI validation
This is not a full enterprise orchestration stack, but it does include practical MLOps building blocks:
- versioned configuration
- reproducible train/test split artifacts
- metrics and artifact tracking
- model registry promotion boundary
- containerized serving
- compose-based local orchestration
- SQL bootstrap for persistence
Airflow and Kafka compose overlays are referenced by scripts if those files exist, but they are optional and not required for the core personality flow.
The root Dockerfile:
- uses
python:3.11-slim - installs system build dependencies
- creates a virtual environment at
/opt/venv - installs Python dependencies from
requirements.txt - copies
api,src,utils,pipelines, andconfig.yaml - starts the API with Uvicorn on port
8000
The active compose file defines two services:
apiRuns the FastAPI application and exposes port8000.workerRuns the signup inference consumer loop.
Both services mount the project code so local changes are reflected in the containers. The worker also installs demoji defensively before starting.
The repository is designed to work with Postgres through SQLAlchemy async sessions. SQL bootstrap details live under sql/.
- API and worker are logically separate processes.
- The worker currently polls the database instead of consuming an external queue.
- Model serving depends on access to the configured MLflow tracking and registry backend.
CREATE_TABLES_ON_STARTUPcan bootstrap tables for development, but real deployments should prefer migrations or SQL bootstrap scripts.
- Target Python:
3.11 - Package metadata is declared in
pyproject.toml
- Create and activate a Python 3.11 environment.
- Install dependencies.
- Configure
.env. - Ensure Postgres and MLflow are reachable.
- Start the API and worker.
scripts/run_local.shBrings up the compose stack, optionally layering Kafka and Airflow compose files if they exist.scripts/up_full_stack.shSame idea, with service URLs echoed after startup.scripts/down_full_stack.shStops the compose stack.scripts/migrate_db.shApplies database migration/bootstrap steps.scripts/extract_reddit_posts.pyUtility for collecting Reddit posts outside the API path.scripts/export_signup_flow_predictions.pyExports persisted signup inference predictions.scripts/save_signup_predictions_direct.pySaves signup predictions directly into storage for backfill or debugging flows.
The current SQL bootstrap notes are in sql/README.md.
sql/001_init.sql initializes the main persistence objects used by the app:
- users
- jobs
- predictions
- related uniqueness constraints
At the ORM layer, the app also maintains models for:
- Reddit-linked platform identities
- inference runs
- weekly predictions
- posts, comments, and media metadata
api/ FastAPI app, DB models, repositories, schemas, services
pipelines/ Trait-specific offline data/train/inference entry points
src/ Shared ML, extraction, and worker runtime code
scripts/ Local orchestration and data utility scripts
sql/ SQL bootstrap assets
tests/ Focused extractor and auth tests
utils/ Central config and MLflow helpers
The list below reflects the current personality-only repository contents that matter to the runtime and ML pipelines.
| File | Purpose |
|---|---|
README.md |
Project documentation and architecture guide. |
config.yaml |
Central configuration for personality traits, artifacts, MLflow, and inference loading. |
Dockerfile |
API container image definition. |
docker-compose.yml |
Local multi-service runtime for API plus worker. |
pyproject.toml |
Python package metadata, dependencies, and tooling configuration. |
| File | Purpose |
|---|---|
api/app/__init__.py |
Marks the FastAPI app package. |
api/app/__main__.py |
CLI/module entry point for the packaged app. |
api/app/main.py |
Builds the FastAPI app, middleware, startup DB initialization, and health endpoint. |
| File | Purpose |
|---|---|
api/app/api/__init__.py |
API package marker. |
api/app/api/deps.py |
Shared dependency helpers for authenticated route access. |
api/app/api/router.py |
Mounts auth, jobs, users, and predictions routers. |
| File | Purpose |
|---|---|
api/app/api/v1/__init__.py |
Versioned API package marker. |
| File | Purpose |
|---|---|
api/app/api/v1/routes/auth.py |
Authentication endpoints and token-related flows. |
api/app/api/v1/routes/jobs.py |
Extraction, captioning, and job-status endpoints. |
api/app/api/v1/routes/predictions.py |
Endpoints for latest personality predictions. |
api/app/api/v1/routes/users.py |
Current-user profile endpoints plus signup inference trigger. |
| File | Purpose |
|---|---|
api/app/core/__init__.py |
Core package marker. |
api/app/core/config.py |
Pydantic settings and environment-backed runtime configuration. |
api/app/core/logging.py |
Logging setup utilities. |
api/app/core/security.py |
JWT and current-user security helpers. |
| File | Purpose |
|---|---|
api/app/clients/__init__.py |
Reserved package marker for external client integrations. |
| File | Purpose |
|---|---|
api/app/db/__init__.py |
Database package marker. |
api/app/db/base.py |
Shared SQLAlchemy declarative base. |
api/app/db/bootstrap_schema.py |
Database schema bootstrap helpers. |
api/app/db/deps.py |
FastAPI database-session dependency functions. |
api/app/db/session.py |
Async engine and session factory creation. |
| File | Purpose |
|---|---|
api/app/db/models/__init__.py |
Re-exports ORM models used across the app. |
api/app/db/models/comment.py |
ORM model for extracted Reddit comments. |
api/app/db/models/job.py |
ORM model for queued and completed inference jobs. |
api/app/db/models/media.py |
ORM model for downloaded or described media assets. |
api/app/db/models/post.py |
ORM model for Reddit posts. |
api/app/db/models/prediction.py |
ORM models for inference runs and weekly personality predictions. |
api/app/db/models/user.py |
ORM model for application users. |
api/app/db/models/user_inference_snapshot.py |
ORM model for user-level inference snapshot state. |
api/app/db/models/user_platform_identity.py |
ORM model linking app users to external identities such as Reddit usernames. |
| File | Purpose |
|---|---|
api/app/repositories/__init__.py |
Repository package marker. |
api/app/repositories/job_repository.py |
CRUD helpers for inference job state transitions and queue polling. |
api/app/repositories/prediction_repository.py |
Persistence and formatting helpers for personality inference runs and weekly predictions. |
api/app/repositories/user_repository.py |
User lookup plus platform identity helpers. |
| File | Purpose |
|---|---|
api/app/schemas/__init__.py |
Schema package marker. |
api/app/schemas/auth.py |
Request and response schemas for authentication. |
api/app/schemas/job.py |
Schemas for inference job payloads and status responses. |
api/app/schemas/user.py |
Core user response schemas. |
api/app/schemas/user_reddit.py |
Reddit username input and signup inference output schemas. |
| File | Purpose |
|---|---|
api/app/services/__init__.py |
Services package marker. |
api/app/services/auth_service.py |
Authentication service logic. |
api/app/services/job_service.py |
Wait/retry helpers for job completion payload resolution. |
api/app/services/reddit_oauth_service.py |
Reddit OAuth-related service utilities. |
api/app/services/signup_inference_service.py |
Queues personality signup inference and optionally waits for completion. |
| File | Purpose |
|---|---|
api/app/services/core/__init__.py |
Core service package marker. |
api/app/services/core/extraction.py |
User-data extraction service abstraction used by legacy extraction routes. |
api/app/services/core/vision.py |
Media captioning and image-processing service used by legacy caption routes. |
| File | Purpose |
|---|---|
api/app/utils/__init__.py |
Utility package marker. |
api/app/utils/middleware.py |
Request ID middleware. |
api/app/utils/security.py |
Lower-level security helpers used by the app. |
| File | Purpose |
|---|---|
pipelines/__init__.py |
Root pipelines package marker. |
pipelines/personality/__init__.py |
Personality pipeline package marker. |
pipelines/personality/run_trait_inference.py |
Shared offline inference loader, feature prep, and prediction exporter for one trait. |
| File | Purpose |
|---|---|
pipelines/personality/agreeableness/__init__.py |
Trait package marker. |
pipelines/personality/agreeableness/data_pipeline.py |
Builds agreeableness train/test datasets and logs dataset metadata. |
pipelines/personality/agreeableness/inference_pipeline.py |
CLI inference wrapper for agreeableness. |
pipelines/personality/agreeableness/train_pipeline.py |
Trains, evaluates, and registers the agreeableness model. |
| File | Purpose |
|---|---|
pipelines/personality/conscientiousness/__init__.py |
Trait package marker. |
pipelines/personality/conscientiousness/data_pipeline.py |
Builds conscientiousness train/test datasets and logs dataset metadata. |
pipelines/personality/conscientiousness/inference_pipeline.py |
CLI inference wrapper for conscientiousness. |
pipelines/personality/conscientiousness/train_pipeline.py |
Trains, evaluates, and registers the conscientiousness model. |
| File | Purpose |
|---|---|
pipelines/personality/extraversion/__init__.py |
Trait package marker. |
pipelines/personality/extraversion/data_pipeline.py |
Builds extraversion train/test datasets and logs dataset metadata. |
pipelines/personality/extraversion/inference_pipeline.py |
CLI inference wrapper for extraversion. |
pipelines/personality/extraversion/train_pipeline.py |
Trains, evaluates, and registers the extraversion model. |
| File | Purpose |
|---|---|
pipelines/personality/neuroticism/__init__.py |
Trait package marker. |
pipelines/personality/neuroticism/data_pipeline.py |
Builds neuroticism train/test datasets and logs dataset metadata. |
pipelines/personality/neuroticism/inference_pipeline.py |
CLI inference wrapper for neuroticism. |
pipelines/personality/neuroticism/train_pipeline.py |
Trains, evaluates, and registers the neuroticism model. |
| File | Purpose |
|---|---|
pipelines/personality/openness/__init__.py |
Trait package marker. |
pipelines/personality/openness/data_pipeline.py |
Builds openness train/test datasets and logs dataset metadata. |
pipelines/personality/openness/inference_pipeline.py |
CLI inference wrapper for openness. |
pipelines/personality/openness/train_pipeline.py |
Trains, evaluates, and registers the openness model. |
| File | Purpose |
|---|---|
src/__init__.py |
Root source package marker. |
src/extractors/__init__.py |
Extractor package marker. |
src/extractors/base.py |
Base extractor abstractions. |
src/extractors/models.py |
Data models for extractor accounts and extracted records. |
src/extractors/reddit.py |
Reddit post extraction logic. |
src/extractors/registry.py |
Extractor registry and lookup helpers. |
src/extractors/service.py |
Higher-level service wrapper over extractor implementations. |
src/extractors/storage.py |
Persistence or caching helpers for extracted bundles. |
| File | Purpose |
|---|---|
src/infra/__init__.py |
Reserved package marker for infrastructure-layer helpers. |
| File | Purpose |
|---|---|
src/inference/__init__.py |
Shared inference package marker. |
src/inference/artifact_resolver.py |
Resolves model URIs, with MLflow registry as the supported path. |
src/inference/base_engine.py |
Shared MLflow-backed inference engine for loading models and predicting. |
src/inference/schema_registry.py |
Resolves feature and target schemas for inference scopes. |
| File | Purpose |
|---|---|
src/personality/__init__.py |
Personality ML package marker. |
src/personality/bigfive_scores.py |
Transformer-based Big Five scoring from text. |
src/personality/data_ingestion.py |
CSV and JSON dataset ingestion utilities. |
src/personality/data_parser.py |
Parsers for score payloads, embeddings, and related structured fields. |
src/personality/data_preprocess.py |
Text cleaning, column dropping, emoji replacement, and concatenation transforms. |
src/personality/data_splitter.py |
Author-aware longitudinal train/test splitting logic. |
src/personality/delta_targets.py |
Future delta target creation, label generation, and NaN handling. |
src/personality/feature_scaling.py |
Feature scaling utilities for model inputs. |
src/personality/inference_engine.py |
Personality-specific wrapper over the shared inference engine. |
src/personality/inference_feature_builder.py |
Builds model-ready weekly inference features directly from raw posts. |
src/personality/lagged_features.py |
Rolling, lag, and shift feature generators. |
src/personality/model_building.py |
Factory/builders for supported classifier types. |
src/personality/model_evaluation.py |
Evaluation metrics and report persistence. |
src/personality/model_inference.py |
Backward-compatible alias over the shared inference engine. |
src/personality/model_training.py |
Simple model fit/save/load wrapper. |
src/personality/preprocessing.py |
Shared inference preprocessing flow for Big Five pipelines. |
src/personality/smote_preprocessing.py |
Oversampling or imbalance-preparation helpers for training data. |
src/personality/vector_embedding.py |
BERT-style embedding generation utilities. |
src/personality/weekly_preprocess.py |
Weekly trait and embedding aggregation plus gap-week calculations. |
| File | Purpose |
|---|---|
src/services/__init__.py |
Service package marker. |
src/services/domain_inference_utils.py |
Shared helpers for domain inference orchestration. |
src/services/personality_signup_inference.py |
Runs all five trait predictions for one signup job and persists results. |
src/services/reddit_extractor.py |
Loads static, cached, or live Reddit bundles and optionally enriches media summaries. |
| File | Purpose |
|---|---|
src/workers/__init__.py |
Worker package marker. |
src/workers/signup_inference_consumer.py |
Database-polling worker that executes queued signup inference jobs. |
| File | Purpose |
|---|---|
scripts/down_full_stack.sh |
Stops the compose-based local stack. |
scripts/export_signup_flow_predictions.py |
Exports stored signup prediction results. |
scripts/extract_reddit_posts.py |
Utility script for collecting Reddit posts outside the API. |
scripts/migrate_db.sh |
Database migration/bootstrap helper script. |
scripts/run_local.sh |
Starts the local compose stack with optional overlays. |
scripts/save_signup_predictions_direct.py |
Direct persistence/backfill helper for signup predictions. |
scripts/up_full_stack.sh |
Starts the local stack and prints key service URLs. |
| File | Purpose |
|---|---|
sql/001_init.sql |
Initial SQL bootstrap for the persistence schema. |
sql/README.md |
Short notes about the SQL bootstrap file. |
| File | Purpose |
|---|---|
tests/test_extractors.py |
Coverage for extractor behavior. |
tests/test_reddit_oauth_auth.py |
Coverage for Reddit OAuth/auth-related behavior. |
| File | Purpose |
|---|---|
utils/__init__.py |
Utility package marker. |
utils/config.py |
YAML-backed configuration accessors used across API, training, and inference. |
utils/mlflow_registry.py |
Minimal MLflow registry URI setup and resolution helpers for serving. |
utils/mlflow_utils.py |
Rich MLflow experiment, logging, and registration helper utilities. |
utils/model_loader.py |
Generic model loading helper utilities. |
If you are new to the repository, read it in this order:
config.yamlapi/app/main.pyapi/app/api/v1/routes/users.pyapi/app/services/signup_inference_service.pysrc/workers/signup_inference_consumer.pysrc/services/personality_signup_inference.pysrc/personality/inference_feature_builder.pysrc/inference/base_engine.py- one trait pipeline under
pipelines/personality/openness/ utils/mlflow_utils.py
That path shows the system from request handling through prediction persistence and then back to training and MLOps.
- The repository still contains some generic naming around “social” or extraction abstractions even though the active provider path is Reddit.
- The worker is queue-less in the external sense; it polls the database rather than consuming Kafka or a dedicated broker.
- Inference requires MLflow registry availability for the configured models and stage.
- Lagged weekly features mean single isolated weeks are generally not enough for full inference output.
- Legacy extraction and vision routes remain in the API, but the personality signup flow is the main maintained path.
This repository is a personality-focused ML application that combines:
- Reddit data extraction
- Big Five feature generation
- weekly longitudinal feature engineering
- per-trait direction classification
- MLflow-based training and registry management
- FastAPI serving
- Postgres-backed job and prediction persistence
- Docker-based local deployment
If you need to extend or debug the project, start from config.yaml, then follow the signup inference path and one trait pipeline end to end.