Companion code for the SocialSys'26 workshop paper "Towards Uncovering Indoor Satisfaction Profiles: LLM Capacity for Structured Extraction from Occupant Feedback."
This repository contains the analysis pipeline, paper source, and a synthetic demo dataset for a study that combines latent profile analysis (LPA) of occupant satisfaction ratings with locally deployed LLM extraction of structured complaint dimensions (tone, severity, attribution, work impact) from free-text feedback.
The published numerical results — eight occupant profiles, model-size thresholds for kappa ≥ 0.6 with human coders, and the classification benchmark showing text/rating orthogonality — were produced on the licensed CBE Occupant Survey database, which cannot be redistributed. The repository ships a fully synthetic dataset so that the pipeline can be exercised end-to-end on a fresh clone.
office_profiler/
├── _targets.R # targets pipeline definition
├── R/ # pure functions sourced by targets
├── scripts/ # one-shot scripts for the real-data path
├── data/
│ ├── raw/README.md # how to obtain the real CBE database
│ └── synthetic/ # synthetic demo dataset + generator
├── paper/ # ACM sigconf source for the workshop paper
│ ├── main-text.tex
│ ├── img/ # figures (regenerated by tar_make)
│ └── references.bib
├── LICENSE # MIT (code)
├── LICENSE-paper # CC-BY 4.0 (paper text + figures)
├── CITATION.cff # citation metadata
├── CONTRIBUTING.md # issue / PR guidance
├── renv.lock # locked package versions
└── .Rprofile # renv activation
The synthetic demo runs the full pipeline on simulated data without needing access to the CBE database. Targets that call Ollama use a pre-computed embedding cache, so the demo runs CPU-only.
git clone https://github.com/IEQLab/office_profiler.git
cd office_profiler# in R, from the repo root
renv::restore() # install locked package versions
# regenerate the synthetic dataset (already checked in; this is optional)
source("data/synthetic/generate_synthetic.R")
targets::tar_make() # build all figuresAfter tar_make() completes, the figures are written to paper/img/:
6_validation_kappa.png— LLM-vs-human agreement by model size8_classification_comparison.png— 5-model classification benchmark1_fit_comparison.png,1_classification_quality.png,1_split_half_ari.png— LPA diagnostics
A successful demo run takes about 3–5 minutes on a recent laptop.
-
Request access to the CBE Occupant Survey database (see
data/raw/README.md). -
Place the export at
data/raw/db_all.rds. -
Pull the LLM and embedding models in Ollama:
ollama pull gemma3:27b # main extraction model (validated) ollama pull llama3.1:8b # alt model variant for kappa benchmark ollama pull llama3.2:3b # alt model variant ollama pull nomic-embed-text # embedding model
-
Run the real-data path:
source("scripts/01-data.R") # cleans data/raw/db_all.rds # the LPA-driven profile assignments must already exist at # data/processed/df_profiles.rds — run the LPA targets via # targets::tar_make(c(model_lpa, df_profiles)) and write df_profiles # from the model output, or supply them manually. source("scripts/02-llm-extraction.R") # runs Ollama, ~1 hour source("scripts/03-llm-validation.R") # kappa across models targets::tar_make() # rebuilds figures from real outputs
On a workstation with a single 24 GB GPU, the gemma3:27b extraction step takes roughly 60 minutes for ~6,000 stratified responses; embedding generation is cached after the first run.
- CPU-only: the synthetic demo runs in a few minutes on any modern laptop. The full real-data pipeline will work CPU-only but the gemma3:27b extraction takes many hours.
- GPU: an NVIDIA card with ≥ 16 GB VRAM (or an Apple Silicon machine with ≥ 32 GB unified memory) is recommended for the real-data path.
If you use this repository or the synthetic demo data, please cite the workshop paper (preferred) and the repository:
@inproceedings{parkinson2026towards,
title = {Towards Uncovering Indoor Satisfaction Profiles:
LLM Capacity for Structured Extraction from
Occupant Feedback},
author = {Parkinson, Thomas and Schiavon, Stefano and
Zhang, Wenhao and Miller, Clayton},
booktitle = {Proceedings of the SocialSys'26 Workshop},
year = {2026},
publisher = {ACM}
}The Center for the Built Environment (UC Berkeley) for access to the Occupant Survey database. ACM SocialSys'26 reviewers for their feedback.