Skip to content

Conversation

@griffinsharps
Copy link
Contributor

Methodology Update: Log-Ratio Regression on Block Group Composition

Summary

This PR refactors Stage 2 regression to use log-ratio regression instead of multinomial logistic regression, addressing feedback on the appropriate unit of analysis for demographic modeling.

What Changed

Previous Approach = Multinomial Logit

  • Unit of analysis: Individual household-day observations
  • Question answered: "Given a household's demographics, what cluster is it in?"
  • Model: Multinomial logistic regression with frequency weights
  • Interpretation: Odds ratios for cluster membership

New Approach (Log-Ratio Regression)

  • Unit of analysis: Block group (aggregated)
  • Question answered: "How do block group demographics affect the distribution of clusters?"
  • Model: Separate linear regressions on log(proportion_i / proportion_baseline)
  • Interpretation: Multiplicative effects on cluster composition ratios

Infrastructure Improvements (handling high-volume processing of the raw AWS CSVs)

Also includes memory-efficient processing improvements for large datasets:

  • Multi-file input support for clustering preparation
  • Batched CSV processing (handles 30k+ files without OOM)
  • Block group sampling density diagnostic tool

Griffin Sharps added 2 commits December 11, 2025 22:40
Infrastructure improvements for handling full-year data within 8GB container:

Added:
- scripts/process_csvs_batched_optimized.py: Batched CSV processing (5k files/batch)
  * Prevents OOM on 30k+ file datasets
  * Streams through data without full materialization
  * Tested on July 2023 (30,027 files)

- tests/diagnose_bg_density_v2.py: Block-group sampling density diagnostic
  * Checks household density per block group
  * Validates cluster share variance
  * Provides recommendations for sample size

Modified:
- analysis/clustering/prepare_clustering_data_households.py: Multi-file input support
  * Accepts multiple parquet files via --input file1 file2 ...
  * Loads and merges manifests from all input files
  * Enables analysis on split datasets (e.g., jan_nov.parquet + dec.parquet)

- .gitignore: Exclude generated Quarto HTML files

This checkpoint precedes the Stage 2 regression refactor (multinomial logit → log-ratio regression).
…Batch clustering

Changed Stage 2 from household multinomial logit to block-group-level log-ratio
regression modeling cluster composition (household-day units). Added MiniBatch
K-means for memory-efficient clustering of 100k+ households. Updated pipeline to
support large-scale analysis with streaming and chunked processing throughout.
@griffinsharps
Copy link
Contributor Author

This commit is a bit messy, so here is a short breakdown of the intended usage.

Pipeline Infrastructure

prepare_clustering_data_households.py

  • Enhanced chunked streaming mode for large datasets
  • Improved memory tracking and logging
  • Better manifest handling for multi-file inputs

Supporting modules:

  • aws_loader.py: Enhanced S3 batch operations
  • manifests.py: Improved caching for account/date manifests
  • process_csvs_batched_optimized.py: Batch processing optimizations

Workflow

1. Prepare household-day profiles

python analysis/clustering/prepare_clustering_data_households.py
--input data/july_2023/month_07.parquet
--output-dir data/runs/run_name/phase1_train_31d
--sample-households 100000
--sample-days 31
--streaming
--chunk-size 300

2. Cluster profiles (use MiniBatch for 100k+ households)

python analysis/clustering/euclidean_clustering_minibatch.py
--input data/runs/run_name/phase1_train_31d/sampled_profiles.parquet
--output-dir data/runs/run_name/phase2_k4
--k 4
--normalize
--batch-size 50000

3. Block-group log-ratio regression

python analysis/clustering/stage2_logratio_regression.py
--clusters data/runs/run_name/phase2_k4/cluster_assignments.parquet
--crosswalk data/reference/2023_comed_zip4_census_crosswalk.txt
--census-cache data/reference/census_17_2023.parquet
--output-dir data/runs/run_name/stage2_logratio_k4
--baseline-cluster 1
--min-obs-per-bg 10

@griffinsharps griffinsharps requested a review from mshron December 18, 2025 16:48
@griffinsharps griffinsharps linked an issue Dec 18, 2025 that may be closed by this pull request
…ient stop-gaps to be corrected in immediate future.
@griffinsharps griffinsharps merged commit 7733692 into main Dec 20, 2025
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Refactoring to Log-Ratio Regression

2 participants