Skip to content

Conversation

@griffinsharps
Copy link
Contributor

@griffinsharps griffinsharps commented Dec 16, 2025

Methodology Update: Log-Ratio Regression on Block Group Composition

Summary

This PR refactors Stage 2 regression to use log-ratio regression instead of multinomial logistic regression, addressing feedback on the appropriate unit of analysis for demographic modeling.

What Changed

Previous Approach = Multinomial Logit

  • Unit of analysis: Individual household-day observations
  • Question answered: "Given a household's demographics, what cluster is it in?"
  • Model: Multinomial logistic regression with frequency weights
  • Interpretation: Odds ratios for cluster membership

New Approach (Log-Ratio Regression)

  • Unit of analysis: Block group (aggregated)
  • Question answered: "How do block group demographics affect the distribution of clusters?"
  • Model: Separate linear regressions on log(proportion_i / proportion_baseline)
  • Interpretation: Multiplicative effects on cluster composition ratios

Infrastructure Improvements (handling high-volume processing of the raw AWS CSVs)

Also includes memory-efficient processing improvements for large datasets:

  • Multi-file input support for clustering preparation
  • Batched CSV processing (handles 30k+ files without OOM)
  • Block group sampling density diagnostic tool

Griffin Sharps added 2 commits December 11, 2025 22:40
Infrastructure improvements for handling full-year data within 8GB container:

Added:
- scripts/process_csvs_batched_optimized.py: Batched CSV processing (5k files/batch)
  * Prevents OOM on 30k+ file datasets
  * Streams through data without full materialization
  * Tested on July 2023 (30,027 files)

- tests/diagnose_bg_density_v2.py: Block-group sampling density diagnostic
  * Checks household density per block group
  * Validates cluster share variance
  * Provides recommendations for sample size

Modified:
- analysis/clustering/prepare_clustering_data_households.py: Multi-file input support
  * Accepts multiple parquet files via --input file1 file2 ...
  * Loads and merges manifests from all input files
  * Enables analysis on split datasets (e.g., jan_nov.parquet + dec.parquet)

- .gitignore: Exclude generated Quarto HTML files

This checkpoint precedes the Stage 2 regression refactor (multinomial logit → log-ratio regression).
@griffinsharps griffinsharps linked an issue Dec 16, 2025 that may be closed by this pull request
@griffinsharps
Copy link
Contributor Author

Duplicate of #44, closing

@griffinsharps griffinsharps deleted the 45-refactoring-to-log-ratio-regression branch December 16, 2025 21:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants