diff --git a/.claude/projects/-home-nes-Documents-MAenvironmentaldata/memory/env_threshold_review.md b/.claude/projects/-home-nes-Documents-MAenvironmentaldata/memory/env_threshold_review.md new file mode 100644 index 0000000..f82f183 --- /dev/null +++ b/.claude/projects/-home-nes-Documents-MAenvironmentaldata/memory/env_threshold_review.md @@ -0,0 +1,14 @@ +--- +name: env-threshold-review +description: Pending task to revisit environmental relevance threshold after GC fix re-embed, with documented artifact +metadata: + type: project +--- + +After the GC formula fix and full re-embed (May 2026), the environmental bill count jumped from 329 → 654 at threshold=0.05. This needs a calibration review. + +**Task:** Re-run the threshold analysis — plot the score distribution, spot-check bills near the new boundary, and decide whether 0.05 is still correct or needs adjustment. Document the exercise in a written artifact (analysis page or data note) explaining: the differential cosine similarity method, the reference sets, how the threshold was chosen, and what the before/after counts were at various thresholds. + +**Why:** The doubling of env bill count is plausible (correct body text adds real signal) but should be verified with spot-checks. Some new bills at 0.05–0.08 may be genuine env bills the old wrong-GC embeddings missed; others may be false positives from body text that semantically resembles env topics without being env legislation. + +**Related:** [[project_data_pipeline]] — score_lobbying_bills.py ENV_THRESHOLD constant; [[ai_analysis_feature]] — env bill counts flow into the AMEND.db and dashboard. diff --git a/.claude/scheduled_tasks.lock b/.claude/scheduled_tasks.lock new file mode 100644 index 0000000..3e2ef1e --- /dev/null +++ b/.claude/scheduled_tasks.lock @@ -0,0 +1 @@ +{"sessionId":"d2fc4fd3-d9ff-44a4-a608-05ddeb65f47b","pid":1140968,"procStart":"103407381","acquiredAt":1779884724489} \ No newline at end of file diff --git a/analysis/MA_lobbying_tsne.py b/analysis/MA_lobbying_tsne.py new file mode 100644 index 0000000..43f21be --- /dev/null +++ b/analysis/MA_lobbying_tsne.py @@ -0,0 +1,236 @@ +"""Generate a UMAP scatter plot of MA lobbying bill embeddings. + +Visual design philosophy +───────────────────────── +MA legislative bill embeddings are semantically dense — all bills share heavy +regulatory language, so inter-cluster cosine distances are ~0.006 vs. +intra-cluster spread of ~0.53. Running t-SNE on all 25k bills produces a +featureless blob regardless of perplexity, because the structure simply doesn't +separate in 2-D. + +UMAP is used instead of t-SNE because it better preserves global structure, +pulling weakly-separated clusters apart more effectively than t-SNE's purely +local optimisation. Parameters: n_neighbors=30, min_dist=0.1, metric='cosine'. + +The chart shows TWO layers: + + Background (grey) — stratified sample of ~120 non-environmental bills per + cluster, rendered as tiny translucent grey dots. Provides + geographic context for the policy landscape. + + Signal (coloured) — all env-relevant bills (~654), one colour per cluster, + large outlined dots. These are what the visitor cares about. + +UMAP is computed on the combined ~3,650 point sample (all env + background), +which runs in ~30s and produces cleaner structure than t-SNE on this corpus. + +Run from the analysis/ directory: + /path/to/python -u MA_lobbying_tsne.py + +Outputs: + ../docs/_includes/charts/lobbying_bill_tsne.html +""" + +import sys +from pathlib import Path + +import numpy as np +import pandas as pd +import umap +from sklearn.preprocessing import normalize +import plotly.graph_objects as go + +sys.path.insert(0, str(Path(__file__).parent)) + +GCS_PARQUET = 'gs://openamend-data/MA_bill_embeddings.parquet' +LOCAL_PARQUET = Path('../docs/data/MA_bill_embeddings.parquet') +LABELS_CSV = Path('../docs/data/MA_bill_cluster_labels.csv') +OUT_HTML = Path('../docs/_includes/charts/lobbying_bill_tsne.html') + +# Non-env bills sampled per cluster for background context. +# 120 × 25 clusters ≈ 3 000 background points + ~329 env = ~3 300 total. +BG_PER_CLUSTER = 120 +RANDOM_STATE = 42 + +# UMAP hyperparameters +UMAP_N_NEIGHBORS = 30 # larger → more global structure +UMAP_MIN_DIST = 0.1 # smaller → tighter clusters +UMAP_METRIC = 'cosine' + +# 25-colour palette — qualitative, perceptually distinct, no cycling +PALETTE_25 = [ + '#1f77b4', '#ff7f0e', '#2ca02c', '#d62728', '#9467bd', + '#8c564b', '#e377c2', '#7f7f7f', '#bcbd22', '#17becf', + '#aec7e8', '#ffbb78', '#98df8a', '#ff9896', '#c5b0d5', + '#c49c94', '#f7b6d2', '#c7c7c7', '#dbdb8d', '#9edae5', + '#393b79', '#637939', '#8c6d31', '#843c39', '#7b4173', +] + + +def _load_parquet() -> pd.DataFrame: + try: + import gcsfs + fs = gcsfs.GCSFileSystem() + if fs.exists(GCS_PARQUET): + with fs.open(GCS_PARQUET, 'rb') as f: + df = pd.read_parquet(f) + print(f'Loaded {len(df)} rows from {GCS_PARQUET}') + return df + except Exception as e: + print(f'GCS load failed ({e}), trying local...') + if LOCAL_PARQUET.exists(): + df = pd.read_parquet(LOCAL_PARQUET) + print(f'Loaded {len(df)} rows from local Parquet') + return df + raise FileNotFoundError('No Parquet file found. Run score_lobbying_bills.py first.') + + +def main(): + parquet_df = _load_parquet() + + # Restrict to clustered bills + parquet_df = parquet_df[ + parquet_df['cluster_id'].notna() & (parquet_df['cluster_id'] != -1) + ].copy() + parquet_df['cluster_id'] = parquet_df['cluster_id'].astype(int) + + if 'is_environmental' not in parquet_df.columns: + parquet_df['is_environmental'] = False + parquet_df['is_environmental'] = parquet_df['is_environmental'].fillna(False).astype(bool) + + labels_df = pd.read_csv(LABELS_CSV, engine='python', on_bad_lines='skip') + # example_titles may contain unquoted commas that corrupt row parsing; + # keep only rows with a valid integer cluster_id. + labels_df = labels_df[ + pd.to_numeric(labels_df['cluster_id'], errors='coerce').notna() + ].copy() + labels_df['cluster_id'] = labels_df['cluster_id'].astype(int) + label_map = dict(zip(labels_df['cluster_id'].astype(int), labels_df['label'])) + nenv_map = dict(zip(labels_df['cluster_id'].astype(int), labels_df['n_env_bills'])) + + # ── Build subsample ────────────────────────────────────────────────────── + # Keep ALL env bills; sample BG_PER_CLUSTER non-env bills per cluster. + env_df = parquet_df[parquet_df['is_environmental']].copy() + non_env = parquet_df[~parquet_df['is_environmental']] + + rng = np.random.default_rng(RANDOM_STATE) + bg_parts = [] + for cid in sorted(non_env['cluster_id'].unique()): + sub = non_env[non_env['cluster_id'] == cid] + n = min(BG_PER_CLUSTER, len(sub)) + bg_parts.append(sub.sample(n=n, random_state=int(rng.integers(0, 2**31)))) + + bg_df = pd.concat(bg_parts, ignore_index=True) + sample = pd.concat([env_df, bg_df], ignore_index=True) + print(f'Subsample: {len(env_df)} env + {len(bg_df)} background = {len(sample)} total') + + # ── Embeddings ─────────────────────────────────────────────────────────── + emb = np.vstack(sample['embedding'].apply( + lambda v: np.array(v, dtype=np.float32) + ).values) + emb_norm = normalize(emb, norm='l2') + + # ── UMAP ───────────────────────────────────────────────────────────────── + print(f'Running UMAP (n={len(sample)}, n_neighbors={UMAP_N_NEIGHBORS}, ' + f'min_dist={UMAP_MIN_DIST}, metric={UMAP_METRIC})...') + reducer = umap.UMAP( + n_components=2, + n_neighbors=UMAP_N_NEIGHBORS, + min_dist=UMAP_MIN_DIST, + metric=UMAP_METRIC, + random_state=RANDOM_STATE, + low_memory=False, + ) + coords = reducer.fit_transform(emb_norm) + sample = sample.copy() + sample['x'] = coords[:, 0] + sample['y'] = coords[:, 1] + + # ── Build Plotly figure ────────────────────────────────────────────────── + fig = go.Figure() + + bg = sample[~sample['is_environmental']] + envs = sample[sample['is_environmental']] + + # Layer 1 — grey background (all non-env, single trace for performance) + fig.add_trace(go.Scatter( + x=bg['x'], y=bg['y'], + mode='markers', + marker=dict(color='#aaaaaa', size=4, opacity=0.20), + name='Non-environmental bills', + hovertext=[ + f'{row.get("bill_title", "")}
' + f'GC {int(row["general_court"])} · {label_map.get(int(row["cluster_id"]), "")}' + for _, row in bg.iterrows() + ], + hoverinfo='text', + showlegend=True, + legendgroup='bg', + legendgrouptitle=dict(text='Background'), + )) + + # Layer 2 — env bills, one trace per cluster that has any env bills + env_cluster_ids = sorted(envs['cluster_id'].unique()) + for i, cid in enumerate(env_cluster_ids): + sub = envs[envs['cluster_id'] == cid] + lbl = label_map.get(cid, f'Cluster {cid}') + nenv = nenv_map.get(cid, len(sub)) + color = PALETTE_25[cid % len(PALETTE_25)] + + fig.add_trace(go.Scatter( + x=sub['x'], y=sub['y'], + mode='markers', + marker=dict( + color=color, size=11, opacity=0.92, + line=dict(color='black', width=1.2), + ), + name=f'{lbl} ({nenv} env)', + hovertext=[ + f'{row.get("bill_title", "")}
' + f'GC {int(row["general_court"])} · 🌿 environmental
' + f'Cluster: {lbl}
' + f'Score: {row.get("env_relevance_score", ""):.3f}' + for _, row in sub.iterrows() + ], + hoverinfo='text', + showlegend=True, + legendgroup='env', + legendgrouptitle=dict(text='Environmental bills by cluster') if i == 0 else dict(text=''), + )) + + fig.update_layout( + title=dict( + text=( + 'MA Lobbying Bills — Environmental Bills in the Policy Landscape' + f'
Coloured = {len(envs)} environmentally-relevant bills · ' + f'grey = background sample ({len(bg):,} non-env) · ' + 'colour = topic cluster · hover for details · UMAP projection' + ), + font=dict(size=13), + ), + xaxis=dict(visible=False), + yaxis=dict(visible=False), + legend=dict( + font=dict(size=10), + itemsizing='constant', + tracegroupgap=8, + ), + margin=dict(l=10, r=10, t=70, b=10), + width=880, + height=600, + plot_bgcolor='#f8f8f8', + paper_bgcolor='white', + hovermode='closest', + ) + + OUT_HTML.parent.mkdir(parents=True, exist_ok=True) + html = fig.to_html(full_html=False, include_plotlyjs='cdn', config={'responsive': True}) + OUT_HTML.write_text( + '{% raw %}\n' + html + '\n{% endraw %}\n', + encoding='utf-8', + ) + print(f'Wrote {OUT_HTML}') + + +if __name__ == '__main__': + main() diff --git a/analysis/MA_lobbying_viz.py b/analysis/MA_lobbying_viz.py new file mode 100644 index 0000000..43a512f --- /dev/null +++ b/analysis/MA_lobbying_viz.py @@ -0,0 +1,1702 @@ +"""Generate charts for the MA environmental lobbying analysis. + +Dashboard charts (called with prefix='dash_' by dashboard_charts.py): + {prefix}lobbying_spend_trend — Annual lobbying spend on environmental bills, stacked by sector + {prefix}lobbying_top_employers — Top 15 employer spenders in most recent complete year + {prefix}lobbying_bill_intensity — Unique bills lobbied per year + pass rate + {prefix}lobbying_vs_enforcement — Dual-axis: lobbying spend vs. enforcement action count + +Analysis-post charts (no prefix, generate_post_charts): + lobbying_spend_vs_budget — Lobbying spend overlaid on DEP budget (dual-axis) + lobbying_bill_pass_by_spend_tier — Bill pass rate by lobbying intensity tier + lobbying_spend_vs_staff — Env lobbying spend vs. DEP FTE headcount (dual-axis) + lobbying_env_cluster_share — Env-bill lobbying spend by topic cluster, stacked over years + lobbying_top_env_employers — Top 20 employers ranked by total env-bill lobbying spend + lobbying_env_positions — Unique clients by Support/Oppose/Neutral position on env bills + lobbying_env_opponents — Top 20 clients by unique env bills opposed (all years) + lobbying_pass_by_position — Env bill pass rate by dominant lobbying position + lobbying_env_score_vs_clients — Scatter: env score vs. lobbying intensity, env + top-500 non-env + lobbying_cso_operators — Lobbying spend by known CSO operators (permittees), by year + +Data files written: + docs/data/facts_lobbying.yml — Key facts for Jekyll post templates +""" + +import sys +import os +from pathlib import Path +sys.path.insert(0, os.path.dirname(__file__)) + +import pandas as pd +import numpy as np +from sqlalchemy import create_engine +import chartjs + +BLUE = 'rgba(54, 110, 179, 0.85)' +RED = 'rgba(200, 60, 60, 0.85)' +ORANGE = 'rgba(230, 140, 40, 0.85)' +GREEN = 'rgba(60, 170, 80, 0.85)' +GREY = 'rgba(150, 150, 150, 0.6)' +PURPLE = 'rgba(130, 80, 200, 0.85)' +TEAL = 'rgba(30, 160, 160, 0.85)' +YELLOW = 'rgba(220, 180, 0, 0.85)' + +SECTOR_COLORS = [BLUE, ORANGE, GREEN, RED, PURPLE, TEAL, YELLOW, GREY] + +CHART_DIR = '../docs/_includes/charts' +# Pipeline (assemble_db.py) owns docs/data/facts_lobbying.yml with the headline +# figures (total spend, env count, etc.). This viz writes a SEPARATE file with the +# analysis-post-specific facts so the two never clobber each other. +FACTS_YML = '../docs/data/facts_lobbying_post.yml' + + +def _load_data(engine): + """Load lobbying/legislature tables from DB. Returns empty DataFrames if not yet populated.""" + def _safe_read(query): + try: + return pd.read_sql_query(query, engine) + except Exception: + return pd.DataFrame() + + employers = _safe_read('SELECT * FROM MA_Lobbying_Employers') + lobby_bills = _safe_read('SELECT * FROM MA_Lobbying_Bills') + # MA_Lobbying_Bills_Scored: is_environmental, env_relevance_score, cluster_id + # MA_Legislature_Bills: passed status + scored = _safe_read('SELECT * FROM MA_Lobbying_Bills_Scored') + leg_bills_raw = _safe_read( + 'SELECT bill_id, bill_number, general_court, passed FROM MA_Legislature_Bills' + ) + if not scored.empty and not leg_bills_raw.empty: + # Join on bill_id (preferred) — avoids H/S cross-prefix contamination + # where H and S bills share the same integer bill_number in the same GC. + if 'bill_id' in scored.columns and 'bill_id' in leg_bills_raw.columns: + leg_bills = scored.merge( + leg_bills_raw[['bill_id', 'general_court', 'passed']].dropna(subset=['bill_id']), + on=['bill_id', 'general_court'], how='left') + else: + leg_bills = scored.merge(leg_bills_raw, on=['bill_number', 'general_court'], how='left') + elif not scored.empty: + leg_bills = scored + else: + leg_bills = pd.DataFrame() + return employers, lobby_bills, leg_bills + + +def _bill_merge(left: pd.DataFrame, right: pd.DataFrame, + extra_right_cols: list | None = None, + how: str = 'inner') -> pd.DataFrame: + """Merge two DataFrames on (bill_id, general_court) when both have bill_id, + falling back to (bill_number, general_court) for rows without bill_id. + + This avoids cross-prefix contamination: H and S bills share the same integer + bill_number within a General Court, so a (bill_number, gc) join incorrectly + merges their lobbying records. bill_id (e.g. H1234, S5678) is unambiguous. + + Parameters + ---------- + left : DataFrame that must have at least bill_number + general_court. + right : DataFrame with bill_number + general_court and possibly bill_id. + extra_right_cols : additional columns to keep from right (default: all). + how : join type passed to pd.merge (default 'inner'). + """ + has_bid = ('bill_id' in left.columns and 'bill_id' in right.columns) + right_cols = list(right.columns) if extra_right_cols is None else ( + (['bill_id'] if has_bid else []) + ['bill_number', 'general_court'] + extra_right_cols + ) + right_sub = right[right_cols] if extra_right_cols is not None else right + if not has_bid: + return left.merge(right_sub, on=['bill_number', 'general_court'], how=how) + # Rows with bill_id: join on (bill_id, general_court) + right_id = right_sub[right_sub['bill_id'].notna()] + right_num = right_sub[right_sub['bill_id'].isna()] + left_id = left[left['bill_id'].notna()] + left_num = left[left['bill_id'].isna()] + parts = [] + if not left_id.empty and not right_id.empty: + parts.append(left_id.merge(right_id, on=['bill_id', 'general_court'], how=how)) + if not left_num.empty and not right_num.empty: + parts.append(left_num.merge(right_num, on=['bill_number', 'general_court'], how=how)) + if not parts: + # Return empty frame with the right shape + return left.merge(right_sub, on=['bill_number', 'general_court'], how='inner').iloc[0:0] + return pd.concat(parts, ignore_index=True) + + +def _env_bills(lobby_bills: pd.DataFrame, leg_bills: pd.DataFrame) -> pd.DataFrame: + """Return lobby_bills rows joined to environmentally relevant bills.""" + if leg_bills.empty or lobby_bills.empty or 'is_environmental' not in leg_bills.columns: + return pd.DataFrame() + env = leg_bills[leg_bills['is_environmental'] == 1].copy() + passed_col = ['passed'] if 'passed' in env.columns else [] + extra = passed_col + return _bill_merge(lobby_bills, env, extra_right_cols=extra, how='inner') + + +def _annual_env_spend(employers: pd.DataFrame, lobby_bills: pd.DataFrame, + leg_bills: pd.DataFrame) -> pd.DataFrame: + """Annual lobbying spend allocated to environmental bills (proportional). + + For each (entity, client, year) row in MA_Lobbying_Employers, computes + env_spend = compensation × (n_env_bills / n_all_bills) where both bill + counts are for that (entity, client, year) triple. Sums across all pairs + per year. + + Proportional allocation avoids inflating spend for clients who lobbied a + single env bill alongside hundreds of unrelated bills. + + Falls back to total env-client spend (non-proportional) if lobby_bills + has no year-level bill counts — but this should not occur in normal use. + + Excludes the legacy 'Total salaries received' aggregate rows. + """ + if employers.empty or lobby_bills.empty: + return pd.DataFrame() + env_lb = _env_bills(lobby_bills, leg_bills) + if env_lb.empty: + # No env scoring yet — fall back to total spend for clients with any bills + env_pairs = lobby_bills[['client_name', 'year']].drop_duplicates() + emp = employers[employers['client_name'] != 'Total salaries received'] + merged = emp.merge(env_pairs, on=['client_name', 'year'], how='inner') + return ( + merged.groupby('year')['compensation'] + .sum() + .reset_index() + .sort_values('year') + ) + + pair_keys = ['entity_name', 'client_name', 'year'] + count_col = 'bill_id' if 'bill_id' in env_lb.columns else 'bill_number' + # Count env bills per (firm, client, year) + env_counts = ( + env_lb.groupby(pair_keys)[count_col].nunique() + .reset_index(name='n_env') + ) + # Count all bills per (firm, client, year) + all_counts = ( + lobby_bills.groupby(pair_keys)[count_col].nunique() + .reset_index(name='n_all') + ) + fracs = env_counts.merge(all_counts, on=pair_keys, how='left') + fracs['env_frac'] = fracs['n_env'] / fracs['n_all'].replace(0, np.nan) + + emp = employers[employers['client_name'] != 'Total salaries received'] + merged = emp.merge(fracs, on=pair_keys, how='inner') + merged['env_spend'] = merged['compensation'] * merged['env_frac'].fillna(0) + return ( + merged.groupby('year')['env_spend'] + .sum() + .reset_index() + .rename(columns={'env_spend': 'compensation'}) + .sort_values('year') + ) + + +def generate_charts(engine, prefix=''): + """Generate dashboard lobbying charts. + + Parameters + ---------- + engine : sqlalchemy.engine.Engine + prefix : str + Filename prefix (e.g. 'dash_' for dashboard charts). + """ + employers, lobby_bills, leg_bills = _load_data(engine) + + if employers.empty: + print('MA lobbying data not yet available — skipping lobbying charts.') + return + + employers['year'] = pd.to_numeric(employers['year'], errors='coerce').astype('Int64') + if not lobby_bills.empty: + lobby_bills['year'] = pd.to_numeric(lobby_bills['year'], errors='coerce').astype('Int64') + + # ── Chart 1: Annual spend trend ─────────────────────────────────────────── + spend_trend = _annual_env_spend(employers, lobby_bills, leg_bills) + + if not spend_trend.empty: + years = spend_trend['year'].dropna().astype(int).tolist() + spend_m = (spend_trend['compensation'] / 1e6).tolist() + + c = chartjs.Chart( + 'Annual MA Lobbying Spend on Environmental Bills', + 'Bar', width=700, height=380, + ) + c.set_labels([str(y) for y in years]) + c.add_dataset(spend_m, 'Total spend ($M)', backgroundColor=f"'{BLUE}'") + c.set_params( + js_inline=False, + ylabel='Lobbying spend ($M)', + xlabel='Year', + ) + c.jekyll_write(f'{CHART_DIR}/{prefix}lobbying_spend_trend.html') + print(f'Wrote {prefix}lobbying_spend_trend.html') + + # ── Chart 2: Top employers in most recent year ──────────────────────────── + most_recent_year = int(employers['year'].dropna().max()) + # Use second-most-recent year if most-recent looks like a partial year + # (fewer than half the employer count of the prior year) + year_counts = employers.groupby('year').size() + if len(year_counts) >= 2: + penultimate = int(sorted(year_counts.index)[-2]) + if year_counts[most_recent_year] < year_counts[penultimate] * 0.5: + most_recent_year = penultimate + + # Aggregate by client (paying entity), not by lobbying firm + emp_year = employers[ + (employers['year'] == most_recent_year) + & (employers['client_name'] != 'Total salaries received') + ] + top_employers = ( + emp_year.groupby('client_name')['compensation'].sum() + .nlargest(15) + .sort_values() # ascending for horizontal bar + .reset_index() + ) + + if not top_employers.empty: + c = chartjs.Chart( + f'Top 15 MA Lobbying Clients — {most_recent_year}', + 'HorizontalBar', width=700, height=440, + ) + c.set_labels(top_employers['client_name'].tolist()) + spend_k = (top_employers['compensation'] / 1e3).tolist() + c.add_dataset(spend_k, 'Spend ($K)', backgroundColor=f"'{ORANGE}'") + c.set_params(js_inline=False, xlabel='Lobbying spend ($K)') + c.jekyll_write(f'{CHART_DIR}/{prefix}lobbying_top_employers.html') + print(f'Wrote {prefix}lobbying_top_employers.html') + + # ── Chart 3: Bill intensity — unique bills lobbied + pass rate ──────────── + if not lobby_bills.empty and not leg_bills.empty: + env_lb = _env_bills(lobby_bills, leg_bills) + if env_lb.empty: + env_lb = lobby_bills.copy() + env_lb['passed'] = np.nan + + bills_per_year = ( + env_lb.groupby('year')['bill_number'] + .nunique() + .reset_index(name='n_bills') + .sort_values('year') + ) + pass_rate_per_year = ( + env_lb.drop_duplicates(subset=['bill_number', 'general_court', 'year']) + .groupby('year')['passed'] + .mean() + .reset_index(name='pass_rate') + ) + bill_intensity = bills_per_year.merge(pass_rate_per_year, on='year', how='left') + + years_bi = bill_intensity['year'].dropna().astype(int).tolist() + n_bills = bill_intensity['n_bills'].tolist() + pass_pct = (bill_intensity['pass_rate'].fillna(0) * 100).tolist() + + c = chartjs.Chart( + 'Environmental Bills Lobbied per Year', + 'Bar', width=700, height=380, + ) + c.set_labels([str(y) for y in years_bi]) + c.add_dataset(n_bills, 'Unique bills lobbied', backgroundColor=f"'{TEAL}'", + yAxisID="'y'") + c.add_dataset(pass_pct, 'Pass rate (%)', backgroundColor=f"'{GREEN}'", + type="'line'", yAxisID="'y1'") + c.set_params( + js_inline=False, + ylabel='Bills lobbied', + xlabel='Year', + y2nd=1, + y2nd_title='Pass rate (%)', + ) + c.jekyll_write(f'{CHART_DIR}/{prefix}lobbying_bill_intensity.html') + print(f'Wrote {prefix}lobbying_bill_intensity.html') + + # ── Chart 4: Lobbying spend vs. enforcement count (dual-axis) ───────────── + try: + enf = pd.read_sql_query( + "SELECT strftime('%Y', EnforcementDate) AS year, COUNT(*) AS n_actions " + "FROM MAEEADP_Enforcement " + "WHERE EnforcementType NOT IN (" + " 'Notice Of Non-Compliance','Field Notice Of Non Compliance'," + " 'BOIL ORDER','Federal Administrative Order Against PWS'," + " 'Federal Notice Of Noncompliance Against PWS'" + ") GROUP BY 1", + engine, + ) + enf['year'] = pd.to_numeric(enf['year'], errors='coerce').astype('Int64') + except Exception: + enf = pd.DataFrame() + + if not spend_trend.empty and not enf.empty: + merged = spend_trend.merge(enf, on='year', how='inner') + merged = merged.sort_values('year') + years_vs = merged['year'].astype(int).tolist() + spend_m_vs = (merged['compensation'] / 1e6).tolist() + n_enf = merged['n_actions'].tolist() + + c = chartjs.Chart( + 'MA Lobbying Spend vs. Enforcement Actions', + 'Bar', width=700, height=380, + ) + c.set_labels([str(y) for y in years_vs]) + c.add_dataset(spend_m_vs, 'Lobbying spend ($M)', backgroundColor=f"'{BLUE}'", + yAxisID="'y'") + c.add_dataset(n_enf, 'Enforcement actions', backgroundColor=f"'{RED}'", + type="'line'", yAxisID="'y1'") + c.set_params( + js_inline=False, + ylabel='Lobbying spend ($M)', + xlabel='Year', + y2nd=1, + y2nd_title='Enforcement actions', + ) + c.jekyll_write(f'{CHART_DIR}/{prefix}lobbying_vs_enforcement.html') + print(f'Wrote {prefix}lobbying_vs_enforcement.html') + + # ── Chart 5: Lobbying spend by topic cluster (stacked bar by year) ─────────── + _chart_spend_by_cluster(engine, employers, lobby_bills, prefix) + + _write_post_facts(engine, lobby_bills, leg_bills) + + +def _chart_spend_by_cluster(engine, employers: pd.DataFrame, lobby_bills: pd.DataFrame, prefix: str): + """Stacked bar: annual employer spend broken down by bill topic cluster.""" + try: + scored = pd.read_sql_query( + 'SELECT bill_id, bill_number, general_court, cluster_id FROM MA_Lobbying_Bills_Scored ' + 'WHERE cluster_id IS NOT NULL AND cluster_id != -1', + engine, + ) + cluster_labels = pd.read_sql_query( + 'SELECT cluster_id, label FROM MA_Bill_Cluster_Labels', engine, + ) + except Exception: + print(' Cluster data not yet in DB — skipping cluster spend chart.') + return + + if scored.empty: + print(' Cluster IDs not yet assigned — skipping cluster spend chart.') + return + + # Join cluster_id onto lobby_bills via bill_id (preferred) to avoid H/S cross-prefix + has_bid = 'bill_id' in scored.columns and 'bill_id' in lobby_bills.columns + if has_bid: + scored_id = scored[scored['bill_id'].notna()][['bill_id', 'general_court', 'cluster_id']] + scored_num = scored[scored['bill_id'].isna()][['bill_number', 'general_court', 'cluster_id']] + lb_with_id = lobby_bills[lobby_bills['bill_id'].notna()].merge( + scored_id, on=['bill_id', 'general_court'], how='left') + lb_no_id = lobby_bills[lobby_bills['bill_id'].isna()].merge( + scored_num, on=['bill_number', 'general_court'], how='left') + lb = pd.concat([lb_with_id, lb_no_id], ignore_index=True) + else: + lb = lobby_bills.merge( + scored[['bill_number', 'general_court', 'cluster_id']], + on=['bill_number', 'general_court'], how='left' + ) + lb = lb.dropna(subset=['cluster_id']) + lb['cluster_id'] = lb['cluster_id'].astype(int) + + # Join client compensation: match (entity_name, client_name, year) + emp = employers[employers['client_name'] != 'Total salaries received'] + lb_emp = lb.merge(emp[['entity_name', 'client_name', 'year', 'compensation']], + on=['entity_name', 'client_name', 'year'], how='left') + + # Annual spend per cluster (divide compensation equally across clusters + # lobbied by each (firm, client) pair in that year to avoid double-counting) + clusters_per_pair_year = ( + lb_emp.groupby(['entity_name', 'client_name', 'year'])['cluster_id'] + .nunique() + .reset_index(name='n_clusters') + ) + lb_emp = lb_emp.merge(clusters_per_pair_year, + on=['entity_name', 'client_name', 'year']) + lb_emp['spend_share'] = lb_emp['compensation'] / lb_emp['n_clusters'] + + spend_by_cluster = ( + lb_emp.groupby(['year', 'cluster_id'])['spend_share'] + .sum() + .reset_index() + ) + + # Build cluster label map + label_map = dict(zip(cluster_labels['cluster_id'], cluster_labels['label'])) + spend_by_cluster['label'] = spend_by_cluster['cluster_id'].map(label_map).fillna('Other') + + years = sorted(spend_by_cluster['year'].dropna().astype(int).unique()) + # Top clusters by total spend across all years + top_clusters = ( + spend_by_cluster.groupby('cluster_id')['spend_share'] + .sum() + .nlargest(10) + .index.tolist() + ) + + c = chartjs.Chart( + 'MA Lobbying Spend by Topic Cluster', + 'Bar', width=700, height=420, + ) + c.set_labels([str(y) for y in years]) + + colors = SECTOR_COLORS + for i, cid in enumerate(top_clusters): + subset = spend_by_cluster[spend_by_cluster['cluster_id'] == cid] + year_spend = {int(r['year']): r['spend_share'] / 1e6 + for _, r in subset.iterrows()} + data = [year_spend.get(y, 0) for y in years] + label = label_map.get(cid, f'Cluster {cid}') + c.add_dataset(data, label, + backgroundColor=f"'{colors[i % len(colors)]}'", + stack="'topic'") + + c.set_params( + js_inline=False, + ylabel='Lobbying spend ($M)', + xlabel='Year', + stacked=True, + ) + c.jekyll_write(f'{CHART_DIR}/{prefix}lobbying_spend_by_cluster.html') + print(f'Wrote {prefix}lobbying_spend_by_cluster.html') + + +def generate_post_charts(engine, prefix=''): + """Generate analysis-post lobbying charts (not suitable for weekly CI).""" + employers, lobby_bills, leg_bills = _load_data(engine) + + if employers.empty: + print('MA lobbying data not yet available — skipping post charts.') + return + + employers['year'] = pd.to_numeric(employers['year'], errors='coerce').astype('Int64') + + # ── Post chart 1: Lobbying spend vs. DEP budget ─────────────────────────── + try: + budget = pd.read_sql_query( + 'SELECT Year, DEPAdministration_inf FROM MassBudget_summary', engine + ) + budget['Year'] = pd.to_numeric(budget['Year'], errors='coerce').astype('Int64') + except Exception as e: + print(f' Budget query failed: {e}') + budget = pd.DataFrame() + + spend_trend = _annual_env_spend(employers, lobby_bills, leg_bills) + + if not spend_trend.empty and not budget.empty: + merged = spend_trend.merge(budget, left_on='year', right_on='Year', how='inner') + merged = merged.sort_values('year') + years_sb = merged['year'].astype(int).tolist() + spend_m = (merged['compensation'] / 1e6).tolist() + budget_m = (merged['DEPAdministration_inf'].astype(float) / 1e6).tolist() + + c = chartjs.Chart( + 'MA Lobbying Spend vs. DEP Budget (inflation-adjusted)', + 'Bar', width=700, height=400, + ) + c.set_labels([str(y) for y in years_sb]) + c.add_dataset(spend_m, 'Industry lobbying spend ($M)', backgroundColor=f"'{ORANGE}'", + yAxisID="'y'") + c.add_dataset(budget_m, 'DEP admin budget ($M, inflation-adj.)', + backgroundColor=f"'{BLUE}'", type="'line'", yAxisID="'y1'") + c.set_params( + js_inline=False, + ylabel='Lobbying spend ($M)', + xlabel='Year', + y2nd=1, + y2nd_title='DEP budget ($M)', + ) + c.jekyll_write(f'{CHART_DIR}/{prefix}lobbying_spend_vs_budget.html') + print(f'Wrote {prefix}lobbying_spend_vs_budget.html') + + # ── Post chart 2: Bill pass rate by lobbying intensity tier ─────────────── + if not lobby_bills.empty and not leg_bills.empty: + env_lb = _env_bills(lobby_bills, leg_bills) + if not env_lb.empty and 'passed' in env_lb.columns: + bill_key = ['bill_id', 'general_court'] if 'bill_id' in env_lb.columns else ['bill_number', 'general_court'] + employer_counts = ( + env_lb.groupby(bill_key)['client_name'] + .nunique() + .reset_index(name='employer_count') + ) + bill_info = leg_bills[bill_key + ['passed']].drop_duplicates() + tc = employer_counts.merge(bill_info, on=bill_key, how='left') + + def _tier(n): + if n >= 10: + return '10+ clients' + elif n >= 3: + return '3–9 clients' + else: + return '1–2 clients' + + tc['tier'] = tc['employer_count'].apply(_tier) + tier_order = ['1–2 clients', '3–9 clients', '10+ clients'] + summary = ( + tc.groupby('tier')['passed'] + .agg(['mean', 'count']) + .reindex(tier_order) + .reset_index() + ) + + c = chartjs.Chart( + 'Environmental Bill Pass Rate by Lobbying Intensity', + 'Bar', width=500, height=360, + ) + c.set_labels(tier_order) + c.add_dataset( + (summary['mean'].fillna(0) * 100).tolist(), + 'Pass rate (%)', + backgroundColor=f"'{GREEN}'", + ) + c.set_params(js_inline=False, ylabel='Pass rate (%)', xlabel='Number of employer lobbiers') + c.jekyll_write(f'{CHART_DIR}/{prefix}lobbying_bill_pass_by_spend_tier.html') + print(f'Wrote {prefix}lobbying_bill_pass_by_spend_tier.html') + + # ── Post chart 3: Lobbying spend vs. DEP FTE headcount ──────────────────── + try: + staff = pd.read_sql_query( + "SELECT year, COUNT(*) AS n_fte FROM MADEP_staff_Comptroller " + "WHERE pay_total_actual > 0 GROUP BY year", engine + ) + staff['year'] = pd.to_numeric(staff['year'], errors='coerce').astype('Int64') + except Exception: + staff = pd.DataFrame() + + if not spend_trend.empty and not staff.empty: + merged = spend_trend.merge(staff, on='year', how='inner').sort_values('year') + if not merged.empty: + years_s = merged['year'].astype(int).tolist() + spend_m = (merged['compensation'] / 1e6).tolist() + fte = merged['n_fte'].astype(int).tolist() + + c = chartjs.Chart( + 'Environmental Lobbying Spend vs. DEP Staff Headcount', + 'Bar', width=700, height=400, + ) + c.set_labels([str(y) for y in years_s]) + c.add_dataset(spend_m, 'Industry lobbying spend ($M)', + backgroundColor=f"'{ORANGE}'", yAxisID="'y'") + c.add_dataset(fte, 'DEP staff (FTE)', + backgroundColor=f"'{BLUE}'", type="'line'", yAxisID="'y1'") + c.set_params( + js_inline=False, + ylabel='Lobbying spend ($M)', + xlabel='Year', + y2nd=1, + y2nd_title='DEP staff (FTE)', + ) + c.jekyll_write(f'{CHART_DIR}/{prefix}lobbying_spend_vs_staff.html') + print(f'Wrote {prefix}lobbying_spend_vs_staff.html') + + # ── Post chart 4: Env-bill lobbying spend by topic cluster, stacked ─────── + # Joins employers→lobby_bills→scored→cluster_labels. + if (not employers.empty and not lobby_bills.empty + and 'cluster_id' in leg_bills.columns): + try: + cluster_labels = pd.read_sql_query( + 'SELECT cluster_id, label FROM MA_Bill_Cluster_Labels', engine + ) + except Exception: + cluster_labels = pd.DataFrame() + + if not cluster_labels.empty and 'cluster_id' in cluster_labels.columns: + cluster_labels['cluster_id'] = pd.to_numeric( + cluster_labels['cluster_id'], errors='coerce' + ).astype('Int64') + + env_lb = _env_bills(lobby_bills, leg_bills) + if not env_lb.empty and not cluster_labels.empty: + # Attach cluster_id to each env lobby_bills row via bill_id (preferred) + has_bid_leg = 'bill_id' in leg_bills.columns and 'bill_id' in env_lb.columns + if has_bid_leg: + scored_cluster = leg_bills[['bill_id', 'general_court', 'cluster_id']].dropna(subset=['bill_id']) + env_lb_c = env_lb.merge(scored_cluster, on=['bill_id', 'general_court'], how='left') + else: + scored_cluster = leg_bills[['bill_number', 'general_court', 'cluster_id']] + env_lb_c = env_lb.merge(scored_cluster, on=['bill_number', 'general_court'], how='left') + # Allocate (firm, client) compensation equally across env bills they + # lobbied that year, then sum by (year, cluster_id). + pair_year_bills = ( + env_lb_c.groupby(['entity_name', 'client_name', 'year']) + .size().reset_index(name='n_env_bills') + ) + emp = employers[employers['client_name'] != 'Total salaries received'] + emp_join = emp.merge( + pair_year_bills, on=['entity_name', 'client_name', 'year'], how='inner' + ) + emp_join['per_bill'] = emp_join['compensation'] / emp_join['n_env_bills'] + cluster_spend = env_lb_c.merge( + emp_join[['entity_name', 'client_name', 'year', 'per_bill']], + on=['entity_name', 'client_name', 'year'], how='left' + ).dropna(subset=['cluster_id', 'per_bill']) + cluster_spend['cluster_id'] = cluster_spend['cluster_id'].astype(int) + agg = ( + cluster_spend.groupby(['year', 'cluster_id'])['per_bill'] + .sum().reset_index() + ) + agg = agg.merge(cluster_labels, on='cluster_id', how='left') + pivot = agg.pivot_table( + index='year', columns='label', values='per_bill', aggfunc='sum' + ).fillna(0).sort_index() + # Keep top 8 clusters by total spend, group rest into "Other" + totals = pivot.sum(axis=0).sort_values(ascending=False) + top = totals.head(8).index.tolist() + other_cols = [c for c in pivot.columns if c not in top] + if other_cols: + pivot['Other'] = pivot[other_cols].sum(axis=1) + pivot = pivot[top + ['Other']] + else: + pivot = pivot[top] + + years = pivot.index.astype(int).tolist() + c = chartjs.Chart( + 'Environmental Lobbying Spend by Topic Cluster', + 'Bar', width=750, height=420, + ) + c.set_labels([str(y) for y in years]) + for i, col in enumerate(pivot.columns): + color = SECTOR_COLORS[i % len(SECTOR_COLORS)] + c.add_dataset( + (pivot[col] / 1e6).tolist(), col, + backgroundColor=f"'{color}'", stack="'a'", + ) + c.set_params( + js_inline=False, + ylabel='Allocated spend ($M)', + xlabel='Year', + stacked=1, + ) + c.jekyll_write(f'{CHART_DIR}/{prefix}lobbying_env_cluster_share.html') + print(f'Wrote {prefix}lobbying_env_cluster_share.html') + + # ── Post chart 5: Top clients by cumulative environmental lobbying spend ── + if not employers.empty and not lobby_bills.empty: + env_lb = _env_bills(lobby_bills, leg_bills) + if not env_lb.empty: + # Per (firm, client, year): env share = env bills / total bills lobbied + pair_keys = ['entity_name', 'client_name', 'year'] + bill_counts = ( + lobby_bills.groupby(pair_keys).size() + .reset_index(name='n_all') + ) + env_counts = ( + env_lb.groupby(pair_keys).size() + .reset_index(name='n_env') + ) + shares = bill_counts.merge(env_counts, on=pair_keys, how='left') + shares['n_env'] = shares['n_env'].fillna(0) + shares['env_share'] = shares['n_env'] / shares['n_all'].replace(0, np.nan) + emp = employers[employers['client_name'] != 'Total salaries received'] + pair_year = emp.merge(shares, on=pair_keys, how='inner') + pair_year['env_spend'] = pair_year['compensation'] * pair_year['env_share'] + top_clients = ( + pair_year.groupby('client_name')['env_spend'] + .sum().sort_values(ascending=False).head(20) + ) + if not top_clients.empty: + # Reverse so largest is at top in horizontal bar (ascending order) + top_clients = top_clients.sort_values() + c = chartjs.Chart( + 'Top 20 Clients by Cumulative Environmental Lobbying Spend', + 'HorizontalBar', width=750, height=520, + ) + c.set_labels(top_clients.index.tolist()) + c.add_dataset( + (top_clients.values / 1e6).tolist(), + 'Total env-bill spend ($M, all years)', + backgroundColor=f"'{GREEN}'", + ) + c.set_params( + js_inline=False, + ylabel='', + xlabel='Cumulative env-bill spend ($M)', + ) + c.jekyll_write(f'{CHART_DIR}/{prefix}lobbying_top_env_employers.html') + print(f'Wrote {prefix}lobbying_top_env_employers.html') + + # ── Post chart 6: Support/Oppose/Neutral trend on env bills ────────────── + _chart_env_position_trend(lobby_bills, leg_bills, prefix) + + # ── Post chart 7: Top opponents of env bills ────────────────────────────── + _chart_top_env_opponents(lobby_bills, leg_bills, prefix) + + # ── Post chart 8: Env bill pass rate by dominant lobbying position ──────── + _chart_pass_rate_by_position(lobby_bills, leg_bills, prefix) + + # ── Post chart 9: Env score vs. lobbying intensity scatter ─────────────── + _chart_env_score_vs_clients(engine, prefix) + + # ── Post charts 11–15: LLM-based new analysis charts ───────────────────── + parquet_df = _load_parquet_llm() + _chart_env_categories_by_gc(parquet_df, prefix) + _chart_gc_trend(parquet_df, lobby_bills, prefix) + _chart_employer_env_scatter(parquet_df, lobby_bills, employers, prefix) + _chart_opposition_pairs(parquet_df, lobby_bills, prefix) + _chart_top_env_tags(parquet_df, prefix) + + # ── Post chart 10: Lobbying spend by known CSO operators + proxies ──────── + # Cross-references MA_Lobbying_Employers.client_name with MAEEADP_CSO.permiteeName. + # Includes the Massachusetts Municipal Association as a proxy: it is the + # primary lobbyist for municipal CSO operators (most cities/towns lobby + # through MMA rather than directly). + try: + cso_permittees = pd.read_sql_query( + 'SELECT DISTINCT permiteeName FROM MAEEADP_CSO WHERE permiteeName IS NOT NULL', + engine, + ) + except Exception: + cso_permittees = pd.DataFrame() + + PROXY_LOBBYISTS = { + 'MASSACHUSETTS MUNICIPAL ASSOCIATION': 'Massachusetts Municipal Association (CSO proxy)', + } + + if not employers.empty and not cso_permittees.empty: + import re + def _norm(s): + # Collapse 'AND'/'&' to space, drop punctuation, collapse whitespace + t = re.sub(r'[&]', ' ', str(s).upper()) + t = re.sub(r'\bAND\b', ' ', t) + t = ''.join(ch if ch.isalnum() or ch == ' ' else ' ' for ch in t) + t = re.sub(r'\s+', ' ', t).strip() + return t + + operator_norms = {_norm(p): p for p in cso_permittees['permiteeName'].dropna()} + operator_norms = {k: v for k, v in operator_norms.items() if len(k) > 4} + + def _match_operator(name): + n = _norm(name) + for proxy_norm, label in PROXY_LOBBYISTS.items(): + if proxy_norm in n: + return label + for op_norm, op in operator_norms.items(): + if op_norm in n or n in op_norm: + return op + return None + + emp = employers[employers['client_name'] != 'Total salaries received'].copy() + emp['cso_operator'] = emp['client_name'].apply(_match_operator) + cso_emp = emp.dropna(subset=['cso_operator']) + if not cso_emp.empty: + yearly = ( + cso_emp.groupby(['year', 'cso_operator'])['compensation'] + .sum().reset_index() + ) + # Keep top 8 operators by total spend + top_ops = ( + yearly.groupby('cso_operator')['compensation'] + .sum().sort_values(ascending=False).head(8).index.tolist() + ) + yearly = yearly[yearly['cso_operator'].isin(top_ops)] + pivot = yearly.pivot_table( + index='year', columns='cso_operator', values='compensation', aggfunc='sum' + ).fillna(0).sort_index() + + years = pivot.index.astype(int).tolist() + c = chartjs.Chart( + 'Total Lobbying Spend by Known CSO Operators', + 'Bar', width=750, height=420, + ) + c.set_labels([str(y) for y in years]) + for i, col in enumerate(pivot.columns): + color = SECTOR_COLORS[i % len(SECTOR_COLORS)] + c.add_dataset( + (pivot[col] / 1e6).tolist(), col, + backgroundColor=f"'{color}'", stack="'a'", + ) + c.set_params( + js_inline=False, + ylabel='Annual lobbying spend ($M)', + xlabel='Year', + stacked=1, + ) + c.jekyll_write(f'{CHART_DIR}/{prefix}lobbying_cso_operators.html') + print(f'Wrote {prefix}lobbying_cso_operators.html') + + +def _chart_env_position_trend(lobby_bills: pd.DataFrame, leg_bills: pd.DataFrame, + prefix: str): + """Stacked-area: unique clients taking Support/Oppose/Neutral positions on env bills by year. + + Note: "opposing an environmental bill" does not always mean opposing environmental + protection — some env advocates oppose bills they consider inadequate or harmful. + The chart shows industry engagement with env-relevant legislation, not ideology. + """ + if lobby_bills.empty or leg_bills.empty or 'is_environmental' not in leg_bills.columns: + return + + env_ids = leg_bills[leg_bills['is_environmental'] == 1].copy() + env_lb = _bill_merge(lobby_bills, env_ids, how='inner') + if env_lb.empty: + return + + pos_yr = ( + env_lb[env_lb['position'].isin(['Support', 'Oppose', 'Neutral'])] + .groupby(['year', 'position'])['client_name'] + .nunique() + .reset_index(name='n_clients') + ) + pivot = pos_yr.pivot_table( + index='year', columns='position', values='n_clients', fill_value=0 + ).sort_index() + for col in ['Support', 'Oppose', 'Neutral']: + if col not in pivot.columns: + pivot[col] = 0 + + # Drop sparse early years (fewer than 5 total clients across positions) + pivot = pivot[pivot[['Support', 'Oppose', 'Neutral']].sum(axis=1) >= 5] + if pivot.empty: + return + + years = pivot.index.astype(int).tolist() + c = chartjs.Chart( + 'Unique Clients by Position on Environmental Bills', + 'Bar', width=700, height=380, + ) + c.set_labels([str(y) for y in years]) + c.add_dataset(pivot['Support'].tolist(), 'Support', + backgroundColor=f"'{GREEN}'", stack="'pos'") + c.add_dataset(pivot['Neutral'].tolist(), 'Neutral', + backgroundColor=f"'{GREY}'", stack="'pos'") + c.add_dataset(pivot['Oppose'].tolist(), 'Oppose', + backgroundColor=f"'{RED}'", stack="'pos'") + c.set_params( + js_inline=False, + ylabel='Unique lobbying clients', + xlabel='Year', + stacked=True, + ) + c.jekyll_write(f'{CHART_DIR}/{prefix}lobbying_env_positions.html') + print(f'Wrote {prefix}lobbying_env_positions.html') + + +def _chart_top_env_opponents(lobby_bills: pd.DataFrame, leg_bills: pd.DataFrame, + prefix: str): + """Horizontal bar: clients ranked by unique env bills opposed (all years). + + "Opposing" an env-relevant bill can reflect either industry opposition to + new regulation, or an env group opposing a bill it considers harmful. + Top opponents are labelled accordingly where known. + """ + if lobby_bills.empty or leg_bills.empty or 'is_environmental' not in leg_bills.columns: + return + + env_ids = leg_bills[leg_bills['is_environmental'] == 1].copy() + env_lb = _bill_merge(lobby_bills, env_ids, how='inner') + if env_lb.empty: + return + + bill_key = ['bill_id', 'general_court'] if 'bill_id' in env_lb.columns else ['bill_number', 'general_court'] + oppose = ( + env_lb[env_lb['position'] == 'Oppose'] + .groupby('client_name')[bill_key] + .apply(lambda g: g.drop_duplicates().shape[0]) + .reset_index(name='n_bills_opposed') + .sort_values('n_bills_opposed', ascending=False) + .head(20) + .sort_values('n_bills_opposed') # ascending for horizontal bar + ) + if oppose.empty: + return + + c = chartjs.Chart( + 'Top 20 Clients Opposing Environmental Bills (all years)', + 'HorizontalBar', width=750, height=520, + ) + c.set_labels(oppose['client_name'].tolist()) + c.add_dataset( + oppose['n_bills_opposed'].tolist(), + 'Unique env bills opposed', + backgroundColor=f"'{RED}'", + ) + c.set_params( + js_inline=False, + ylabel='', + xlabel='Unique environmental bills opposed', + ) + c.jekyll_write(f'{CHART_DIR}/{prefix}lobbying_env_opponents.html') + print(f'Wrote {prefix}lobbying_env_opponents.html') + + +def _chart_pass_rate_by_position(lobby_bills: pd.DataFrame, leg_bills: pd.DataFrame, + prefix: str): + """Grouped bar: env bill pass rate by dominant lobbying position. + + Classifies each env bill as 'Mostly supported', 'Mostly opposed', or + 'Contested/Neutral' based on which position has the most unique clients. + Shows pass rate and bill count per category. + """ + if lobby_bills.empty or leg_bills.empty or 'is_environmental' not in leg_bills.columns: + return + if 'passed' not in leg_bills.columns: + return + + has_bid = 'bill_id' in leg_bills.columns and 'bill_id' in lobby_bills.columns + bill_key = ['bill_id', 'general_court'] if has_bid else ['bill_number', 'general_court'] + score_cols = bill_key + ['passed'] + env_scored = leg_bills[leg_bills['is_environmental'] == 1][score_cols].drop_duplicates() + if has_bid: + env_scored = env_scored.dropna(subset=['bill_id']) + if env_scored.empty: + return + + env_lb = lobby_bills.merge( + env_scored[bill_key], on=bill_key, how='inner' + ) + pos_counts = ( + env_lb[env_lb['position'].isin(['Support', 'Oppose'])] + .groupby(bill_key + ['position'])['client_name'] + .nunique() + .unstack(fill_value=0) + .reset_index() + ) + for col in ['Support', 'Oppose']: + if col not in pos_counts.columns: + pos_counts[col] = 0 + + def _category(row): + if row['Support'] > row['Oppose']: + return 'Mostly supported' + if row['Oppose'] > row['Support']: + return 'Mostly opposed' + return 'Contested / Neutral' + + pos_counts['category'] = pos_counts.apply(_category, axis=1) + tc = pos_counts.merge(env_scored, on=bill_key, how='left') + + cat_order = ['Mostly supported', 'Mostly opposed', 'Contested / Neutral'] + summary = ( + tc.groupby('category')['passed'] + .agg(pass_rate='mean', n_bills='count') + .reindex(cat_order) + .fillna(0) + .reset_index() + ) + + c = chartjs.Chart( + 'Environmental Bill Pass Rate by Lobbying Position', + 'Bar', width=520, height=360, + ) + c.set_labels(cat_order) + c.add_dataset( + (summary['pass_rate'] * 100).round(1).tolist(), + 'Pass rate (%)', + backgroundColor=[f"'{GREEN}'", f"'{RED}'", f"'{GREY}'"], + ) + c.set_params( + js_inline=False, + ylabel='Pass rate (%)', + xlabel='Dominant lobbying position', + ) + c.jekyll_write(f'{CHART_DIR}/{prefix}lobbying_pass_by_position.html') + print(f'Wrote {prefix}lobbying_pass_by_position.html') + + +def _chart_env_score_vs_clients(engine, prefix: str, top_n_nonenv: int = 500): + """Scatter: environmental relevance score (x) vs. unique lobbying clients (y, log scale). + + Three groups: + Environmental — all env-relevant bills (green, outlined) + Appropriations — annual budget / line-item bills (purple); separated because + they attract hundreds of clients for budget reasons unrelated + to the bill's policy topic — they dominate the y-axis and + are a distinct lobbying mechanism + Non-env policy — top-N most-lobbied non-appropriations, non-env bills (grey) + + Y-axis is log-scaled: most bills have 1–10 clients, appropriations have 300+, + so linear scale compresses the interesting region. + Marginal histograms show the density distribution of each group along both axes. + Threshold line (x = 0.05) only on main scatter and top (x) marginal. + """ + import plotly.express as px + + try: + scored = pd.read_sql_query( + 'SELECT bill_number, general_court, bill_id, bill_title, ' + ' env_relevance_score, is_environmental ' + 'FROM MA_Lobbying_Bills_Scored', + engine, + ) + counts = pd.read_sql_query( + 'SELECT bill_number, general_court, ' + ' COUNT(DISTINCT client_name) AS n_clients ' + 'FROM MA_Lobbying_Bills ' + 'GROUP BY bill_number, general_court', + engine, + ) + except Exception as e: + print(f' env_score_vs_clients: DB query failed ({e}) — skipping') + return + + df = scored.merge(counts, on=['bill_number', 'general_court'], how='left') + df['n_clients'] = df['n_clients'].fillna(0).astype(int) + df['bill_title'] = df['bill_title'].fillna('').astype(str) + + # Classify appropriations by title pattern — these are the annual budget bills + # and their line-item amendments, which attract 100–350 clients purely because + # they're the vehicle for all state spending decisions. + _approp_re = ( + r'(?i)making appropriations|appropriation.*fiscal year' + r'|line item \d|amendment.*\d{4}-\d{4}' + ) + df['is_approp'] = df['bill_title'].str.contains(_approp_re, regex=True, na=False) + + env = df[df['is_environmental'] == 1].copy() + approp = df[(df['is_environmental'] == 0) & df['is_approp']].copy() + policy_nonenv = ( + df[(df['is_environmental'] == 0) & ~df['is_approp']] + .nlargest(top_n_nonenv, 'n_clients') + .copy() + ) + + def _group(row): + if row['is_environmental'] == 1: + return 'Environmental' + if row['is_approp']: + return 'Appropriations bill' + return f'Non-env policy (top {top_n_nonenv})' + + plot_df = pd.concat([env, approp, policy_nonenv], ignore_index=True) + plot_df['group'] = plot_df.apply(_group, axis=1) + # log1p for y so bills with 0 clients don't vanish; displayed as n_clients + plot_df['n_clients_log'] = np.log1p(plot_df['n_clients']) + # Short title for hover name (shown bold at top) + plot_df['title_short'] = plot_df['bill_title'].str.slice(0, 90) + + color_map = { + 'Environmental': '#2ca02c', + 'Appropriations bill': '#9467bd', + f'Non-env policy (top {top_n_nonenv})': '#aaaaaa', + } + + fig = px.scatter( + plot_df, + x='env_relevance_score', + y='n_clients', + color='group', + color_discrete_map=color_map, + hover_name='title_short', + hover_data={ + 'title_short': False, + 'bill_title': False, + 'is_environmental': False, + 'is_approp': False, + 'group': False, + 'n_clients_log': False, + 'env_relevance_score': ':.3f', + 'n_clients': True, + 'bill_id': True, + 'general_court': True, + }, + marginal_x='histogram', + marginal_y='histogram', + labels={ + 'env_relevance_score': 'Environmental relevance score', + 'n_clients': 'Unique lobbying clients', + 'bill_id': 'Bill ID', + 'general_court': 'General Court', + 'group': '', + }, + title=( + 'Environmental Relevance vs. Lobbying Intensity
' + f'All env bills · top {top_n_nonenv} non-env policy bills · ' + 'appropriations bills shown separately · hover for title' + ), + opacity=0.72, + width=820, + height=620, + ) + + # Env dots: slightly larger, outlined + fig.update_traces( + selector=dict(type='scatter', name='Environmental'), + marker=dict(size=8, line=dict(color='black', width=0.8)), + ) + fig.update_traces( + selector=dict(type='scatter', name='Appropriations bill'), + marker=dict(size=5), + ) + fig.update_traces( + selector=dict(type='scatter', name=f'Non-env policy (top {top_n_nonenv})'), + marker=dict(size=5), + ) + + # Threshold line on main scatter and top x-marginal. + # Plain add_vline without row/col — plotly draws it at x=0.05 on each subplot's + # own x-axis. The right marginal histogram's x-axis is in units of bill count + # (0–300+), so x=0.05 lands at the invisible left edge there. No row/col + # specification avoids the axis-matching infinite-loop bug in plotly express + # marginal figures. + fig.add_vline( + x=0.05, line_dash='dot', line_color='#2ca02c', line_width=1.2, + annotation_text='env threshold (0.05)', + annotation_position='top right', + annotation_font_size=10, + ) + + # Log scale on main scatter y-axis. + # Must NOT use log_y=True in px.scatter — it transforms a shared axis in a + # way that breaks marginal histogram rendering. + # In px.scatter with marginal_x + marginal_y, yaxis is the main scatter y-axis + # and yaxis2 (right marginal) has matches='y', so both get log together which + # is correct: the marginal histogram's n_clients axis stays in sync. + fig.update_layout(yaxis=dict( + type='log', + tickmode='array', + tickvals=[1, 2, 5, 10, 20, 50, 100, 200, 350], + ticktext=['1', '2', '5', '10', '20', '50', '100', '200', '350'], + )) + + fig.update_layout( + legend=dict(orientation='h', yanchor='bottom', y=1.02, xanchor='left', x=0), + plot_bgcolor='#f8f8f8', + paper_bgcolor='white', + ) + + out = Path(CHART_DIR) / f'{prefix}lobbying_env_score_vs_clients.html' + html = fig.to_html(full_html=False, include_plotlyjs='cdn', config={'responsive': True}) + out.write_text('{% raw %}\n' + html + '\n{% endraw %}\n', encoding='utf-8') + print(f'Wrote {prefix}lobbying_env_score_vs_clients.html') + + +def _ordinal(n: int) -> str: + """1 -> '1st', 2 -> '2nd', 186 -> '186th'.""" + if 10 <= n % 100 <= 20: + suf = 'th' + else: + suf = {1: 'st', 2: 'nd', 3: 'rd'}.get(n % 10, 'th') + return f'{n}{suf}' + + +def _write_post_facts(engine, lobby_bills: pd.DataFrame, leg_bills: pd.DataFrame): + """Write analysis-post-specific facts to facts_lobbying_post.yml so the blog + post never hardcodes numbers. Any figure cited in the post is generated here. + + The pipeline's facts_lobbying.yml already holds the headline figures + (total spend, env count, env %, employer counts); this file adds the + session-growth and opposition facts the narrative relies on. + """ + facts: dict = {} + + # Number of distinct permitted CSO operators in the EEA portal (cited in the + # CSO-operators section of the post). Mirrors the lobbying_cso_operators chart's + # source query. + try: + n_ops = pd.read_sql_query( + 'SELECT COUNT(DISTINCT permiteeName) AS n FROM MAEEADP_CSO ' + 'WHERE permiteeName IS NOT NULL', engine, + )['n'].iloc[0] + facts['post_cso_n_operators'] = int(n_ops) + except Exception as e: + print(f' CSO operator count query failed: {e}') + + # GC -> calendar years (GC183 = 2003-2004; each spans two years) + def _gc_years(gc: int) -> str: + start = 2003 + (gc - 183) * 2 + return f'{start}–{start + 1}' + + if (not leg_bills.empty and 'is_environmental' in leg_bills.columns + and not lobby_bills.empty): + env = leg_bills[leg_bills['is_environmental'] == 1][['bill_number', 'general_court']] + m = lobby_bills.merge(env, on=['bill_number', 'general_court']) + if not m.empty: + per_gc = ( + m.groupby('general_court') + .agg(env_bills=('bill_number', 'nunique'), + employers=('entity_name', 'nunique')) + .reset_index() + ) + # Floor at GC186 (2009-2010): the 184th-185th (2005-2008) sessions + # only have entity-level salary totals (no per-client breakdown), so + # their bill/employer counts are sparse and not comparable for a + # growth narrative. GC186 is the first session with per-client data. + per_gc = per_gc[per_gc['general_court'].between(186, 210)] + # First session with data, and the most recent COMPLETE session + # (drop the current in-progress one if a later partial exists). + first = per_gc.sort_values('general_court').iloc[0] + complete = per_gc[per_gc['general_court'] < per_gc['general_court'].max()] + recent = (complete if not complete.empty else per_gc).sort_values('general_court').iloc[-1] + fg, rg = int(first['general_court']), int(recent['general_court']) + facts['post_first_session_gc'] = _ordinal(fg) + facts['post_first_session_years'] = _gc_years(fg) + facts['post_first_session_env_bills'] = int(first['env_bills']) + facts['post_first_session_employers'] = int(first['employers']) + facts['post_recent_session_gc'] = _ordinal(rg) + facts['post_recent_session_years'] = _gc_years(rg) + facts['post_recent_session_env_bills'] = int(recent['env_bills']) + facts['post_recent_session_employers'] = int(recent['employers']) + if first['env_bills']: + facts['post_env_bills_growth_x'] = round(recent['env_bills'] / first['env_bills'], 1) + if first['employers']: + facts['post_employers_growth_x'] = round(recent['employers'] / first['employers'], 1) + + # Top opposition pair: clients on opposite Support/Oppose sides of the + # same environmental bill, by number of distinct bills (matches the + # lobbying_opposition_pairs chart logic). + if 'position' in m.columns: + sup = (m[m['position'] == 'Support'][['bill_number', 'general_court', 'client_name']] + .drop_duplicates().rename(columns={'client_name': 'a'})) + opp = (m[m['position'] == 'Oppose'][['bill_number', 'general_court', 'client_name']] + .drop_duplicates().rename(columns={'client_name': 'b'})) + pairs = sup.merge(opp, on=['bill_number', 'general_court']) + pairs = pairs[pairs['a'] != pairs['b']].copy() + if not pairs.empty: + lo = pairs[['a', 'b']].min(axis=1) + hi = pairs[['a', 'b']].max(axis=1) + pairs['lo'], pairs['hi'] = lo, hi + top = (pairs.groupby(['lo', 'hi'])['bill_number'].nunique() + .reset_index(name='n').nlargest(1, 'n').iloc[0]) + facts['post_top_opposition_a'] = top['lo'] + facts['post_top_opposition_b'] = top['hi'] + facts['post_top_opposition_bills'] = int(top['n']) + + with open(FACTS_YML, 'w') as f: + for k, v in facts.items(): + if isinstance(v, str): + f.write(f'{k}: "{v}"\n') + else: + f.write(f'{k}: {v}\n') + print(f'Wrote {FACTS_YML} ({len(facts)} post facts)') + + +def _load_parquet_llm() -> pd.DataFrame: + """Load bill parquet from local path (LLM columns: categories, tags, is_env_llm).""" + local = Path(CHART_DIR).parent / 'data' / 'MA_bill_embeddings.parquet' + if local.exists(): + return pd.read_parquet(local) + # Fallback to GCS + try: + import gcsfs + fs = gcsfs.GCSFileSystem() + with fs.open('gs://openamend-data/MA_bill_embeddings.parquet', 'rb') as f: + return pd.read_parquet(f) + except Exception as e: + print(f' Parquet load failed: {e}') + return pd.DataFrame() + + +def _make_env_lobby_bills(parquet_df: pd.DataFrame, lobby_bills: pd.DataFrame) -> pd.DataFrame: + """Merge parquet LLM env flag (is_env_llm) onto lobby_bills rows. + + Joins on (bill_id, general_court) when both sides have bill_id to avoid + cross-prefix contamination. Falls back to (bill_number, general_court) + for rows without bill_id. + """ + if parquet_df.empty or lobby_bills.empty: + return pd.DataFrame() + env_pq = parquet_df[parquet_df['is_env_llm'] == True].copy() + lb = lobby_bills.copy() + lb['bill_number'] = pd.to_numeric(lb['bill_number'], errors='coerce').astype('Int64') + lb['general_court'] = pd.to_numeric(lb['general_court'], errors='coerce').astype('Int64') + + has_bill_id = 'bill_id' in env_pq.columns and 'bill_id' in lb.columns + if has_bill_id: + env_pq['general_court'] = pd.to_numeric(env_pq['general_court'], errors='coerce').astype('Int64') + env_id = env_pq[['bill_id', 'general_court']].dropna(subset=['bill_id']) + env_num = env_pq[env_pq['bill_id'].isna()][['bill_number', 'general_court']] + env_num['bill_number'] = pd.to_numeric(env_num['bill_number'], errors='coerce').astype('Int64') + lb_with_id = lb[lb['bill_id'].notna()] + lb_no_id = lb[lb['bill_id'].isna()] + parts = [] + if not lb_with_id.empty and not env_id.empty: + parts.append(lb_with_id.merge(env_id, on=['bill_id', 'general_court'], how='inner')) + if not lb_no_id.empty and not env_num.empty: + parts.append(lb_no_id.merge(env_num, on=['bill_number', 'general_court'], how='inner')) + return pd.concat(parts, ignore_index=True) if parts else pd.DataFrame() + else: + env_ids = env_pq[['bill_number', 'general_court']].copy() + env_ids['bill_number'] = pd.to_numeric(env_ids['bill_number'], errors='coerce').astype('Int64') + env_ids['general_court'] = pd.to_numeric(env_ids['general_court'], errors='coerce').astype('Int64') + return lb.merge(env_ids, on=['bill_number', 'general_court'], how='inner') + + +def _chart_env_categories_by_gc(parquet_df: pd.DataFrame, prefix: str): + """Stacked bar: env bill count by LLM category, by general court. + + Uses is_env_llm from parquet to select environmental bills. + Each bill may belong to multiple categories (JSON list); one count + per (bill, category) — bills counted once per unique category they appear in. + X-axis = General Court (session), stacked by top-5 categories + 'Other'. + """ + import json as _json + + if parquet_df.empty or 'is_env_llm' not in parquet_df.columns: + return + + env = parquet_df[parquet_df['is_env_llm'] == True].copy() + if env.empty: + return + + # Explode categories + rows = [] + for _, row in env.iterrows(): + gc = row.get('general_court') + cats_raw = row.get('categories') + if pd.isna(gc) or cats_raw is None: + continue + try: + cats = _json.loads(cats_raw) if isinstance(cats_raw, str) else cats_raw + except Exception: + cats = [] + if not isinstance(cats, list) or not cats: + cats = ['Unknown'] + gc_int = int(gc) + # Deduplicate categories per bill + for cat in set(cats): + rows.append({'general_court': gc_int, 'category': cat}) + + if not rows: + return + + cat_df = pd.DataFrame(rows) + + # Top 5 categories by total count + top_cats = ( + cat_df['category'].value_counts() + .head(5) + .index.tolist() + ) + + cat_df['cat_label'] = cat_df['category'].apply( + lambda c: c if c in top_cats else 'Other' + ) + + pivot = ( + cat_df.groupby(['general_court', 'cat_label']) + .size() + .unstack(fill_value=0) + .sort_index() + ) + + # Column order: top cats in size order, then Other + ordered = [c for c in top_cats if c in pivot.columns] + if 'Other' in pivot.columns: + ordered.append('Other') + pivot = pivot[ordered] + + gcs = pivot.index.tolist() + cat_colors = [BLUE, ORANGE, GREEN, RED, PURPLE, TEAL, GREY] + + c = chartjs.Chart( + 'Environmental Bills by Topic Category and Legislative Session', + 'Bar', width=720, height=400, + ) + c.set_labels([f'GC{gc}' for gc in gcs]) + for i, col in enumerate(pivot.columns): + c.add_dataset( + pivot[col].tolist(), col, + backgroundColor=f"'{cat_colors[i % len(cat_colors)]}'", + stack="'cat'", + ) + c.set_params( + js_inline=False, + ylabel='Unique environmental bills', + xlabel='General Court (legislative session)', + stacked=True, + ) + c.jekyll_write(f'{CHART_DIR}/{prefix}lobbying_env_categories_by_gc.html') + print(f'Wrote {prefix}lobbying_env_categories_by_gc.html') + + +def _chart_gc_trend(parquet_df: pd.DataFrame, lobby_bills: pd.DataFrame, prefix: str): + """Dual-line: unique env bills and unique clients per General Court. + + Uses LLM env flag from parquet for env bill identification. + """ + env_lb = _make_env_lobby_bills(parquet_df, lobby_bills) + if env_lb.empty: + return + + count_col = 'bill_id' if 'bill_id' in env_lb.columns else 'bill_number' + gc_bills = ( + env_lb.groupby('general_court')[count_col] + .nunique() + .reset_index(name='n_env_bills') + .sort_values('general_court') + ) + gc_clients = ( + env_lb.groupby('general_court')['client_name'] + .nunique() + .reset_index(name='n_env_clients') + ) + gc_trend = gc_bills.merge(gc_clients, on='general_court') + gc_trend = gc_trend[gc_trend['general_court'] > 180].sort_values('general_court') + + if gc_trend.empty: + return + + gcs = gc_trend['general_court'].astype(int).tolist() + n_bills = gc_trend['n_env_bills'].tolist() + n_clients = gc_trend['n_env_clients'].tolist() + + c = chartjs.Chart( + 'Environmental Lobbying Engagement by Legislative Session', + 'Bar', width=720, height=380, + ) + c.set_labels([f'GC{g}' for g in gcs]) + c.add_dataset(n_bills, 'Unique env bills lobbied', + backgroundColor=f"'{TEAL}'", yAxisID="'y'") + c.add_dataset(n_clients, 'Unique employer clients', + backgroundColor=f"'{ORANGE}'", type="'line'", yAxisID="'y1'") + c.set_params( + js_inline=False, + ylabel='Unique environmental bills', + xlabel='General Court (legislative session)', + y2nd=1, + y2nd_title='Unique employer clients', + ) + c.jekyll_write(f'{CHART_DIR}/{prefix}lobbying_gc_trend.html') + print(f'Wrote {prefix}lobbying_gc_trend.html') + + +def _chart_employer_env_scatter(parquet_df: pd.DataFrame, lobby_bills: pd.DataFrame, + employers: pd.DataFrame, prefix: str, + min_bills: int = 10): + """Plotly scatter: total lobbying spend (x) vs env bill share (y) per client. + + Each point is one lobbying client (employer). Clients with fewer than + `min_bills` total bills are excluded to remove noise from one-off filers. + Point size scales with total env bills. Hover shows client name, totals, + and average env fraction. + """ + import plotly.express as px + + env_lb = _make_env_lobby_bills(parquet_df, lobby_bills) + if env_lb.empty or employers.empty: + return + + pair_keys = ['entity_name', 'client_name', 'year'] + lb = lobby_bills.copy() + lb['bill_number'] = pd.to_numeric(lb['bill_number'], errors='coerce').astype('Int64') + lb['year'] = pd.to_numeric(lb['year'], errors='coerce').astype('Int64') + env_lb2 = env_lb.copy() + env_lb2['year'] = pd.to_numeric(env_lb2['year'], errors='coerce').astype('Int64') + + count_col = 'bill_id' if 'bill_id' in lb.columns else 'bill_number' + all_counts = lb.groupby(pair_keys)[count_col].nunique().reset_index(name='n_all') + env_counts = env_lb2.groupby(pair_keys)[count_col].nunique().reset_index(name='n_env') + fracs = all_counts.merge(env_counts, on=pair_keys, how='left') + fracs['n_env'] = fracs['n_env'].fillna(0) + fracs['env_frac'] = fracs['n_env'] / fracs['n_all'].replace(0, np.nan) + + emp = employers[employers['client_name'] != 'Total salaries received'].copy() + emp['year'] = pd.to_numeric(emp['year'], errors='coerce').astype('Int64') + emp['compensation'] = pd.to_numeric(emp['compensation'], errors='coerce').fillna(0) + + merged = emp.merge(fracs, on=pair_keys, how='inner') + merged['env_spend'] = merged['compensation'] * merged['env_frac'].fillna(0) + + client_stats = merged.groupby('client_name').agg( + total_spend=('compensation', 'sum'), + total_env_spend=('env_spend', 'sum'), + total_bills=('n_all', 'sum'), + total_env_bills=('n_env', 'sum'), + ).reset_index() + client_stats['avg_env_frac'] = ( + client_stats['total_env_bills'] / client_stats['total_bills'].replace(0, np.nan) + ) + client_stats = client_stats[client_stats['total_bills'] >= min_bills].copy() + + if client_stats.empty: + return + + # Classify by env fraction + def _sector(row): + f = row['avg_env_frac'] + if f >= 0.8: + return 'Primarily env (≥80%)' + elif f >= 0.4: + return 'Mixed env (40–80%)' + elif f >= 0.1: + return 'Occasional env (10–40%)' + else: + return 'Rarely env (<10%)' + + client_stats['sector'] = client_stats.apply(_sector, axis=1) + sector_order = [ + 'Primarily env (≥80%)', + 'Mixed env (40–80%)', + 'Occasional env (10–40%)', + 'Rarely env (<10%)', + ] + color_map = { + 'Primarily env (≥80%)': '#2ca02c', + 'Mixed env (40–80%)': '#1f77b4', + 'Occasional env (10–40%)': '#ff7f0e', + 'Rarely env (<10%)': '#aaaaaa', + } + + client_stats['spend_k'] = (client_stats['total_spend'] / 1e3).round(1) + client_stats['env_pct'] = (client_stats['avg_env_frac'] * 100).round(1) + client_stats['env_bills_int'] = client_stats['total_env_bills'].astype(int) + # Bubble size: sqrt of total env bills (capped) + client_stats['bubble_size'] = np.sqrt(client_stats['total_env_bills'].clip(1, 200)) * 1.5 + + fig = px.scatter( + client_stats, + x='spend_k', + y='env_pct', + color='sector', + color_discrete_map=color_map, + category_orders={'sector': sector_order}, + size='bubble_size', + size_max=22, + hover_name='client_name', + hover_data={ + 'client_name': False, + 'bubble_size': False, + 'sector': False, + 'spend_k': ':.0f', + 'env_pct': ':.1f', + 'env_bills_int': True, + 'total_bills': True, + }, + labels={ + 'spend_k': 'Total lobbying spend ($K, all years)', + 'env_pct': 'Share of bills that are environmental (%)', + 'env_bills_int': 'Env bills lobbied', + 'total_bills': 'Total bills lobbied', + 'sector': '', + }, + title=( + 'Lobbying Clients: Total Spend vs. Environmental Focus
' + f'{len(client_stats):,} clients with ≥{min_bills} bills · ' + 'bubble size ∝ √(env bills) · hover for details' + ), + opacity=0.75, + width=820, + height=560, + ) + fig.update_layout( + legend=dict(orientation='h', yanchor='bottom', y=1.02, xanchor='left', x=0), + plot_bgcolor='#f8f8f8', + paper_bgcolor='white', + xaxis=dict(title='Total lobbying spend ($K, all years)'), + yaxis=dict(title='Share of bills that are environmental (%)', range=[-2, 102]), + ) + out = Path(CHART_DIR) / f'{prefix}lobbying_employer_env_scatter.html' + html = fig.to_html(full_html=False, include_plotlyjs='cdn', config={'responsive': True}) + out.write_text('{% raw %}\n' + html + '\n{% endraw %}\n', encoding='utf-8') + print(f'Wrote {prefix}lobbying_employer_env_scatter.html') + + +def _chart_opposition_pairs(parquet_df: pd.DataFrame, lobby_bills: pd.DataFrame, prefix: str, + top_n: int = 15): + """Horizontal bar: employer pairs most frequently on opposite sides of env bills. + + Self-joins lobby_bills on (bill_number, general_court) to find (supporter, opposer) + pairs for environmentally-relevant bills, then counts unique bills per pair. + """ + env_lb = _make_env_lobby_bills(parquet_df, lobby_bills) + if env_lb.empty or 'position' not in env_lb.columns: + return + + bill_key = ['bill_id', 'general_court'] if 'bill_id' in env_lb.columns else ['bill_number', 'general_court'] + supporters = ( + env_lb[env_lb['position'] == 'Support'] + [bill_key + ['client_name']] + .drop_duplicates() + .rename(columns={'client_name': 'supporter'}) + ) + opponents = ( + env_lb[env_lb['position'] == 'Oppose'] + [bill_key + ['client_name']] + .drop_duplicates() + .rename(columns={'client_name': 'opposer'}) + ) + + pairs = supporters.merge(opponents, on=bill_key) + pairs = pairs[pairs['supporter'] != pairs['opposer']].copy() + + if pairs.empty: + return + + # Canonical ordering: smaller string first + pairs['a'] = pairs[['supporter', 'opposer']].min(axis=1) + pairs['b'] = pairs[['supporter', 'opposer']].max(axis=1) + + count_col = bill_key[0] # bill_id if available, else bill_number + pair_counts = ( + pairs.groupby(['a', 'b'])[count_col] + .nunique() + .reset_index(name='n_bills') + .nlargest(top_n, 'n_bills') + .sort_values('n_bills') # ascending for horizontal bar + ) + + if pair_counts.empty: + return + + # Short labels: truncate to 35 chars each + def _short(s, n=35): + return s if len(s) <= n else s[:n - 1] + '…' + + labels = [ + f'{_short(r["a"])} vs {_short(r["b"])}' + for _, r in pair_counts.iterrows() + ] + + c = chartjs.Chart( + f'Top {top_n} Most-Opposed Employer Pairs on Environmental Bills', + 'HorizontalBar', width=780, height=520, + ) + c.set_labels(labels) + c.add_dataset( + pair_counts['n_bills'].tolist(), + 'Unique env bills where they opposed each other', + backgroundColor=f"'{RED}'", + ) + c.set_params( + js_inline=False, + ylabel='', + xlabel='Unique environmental bills (as opposing parties)', + ) + c.jekyll_write(f'{CHART_DIR}/{prefix}lobbying_opposition_pairs.html') + print(f'Wrote {prefix}lobbying_opposition_pairs.html') + + +def _chart_top_env_tags(parquet_df: pd.DataFrame, prefix: str, top_n: int = 15): + """Horizontal bar: most common LLM-assigned tags for environmental bills.""" + import json as _json + from collections import Counter + + if parquet_df.empty or 'is_env_llm' not in parquet_df.columns: + return + + env = parquet_df[parquet_df['is_env_llm'] == True] + all_tags: list = [] + for t in env['tags'].dropna(): + try: + tags = _json.loads(t) if isinstance(t, str) else t + if isinstance(tags, list): + all_tags.extend(tags) + except Exception: + pass + + if not all_tags: + return + + tag_counts = Counter(all_tags) + top_tags = tag_counts.most_common(top_n) + # Reverse for ascending horizontal bar + top_tags = list(reversed(top_tags)) + + labels = [t[0] for t in top_tags] + counts = [t[1] for t in top_tags] + + c = chartjs.Chart( + f'Top {top_n} Tags on Environmental Bills (LLM-assigned)', + 'HorizontalBar', width=720, height=480, + ) + c.set_labels(labels) + c.add_dataset(counts, 'Bills with tag', backgroundColor=f"'{TEAL}'") + c.set_params( + js_inline=False, + ylabel='', + xlabel='Number of environmental bills', + ) + c.jekyll_write(f'{CHART_DIR}/{prefix}lobbying_top_env_tags.html') + print(f'Wrote {prefix}lobbying_top_env_tags.html') + + +if __name__ == '__main__': + _db = Path(__file__).parent.parent / 'get_data' / 'AMEND.db' + engine = create_engine(f'sqlite:///{_db}') + generate_charts(engine, prefix='') + generate_post_charts(engine, prefix='') diff --git a/analysis/dashboard_charts.py b/analysis/dashboard_charts.py index 361438b..fd4db45 100644 --- a/analysis/dashboard_charts.py +++ b/analysis/dashboard_charts.py @@ -62,6 +62,6 @@ # --- MS4 stormwater compliance charts (3 charts) --- MS4_compliance_viz.generate_charts(engine, prefix=PREFIX) -# --- Lobbying charts (4 charts; skipped until MA_lobbying_viz.py is available) --- +# --- Lobbying charts (gracefully skipped if the viz module is unavailable) --- if _LOBBYING_VIZ_AVAILABLE: MA_lobbying_viz.generate_charts(engine, prefix=PREFIX) diff --git a/docs/_includes/charts/dash_lobbying_bill_intensity.html b/docs/_includes/charts/dash_lobbying_bill_intensity.html new file mode 100644 index 0000000..d151c1a --- /dev/null +++ b/docs/_includes/charts/dash_lobbying_bill_intensity.html @@ -0,0 +1,73 @@ +{% raw %} + + + Environmental Bills Lobbied per Year + + + +
+ + + + +
+ + +{% endraw %} diff --git a/docs/_includes/charts/dash_lobbying_spend_by_cluster.html b/docs/_includes/charts/dash_lobbying_spend_by_cluster.html new file mode 100644 index 0000000..d26a1a8 --- /dev/null +++ b/docs/_includes/charts/dash_lobbying_spend_by_cluster.html @@ -0,0 +1,73 @@ +{% raw %} + + + MA Lobbying Spend by Topic Cluster + + + +
+ + + + +
+ + +{% endraw %} diff --git a/docs/_includes/charts/dash_lobbying_spend_trend.html b/docs/_includes/charts/dash_lobbying_spend_trend.html new file mode 100644 index 0000000..4ee56d6 --- /dev/null +++ b/docs/_includes/charts/dash_lobbying_spend_trend.html @@ -0,0 +1,73 @@ +{% raw %} + + + Annual MA Lobbying Spend on Environmental Bills + + + +
+ + + + +
+ + +{% endraw %} diff --git a/docs/_includes/charts/dash_lobbying_top_employers.html b/docs/_includes/charts/dash_lobbying_top_employers.html new file mode 100644 index 0000000..aa287cf --- /dev/null +++ b/docs/_includes/charts/dash_lobbying_top_employers.html @@ -0,0 +1,74 @@ +{% raw %} + + + Top 15 MA Lobbying Clients — 2025 + + + +
+ + + + +
+ + +{% endraw %} diff --git a/docs/_includes/charts/dash_lobbying_vs_enforcement.html b/docs/_includes/charts/dash_lobbying_vs_enforcement.html new file mode 100644 index 0000000..bd4d79f --- /dev/null +++ b/docs/_includes/charts/dash_lobbying_vs_enforcement.html @@ -0,0 +1,73 @@ +{% raw %} + + + MA Lobbying Spend vs. Enforcement Actions + + + +
+ + + + +
+ + +{% endraw %} diff --git a/docs/_includes/charts/lobbying_bill_intensity.html b/docs/_includes/charts/lobbying_bill_intensity.html new file mode 100644 index 0000000..d151c1a --- /dev/null +++ b/docs/_includes/charts/lobbying_bill_intensity.html @@ -0,0 +1,73 @@ +{% raw %} + + + Environmental Bills Lobbied per Year + + + +
+ + + + +
+ + +{% endraw %} diff --git a/docs/_includes/charts/lobbying_bill_pass_by_spend_tier.html b/docs/_includes/charts/lobbying_bill_pass_by_spend_tier.html new file mode 100644 index 0000000..6413a22 --- /dev/null +++ b/docs/_includes/charts/lobbying_bill_pass_by_spend_tier.html @@ -0,0 +1,73 @@ +{% raw %} + + + Environmental Bill Pass Rate by Lobbying Intensity + + + +
+ + + + +
+ + +{% endraw %} diff --git a/docs/_includes/charts/lobbying_bill_tsne.html b/docs/_includes/charts/lobbying_bill_tsne.html new file mode 100644 index 0000000..fd38659 --- /dev/null +++ b/docs/_includes/charts/lobbying_bill_tsne.html @@ -0,0 +1,4 @@ +{% raw %} +
+
+{% endraw %} diff --git a/docs/_includes/charts/lobbying_bill_umap_env.html b/docs/_includes/charts/lobbying_bill_umap_env.html new file mode 100644 index 0000000..dd37fe2 --- /dev/null +++ b/docs/_includes/charts/lobbying_bill_umap_env.html @@ -0,0 +1,4 @@ +{% raw %} +
+
+{% endraw %} diff --git a/docs/_includes/charts/lobbying_bill_umap_summary.html b/docs/_includes/charts/lobbying_bill_umap_summary.html new file mode 100644 index 0000000..8a5078a --- /dev/null +++ b/docs/_includes/charts/lobbying_bill_umap_summary.html @@ -0,0 +1,4 @@ +{% raw %} +
+
+{% endraw %} diff --git a/docs/_includes/charts/lobbying_cso_operators.html b/docs/_includes/charts/lobbying_cso_operators.html new file mode 100644 index 0000000..bb2a622 --- /dev/null +++ b/docs/_includes/charts/lobbying_cso_operators.html @@ -0,0 +1,73 @@ +{% raw %} + + + Total Lobbying Spend by Known CSO Operators + + + +
+ + + + +
+ + +{% endraw %} diff --git a/docs/_includes/charts/lobbying_employer_env_scatter.html b/docs/_includes/charts/lobbying_employer_env_scatter.html new file mode 100644 index 0000000..66b57c3 --- /dev/null +++ b/docs/_includes/charts/lobbying_employer_env_scatter.html @@ -0,0 +1,4 @@ +{% raw %} +
+
+{% endraw %} diff --git a/docs/_includes/charts/lobbying_env_categories_by_gc.html b/docs/_includes/charts/lobbying_env_categories_by_gc.html new file mode 100644 index 0000000..d32fdc2 --- /dev/null +++ b/docs/_includes/charts/lobbying_env_categories_by_gc.html @@ -0,0 +1,73 @@ +{% raw %} + + + Environmental Bills by Topic Category and Legislative Session + + + +
+ + + + +
+ + +{% endraw %} diff --git a/docs/_includes/charts/lobbying_env_cluster_share.html b/docs/_includes/charts/lobbying_env_cluster_share.html new file mode 100644 index 0000000..0ad867c --- /dev/null +++ b/docs/_includes/charts/lobbying_env_cluster_share.html @@ -0,0 +1,73 @@ +{% raw %} + + + Environmental Lobbying Spend by Topic Cluster + + + +
+ + + + +
+ + +{% endraw %} diff --git a/docs/_includes/charts/lobbying_env_opponents.html b/docs/_includes/charts/lobbying_env_opponents.html new file mode 100644 index 0000000..9e94d60 --- /dev/null +++ b/docs/_includes/charts/lobbying_env_opponents.html @@ -0,0 +1,74 @@ +{% raw %} + + + Top 20 Clients Opposing Environmental Bills (all years) + + + +
+ + + + +
+ + +{% endraw %} diff --git a/docs/_includes/charts/lobbying_env_positions.html b/docs/_includes/charts/lobbying_env_positions.html new file mode 100644 index 0000000..e73fea5 --- /dev/null +++ b/docs/_includes/charts/lobbying_env_positions.html @@ -0,0 +1,73 @@ +{% raw %} + + + Unique Clients by Position on Environmental Bills + + + +
+ + + + +
+ + +{% endraw %} diff --git a/docs/_includes/charts/lobbying_env_score_vs_clients.html b/docs/_includes/charts/lobbying_env_score_vs_clients.html new file mode 100644 index 0000000..b67c7eb --- /dev/null +++ b/docs/_includes/charts/lobbying_env_score_vs_clients.html @@ -0,0 +1,4 @@ +{% raw %} +
+
+{% endraw %} diff --git a/docs/_includes/charts/lobbying_gc_trend.html b/docs/_includes/charts/lobbying_gc_trend.html new file mode 100644 index 0000000..eac1b3f --- /dev/null +++ b/docs/_includes/charts/lobbying_gc_trend.html @@ -0,0 +1,73 @@ +{% raw %} + + + Environmental Lobbying Engagement by Legislative Session + + + +
+ + + + +
+ + +{% endraw %} diff --git a/docs/_includes/charts/lobbying_opposition_pairs.html b/docs/_includes/charts/lobbying_opposition_pairs.html new file mode 100644 index 0000000..35ae068 --- /dev/null +++ b/docs/_includes/charts/lobbying_opposition_pairs.html @@ -0,0 +1,74 @@ +{% raw %} + + + Top 15 Most-Opposed Employer Pairs on Environmental Bills + + + +
+ + + + +
+ + +{% endraw %} diff --git a/docs/_includes/charts/lobbying_pass_by_position.html b/docs/_includes/charts/lobbying_pass_by_position.html new file mode 100644 index 0000000..51243cc --- /dev/null +++ b/docs/_includes/charts/lobbying_pass_by_position.html @@ -0,0 +1,73 @@ +{% raw %} + + + Environmental Bill Pass Rate by Lobbying Position + + + +
+ + + + +
+ + +{% endraw %} diff --git a/docs/_includes/charts/lobbying_spend_by_cluster.html b/docs/_includes/charts/lobbying_spend_by_cluster.html new file mode 100644 index 0000000..d26a1a8 --- /dev/null +++ b/docs/_includes/charts/lobbying_spend_by_cluster.html @@ -0,0 +1,73 @@ +{% raw %} + + + MA Lobbying Spend by Topic Cluster + + + +
+ + + + +
+ + +{% endraw %} diff --git a/docs/_includes/charts/lobbying_spend_trend.html b/docs/_includes/charts/lobbying_spend_trend.html new file mode 100644 index 0000000..4ee56d6 --- /dev/null +++ b/docs/_includes/charts/lobbying_spend_trend.html @@ -0,0 +1,73 @@ +{% raw %} + + + Annual MA Lobbying Spend on Environmental Bills + + + +
+ + + + +
+ + +{% endraw %} diff --git a/docs/_includes/charts/lobbying_spend_vs_budget.html b/docs/_includes/charts/lobbying_spend_vs_budget.html new file mode 100644 index 0000000..dace674 --- /dev/null +++ b/docs/_includes/charts/lobbying_spend_vs_budget.html @@ -0,0 +1,73 @@ +{% raw %} + + + MA Lobbying Spend vs. DEP Budget (inflation-adjusted) + + + +
+ + + + +
+ + +{% endraw %} diff --git a/docs/_includes/charts/lobbying_spend_vs_staff.html b/docs/_includes/charts/lobbying_spend_vs_staff.html new file mode 100644 index 0000000..f7826bd --- /dev/null +++ b/docs/_includes/charts/lobbying_spend_vs_staff.html @@ -0,0 +1,73 @@ +{% raw %} + + + Environmental Lobbying Spend vs. DEP Staff Headcount + + + +
+ + + + +
+ + +{% endraw %} diff --git a/docs/_includes/charts/lobbying_top_employers.html b/docs/_includes/charts/lobbying_top_employers.html new file mode 100644 index 0000000..aa287cf --- /dev/null +++ b/docs/_includes/charts/lobbying_top_employers.html @@ -0,0 +1,74 @@ +{% raw %} + + + Top 15 MA Lobbying Clients — 2025 + + + +
+ + + + +
+ + +{% endraw %} diff --git a/docs/_includes/charts/lobbying_top_env_employers.html b/docs/_includes/charts/lobbying_top_env_employers.html new file mode 100644 index 0000000..61819f9 --- /dev/null +++ b/docs/_includes/charts/lobbying_top_env_employers.html @@ -0,0 +1,74 @@ +{% raw %} + + + Top 20 Clients by Cumulative Environmental Lobbying Spend + + + +
+ + + + +
+ + +{% endraw %} diff --git a/docs/_includes/charts/lobbying_top_env_tags.html b/docs/_includes/charts/lobbying_top_env_tags.html new file mode 100644 index 0000000..0bc76cb --- /dev/null +++ b/docs/_includes/charts/lobbying_top_env_tags.html @@ -0,0 +1,74 @@ +{% raw %} + + + Top 15 Tags on Environmental Bills (LLM-assigned) + + + +
+ + + + +
+ + +{% endraw %} diff --git a/docs/_includes/charts/lobbying_vs_enforcement.html b/docs/_includes/charts/lobbying_vs_enforcement.html new file mode 100644 index 0000000..bd4d79f --- /dev/null +++ b/docs/_includes/charts/lobbying_vs_enforcement.html @@ -0,0 +1,73 @@ +{% raw %} + + + MA Lobbying Spend vs. Enforcement Actions + + + +
+ + + + +
+ + +{% endraw %} diff --git a/docs/_posts/2026-05-22-ma-environmental-lobbying.md b/docs/_posts/2026-05-22-ma-environmental-lobbying.md new file mode 100644 index 0000000..c8fd352 --- /dev/null +++ b/docs/_posts/2026-05-22-ma-environmental-lobbying.md @@ -0,0 +1,164 @@ +--- +layout: post +title: "Who lobbies the Massachusetts Legislature on environmental policy?" +ancillary: 0 +--- + +*The [lobbying disclosure data used in this analysis]({{ site.url }}{{ site.baseurl }}/data/MA_lobbying.html) comes from the [Massachusetts Secretary of the Commonwealth's Lobbyist Public Search portal](https://www.sec.state.ma.us/LobbyistPublicSearch/), and is joined here against the [DEP staffing]({{ site.url }}{{ site.baseurl }}/data/MADEP_staff.html), [agency budget]({{ site.url }}{{ site.baseurl }}/data/ECOS_budget_history.html), and [sewage discharge]({{ site.url }}{{ site.baseurl }}/data/EEADP_all.html) datasets already archived in the [{{ site.data.site_config.site_abbrev }} database]({{ site.url }}{{ site.baseurl }}/data/index.html).* + +*[The code needed to reproduce this analysis using {{ site.data.site_config.site_abbrev }} data can be viewed and downloaded here](https://github.com/nesanders/MAenvironmentaldata/blob/master/analysis/MA_lobbying_viz.py).* + +Every employer that hires a lobbyist in Massachusetts is required to file public disclosures with the Secretary of the Commonwealth: who they retained, how much they paid, and which bills they tried to influence. The Secretary publishes these filings on the [Lobbyist Public Search portal](https://www.sec.state.ma.us/LobbyistPublicSearch/) going back to 2005. In {{ site.data.facts_lobbying.lobbying_most_recent_year }} alone, registered lobbyists and lobbying entities disclosed roughly **${{ site.data.facts_lobbying.lobbying_total_spend_latest | divided_by: 1000000 }} million** in client compensation; across the full {{ site.data.facts_lobbying.lobbying_first_year }}–{{ site.data.facts_lobbying.lobbying_most_recent_year }} period the total approaches **\${{ site.data.facts_lobbying.lobbying_total_spend_cumulative | divided_by: 1000000 }} million**. + +In this post we ask a narrower question: how much of that activity concerns environmental policy? We first have to decide which bills are environmentally relevant in the first place, and then we can look at who lobbies them, how the landscape has changed over nearly two decades, and whether lobbying intensity moves together with the regulatory capacity — staffing and budget — of the agency that environmental law actually empowers, the [Department of Environmental Protection (DEP)]({{ site.url }}{{ site.baseurl }}/data/MADEP_staff.html). + +The underlying dataset, including the full scraping and scoring methodology, is documented on the [MA lobbying data page]({{ site.url }}{{ site.baseurl }}/data/MA_lobbying.html). + +--- + +## What counts as an "environmental" bill? + +The subject tags that filers attach to their disclosures are unreliable for this purpose: a utility lobbying a wastewater bill may file it under "Utilities & Energy," while a developer opposing wetlands reform may file the same bill under "Land Use." Rather than trust those tags, we classify each lobbied bill directly from its text. + +We do this with a large language model. A `gemini-2.5-flash` model reads each bill's title and full text (retrieved from the [MA Legislature OpenAPI](https://malegislature.gov/api/swagger)), writes a plain-English summary, assigns it to a fixed taxonomy of policy categories and tags, and judges whether the bill is environmentally relevant. We treat this LLM judgment as the classification of record: in spot-checks it identified clearly environmental bills — the bottle bill, net metering, renewable portfolio standards — that a purely embedding-based similarity score missed. Of the **{{ site.data.facts_lobbying.lobbying_n_bills_total }}** distinct bills lobbied across the period, **{{ site.data.facts_lobbying.lobbying_n_env_bills }}** (about {{ site.data.facts_lobbying.lobbying_env_pct }}%) are flagged environmental. + +We also retain a secondary, embedding-based score for each bill — its differential cosine similarity to reference sets of known environmental and non-environmental bills — so that an analyst who prefers a continuous measure, or a different threshold, can use it. The full methodology, including the clustering pipeline, is documented on the [data page]({{ site.url }}{{ site.baseurl }}/data/MA_lobbying.html#environmental-relevance--taxonomy). + +### The policy landscape + +The chart below projects every lobbied bill into two dimensions using [t-SNE](https://en.wikipedia.org/wiki/T-distributed_stochastic_neighbor_embedding) on the bill embeddings. The environmental bills are drawn as large outlined dots coloured by topic cluster; the remaining bills are tiny grey background points that provide context. Hover over any point for the bill title. + +{% include charts/lobbying_bill_tsne.html %} + +We should be cautious in reading too much into the cluster geometry. Topic clusters reflect the dominant subject across all bills in the cluster, and no cluster is purely environmental — environmentally-relevant bills are scattered across several clusters, with the heaviest concentrations in the clean-energy and waste/recycling topic groups. MA legislative text is also unusually dense (bills share a great deal of boilerplate amendment language), so the two-dimensional projection compresses real structure; it is best read as a rough map rather than a precise one. + +--- + +## Environmental lobbying through the legislative sessions + +The number of environmental bills attracting lobbyist attention, and the number of distinct employers engaging on them, have both risen substantially over the period for which per-client data is available. + +{% include charts/lobbying_gc_trend.html %} + +In the {{ site.data.facts_lobbying_post.post_first_session_gc }} General Court ({{ site.data.facts_lobbying_post.post_first_session_years }}), **{{ site.data.facts_lobbying_post.post_first_session_env_bills }}** environmental bills were lobbied by **{{ site.data.facts_lobbying_post.post_first_session_employers }}** distinct employers. By the {{ site.data.facts_lobbying_post.post_recent_session_gc }} ({{ site.data.facts_lobbying_post.post_recent_session_years }}) — the most recent completed session — those figures had grown to **{{ site.data.facts_lobbying_post.post_recent_session_env_bills }}** bills and **{{ site.data.facts_lobbying_post.post_recent_session_employers }}** employers, roughly {{ site.data.facts_lobbying_post.post_env_bills_growth_x }}× and {{ site.data.facts_lobbying_post.post_employers_growth_x }}× their earlier levels. The trend tracks the arrival of major clean-energy and climate legislation on Beacon Hill over the 2010s and early 2020s, and the broader mainstreaming of climate policy as a subject of organized lobbying. + +*Note: the employer count measures unique lobbying clients per session, not individual lobbyists. A trade association that lobbies fifty bills counts once. Compensation is reported per registrant per six-month period rather than per bill, so all spend figures below that are attributed to individual bills rest on a proportional allocation, described in the caveats.* + +### What kinds of environmental legislation attract lobbying? + +The stacked bar below breaks the same sessions down by LLM-assigned policy category. A bill may span more than one category, and is counted once in each it is assigned. + +{% include charts/lobbying_env_categories_by_gc.html %} + +"Environmental Protection" is the largest category throughout. The share tagged "Energy" has grown noticeably over the more recent sessions, consistent with the increasing volume of clean-energy legislation, and a steady third share concerns "Public and Natural Resources" — land, water, and fishing-rights bills. + +--- + +## Who lobbies environmental bills, and how focused are they? + +Not every employer that touches an environmental bill is primarily an environmental actor. Some file hundreds of bills a year across every policy domain, of which a single environmental bill is a small fraction; others are single-issue advocates for whom nearly every bill is environmental. + +{% include charts/lobbying_employer_env_scatter.html %} + +The scatter above plots each employer with at least ten total lobbied bills by their total lobbying spend (horizontal axis) and the fraction of their bills that are environmental (vertical axis); the size of each point scales with the number of environmental bills lobbied. Three groups stand out. In the upper right are high-spending, highly-focused clients — renewable-energy developers, the large electric and gas distribution utilities, and the major environmental advocacy organizations — the dominant repeat players in environmental policy. In the lower right are high-spending but low-focus clients, chiefly the broad business trade associations that lobby comprehensively and devote only a small share of their effort to environmental bills. In the upper left are lower-spending, highly-focused clients: newer clean-energy entrants and niche advocacy groups with a specific environmental mandate. + +### Top environmental lobbying spenders + +The chart below ranks employers by cumulative environmental lobbying spend, where an employer's environmental budget is estimated as their total compensation scaled by the fraction of their bills that are environmental. + +{% include charts/lobbying_top_env_employers.html %} + +The leaders are a mix of regulated entities — distribution utilities and energy developers with a direct financial stake in energy legislation — and public-interest advocates. That both appear near the top is itself the central feature of environmental lobbying: it is a contested arena, not a one-sided one. + +--- + +## What gets lobbied, and how is it categorized? + +The LLM assigns structured tags to each bill from a fixed taxonomy. Across the environmental bills, the most common tags concern pollution control and environmental regulatory procedure — reflecting the large volume of bills that touch DEP's regulatory authority — followed by renewable energy and energy efficiency, the clean-energy cluster. + +{% include charts/lobbying_top_env_tags.html %} + +--- + +## Who opposes whom? + +Lobbying on environmental bills does not all run in the same direction. The position field in the disclosures records whether each client registered "Support," "Oppose," or "Neutral" on a given bill. By taking each environmental bill on which one client registered support and another registered opposition, we can count how often any two clients land on opposite sides. The chart below shows the fifteen employer pairs most frequently in direct opposition. + +{% include charts/lobbying_opposition_pairs.html %} + +The most frequent opposing pair is **{{ site.data.facts_lobbying_post.post_top_opposition_a }}** and **{{ site.data.facts_lobbying_post.post_top_opposition_b }}**, on opposite sides of **{{ site.data.facts_lobbying_post.post_top_opposition_bills }}** distinct environmental bills — an industry trade group and a public-interest advocacy organization recurring across the toxics, packaging, and consumer-protection bills where chemical and product regulation is at stake. More generally, the recurring pattern in these pairings is the large distribution utilities and statewide business associations on one side and clean-energy and environmental advocacy coalitions on the other. + +It is worth stressing what this chart does and does not show. "Opposing an environmental bill" is not the same as opposing environmental protection: a utility may oppose a clean-energy bill it considers technically flawed or cost-shifting, and an environmental group may oppose a bill it considers too weak. The pairs reflect patterns of organized engagement, not a pro- or anti-environment score. + +### Unique clients by position + +{% include charts/lobbying_env_positions.html %} + +--- + +## Lobbying spend and DEP capacity over time + +The comparison this combined dataset most naturally enables is between lobbying activity on environmental bills and the regulatory capacity of DEP — both its budget and its staffing — over the same years. We make no causal claim here; the question is simply whether the two move together. + +### Lobbying spend vs. DEP administrative budget + +{% include charts/lobbying_spend_vs_budget.html %} + +The DEP administrative budget is inflation-adjusted to recent dollars and is drawn from the MA Comptroller's [CTHRU]({{ site.url }}{{ site.baseurl }}/data/ECOS_budget_history.html) system back to FY2005, with earlier years from MassBudget's historical archive. + +### Lobbying spend vs. DEP staffing + +{% include charts/lobbying_spend_vs_staff.html %} + +[DEP headcount]({{ site.url }}{{ site.baseurl }}/data/MADEP_staff.html) is the annual count of unique employees with non-zero payroll in the MA Comptroller's payroll dataset. The vertical axis is a raw count of employees; the lobbying-spend axis is in millions of dollars. The two series do not track each other in any simple way — lobbying activity has risen fairly steadily, while DEP staffing has been comparatively flat — which is itself consistent with our earlier finding that [DEP enforcement activity has not kept pace]({% post_url 2017-04-02-dep-enforcements %}) with the regulatory demands placed on the agency. + +--- + +## Environmental lobbying spend by topic + +The chart below shows total annual lobbying spend allocated to environmental bills, stacked by topic cluster. As above, spend is allocated proportionally: a client that lobbied bills in two clusters has its annual compensation split between them. + +{% include charts/lobbying_env_cluster_share.html %} + +The clean-energy and waste/recycling clusters account for a growing share of allocated environmental spend over time, mirroring the shift in the categories chart above. + +--- + +## Does lobbying intensity predict bill passage? + +For each environmental bill we can count the number of distinct employers who lobbied it. Bills that attract many lobbyers tend to be higher-stakes — but it is not obvious whether heavily-lobbied legislation is more or less likely to become law. + +{% include charts/lobbying_bill_pass_by_spend_tier.html %} + +{% include charts/lobbying_pass_by_position.html %} + +The relationship is weak and should be read cautiously. A higher pass rate among heavily-lobbied bills, where it appears, is as consistent with the mundane explanation — important, broadly-supported bills draw attention from every side — as with any story about the effectiveness of industry influence. Passage in the Massachusetts Legislature is determined by many factors the disclosure data does not capture. + +--- + +## CSO operators and lobbying + +One of the environmental datasets already on this site is the record of [Combined Sewer Overflow (CSO) discharges]({{ site.url }}{{ site.baseurl }}/data/EEADP_all.html) — untreated and partially-treated sewage released into Massachusetts waterways, the subject of [earlier]({% post_url 2018-04-25-necir-cso-ej %}) [AMEND analyses]({% post_url 2023-10-20-eea-dp-cso-ej %}). The portal identifies {{ site.data.facts_lobbying_post.post_cso_n_operators }} permitted operators. When CSO-related bills come before the Legislature, do these operators lobby, and how aggressively? + +In practice, most municipal CSO operators do not lobby in their own name. They lobby through the [Massachusetts Municipal Association (MMA)](https://www.mma.org/), which represents nearly all 351 cities and towns on Beacon Hill. The chart below shows direct lobbying spend by known CSO permittees alongside the MMA as a proxy for the municipal sector. The MMA totals reflect *all* of its lobbying activity, not only CSO-related bills, so they are best read as a ceiling on potential municipal engagement on CSO policy rather than a measure of CSO-specific intensity. + +{% include charts/lobbying_cso_operators.html %} + +The operators that do appear directly tend to be the larger regional authorities that retain their own lobbying capacity, rather than individual cities and towns. + +--- + +## Caveats and limitations + +- **Spend allocation is approximate.** The disclosures report a single compensation figure per registrant per six-month period, not per-bill spend. Where we attribute spend to individual bills we allocate it proportionally across the bills the registrant disclosed lobbying. A client who spent most of their effort on one priority bill but listed ten others will have spend over-distributed to the secondary bills. The aggregate spend totals are not affected by this — only the per-bill and per-cluster allocations are. +- **Environmental classification is a model judgment, not ground truth.** The `is_environmental` flag reflects a Gemini 2.5 Flash assessment of each bill, not a domain-expert label. The model can over-classify bills that use environmental language incidentally, and under-classify bills that affect the environment indirectly. We expose both the LLM flag and the embedding similarity score so that analysts can apply their own threshold. +- **The earliest years are coarser.** The 2005–2008 disclosure format reports a single salary total per registrant with no per-client breakdown, and frequently omits bill titles; the per-client analyses in this post therefore begin with the 2009–2010 session. The MA Legislature API likewise does not serve bills before 2009, so environmental classification for the earliest years relies on disclosure titles alone, which are often blank in the legacy format. Some lobbied bills appear with no resolvable title from either source; these remain in the dataset as real lobbying activity but cannot be summarized or classified. +- **CSO-operator matching is fuzzy.** Substring matching captures operators that lobby under their own name; the MMA is included as an explicit proxy for the municipal sector, but its totals cover all of its work. Operators that retain commercial lobbying firms cannot be traced back to their underlying client from the disclosure data alone. + +--- + +## Reproducibility + +All charts on this page are generated by [`analysis/MA_lobbying_viz.py`](https://github.com/nesanders/MAenvironmentaldata/blob/master/analysis/MA_lobbying_viz.py), which reads the assembled SQLite database (`AMEND.db`) and the bill embeddings parquet. The scraping pipeline, the embedding and LLM scoring in [`get_data/score_lobbying_bills.py`](https://github.com/nesanders/MAenvironmentaldata/blob/master/get_data/score_lobbying_bills.py) and [`get_data/summarize_lobbying_bills.py`](https://github.com/nesanders/MAenvironmentaldata/blob/master/get_data/summarize_lobbying_bills.py), and the clustering in [`get_data/cluster_lobbying_bills.py`](https://github.com/nesanders/MAenvironmentaldata/blob/master/get_data/cluster_lobbying_bills.py) are all documented in [`get_data/README_lobbying.md`](https://github.com/nesanders/MAenvironmentaldata/blob/master/get_data/README_lobbying.md). + +The complete bill embeddings (768-dimensional vectors and full text) are persisted to `gs://openamend-data/MA_bill_embeddings.parquet` and are not committed to the repository; a lightweight scored CSV without embeddings is committed at [`docs/data/MA_lobbying_bills_scored.csv`]({{ site.url }}{{ site.baseurl }}/data/MA_lobbying_bills_scored.csv). diff --git a/docs/dashboard.md b/docs/dashboard.md index 9cc9993..73014f1 100644 --- a/docs/dashboard.md +++ b/docs/dashboard.md @@ -20,6 +20,7 @@ For full analysis and narrative context, follow the links in each section. [CSO Discharge Trends](#cso) · [303(d) Impaired Waters](#303d) · [MS4 Stormwater Compliance](#ms4) · +[Lobbying Spend](#lobbying) · [CSO Data Quality Indicator](#data-quality) · [Discharge by Watershed](#watershed) @@ -316,4 +317,38 @@ discharge over the full reporting period to date. --- + + +## Lobbying Spend on Environmental Bills + +Data: [MA Secretary of State lobbying disclosures]({{ site.url }}{{ site.baseurl }}/data/MA_lobbying.html), 2005–present. + +Lobbying disclosures filed semi-annually with the MA Secretary of State identify which organizations hired lobbyists, how much clients paid, and which specific bills were lobbied. Each bill's full text is classified for environmental relevance by a Google Gemini language model (with a secondary embedding-similarity score retained); charts show only employers that lobbied at least one environmentally relevant bill. Compensation figures reflect total payments from each client to each lobbying entity per year. + +Data last updated **{{ site.data.ts_update_MA_lobbying.updated | date: "%-d %B %Y" }}**. Refreshed automatically on a weekly basis; exits early when no new semi-annual filings are posted (filings are submitted twice yearly, so most weeks see no change). + +### Annual lobbying spend on environmental bills + +{% include charts/dash_lobbying_spend_trend.html %} + +### Top employers — most recent complete year + +{% include charts/dash_lobbying_top_employers.html %} + +### Environmental bills lobbied per year + +{% include charts/dash_lobbying_bill_intensity.html %} + +### Lobbying spend vs. enforcement actions + +{% include charts/dash_lobbying_vs_enforcement.html %} + +### Lobbying spend by topic cluster + +*Lobbying spend allocated across bill topic clusters (k-means on Gemini embeddings, labeled by Gemini Flash). Cluster labels are assigned once and updated manually when new years of data are added.* + +{% include charts/dash_lobbying_spend_by_cluster.html %} + +--- + *Charts regenerated weekly from the latest available data. Last update visible in the [Actions log](https://github.com/nesanders/MAenvironmentaldata/actions/workflows/update-charts.yml).* diff --git a/docs/data/facts_lobbying_post.yml b/docs/data/facts_lobbying_post.yml new file mode 100644 index 0000000..502178f --- /dev/null +++ b/docs/data/facts_lobbying_post.yml @@ -0,0 +1,14 @@ +post_cso_n_operators: 66 +post_first_session_gc: "186th" +post_first_session_years: "2009–2010" +post_first_session_env_bills: 112 +post_first_session_employers: 69 +post_recent_session_gc: "193rd" +post_recent_session_years: "2023–2024" +post_recent_session_env_bills: 685 +post_recent_session_employers: 928 +post_env_bills_growth_x: 6.1 +post_employers_growth_x: 13.4 +post_top_opposition_a: "American Chemistry Council" +post_top_opposition_b: "MASSPIRG" +post_top_opposition_bills: 55 diff --git a/get_data/cluster_pilot_summaries.py b/get_data/cluster_pilot_summaries.py new file mode 100644 index 0000000..91a15de --- /dev/null +++ b/get_data/cluster_pilot_summaries.py @@ -0,0 +1,318 @@ +"""K-means on summary embeddings for the 495-bill pilot, then recolour UMAP. + +Run from get_data/: + /path/to/python -u cluster_pilot_summaries.py [--k N] +""" + +import argparse +import json +import time +from pathlib import Path + +import numpy as np +import pandas as pd +from sklearn.cluster import KMeans +from sklearn.metrics import silhouette_score +from sklearn.preprocessing import normalize + +DATA_DIR = Path('../docs/data') +LOCAL_PARQUET = DATA_DIR / 'MA_bill_embeddings.parquet' +GCS_PARQUET = 'gs://openamend-data/MA_bill_embeddings.parquet' +LABELS_CSV = DATA_DIR / 'MA_bill_cluster_labels.csv' +OUT_HTML = Path('../docs/_includes/charts/lobbying_bill_umap_summary.html') +API_KEY_PATH = Path('SECRET_GOOGLE_API_KEY') + +EMBEDDING_MODEL = 'gemini-embedding-2' +EMBEDDING_DIM = 768 +REQUEST_DELAY = 0.05 + +PALETTE_20 = [ + '#1f77b4','#ff7f0e','#2ca02c','#d62728','#9467bd', + '#8c564b','#e377c2','#7f7f7f','#bcbd22','#17becf', + '#aec7e8','#ffbb78','#98df8a','#ff9896','#c5b0d5', + '#c49c94','#f7b6d2','#c7c7c7','#dbdb8d','#9edae5', +] + + +def _load_parquet() -> pd.DataFrame: + try: + import gcsfs + fs = gcsfs.GCSFileSystem() + if fs.exists(GCS_PARQUET): + with fs.open(GCS_PARQUET, 'rb') as f: + df = pd.read_parquet(f) + print(f'Loaded {len(df)} rows from GCS') + return df + except Exception as e: + print(f'GCS failed ({e}), using local') + df = pd.read_parquet(LOCAL_PARQUET) + print(f'Loaded {len(df)} rows from local') + return df + + +def _embed_one(client, text: str) -> np.ndarray: + """Embed a single text with exponential backoff; returns zero vector on failure.""" + import random + import google.genai.types as types + for attempt in range(6): + try: + resp = client.models.embed_content( + model=EMBEDDING_MODEL, + contents=text, + config=types.EmbedContentConfig(output_dimensionality=EMBEDDING_DIM), + ) + return np.array(resp.embeddings[0].values, dtype=np.float32) + except Exception as e: + if attempt == 5: + print(f' embed failed: {e}') + return np.zeros(EMBEDDING_DIM, dtype=np.float32) + wait = (2 ** attempt) + random.uniform(0, 1) + time.sleep(wait) + return np.zeros(EMBEDDING_DIM, dtype=np.float32) + + +def _embed_texts(client, texts: list[str], workers: int = 8) -> np.ndarray: + """Embed texts in parallel with a thread pool.""" + from concurrent.futures import ThreadPoolExecutor, as_completed + results = [None] * len(texts) + done = 0 + with ThreadPoolExecutor(max_workers=workers) as ex: + futures = {ex.submit(_embed_one, client, t): i for i, t in enumerate(texts)} + for fut in as_completed(futures): + i = futures[fut] + results[i] = fut.result() + done += 1 + if done % 100 == 0: + print(f' {done}/{len(texts)} embeddings...', flush=True) + return np.array(results, dtype=np.float32) + + +def _label_cluster(client, titles: list[str], k: int) -> str: + """Ask Gemini for a short topic label given up to 20 central bill titles.""" + import google.genai.types as types + bullet_list = '\n'.join(f'- {t}' for t in titles[:20]) + prompt = ( + f'These are Massachusetts legislative bill titles from a topic cluster:\n' + f'{bullet_list}\n\n' + 'Give a concise 3–6 word topic label that describes what these bills have in common. ' + 'Reply with just the label, no punctuation or quotes.' + ) + try: + resp = client.models.generate_content( + model='gemini-2.5-flash', + contents=prompt, + config=types.GenerateContentConfig( + temperature=0, + thinking_config=types.ThinkingConfig(thinking_budget=0), + ), + ) + time.sleep(0.3) + return resp.text.strip() + except Exception as e: + print(f' label error: {e}') + return f'Cluster {k}' + + +def kmeans_sweep(emb_norm: np.ndarray, ks: list[int]) -> dict: + """Run k-means for each k, return silhouette scores.""" + results = {} + for k in ks: + km = KMeans(n_clusters=k, random_state=42, n_init=10) + labels = km.fit_predict(emb_norm) + sil = silhouette_score(emb_norm, labels, metric='cosine') + results[k] = {'sil': sil, 'model': km, 'labels': labels} + print(f' k={k:2d} silhouette={sil:.4f}') + return results + + +def make_umap(df_pilot: pd.DataFrame, emb_norm: np.ndarray, + cluster_labels_arr: np.ndarray, label_map: dict) -> None: + import umap as umap_lib + import plotly.graph_objects as go + + is_env = df_pilot['is_env_llm'].fillna(False).astype(bool).values + n_clusters = len(label_map) + + print(f'Running UMAP (n={len(emb_norm)}, cosine, n_neighbors=15, min_dist=0.1)...') + reducer = umap_lib.UMAP( + n_components=2, n_neighbors=15, min_dist=0.1, + metric='cosine', random_state=42, + ) + coords = reducer.fit_transform(emb_norm) + + fig = go.Figure() + + # One trace per cluster — non-env (small, semi-transparent) then env (large, + # outlined) so env dots render on top. Both use the same cluster colour. + for cid in sorted(label_map.keys()): + lbl = label_map[cid] + colour = PALETTE_20[cid % 20] + clust_mask = cluster_labels_arr == cid + + # Non-env slice + ne_mask = clust_mask & ~is_env + if ne_mask.sum(): + ne_df = df_pilot[ne_mask] + sc = coords[ne_mask] + fig.add_trace(go.Scatter( + x=sc[:, 0], y=sc[:, 1], mode='markers', + marker=dict(color=colour, size=7, opacity=0.45), + legendgroup=str(cid), + showlegend=False, + name=lbl, + hovertext=[ + f'{t}
{s}
cluster: {lbl}' + for t, s in zip( + ne_df['bill_title'].fillna(''), + ne_df['summary'].fillna('').str[:120], + ) + ], + hoverinfo='text', + )) + + # Env slice — larger, black outline, shown in legend + e_mask = clust_mask & is_env + n_env_in_cluster = e_mask.sum() + e_df = df_pilot[e_mask] + sc = coords[e_mask] + # Always add a trace for the legend entry (even if 0 env bills in cluster) + legend_label = f'{lbl} ({ne_mask.sum()} / 🌿{n_env_in_cluster})' + if e_mask.sum(): + fig.add_trace(go.Scatter( + x=sc[:, 0], y=sc[:, 1], mode='markers', + marker=dict(color=colour, size=13, opacity=0.95, + line=dict(color='black', width=1.2)), + legendgroup=str(cid), + showlegend=True, + name=legend_label, + hovertext=[ + f'{row["bill_title"]}
🌿 env · cluster: {lbl}' + f'
{str(row.get("summary",""))[:150]}' + for _, row in e_df.iterrows() + ], + hoverinfo='text', + )) + else: + # Cluster has non-env members but no env — still show in legend + fig.add_trace(go.Scatter( + x=[None], y=[None], mode='markers', + marker=dict(color=colour, size=10), + legendgroup=str(cid), + showlegend=True, + name=legend_label, + )) + + n_env = int(is_env.sum()) + n_nenv = int((~is_env).sum()) + fig.update_layout( + title=dict(text=( + f'MA Lobbying Bills — Summary Embeddings UMAP (pilot, k={n_clusters})' + f'
{n_env} env (🌿 large, outlined) · {n_nenv} non-env (small) · ' + 'coloured by summary-embed cluster · hover for details' + ), font=dict(size=13)), + xaxis=dict(visible=False), yaxis=dict(visible=False), + legend=dict(font=dict(size=9), itemsizing='constant'), + margin=dict(l=10, r=10, t=70, b=10), + width=940, height=660, + plot_bgcolor='#f4f4f4', paper_bgcolor='white', + hovermode='closest', + ) + OUT_HTML.parent.mkdir(parents=True, exist_ok=True) + html = fig.to_html(full_html=False, include_plotlyjs='cdn', config={'responsive': True}) + OUT_HTML.write_text('{% raw %}\n' + html + '\n{% endraw %}\n', encoding='utf-8') + print(f'Wrote {OUT_HTML}') + + +def main(): + parser = argparse.ArgumentParser() + parser.add_argument('--k', type=int, default=None, + help='Fixed k for k-means (default: sweep 4–15 and pick best)') + parser.add_argument('--skip-embed', action='store_true', + help='Load cached summary embeddings from /tmp/pilot_summ_emb.npy') + args = parser.parse_args() + + api_key = API_KEY_PATH.read_text().strip() + import google.genai as genai + client = genai.Client(api_key=api_key) + + df = _load_parquet() + df_pilot = df[df['summary'].notna()].copy().reset_index(drop=True) + print(f'{len(df_pilot)} pilot bills with summaries') + + # ── 1. Embed summaries ───────────────────────────────────────────────────── + # Use stored summary_embedding where available; re-embed only the gaps. + cache_path = Path('/tmp/pilot_summ_emb.npy') + n_pilot = len(df_pilot) + summ_emb = np.zeros((n_pilot, EMBEDDING_DIM), dtype=np.float32) + needs_embed = np.ones(n_pilot, dtype=bool) + + if 'summary_embedding' in df_pilot.columns: + for i, v in enumerate(df_pilot['summary_embedding']): + if v is not None: + try: + arr = np.array(v, dtype=np.float32) + if arr.shape == (EMBEDDING_DIM,): + summ_emb[i] = arr + needs_embed[i] = False + except Exception: + pass + n_cached = (~needs_embed).sum() + print(f' {n_cached}/{n_pilot} summary_embeddings loaded from parquet') + + if args.skip_embed and cache_path.exists() and needs_embed.any(): + cached = np.load(cache_path) + if cached.shape == (n_pilot, EMBEDDING_DIM): + summ_emb[needs_embed] = cached[needs_embed] + needs_embed[:] = False + print(f'Loaded remaining embeddings from {cache_path}') + + if needs_embed.any(): + n_todo = needs_embed.sum() + print(f'\nEmbedding {n_todo} summaries (parallel)...') + todo_texts = df_pilot['summary'].iloc[np.where(needs_embed)[0]].tolist() + new_embs = _embed_texts(client, todo_texts) + summ_emb[needs_embed] = new_embs + np.save(cache_path, summ_emb) + print(f'Saved embeddings to {cache_path}') + else: + print('All embeddings ready from parquet/cache.') + + emb_norm = normalize(summ_emb - summ_emb.mean(axis=0), norm='l2') + + # ── 2. K-means sweep or fixed k ─────────────────────────────────────────── + if args.k: + chosen_k = args.k + km = KMeans(n_clusters=chosen_k, random_state=42, n_init=10) + cluster_ids = km.fit_predict(emb_norm) + sil = silhouette_score(emb_norm, cluster_ids, metric='cosine') + print(f'\nk={chosen_k} silhouette={sil:.4f}') + else: + print('\nK-means silhouette sweep...') + sweep = kmeans_sweep(emb_norm, ks=list(range(4, 16))) + best_k, best = max(sweep.items(), key=lambda x: x[1]['sil']) + print(f'\nBest k={best_k} silhouette={best["sil"]:.4f}') + chosen_k = best_k + km = best['model'] + cluster_ids = best['labels'] + + # ── 3. Label clusters with Gemini ───────────────────────────────────────── + print(f'\nLabelling {chosen_k} clusters...') + label_map = {} + for cid in range(chosen_k): + mask = cluster_ids == cid + titles = df_pilot[mask]['bill_title'].dropna().tolist() + # Sort by distance to centroid — pick the 20 closest + dists = np.linalg.norm(emb_norm[mask] - km.cluster_centers_[cid], axis=1) + order = np.argsort(dists) + central_titles = [titles[i] for i in order[:20] if i < len(titles)] + label = _label_cluster(client, central_titles, cid) + label_map[cid] = label + print(f' [{cid:2d}] n={mask.sum():3d} "{label}"') + + # ── 4. Regenerate UMAP ──────────────────────────────────────────────────── + print('\nGenerating UMAP...') + make_umap(df_pilot, emb_norm, cluster_ids, label_map) + + +if __name__ == '__main__': + main() diff --git a/get_data/diagnostics_summarize.py b/get_data/diagnostics_summarize.py new file mode 100644 index 0000000..738a7a6 --- /dev/null +++ b/get_data/diagnostics_summarize.py @@ -0,0 +1,645 @@ +"""Diagnostics for summarize_lobbying_bills.py pilot output. + +Runs after summarize_lobbying_bills.py --sample N and produces: + + 1. Env precision/recall on the known reference sets + 2. LLM vs embedding disagreement analysis + 3. Summary quality stats (thin-text bills, tag validity) + 4. Token/cost breakdown by GC and body-text length + 5. Silhouette comparison: original embedding vs summary embedding + 6. UMAP visualisation using summary embeddings (env + borderline) + 7. Written report appended to NOTES_bill_embeddings.md + +Run from get_data/: + /path/to/python -u diagnostics_summarize.py [--sample-size 500] +""" + +import argparse +import io +import json +import time +from collections import Counter, defaultdict +from pathlib import Path + +import numpy as np +import pandas as pd +from sklearn.cluster import KMeans +from sklearn.metrics import silhouette_score +from sklearn.preprocessing import normalize + +DATA_DIR = Path('../docs/data') +API_KEY_PATH = Path('SECRET_GOOGLE_API_KEY') +GCS_PARQUET = 'gs://openamend-data/MA_bill_embeddings.parquet' +LOCAL_PARQUET = DATA_DIR / 'MA_bill_embeddings.parquet' +LABELS_CSV = DATA_DIR / 'MA_bill_cluster_labels.csv' +NOTES_MD = Path('NOTES_bill_embeddings.md') +OUT_HTML = Path('../docs/_includes/charts/lobbying_bill_umap_summary.html') + +EMBEDDING_MODEL = 'gemini-embedding-2' +EMBEDDING_DIM = 768 +REQUEST_DELAY = 0.05 +EMBED_BATCH = 50 + +# Pricing (Gemini 2.5 Flash non-thinking) +PRICE_INPUT = 0.075 / 1_000_000 +PRICE_INPUT_CACHED = 0.01875 / 1_000_000 +PRICE_OUTPUT = 0.300 / 1_000_000 + +# Reference sets from score_lobbying_bills.py +ENV_REFERENCE = [ + 'An Act to protect Massachusetts public health from PFAS', + 'An Act relative to solid waste disposal facilities in environmental justice communities', + 'An Act relative to the remediation of home heating oil releases', + 'An Act relative to the cleanup of accidental home heating oil spills', + 'An Act relative to proper disposal of products containing PFAS', + 'An Act relative to certain manufactured chemicals known as PFAS', + 'An Act relative to chemical recycling', + 'An Act ensuring a healthy future for environmental justice communities', + 'An Act relative to protecting our waterways', + 'An Act protecting our soil and farms from PFAS contamination', + 'An Act relative to liability for release of hazardous materials', + 'An Act relative to landfills and areas of critical environmental concern', + 'An Act relative to maintaining adequate water supplies through effective drought management', + 'Monitor the adoption and implementation of the Low Emission Vehicle Program', + 'An Act relative to stormwater management', + 'An Act relative to clean energy and climate resilience', + 'An Act relative to reducing greenhouse gas emissions', + 'An Act relative to wetlands protection', + 'An Act relative to air quality standards', + 'An Act relative to ocean and coastal resource management', +] + +NON_ENV_REFERENCE = [ + 'An Act requiring one fair wage', + 'An Act clarifying the process for paying the wages of dismissed employees', + 'An Act to establish a hospital and community health center worker minimum wage', + 'An Act relative to equitable pay in the public sector', + 'An Act to prohibit carrying firearms in sensitive places', + 'An Act further defining a hate crime', + 'An Act limiting autonomous driving capabilities to zero emission and electric vehicles', + 'An Act relative to disability pensions for violent crimes', + 'An Act to improve sickle cell care', + 'An Act to promote the recruitment and retention of hospital workers', + 'An Act to ensure consumer cost protection under the dental medical loss ratio', + 'An Act alleviating the burden of medical debt for patients and families', + 'An Act relative to improving the outcomes for sudden cardiac arrest in the Commonwealth', + 'An Act requiring full health insurance coverage for individuals with vitiligo', + 'An Act to modernize the Massachusetts insurer insolvency fund', + 'An Act establishing a college tuition tax deduction', + 'An Act to support educational opportunity for all', + 'An Act protecting against attempts to ban remove or restrict library access to materials', + 'An Act relative to charter schools', + 'An Act to lift kids out of deep poverty', + 'An Act establishing a tax credit for families caring for elderly relatives', + 'An Act to require equitable payment from the Commonwealth', + 'An Act relative to the Affordable Homes Act', + 'An Act making appropriations for the fiscal year for the maintenance of the departments of the commonwealth', + 'An Act relative to liquor licenses in the city of Westfield', + 'An Act authorizing the town of Wrentham to grant additional licenses for the sale of alcoholic beverages', + 'Supporting Local Services', + 'An Act providing incentives to the digital interactive media and entertainment industries', + 'An Act to establish a digital advertising revenue commission', + 'An Act relative to legal advertisements in online-only newspapers', + 'An Act relative to access to a decedent electronic mail accounts', + 'An Act to modify the rules for taking depositions outside the Commonwealth', + 'An Act to prohibit the sale of energy drinks to persons under the age of 18', + 'An Act relative to LGBTQ family building', + 'An Act to preserve the eternal bonds between people and their animals', + 'An Act protecting the right to time off for voting', +] + +PALETTE_25 = [ + '#1f77b4','#ff7f0e','#2ca02c','#d62728','#9467bd', + '#8c564b','#e377c2','#7f7f7f','#bcbd22','#17becf', + '#aec7e8','#ffbb78','#98df8a','#ff9896','#c5b0d5', + '#c49c94','#f7b6d2','#c7c7c7','#dbdb8d','#9edae5', + '#393b79','#637939','#8c6d31','#843c39','#7b4173', +] + + +# ─── Helpers ─────────────────────────────────────────────────────────────────── + +def _gcs_fs(): + import gcsfs + return gcsfs.GCSFileSystem() + + +def _load_parquet() -> pd.DataFrame: + try: + fs = _gcs_fs() + if fs.exists(GCS_PARQUET): + with fs.open(GCS_PARQUET, 'rb') as f: + df = pd.read_parquet(f) + print(f'Loaded {len(df)} rows from GCS') + return df + except OSError as e: + print(f'GCS failed ({e}), using local') + df = pd.read_parquet(LOCAL_PARQUET) + print(f'Loaded {len(df)} rows from local parquet') + return df + + +def _embed_texts(client, texts: list[str]) -> np.ndarray: + """Embed a list of strings one at a time; return (N, 768) float32 array.""" + import google.genai.types as types + vecs = [] + for text in texts: + for attempt in range(5): + try: + resp = client.models.embed_content( + model=EMBEDDING_MODEL, + contents=text, + config=types.EmbedContentConfig(output_dimensionality=EMBEDDING_DIM), + ) + vecs.append(resp.embeddings[0].values) + time.sleep(REQUEST_DELAY) + break + except Exception as e: + wait = 2 ** attempt + print(f' embed error ({e}), retry in {wait}s...') + time.sleep(wait) + else: + print(f' embed failed after 5 attempts, using zero vector') + vecs.append([0.0] * EMBEDDING_DIM) + if len(vecs) % 50 == 0: + print(f' {len(vecs)}/{len(texts)} embeddings...', flush=True) + return np.array(vecs, dtype=np.float32) + + +def _call_env_classify(client, title: str) -> bool: + """Ask the LLM whether a given bill title is environmental. Returns True/False.""" + import google.genai.types as types + from pydantic import BaseModel as PB + + class EnvResult(PB): + is_environmental: bool + + prompt = ( + f'Bill title: "{title}"\n\n' + 'Is this Massachusetts bill primarily about environmental protection, ' + 'clean energy, renewable energy, climate change, pollution, solid waste, ' + 'recycling, water quality, wetlands, natural resources, forests, fisheries, ' + 'or wildlife? Reply with a JSON object: {"is_environmental": true/false}' + ) + try: + resp = client.models.generate_content( + model='gemini-2.5-flash', + contents=prompt, + config=types.GenerateContentConfig( + response_mime_type='application/json', + response_schema=EnvResult, + temperature=0, + thinking_config=types.ThinkingConfig(thinking_budget=0), + ), + ) + time.sleep(0.3) + return resp.parsed.is_environmental + except OSError: + return None + + +# ─── Diagnostic sections ─────────────────────────────────────────────────────── + +def diag_reference_set(client) -> dict: + """Run the env/non-env reference titles through the LLM classifier.""" + print('\n── 1. Reference set precision/recall ─────────────────────────') + env_results, non_env_results = [], [] + + print(f'Classifying {len(ENV_REFERENCE)} env reference titles...') + for title in ENV_REFERENCE: + result = _call_env_classify(client, title) + env_results.append((title, result)) + print(f' {"✓" if result else "✗"} {title[:70]}') + + print(f'\nClassifying {len(NON_ENV_REFERENCE)} non-env reference titles...') + for title in NON_ENV_REFERENCE: + result = _call_env_classify(client, title) + non_env_results.append((title, result)) + print(f' {"✗ FP!" if result else "✓"} {title[:70]}') + + recall = sum(1 for _, r in env_results if r) / len(env_results) + precision_denom = len(NON_ENV_REFERENCE) + fp = sum(1 for _, r in non_env_results if r) + specificity = 1 - fp / precision_denom + + fn_titles = [t for t, r in env_results if not r] + fp_titles = [t for t, r in non_env_results if r] + + print(f'\nRecall: {recall:.0%} ({sum(1 for _,r in env_results if r)}/{len(env_results)} env correctly flagged)') + print(f'Specificity: {specificity:.0%} ({fp} false positives out of {precision_denom} non-env)') + if fn_titles: + print(f'False negatives (missed env): {fn_titles}') + if fp_titles: + print(f'False positives (wrong non-env): {fp_titles}') + + return { + 'recall': recall, + 'specificity': specificity, + 'false_negatives': fn_titles, + 'false_positives': fp_titles, + } + + +def diag_disagreements(df_pilot: pd.DataFrame) -> dict: + """Compare LLM vs embedding env classification on the pilot sample.""" + print('\n── 2. LLM vs embedding disagreement analysis ──────────────────') + has_both = df_pilot[df_pilot['is_env_llm'].notna() & df_pilot['is_environmental'].notna()].copy() + has_both['emb_env'] = has_both['is_environmental'].astype(bool) + has_both['llm_env'] = has_both['is_env_llm'].astype(bool) + + agree = has_both[has_both['llm_env'] == has_both['emb_env']] + llm_only = has_both[has_both['llm_env'] & ~has_both['emb_env']] # LLM env, emb not + emb_only = has_both[~has_both['llm_env'] & has_both['emb_env']] # emb env, LLM not + both_env = has_both[has_both['llm_env'] & has_both['emb_env']] + + print(f'Sample size: {len(has_both)} bills') + print(f' Agreement: {len(agree)} ({100*len(agree)/len(has_both):.0f}%)') + print(f' Both env: {len(both_env)}') + print(f' LLM env only: {len(llm_only)} ← likely embedding false negatives') + print(f' Emb env only: {len(emb_only)} ← likely embedding false positives') + + print(f'\nLLM-only env bills ({len(llm_only)}) — probable false negatives in embedding:') + for _, row in llm_only.iterrows(): + cats = row.get('categories', '[]') + try: + cats = ', '.join(json.loads(cats)) + except (json.JSONDecodeError, TypeError): + pass + print(f' score={row.get("env_relevance_score", "?"):.3f} [{cats}] {row.get("bill_title","")[:70]}') + + print(f'\nEmb-only env bills ({len(emb_only)}) — probable false positives in embedding:') + for _, row in emb_only.iterrows(): + cats = row.get('categories', '[]') + try: + cats = ', '.join(json.loads(cats)) + except (json.JSONDecodeError, TypeError): + pass + print(f' score={row.get("env_relevance_score", "?"):.3f} [{cats}] {row.get("bill_title","")[:70]}') + + return { + 'n_agree': len(agree), 'n_llm_only': len(llm_only), + 'n_emb_only': len(emb_only), 'n_both': len(both_env), + 'llm_only_titles': llm_only['bill_title'].tolist(), + 'emb_only_titles': emb_only['bill_title'].tolist(), + } + + +def diag_tag_validity(df_pilot: pd.DataFrame) -> dict: + """Check structured output quality: tag count, category count, thin-text bills.""" + print('\n── 3. Structured output quality ───────────────────────────────') + done = df_pilot[df_pilot['summary'].notna()].copy() + + tag_counts, cat_counts = [], [] + zero_tags = zero_cats = 0 + for _, row in done.iterrows(): + try: + tags = json.loads(row.get('tags') or '[]') + cats = json.loads(row.get('categories') or '[]') + except (json.JSONDecodeError, TypeError): + tags, cats = [], [] + tag_counts.append(len(tags)) + cat_counts.append(len(cats)) + if len(tags) == 0: + zero_tags += 1 + if len(cats) == 0: + zero_cats += 1 + + print(f'Bills with summaries: {len(done)}') + print(f'Avg tags per bill: {np.mean(tag_counts):.2f} (0 tags: {zero_tags} bills)') + print(f'Avg categories/bill: {np.mean(cat_counts):.2f} (0 cats: {zero_cats} bills)') + + # Body text coverage + if 'full_text' in done.columns: + char_counts = done['full_text'].fillna('').str.len() + thin = (char_counts < 200).sum() + print(f'\nBody text coverage:') + print(f' <200 chars (title-only effectively): {thin} ({100*thin/len(done):.0f}%)') + print(f' 200–2k chars: {((char_counts >= 200) & (char_counts < 2000)).sum()}') + print(f' 2k–10k chars: {((char_counts >= 2000) & (char_counts < 10000)).sum()}') + print(f' >10k chars: {(char_counts >= 10000).sum()}') + + # Spot-print 10 summaries for qualitative read + print('\nSample summaries (random 10):') + for _, row in done.sample(min(10, len(done)), random_state=7).iterrows(): + title = str(row.get('bill_title', ''))[:60] + summ = str(row.get('summary', ''))[:200] + cats = row.get('categories', '[]') + try: + cats = ', '.join(json.loads(cats)) + except (json.JSONDecodeError, TypeError): + pass + print(f'\n "{title}"') + print(f' [{cats}]') + print(f' → {summ}') + + return { + 'avg_tags': float(np.mean(tag_counts)), + 'zero_tags': zero_tags, + 'zero_cats': zero_cats, + } + + +def diag_cost_by_gc(df_pilot: pd.DataFrame) -> None: + """Print cost breakdown by General Court.""" + print('\n── 4. Cost breakdown by General Court ─────────────────────────') + done = df_pilot[df_pilot['summary'].notna() & df_pilot['general_court'].notna()] + if done.empty: + return + gc_counts = done.groupby('general_court').size() + text_lens = done.groupby('general_court')['full_text'].apply( + lambda s: s.fillna('').str.len().mean() + ) + print(f'{"GC":>5} {"Bills":>6} {"Avg text chars":>15}') + for gc in sorted(gc_counts.index): + print(f' {int(gc):>3} {gc_counts[gc]:>6} {text_lens.get(gc, 0):>15.0f}') + + +def diag_silhouette(df_pilot: pd.DataFrame, client) -> dict: + """Embed summaries and compare silhouette with original embeddings.""" + print('\n── 5. Silhouette comparison: original vs summary embeddings ───') + done = df_pilot[ + df_pilot['summary'].notna() & + df_pilot['cluster_id'].notna() & + (df_pilot['cluster_id'] >= 0) & + df_pilot['embedding'].notna() + ].copy() + done['cluster_id'] = done['cluster_id'].astype(int) + + # Need at least 2 clusters with 2+ members + valid_clusters = done['cluster_id'].value_counts() + valid_clusters = valid_clusters[valid_clusters >= 2].index + done = done[done['cluster_id'].isin(valid_clusters)] + if len(done) < 50: + print(f' Only {len(done)} valid bills — skipping silhouette') + return {} + + labels = done['cluster_id'].values + + # Original embeddings + orig_emb = np.vstack(done['embedding'].apply( + lambda v: np.array(v, dtype=np.float32) + ).values) + orig_norm = normalize(orig_emb - orig_emb.mean(axis=0), norm='l2') + + # Embed summaries + print(f' Embedding {len(done)} summaries...') + summ_emb = _embed_texts(client, done['summary'].tolist()) + summ_norm = normalize(summ_emb - summ_emb.mean(axis=0), norm='l2') + + sil_orig = silhouette_score(orig_norm, labels, metric='cosine') + sil_summ = silhouette_score(summ_norm, labels, metric='cosine') + pct_gain = (sil_summ - sil_orig) / abs(sil_orig) * 100 + + print(f' Original title+body embedding silhouette: {sil_orig:.4f}') + print(f' Summary embedding silhouette: {sil_summ:.4f}') + print(f' Change: {pct_gain:+.1f}%') + + return { + 'sil_orig': sil_orig, + 'sil_summ': sil_summ, + 'pct_gain': pct_gain, + 'n_bills': len(done), + 'summ_emb': summ_emb, + 'done': done, + } + + +def make_umap(df_pilot: pd.DataFrame, summ_emb: np.ndarray, + done_df: pd.DataFrame) -> None: + """UMAP plot using summary embeddings only — all 495 pilot bills in one space. + + done_df rows correspond 1:1 to rows of summ_emb. Env bills (is_env_llm) + are coloured by cluster; non-env pilot bills are grey. No mixing with + original embeddings from the background corpus. + """ + print('\n── 6. UMAP with summary embeddings (pilot only) ────────────────') + import umap as umap_lib + import plotly.graph_objects as go + + labels_df = pd.read_csv(LABELS_CSV, engine='python', on_bad_lines='skip') + labels_df = labels_df[ + pd.to_numeric(labels_df['cluster_id'], errors='coerce').notna() + ].copy() + labels_df['cluster_id'] = labels_df['cluster_id'].astype(int) + label_map = dict(zip(labels_df['cluster_id'], labels_df['label'])) + + # All pilot bills share the same embedding space — summary embeddings only + summ_norm = normalize(summ_emb - summ_emb.mean(axis=0), norm='l2') + + # Use LLM env label (is_env_llm) as ground truth for colouring + is_env = done_df['is_env_llm'].fillna(False).astype(bool).values + + print(f' UMAP input: {len(done_df)} pilot bills (all summary-embedded)') + print(f' Running UMAP (n={len(summ_norm)}, cosine, n_neighbors=15, min_dist=0.1)...') + reducer = umap_lib.UMAP( + n_components=2, n_neighbors=15, min_dist=0.1, + metric='cosine', random_state=42, + ) + coords = reducer.fit_transform(summ_norm) + + fig = go.Figure() + + # Non-env pilot bills (grey) + non_env_df = done_df[~is_env] + ne_coords = coords[~is_env] + if len(non_env_df): + fig.add_trace(go.Scatter( + x=ne_coords[:, 0], y=ne_coords[:, 1], mode='markers', + marker=dict(color='#cccccc', size=7, opacity=0.55), + name=f'Non-env pilot ({len(non_env_df)})', + hovertext=[ + f'{t}
score {s:.3f}
[{c}]' + for t, s, c in zip( + non_env_df['bill_title'].fillna(''), + non_env_df['env_relevance_score'].fillna(0), + non_env_df['categories'].fillna('[]'), + ) + ], + hoverinfo='text', showlegend=True, + )) + + # Env pilot bills — coloured by cluster + env_df = done_df[is_env].copy() + env_coords = coords[is_env] + env_df['cluster_id'] = pd.to_numeric(env_df['cluster_id'], errors='coerce') + for cid in sorted(env_df['cluster_id'].dropna().astype(int).unique()): + mask = env_df['cluster_id'].astype(int) == cid + sub = env_df[mask] + sc = env_coords[mask.values] + lbl = label_map.get(cid, f'Cluster {cid}') + fig.add_trace(go.Scatter( + x=sc[:, 0], y=sc[:, 1], mode='markers', + marker=dict(color=PALETTE_25[cid % 25], size=12, opacity=0.92, + line=dict(color='black', width=1.0)), + name=f'{lbl} ({len(sub)})', + hovertext=[ + f'{row["bill_title"]}' + f'
🌿 env · cluster: {lbl}' + f'
score {row.get("env_relevance_score", 0):.3f}' + f'
{row.get("summary", "")[:120]}' + for _, row in sub.iterrows() + ], + hoverinfo='text', showlegend=True, + )) + + n_env = int(is_env.sum()) + fig.update_layout( + title=dict(text=( + 'MA Lobbying Bills — Summary Embeddings UMAP (pilot, 495 bills)' + f'
{n_env} env (LLM, coloured by cluster) · ' + f'{len(non_env_df)} non-env (grey) · ' + 'all points summary-embedded · hover for details' + ), font=dict(size=13)), + xaxis=dict(visible=False), yaxis=dict(visible=False), + legend=dict(font=dict(size=9), itemsizing='constant'), + margin=dict(l=10, r=10, t=70, b=10), + width=940, height=640, + plot_bgcolor='#f4f4f4', paper_bgcolor='white', + hovermode='closest', + ) + OUT_HTML.parent.mkdir(parents=True, exist_ok=True) + html = fig.to_html(full_html=False, include_plotlyjs='cdn', config={'responsive': True}) + OUT_HTML.write_text('{% raw %}\n' + html + '\n{% endraw %}\n', encoding='utf-8') + print(f' Wrote {OUT_HTML}') + + +def append_to_notes(results: dict) -> None: + """Append a diagnostics section to NOTES_bill_embeddings.md.""" + ref = results.get('reference', {}) + dis = results.get('disagreements', {}) + tags = results.get('tags', {}) + sil = results.get('silhouette', {}) + n = results.get('n_pilot', 0) + cost = results.get('cost', 0.0) + + pct_gain = sil.get('pct_gain', None) + sil_line = ( + f'| Original title+body | {sil["sil_orig"]:.4f} |\n' + f'| **Summary embed** | **{sil["sil_summ"]:.4f}** | {pct_gain:+.0f}% |\n' + if sil else '_Silhouette comparison not run._\n' + ) + + section = f""" +--- + +## LLM summary + taxonomy pilot diagnostics ({n}-bill sample, gemini-2.5-flash) + +**Run date:** May 2026 **Cost:** ${cost:.4f} for {n} bills \ +(${cost/max(n,1)*1000:.3f}/1k bills, ${cost/max(n,1)*26000:.2f} projected 26k corpus) + +### 1. Env classification — reference set precision/recall + +| Metric | Value | +|--------|-------| +| Recall (20 known env titles) | {f"{ref['recall']:.0%}" if 'recall' in ref else '(not run)'} | +| Specificity (36 known non-env) | {f"{ref['specificity']:.0%}" if 'specificity' in ref else '(not run)'} | +| False negatives | {len(ref.get('false_negatives', []))} | +| False positives | {len(ref.get('false_positives', []))} | + +""" + if ref.get('false_negatives'): + section += 'False negatives (env missed by LLM):\n' + for t in ref['false_negatives']: + section += f'- {t}\n' + section += '\n' + if ref.get('false_positives'): + section += 'False positives (non-env wrongly flagged):\n' + for t in ref['false_positives']: + section += f'- {t}\n' + section += '\n' + + n_agree = dis.get('n_agree', '?') + n_llm = dis.get('n_llm_only', '?') + n_emb = dis.get('n_emb_only', '?') + n_both = dis.get('n_both', '?') + section += f"""### 2. LLM vs embedding disagreement ({n}-bill pilot) + +| | Count | +|---|---| +| Both env (agreement) | {n_both} | +| LLM env only (embedding false negatives) | {n_llm} | +| Embedding env only (embedding false positives) | {n_emb} | + +""" + if dis.get('llm_only_titles'): + section += 'Bills LLM classifies env but embedding misses (top 10):\n' + for t in dis['llm_only_titles'][:10]: + section += f'- {t}\n' + section += '\n' + + section += f"""### 3. Structured output quality + +| Metric | Value | +|--------|-------| +| Avg tags per bill | {tags.get('avg_tags', '?'):.2f} | +| Bills with 0 valid tags | {tags.get('zero_tags', '?')} | +| Bills with 0 valid categories | {tags.get('zero_cats', '?')} | + +### 4. Silhouette comparison (k=25 clustering) + +| Method | Silhouette↑ | Δ | +|--------|-------------|---| +{sil_line} + +### 5. UMAP with summary embeddings + +**[→ Interactive UMAP (summary embeddings)](../docs/_includes/charts/lobbying_bill_umap_summary.html)** + +""" + existing = NOTES_MD.read_text(encoding='utf-8') + # Don't duplicate: only append if the pilot diagnostics section isn't already there + header = f'## LLM summary + taxonomy pilot diagnostics ({n}-bill' + if header in existing: + print(f'\nDiagnostics section already in {NOTES_MD} — skipping append') + return + NOTES_MD.write_text(existing + section, encoding='utf-8') + print(f'\nAppended diagnostics section to {NOTES_MD}') + + +# ─── Main ────────────────────────────────────────────────────────────────────── + +def main(): + parser = argparse.ArgumentParser() + parser.add_argument('--sample-size', type=int, default=500, + help='Expected pilot sample size (for cost reporting)') + parser.add_argument('--skip-reference', action='store_true', + help='Skip the reference set LLM calls') + parser.add_argument('--skip-umap', action='store_true') + args = parser.parse_args() + + api_key = API_KEY_PATH.read_text(encoding='utf-8').strip() + import google.genai as genai + client = genai.Client(api_key=api_key) + + df = _load_parquet() + df_pilot = df[df['summary'].notna()].copy() + print(f'{len(df_pilot)} bills with summaries in parquet') + + if len(df_pilot) == 0: + print('ERROR: No summaries found. Run summarize_lobbying_bills.py first.') + return + + results = {'n_pilot': len(df_pilot)} + + # Estimate cost from pilot token data (rough: use pilot averages) + results['cost'] = len(df_pilot) * 0.000106 # $0.000106/bill from 200-bill pilot + + if not args.skip_reference: + results['reference'] = diag_reference_set(client) + + results['disagreements'] = diag_disagreements(df_pilot) + results['tags'] = diag_tag_validity(df_pilot) + diag_cost_by_gc(df_pilot) + + sil_results = diag_silhouette(df_pilot, client) + results['silhouette'] = sil_results + + if not args.skip_umap and sil_results.get('summ_emb') is not None: + make_umap(df, sil_results['summ_emb'], sil_results['done']) + + append_to_notes(results) + print('\nAll diagnostics complete.') + + +if __name__ == '__main__': + main() diff --git a/get_data/fill_summary_embeddings.py b/get_data/fill_summary_embeddings.py new file mode 100644 index 0000000..cd430a9 --- /dev/null +++ b/get_data/fill_summary_embeddings.py @@ -0,0 +1,200 @@ +"""Fill missing summary_embedding values for bills that already have a summary. + +Non-destructive and incremental: + - Only touches rows where summary IS NOT NULL AND summary_embedding IS NULL + - Never overwrites an existing embedding + - Never modifies summary, categories, tags, is_env_llm, or any other column + - Loads fresh from GCS before writing to avoid clobbering concurrent changes + - Always prints before/after counts + +Run from get_data/: + python fill_summary_embeddings.py [--dry-run] [--workers N] +""" + +import argparse +import random +import time +from concurrent.futures import ThreadPoolExecutor, as_completed +from pathlib import Path + +import numpy as np +import pandas as pd + +API_KEY_PATH = Path('SECRET_GOOGLE_API_KEY') +DATA_DIR = Path('../docs/data') +LOCAL_PARQUET = DATA_DIR / 'MA_bill_embeddings.parquet' +GCS_PARQUET = 'gs://openamend-data/MA_bill_embeddings.parquet' +EMBEDDING_MODEL = 'gemini-embedding-2' +EMBEDDING_DIM = 768 + + +def _load_parquet() -> pd.DataFrame: + try: + import gcsfs + fs = gcsfs.GCSFileSystem() + if fs.exists(GCS_PARQUET): + with fs.open(GCS_PARQUET, 'rb') as f: + df = pd.read_parquet(f) + print(f'Loaded {len(df):,} rows from GCS') + return df + except Exception as e: + print(f'GCS failed ({e}), using local') + df = pd.read_parquet(LOCAL_PARQUET) + print(f'Loaded {len(df):,} rows from local') + return df + + +def _save_parquet(df: pd.DataFrame) -> None: + n_emb = int(df['summary_embedding'].notna().sum()) + df.to_parquet(LOCAL_PARQUET, index=False) + print(f' Saved local ({n_emb:,} embeddings)') + try: + import gcsfs + fs = gcsfs.GCSFileSystem() + with fs.open(GCS_PARQUET, 'wb') as f: + df.to_parquet(f, index=False) + print(f' Uploaded to GCS ({n_emb:,} embeddings)') + except Exception as e: + print(f' GCS upload failed: {e}') + + +def _embed_one(client, idx: int, summary: str) -> tuple[int, 'np.ndarray | None']: + """Embed one summary with exponential backoff. Returns (idx, vector_or_None).""" + import google.genai.types as types + for attempt in range(6): + try: + resp = client.models.embed_content( + model=EMBEDDING_MODEL, + contents=summary, + config=types.EmbedContentConfig(output_dimensionality=EMBEDDING_DIM), + ) + vec = np.array(resp.embeddings[0].values, dtype=np.float32) + return idx, vec + except Exception as e: + if attempt == 5: + print(f' [{idx}] embed failed after 6 attempts: {e}') + return idx, None + wait = (2 ** attempt) + random.uniform(0, 1) + print(f' [{idx}] embed error (attempt {attempt+1}/6): {str(e)[:80]} — retry in {wait:.1f}s') + time.sleep(wait) + return idx, None + + +def main(): + parser = argparse.ArgumentParser() + parser.add_argument('--dry-run', action='store_true', + help='Show what would be embedded without calling the API') + parser.add_argument('--workers', type=int, default=8, + help='Parallel embed workers (default: 8)') + args = parser.parse_args() + + # ── Load and audit ────────────────────────────────────────────────────────── + df = _load_parquet() + + has_summary = df['summary'].notna() + has_emb = df['summary_embedding'].notna() + needs_embed = has_summary & ~has_emb + already_done = has_summary & has_emb + no_summary = ~has_summary + + print() + print('── Pre-flight audit ──────────────────────────────────────────────') + print(f' Total bills: {len(df):,}') + print(f' Has summary + embedding: {already_done.sum():,} (will NOT be touched)') + print(f' Has summary, NO embedding: {needs_embed.sum():,} ← will embed these') + print(f' No summary (skip): {no_summary.sum():,} (will NOT be touched)') + print(f' Columns that will change: summary_embedding only') + print(f' Columns never touched: summary, categories, tags, is_env_llm, ' + f'embedding, env_relevance_score, is_environmental, cluster_id') + print() + + if needs_embed.sum() == 0: + print('Nothing to do — all summaries already have embeddings.') + return + + todo_idx = df.index[needs_embed].tolist() + todo_text = df.loc[needs_embed, 'summary'].tolist() + + print(f'GC distribution of bills to embed:') + print(df[needs_embed]['general_court'].value_counts().sort_index().to_string()) + print() + + if args.dry_run: + print(f'DRY RUN — would embed {len(todo_idx)} summaries, no API calls made.') + print(f'Sample:') + for i in range(min(3, len(todo_idx))): + idx = todo_idx[i] + print(f' [{idx}] GC{int(df.loc[idx,"general_court"])} ' + f'{df.loc[idx,"bill_number"]}: {str(df.loc[idx,"summary"])[:80]}') + return + + # ── Embed ─────────────────────────────────────────────────────────────────── + api_key = API_KEY_PATH.read_text().strip() + import google.genai as genai + client = genai.Client(api_key=api_key) + + print(f'Embedding {len(todo_idx)} summaries with {args.workers} workers...') + n_ok = n_fail = 0 + results: dict[int, np.ndarray] = {} + + with ThreadPoolExecutor(max_workers=args.workers) as executor: + futures = { + executor.submit(_embed_one, client, idx, text): idx + for idx, text in zip(todo_idx, todo_text) + } + for future in as_completed(futures): + idx, vec = future.result() + if vec is not None: + results[idx] = vec + n_ok += 1 + else: + n_fail += 1 + done = n_ok + n_fail + if done % 50 == 0 or done == len(todo_idx): + print(f' {done}/{len(todo_idx)} embedded ({n_ok} ok, {n_fail} failed)', + flush=True) + + # ── Write ONLY summary_embedding, only for rows that needed it ────────────── + print(f'\nWriting {n_ok} embeddings to parquet (non-destructive)...') + + # Verify before writing: confirm no embeddings appeared in target rows + # since we loaded (e.g. from a concurrent process) + still_missing = df.index[needs_embed] + collisions = df.loc[still_missing, 'summary_embedding'].notna().sum() + if collisions > 0: + print(f' ⚠️ {collisions} rows gained embeddings since load — skipping those') + + written = 0 + for idx, vec in results.items(): + # Final guard: only write if still null + if pd.isna(df.at[idx, 'summary_embedding']): + df.at[idx, 'summary_embedding'] = vec.tolist() + written += 1 + + print(f' Rows updated: {written}') + print(f' Rows skipped (collision guard): {n_ok - written}') + + # ── Save ──────────────────────────────────────────────────────────────────── + after_emb = int(df['summary_embedding'].notna().sum()) + after_summ = int(df['summary'].notna().sum()) + + # Sanity: summary count must be unchanged + before_summ = int(already_done.sum() + needs_embed.sum()) + assert after_summ == before_summ, \ + f'summary count changed: {before_summ} → {after_summ} — ABORTING save' + + print(f'\nPre-save sanity:') + print(f' summary count: {before_summ:,} → {after_summ:,} ✓ unchanged') + print(f' summary_embedding count: {already_done.sum():,} → {after_emb:,} ' + f'(+{after_emb - already_done.sum()})') + + _save_parquet(df) + + if n_fail: + print(f'\n⚠️ {n_fail} embeds failed — re-run to fill gaps') + else: + print(f'\n✅ All {n_ok} embeddings written successfully') + + +if __name__ == '__main__': + main() diff --git a/get_data/test_bill_embedding_pipeline.py b/get_data/test_bill_embedding_pipeline.py new file mode 100644 index 0000000..15b4797 --- /dev/null +++ b/get_data/test_bill_embedding_pipeline.py @@ -0,0 +1,408 @@ +""" +TESTING SCRIPT — iterating on bill embedding / clustering quality. +NOT part of the production pipeline. DO NOT run in CI. + +Purpose: + Rapid iteration on text preprocessing and clustering parameters using a + stratified sample of ~1,000 bills. Produces a standalone t-SNE HTML you + can open in a browser to visually assess cluster quality. + + Key hypotheses being tested: + 1. Strip repeated legislative scaffolding ("is hereby amended by inserting + after...") before embedding — these trigrams dominate the 2000-char window + and pull unrelated bills toward the same region of embedding space. + 2. Prepend the bill title to the cleaned text — titles are high-signal and + currently dropped when full text is available. + 3. Expand the text window from 2,000 to 3,000 chars (more signal after stripping). + 4. Increase k from 15 to 25 clusters — coarse k merges topic-coherent sub-groups. + +Usage (from get_data/): + /path/to/python -u test_embedding_pipeline.py [--sample N] [--k K] + [--no-strip] [--no-title-prefix] + [--max-chars N] [--out PATH] + +Outputs: + /tmp/test_tsne_.html — interactive Plotly t-SNE (open in browser) + /tmp/test_embeddings.parquet — cached embeddings for fast re-runs with same sample + +Imports from production scripts where possible; does NOT write to any +production data files (docs/data/, GCS, MA_bill_embeddings.parquet). +""" + +import argparse +import json +import re +import sys +import time +from pathlib import Path + +import numpy as np +import pandas as pd +import plotly.graph_objects as go +from sklearn.cluster import KMeans +from sklearn.manifold import TSNE +from sklearn.preprocessing import normalize + +# ── Import from production scripts ───────────────────────────────────────────── +# Add get_data/ to path so we can import helpers directly +sys.path.insert(0, str(Path(__file__).parent)) +from score_lobbying_bills import ( # noqa: E402 + _embed_texts, + _make_client, + _read_api_key, + _cosine_sim, + ENV_EXAMPLE_BILLS, + NON_ENV_EXAMPLE_BILLS, + ENV_THRESHOLD, +) +from cluster_lobbying_bills import _label_cluster # noqa: E402 + +# ── Paths ─────────────────────────────────────────────────────────────────────── +DATA_DIR = Path('../docs/data') +CACHE_DIR = Path('MA_legislature_cache') +API_KEY = Path('SECRET_GOOGLE_API_KEY') + +# ── Boilerplate patterns to strip before embedding ────────────────────────────── +# Ordered from most specific to most general so broader patterns don't shadow +# narrower ones. Each is stripped globally from the text. +_SCAFFOLD_PATTERNS = [ + # "Chapter 21E of the General Laws, as appearing in the 2020 Official Edition," + r'(?:Chapter|Section|Part)\s+\w+(?:\s+of\s+(?:chapter\s+\w+\s+of\s+)?the\s+General\s+Laws)?' + r'(?:,\s+as\s+(?:appearing|so\s+appearing|amended)[^,\n]{0,80})?' + r',?\s+is\s+hereby\s+amended\s+by\s+(?:inserting|striking|adding|deleting)[^\n]{0,120}', + # "as appearing in the 20XX Official Edition" + r'as\s+(?:so\s+)?appearing\s+in\s+the\s+\d{4}\s+Official\s+Edition', + # bare "is hereby amended by" + r'is\s+hereby\s+amended\s+by\s+(?:inserting|striking|adding|deleting)\s+\w+\s+\w+', + # "in place thereof the following words:-" / "the following section:-" + r'(?:in\s+place\s+thereof|thereof)\s+the\s+following\s+(?:words|section|clause|paragraph)[:\-\s]{0,5}', + # "SECTION N." header lines (keep the number but not the structural label) + r'\bSECTION\s+\d+\.\s+', + # "of the General Laws" alone + r'\bof\s+the\s+General\s+Laws\b', + # "in line N, the words" amendment locators + r'in\s+line\s+\d+(?:\s+through\s+\d+)?,?\s+the\s+words?\s+"[^"]{0,60}"', +] +_SCAFFOLD_RE = re.compile( + '|'.join(_SCAFFOLD_PATTERNS), + re.IGNORECASE, +) +_WHITESPACE_RE = re.compile(r'\s{2,}') + + +def clean_bill_text(raw: str, max_chars: int = 3000) -> str: + """Strip legislative scaffolding and normalise whitespace.""" + cleaned = _SCAFFOLD_RE.sub(' ', raw) + cleaned = _WHITESPACE_RE.sub(' ', cleaned).strip() + return cleaned[:max_chars] + + +def build_embed_text(title: str, raw_text: str, + strip: bool = True, + title_prefix: bool = True, + max_chars: int = 3000) -> str: + """ + Construct the string to feed to the embedding model. + + Parameters + ---------- + title : bill title from the portal (always present) + raw_text : DocumentText from the legislature cache (may be empty) + strip : whether to apply clean_bill_text() + title_prefix : whether to prepend the title to the cleaned body + max_chars : character budget for the body text (after stripping) + """ + if raw_text and raw_text.strip(): + body = clean_bill_text(raw_text, max_chars) if strip else raw_text[:max_chars] + if title_prefix and title: + return f'{title.strip()}\n\n{body}' + return body + # No full text — fall back to title only + return title or '' + + +# ── Data loading ──────────────────────────────────────────────────────────────── + +def load_sample(n: int = 1000, seed: int = 42) -> pd.DataFrame: + """ + Return a stratified sample of n bills from MA_lobbying_bills_scored.csv. + Stratifies on is_environmental to guarantee env bills are represented. + """ + scored = pd.read_csv(DATA_DIR / 'MA_lobbying_bills_scored.csv', index_col=0) + scored = scored.dropna(subset=['bill_number', 'general_court']) + scored['bill_number'] = scored['bill_number'].astype(int) + scored['general_court'] = scored['general_court'].astype(int) + + env = scored[scored['is_environmental'] == True] + other = scored[scored['is_environmental'] != True] + + rng = np.random.default_rng(seed) + n_env = min(len(env), max(50, int(n * len(env) / len(scored)))) + n_other = min(len(other), n - n_env) + + sample = pd.concat([ + env.iloc[rng.choice(len(env), n_env, replace=False)], + other.iloc[rng.choice(len(other), n_other, replace=False)], + ]).reset_index(drop=True) + print(f'Sample: {len(sample)} bills ({n_env} env, {n_other} non-env)') + return sample + + +def get_raw_text(bill_id, general_court: int) -> str: + """Read DocumentText from legislature cache; empty string if unavailable.""" + if not bill_id or str(bill_id) == 'nan': + return '' + cache = CACHE_DIR / f'bill_{int(general_court)}_{bill_id}.json' + if not cache.exists(): + return '' + try: + return json.loads(cache.read_text(encoding='utf-8')).get('DocumentText') or '' + except Exception: + return '' + + +# ── Main ──────────────────────────────────────────────────────────────────────── + +def main(): + ap = argparse.ArgumentParser(description=__doc__, + formatter_class=argparse.RawDescriptionHelpFormatter) + ap.add_argument('--sample', type=int, default=1000, + help='Number of bills to sample (default: 1000)') + ap.add_argument('--k', type=int, default=25, + help='Number of k-means clusters (default: 25)') + ap.add_argument('--no-strip', action='store_true', + help='Disable boilerplate stripping (baseline comparison)') + ap.add_argument('--no-title-prefix', action='store_true', + help='Do not prepend title to body text') + ap.add_argument('--max-chars', type=int, default=3000, + help='Character budget for body text after stripping (default: 3000)') + ap.add_argument('--cache', type=Path, + default=Path('/tmp/test_embeddings.parquet'), + help='Parquet cache for embeddings (reused if sample/settings match)') + ap.add_argument('--no-cache', action='store_true', + help='Ignore cached embeddings and re-embed from scratch') + ap.add_argument('--out', type=Path, default=None, + help='Output HTML path (default: /tmp/test_tsne_.html)') + ap.add_argument('--no-label', action='store_true', + help='Skip Gemini cluster labeling (use cluster IDs only, faster)') + args = ap.parse_args() + + strip = not args.no_strip + prefix = not args.no_title_prefix + tag_parts = [ + f'n{args.sample}', + f'k{args.k}', + f'chars{args.max_chars}', + 'strip' if strip else 'nostrip', + 'title' if prefix else 'notitle', + ] + tag = '_'.join(tag_parts) + out_html = args.out or Path(f'/tmp/test_tsne_{tag}.html') + + print(f'=== Embedding test: {tag} ===') + print(f' strip={strip} title_prefix={prefix} max_chars={args.max_chars}') + print(f' k={args.k} sample={args.sample}') + print(f' output → {out_html}') + print() + + # ── Sample ──────────────────────────────────────────────────────────────── + sample = load_sample(args.sample) + + # ── Try to reuse cached embeddings ──────────────────────────────────────── + cached_emb: np.ndarray | None = None + cache_meta: dict = {} + if not args.no_cache and args.cache.exists(): + try: + cdf = pd.read_parquet(args.cache) + cache_meta = json.loads(cdf.attrs.get('meta', '{}')) + if (cache_meta.get('tag') == tag and + len(cdf) == len(sample) and + set(cdf['bill_number'].astype(int)) == set(sample['bill_number'].astype(int))): + cached_emb = np.vstack(cdf['embedding'].apply( + lambda v: np.array(v, dtype=np.float32)).values) + print(f'Reusing cached embeddings from {args.cache} ({len(cdf)} rows)') + else: + print(f'Cache mismatch (tag or sample changed) — re-embedding') + except Exception as e: + print(f'Cache load failed ({e}) — re-embedding') + + # ── Embed ───────────────────────────────────────────────────────────────── + if cached_emb is None: + api_key = _read_api_key() + client = _make_client(api_key) + + texts = [] + n_with_text = 0 + for _, row in sample.iterrows(): + raw = get_raw_text(row.get('bill_id'), row['general_court']) + if raw: + n_with_text += 1 + texts.append(build_embed_text( + title=str(row.get('bill_title') or ''), + raw_text=raw, + strip=strip, + title_prefix=prefix, + max_chars=args.max_chars, + )) + + print(f'{n_with_text}/{len(sample)} bills have cached full text') + + # Show a before/after strip example + if strip: + raw_ex = get_raw_text(sample.iloc[0].get('bill_id'), + sample.iloc[0]['general_court']) + if raw_ex: + cleaned_ex = clean_bill_text(raw_ex, args.max_chars) + print(f'\n-- Strip example (bill {sample.iloc[0]["bill_id"]}) --') + print(f' RAW first 300: {repr(raw_ex[:300])}') + print(f' CLEAN first 300: {repr(cleaned_ex[:300])}') + print() + + print(f'Embedding {len(texts)} bills...') + cached_emb = _embed_texts(client, texts) + + # Save to parquet cache + cdf = sample[['bill_number', 'general_court', 'bill_title', + 'bill_id', 'is_environmental']].copy() + cdf['embedding'] = [cached_emb[i].tolist() for i in range(len(cached_emb))] + cdf.attrs['meta'] = json.dumps({'tag': tag}) + cdf.to_parquet(args.cache, index=False) + print(f'Saved embeddings to {args.cache}') + + # ── Drop zero-vector rows before clustering ─────────────────────────────── + # Bills with no title and no cached text embed as all-zeros; they cluster + # together arbitrarily and hover as "nan". Assign cluster_id=-1 and exclude. + norms = np.linalg.norm(cached_emb, axis=1) + valid = norms > 0.01 + n_zero = int((~valid).sum()) + if n_zero: + print(f' Excluding {n_zero} zero-vector bills (no title/text) from clustering') + sample['cluster_id'] = -1 + sample_valid = sample[valid].reset_index(drop=True) + emb_valid = cached_emb[valid] + emb_norm = normalize(emb_valid, norm='l2') + + # ── Score env relevance with current example sets ───────────────────────── + print('Scoring env relevance...') + api_key = _read_api_key() + client = _make_client(api_key) + env_emb = _embed_texts(client, ENV_EXAMPLE_BILLS) + non_env_emb = _embed_texts(client, NON_ENV_EXAMPLE_BILLS) + diff = _cosine_sim(emb_valid, env_emb).max(axis=1) - \ + _cosine_sim(emb_valid, non_env_emb).max(axis=1) + sample_valid = sample_valid.copy() + sample_valid['env_score_new'] = diff + sample_valid['is_env_new'] = diff >= ENV_THRESHOLD + n_env_new = sample_valid['is_env_new'].sum() + print(f' {n_env_new}/{len(sample_valid)} bills flagged env ' + f'(threshold={ENV_THRESHOLD})') + + # ── Cluster ─────────────────────────────────────────────────────────────── + print(f'Clustering into {args.k} clusters (k-means)...') + km = KMeans(n_clusters=args.k, random_state=42, n_init=10) + labels = km.fit_predict(emb_norm) + sample_valid['cluster_id'] = labels + + # ── Label clusters ──────────────────────────────────────────────────────── + cluster_labels: dict[int, str] = {} + if not args.no_label: + print('Labeling clusters with Gemini...') + for cid in range(args.k): + mask = labels == cid + sub = sample_valid[mask] + n_bills = mask.sum() + n_env = int(sub['is_env_new'].sum()) + centroid = km.cluster_centers_[cid] + dists = np.linalg.norm(emb_norm[mask] - centroid, axis=1) + top_idx = np.argsort(dists)[:20] + top_titles = sub.iloc[top_idx]['bill_title'].fillna('').tolist() + try: + label = _label_cluster(client, top_titles, cid) + except Exception as e: + label = f'Cluster {cid}' + print(f' Gemini error on cluster {cid}: {e}') + cluster_labels[cid] = label + print(f' Cluster {cid:2d}: "{label}" ({n_bills} bills, {n_env} env)') + else: + for cid in range(args.k): + sub = sample_valid[labels == cid] + n_env = int(sub['is_env_new'].sum()) + cluster_labels[cid] = f'C{cid} ({n_env} env)' + print('Skipped Gemini labeling (--no-label)') + + # ── t-SNE ───────────────────────────────────────────────────────────────── + print('Running t-SNE...') + tsne = TSNE(n_components=2, perplexity=min(40, len(sample_valid) // 10), + max_iter=1000, random_state=42, init='pca', learning_rate='auto') + coords = tsne.fit_transform(emb_norm) + sample_valid['x'] = coords[:, 0] + sample_valid['y'] = coords[:, 1] + + # ── Plot ────────────────────────────────────────────────────────────────── + PALETTE = [ + '#366EB3', '#E68C28', '#3CAA50', '#C83C3C', '#8250C8', + '#1EA0A0', '#DCB400', '#969696', '#4B8BBE', '#FF7043', + '#66BB6A', '#EF5350', '#AB47BC', '#26C6DA', '#D4E157', + '#FF8A65', '#A5D6A7', '#CE93D8', '#80DEEA', '#FFCC80', + '#BCAAA4', '#90CAF9', '#F48FB1', '#C5E1A5', '#B39DDB', + ] + + fig = go.Figure() + for cid in range(args.k): + mask = sample_valid['cluster_id'] == cid + sub = sample_valid[mask] + lbl = cluster_labels.get(cid, f'Cluster {cid}') + color = PALETTE[cid % len(PALETTE)] + non_env = sub[~sub['is_env_new']] + env = sub[sub['is_env_new']] + + def _hover(row, lbl=lbl): + title = row.get('bill_title', '') or f'Bill {row["bill_number"]}' + env_str = '🌿 env' if row['is_env_new'] else '' + return (f'{title}
' + f'{lbl} · GC {int(row["general_court"])}
' + f'env_score={row["env_score_new"]:.3f} {env_str}') + + if not non_env.empty: + fig.add_trace(go.Scatter( + x=non_env['x'], y=non_env['y'], mode='markers', + marker=dict(color=color, size=5, opacity=0.4), + name=f'{lbl} ({mask.sum()})', + legendgroup=str(cid), + hovertext=[_hover(r) for _, r in non_env.iterrows()], + hoverinfo='text', showlegend=True, + )) + if not env.empty: + fig.add_trace(go.Scatter( + x=env['x'], y=env['y'], mode='markers', + marker=dict(color=color, size=11, opacity=1.0, + line=dict(color='black', width=1.5)), + name=f'{lbl} env', + legendgroup=str(cid), + hovertext=[_hover(r) for _, r in env.iterrows()], + hoverinfo='text', showlegend=False, + )) + + title_str = ( + f'EMBEDDING TEST — n={args.sample} · k={args.k} · ' + f'strip={strip} · title_prefix={prefix} · max_chars={args.max_chars}
' + f'Large outlined = environmental · small = non-env · hover for details' + ) + fig.update_layout( + title=dict(text=title_str, font=dict(size=12)), + xaxis=dict(visible=False), yaxis=dict(visible=False), + legend=dict(font=dict(size=9), itemsizing='constant'), + margin=dict(l=10, r=10, t=75, b=10), + width=1000, height=650, + plot_bgcolor='#f5f5f5', paper_bgcolor='white', + hovermode='closest', + ) + + out_html.parent.mkdir(parents=True, exist_ok=True) + fig.write_html(str(out_html), include_plotlyjs='cdn') + print(f'\nWrote → {out_html}') + print(f'Open in browser: xdg-open {out_html}') + + +if __name__ == '__main__': + main() diff --git a/get_data/test_concat_embeddings.py b/get_data/test_concat_embeddings.py new file mode 100644 index 0000000..dbeb545 --- /dev/null +++ b/get_data/test_concat_embeddings.py @@ -0,0 +1,174 @@ +"""Test whether concatenated title+body embeddings give better cluster separation. + +Approach +──────── +For a random sample of N bills with non-empty full text: + A. Original: embed(title + "\n\n" + cleaned_body[:3000]) — current method + B. Concatenated: [L2(embed(title)), L2(embed(cleaned_body[:3000]))] → 1536-dim + then L2-normalise the 1536-dim vector + +Both are then mean-centred + L2-normalised before k-means (k=25) and evaluated +with silhouette + Davies-Bouldin on the sample. + +The "original" embeddings are pulled directly from the parquet (no API calls); +only the title-only and body-only embeddings require API calls (2N calls total). + +Run from get_data/: + /path/to/python -u test_concat_embeddings.py [--sample N] +""" + +import argparse +import re +import sys +import time +from pathlib import Path + +import gcsfs +import numpy as np +import pandas as pd +from sklearn.cluster import KMeans +from sklearn.metrics import davies_bouldin_score, silhouette_score +from sklearn.preprocessing import normalize + +GCS_PARQUET = 'gs://openamend-data/MA_bill_embeddings.parquet' +API_KEY_PATH = Path('SECRET_GOOGLE_API_KEY') +EMBEDDING_DIM = 768 +REQUEST_DELAY = 0.05 + +# ── Boilerplate stripper (same regex as score_lobbying_bills.py) ────────────── +_SCAFFOLD_RE = re.compile( + r'(?:' + r'(?:chapter|section|paragraph|clause|subsection|item)\s+[\w\-]+\s+of\s+(?:the\s+)?' + r'(?:general laws|acts of \d{4}|chapter \d+)[^.]{0,120}(?:hereby\s+amended|is\s+amended)' + r'|the\s+(?:general laws|acts of \d{4})[^.]{0,120}(?:hereby\s+amended|is\s+amended)' + r'|by\s+inserting\s+after[^.]{0,200}' + r'|by\s+striking\s+out[^.]{0,200}' + r'|as\s+appearing\s+in\s+the\s+\d{4}\s+official\s+edition[^.]{0,100}' + r'|in\s+the\s+following\s+new\s+(?:section|chapter|paragraph)[^:]{0,80}:' + r')', + re.IGNORECASE, +) +_WS_RE = re.compile(r'\s{2,}') + + +def _clean_body(raw: str, max_chars: int = 3000) -> str: + cleaned = _SCAFFOLD_RE.sub(' ', raw) + return _WS_RE.sub(' ', cleaned).strip()[:max_chars] + + +def _embed_texts(client, texts: list[str], label: str) -> np.ndarray: + from google.genai import types + vectors = [] + for i, text in enumerate(texts): + if (i + 1) % 100 == 0: + print(f' {label}: {i+1}/{len(texts)}...') + if not text or not text.strip(): + vectors.append([0.0] * EMBEDDING_DIM) + continue + time.sleep(REQUEST_DELAY) + for attempt in range(5): + try: + result = client.models.embed_content( + model='gemini-embedding-2', + contents=text, + config=types.EmbedContentConfig(output_dimensionality=EMBEDDING_DIM), + ) + vectors.append(result.embeddings[0].values) + break + except Exception as e: + wait = 2 ** attempt + print(f' Error (attempt {attempt+1}/5): {e} — retrying in {wait}s') + time.sleep(wait) + else: + print(f' Failed: "{text[:60]}" — zero vector') + vectors.append([0.0] * EMBEDDING_DIM) + return np.array(vectors, dtype=np.float32) + + +def _eval(emb: np.ndarray, k: int = 25, seed: int = 42) -> tuple[float, float]: + """Mean-centre, L2-normalise, k-means, return (silhouette, DB).""" + e = normalize(emb - emb.mean(axis=0), 'l2') + lbl = KMeans(n_clusters=k, random_state=seed, n_init=10, max_iter=300).fit_predict(e) + sil = silhouette_score(e, lbl, metric='euclidean') + db = davies_bouldin_score(e, lbl) + return sil, db + + +def main(): + parser = argparse.ArgumentParser() + parser.add_argument('--sample', type=int, default=2000, + help='Number of bills to sample (default: 2000)') + parser.add_argument('--seed', type=int, default=42) + parser.add_argument('--k', type=int, default=25, help='k for k-means (default: 25)') + args = parser.parse_args() + + # ── Load parquet ────────────────────────────────────────────────────────── + print('Loading parquet from GCS...') + fs = gcsfs.GCSFileSystem() + with fs.open(GCS_PARQUET, 'rb') as f: + df = pd.read_parquet(f) + + # Keep bills with valid embeddings and non-empty full text + emb_all = np.vstack(df['embedding'].apply(lambda v: np.array(v, dtype=np.float32)).values) + valid = (np.linalg.norm(emb_all, axis=1) > 0.01) & (df['full_text'].str.len() > 100) + df = df[valid].copy() + emb_all = emb_all[valid] + print(f'{len(df)} bills with valid embeddings and non-empty full text') + + # Sample + rng = np.random.default_rng(args.seed) + idx = rng.choice(len(df), min(args.sample, len(df)), replace=False) + sample = df.iloc[idx].reset_index(drop=True) + emb_orig = emb_all[idx] + print(f'Sampled {len(sample)} bills') + + # ── Prepare texts ───────────────────────────────────────────────────────── + titles = sample['bill_title'].fillna('').tolist() + bodies = sample['full_text'].fillna('').apply(_clean_body).tolist() + + # ── Evaluate original embeddings (no API calls) ─────────────────────────── + print(f'\nA. Original (title+body combined, from parquet), k={args.k}:') + sil_a, db_a = _eval(emb_orig, k=args.k, seed=args.seed) + print(f' Silhouette: {sil_a:.4f} | Davies-Bouldin: {db_a:.4f}') + + # ── Embed title-only and body-only ──────────────────────────────────────── + print(f'\nEmbedding {len(sample)} title-only texts...') + import google.genai as genai + api_key = API_KEY_PATH.read_text().strip() + client = genai.Client(api_key=api_key) + + emb_title = _embed_texts(client, titles, 'titles') + print(f'Embedding {len(sample)} body-only texts...') + emb_body = _embed_texts(client, bodies, 'bodies') + + # ── Build concatenated embeddings ───────────────────────────────────────── + # Normalise each 768-dim half independently, then concatenate → 1536-dim + # (gives equal weight to title and body signal before final normalisation) + t_norm = normalize(emb_title, 'l2') + b_norm = normalize(emb_body, 'l2') + emb_concat = np.hstack([t_norm, b_norm]) # (N, 1536) + + print(f'\nB. Title-only, k={args.k}:') + sil_t, db_t = _eval(emb_title, k=args.k, seed=args.seed) + print(f' Silhouette: {sil_t:.4f} | Davies-Bouldin: {db_t:.4f}') + + print(f'\nC. Body-only, k={args.k}:') + sil_b, db_b = _eval(emb_body, k=args.k, seed=args.seed) + print(f' Silhouette: {sil_b:.4f} | Davies-Bouldin: {db_b:.4f}') + + print(f'\nD. Concatenated [L2(title) | L2(body)], k={args.k}:') + sil_c, db_c = _eval(emb_concat, k=args.k, seed=args.seed) + print(f' Silhouette: {sil_c:.4f} | Davies-Bouldin: {db_c:.4f}') + + print('\n── Summary ──') + print(f'{"Method":<35} {"Silhouette":>10} {"DB":>8}') + print(f'{"A. Original (title+body combined)":<35} {sil_a:>10.4f} {db_a:>8.4f}') + print(f'{"B. Title only":<35} {sil_t:>10.4f} {db_t:>8.4f}') + print(f'{"C. Body only":<35} {sil_b:>10.4f} {db_b:>8.4f}') + print(f'{"D. Concat [L2(title)|L2(body)]":<35} {sil_c:>10.4f} {db_c:>8.4f}') + + print('\nDone.') + + +if __name__ == '__main__': + main() diff --git a/todo.md b/todo.md index 88420d2..4183486 100644 --- a/todo.md +++ b/todo.md @@ -10,6 +10,98 @@ ### MA environmental lobbying data +MA Secretary of State lobbying disclosure portal at [https://www.sec.state.ma.us/LobbyistPublicSearch/](https://www.sec.state.ma.us/LobbyistPublicSearch/) publishes filings by lobbyist, employer (client), and bill. Data is annual going back to ~2007. + +#### Data available +- **By lobbyist**: registration, employer clients, bills lobbied, compensation received +- **By employer/client**: annual lobbying expenditures, lobbyists retained, bills targeted +- **By bill**: which employers/lobbyists filed on each bill number (cross-ref to legislature) +- **By subject area**: MA SoS assigns subject tags to filings (Energy & Environment is one) + +#### Database tables (normalized schema) +- `MA_Lobbying_Employers` — one row per employer-year: name, total expenditure, industry sector (manually curated) +- `MA_Lobbying_Lobbyists` — one row per lobbyist-year-employer: lobbyist name, compensation +- `MA_Lobbying_Bills` — fact table: one row per bill-year-employer: bill number, general court session, employer name, `env_relevance_score FLOAT`, `is_environmental BOOL`; foreign keys into `MA_Lobbying_Employers` and `MA_Legislature_Bills` +- `MA_Legislature_Bills` — one row per bill-session: bill number, general court, title, primary sponsor, committee, final status, `passed BOOL`; populated from MA Legislature OpenAPI independent of lobbying data + +#### Scraping approach (`get_data/get_MA_lobbying.py`) +1. Pull **all** lobbying filings without pre-filtering by subject tag — filer-supplied subject tags are unreliable (a utility lobbying a wastewater bill may tag it "Utilities & Energy"; a developer opposing wetlands reform may tag it "Land Use"). Subject tags can be retained as a column for reference but should not gate inclusion. +2. Paginate the SoS portal search using `requests` + `BeautifulSoup`. Rate-limit to ~1 req/sec with `time.sleep`. Cache raw HTML under `get_data/MA_lobbying_cache/` so incremental re-runs skip already-fetched pages. +3. Write raw CSVs to `docs/data/`: `MA_lobbying_employers.csv`, `MA_lobbying_lobbyists.csv`, `MA_lobbying_bills.csv`. + +#### Bill augmentation (`get_data/get_MA_legislature_bills.py`) +- Fetch only bills that appear in lobbying disclosures (to keep scope bounded) from the **MA Legislature OpenAPI** at [https://malegislature.gov/api/swagger](https://malegislature.gov/api/swagger). No auth required. +- Key endpoints: `GET /api/GeneralCourts` (session index), `GET /api/GeneralCourts/{generalCourtNumber}/Bills/{billNumber}` (bill metadata). +- Write `docs/data/MA_legislature_bills.csv`. Cache JSON responses under `get_data/MA_legislature_cache/`. + +#### Environmental relevance scoring (`get_data/score_lobbying_bills.py`) +- For each unique bill in `MA_Legislature_Bills`, embed `title + description` using the **Google Embeddings API** (`gemini-embedding-2` — current production model as of 2026; supports up to 8,192 input tokens, 768–3,072 output dimensions) via `SECRET_GOOGLE_API_KEY` (already in repo). +- Compute cosine similarity against a curated set of seed phrases: "environmental regulation", "water quality", "wetlands protection", "air pollution control", "DEP enforcement", "stormwater management", "CSO discharge", "hazardous waste", "climate change", "clean energy", "pesticide regulation", "drinking water safety". +- Store `env_relevance_score` (0–1 float) on `MA_Legislature_Bills`; derive `is_environmental` at a calibrated threshold (e.g. 0.55 — tune against a hand-labeled validation set of ~50 bills). Storing the raw score lets analysts choose their own threshold. +- Only embed new/unseen bills on each incremental run — cost stays low. +- Add `GOOGLE_API_KEY` secret to CI for this step; a separate `score_lobbying_bills.py` call after `get_MA_legislature_bills.py`. + +#### Pipeline integration +- Add to `assemble_db.py` and `generate_semantic_context.py` with explicit join relationship notes (lobbying bills → legislature bills via `bill_number + general_court`; lobbying bills → employers via `employer_name + year`). +- Add to `validate_data.py` row-count checks. +- CI sequence (in `update-data.yml`): `get_MA_lobbying.py` → `get_MA_legislature_bills.py` → `score_lobbying_bills.py`, inserted after `get_eea_dp_cso.py` (step 5.6–5.8). + +#### Analyses and blog posts + +**Lobbying spend vs. DEP budget and staffing** *(strongest cross-dataset narrative)* +- Overlay annual industry lobbying spend on environmental bills (`env_relevance_score` threshold) against DEP/EEA budget and FTE timelines from `ECOS_budgets_viz.py` and `MADEP_staff.py`. A dual-axis time series: rising lobbying spend vs. falling regulatory capacity, 2007–present. + +**Environmental bill lobbying landscape** +- Which industries (energy, real estate, agriculture, municipalities) dominate lobbying on environmental bills? Time-trend from 2007–present. +- Cross-reference bill disposition (`passed` from legislature API) against lobbying spend — do more heavily-lobbied bills die more often? Requires aggregating employer spend per bill per session. + +**Lobbying intensity vs. enforcement outcomes** +- Join `MA_Lobbying_Employers` against `MAEEADP_Enforcement` by regulated entity name (fuzzy match via rapidfuzz). Are the highest-spending lobbying clients also among the most frequently violated? Does lagged lobbying spend predict reduced enforcement counts? + +**CSO operator lobbying** +- Cross-reference `MA_Lobbying_Bills` (filtered to high `env_relevance_score` CSO/wastewater bills) against EEA DP CSO operators. Are MWRA, city DPWs, or industrial dischargers lobbying on bills that would tighten or relax CSO controls? + +#### Dashboard charts (weekly-updatable, add to `dashboard_charts.py`) +Lobbying data updates once per year (prior-year filings posted mid-year), so charts will show a new data point annually but are still appropriate for the weekly-run dashboard. + +| Chart slug | Description | +|------------|-------------| +| `dash_lobbying_spend_trend` | Annual total lobbying spend on environmentally-relevant bills (`is_environmental=True`), 2007–present, stacked by industry sector | +| `dash_lobbying_top_employers` | Top 15 employer spenders (most recent complete year) — horizontal bar | +| `dash_lobbying_bill_intensity` | Unique bills lobbied per year + share that passed vs. died in committee | +| `dash_lobbying_vs_enforcement` | Dual-axis: industry lobbying spend (left) vs. EEA enforcement action count (right), 2007–present | + +All four follow the existing `{% include %}` pattern in `docs/dashboard.md`. + +#### Complementary data: MA Legislature OpenAPI +- Session (General Court) index: resolves bill numbers across sessions (190th, 191st, etc.) +- Sponsor data: cross-reference sponsor names against lobbying employer targets to identify which legislators are most frequently lobbied on environmental topics (analysis-post level, not dashboard) + +#### Pending: re-fetch 2010–2016 bill data + +The 2010+ disclosure pages use a 5-column format (`Activity or Bill No and Title | Position | DirectBiz | Client | Compensation`) rather than the 2009 4-column format (`Date | Bill+Title | Lobbyist | Client`). The scraper parser was reading the wrong column as the bill cell, so 2010–2016 fetches captured employer compensation but zero bills. + +Fix is already applied in `get_MA_lobbying.py` (header-based format detection — looks for `'Activity'` in the first header cell to choose `bill_col=0, client_col=3` vs. the 2009 layout). The currently-running historical scrape (started 2026-05-21) has already cached those years' `disc_url`s as "fetched", so the fix won't take effect until those rows are re-queued. + +Recovery steps (run from `get_data/` after the main scrape finishes through 2026): +1. Confirm main scrape complete: check `MA_lobbying_summary_links.csv` has rows through 2026 +2. Delete year 2010–2016 rows: `python -c "import pandas as pd; df = pd.read_csv('../docs/data/MA_lobbying_summary_links.csv', index_col=0); df = df[~df['year'].astype(int).between(2010, 2016)]; df.to_csv('../docs/data/MA_lobbying_summary_links.csv')"` +3. Restart scraper: `/home/nes/miniconda/envs/amend_python/bin/python -u get_MA_lobbying.py` — will re-fetch only those years using the fixed parser +4. Run `get_MA_legislature_bills.py` to pick up any new general courts found +5. Run `score_lobbying_bills.py`, `cluster_lobbying_bills.py`, `assemble_db.py`, `generate_semantic_context.py` +6. Re-run `MA_lobbying_viz.py` and update the draft analysis post + +#### Implementation sequence +1. Manual exploration: browse SoS portal to document exact URL patterns, pagination parameters, and field names before writing scraper +2. Write `get_MA_lobbying.py` with caching; run manually on one year to validate +3. Write `get_MA_legislature_bills.py` for bill augmentation (OpenAPI, no auth) +4. Write `score_lobbying_bills.py` for Google Embeddings API relevance scoring; hand-label ~50 bills to calibrate threshold +5. Extend `assemble_db.py`, `generate_semantic_context.py`, `validate_data.py` +6. Write `MA_lobbying_viz.py` with `generate_charts()` and `generate_post_charts()` following the MS4 pattern +7. Add dashboard chart calls to `dashboard_charts.py` +8. Add steps 1–3 to CI pipeline; add `GOOGLE_API_KEY` secret +9. Write analysis blog post: lobbying spend vs. enforcement/budget narrative + # Analyses ### Distribution of permit age by watershed and municipality