Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
50 commits
Select commit Hold shift + click to select a range
c1e708a
feat: MA lobbying data pipeline and dashboard charts
nesanders May 20, 2026
e09b6b4
docs: add MA lobbying dataset page, dashboard section, and CLAUDE.md …
nesanders May 20, 2026
2b33f77
data: add 2024 MA lobbying disclosures (1,650 employer rows, 14,822 b…
nesanders May 20, 2026
db46876
fix: correct column names, API endpoint, and dual-axis params in lobb…
nesanders May 20, 2026
1952cc9
feat: bill embeddings, topic clustering, and cluster spend chart
nesanders May 20, 2026
6724e99
fix: store lobbying bill_number as integer in DB; load all analysis d…
nesanders May 20, 2026
6e181c3
fix: make lobbying and legislature fetch scripts fully resumable
nesanders May 20, 2026
280a08b
feat: t-SNE cluster plot, embedding docs, and fetch retry/cleanup
nesanders May 20, 2026
2fd4d92
docs: add get_data/README.md covering lobbying pipeline and embeddings
nesanders May 21, 2026
8d516d2
rename: README.md → README_lobbying.md in get_data/
nesanders May 21, 2026
6002fa7
docs: trim README_lobbying.md to lobbying pipeline only
nesanders May 21, 2026
fbfd386
feat: lobbying charts, semantic context, and pipeline fixes
nesanders May 22, 2026
839e9b2
feat: MA environmental lobbying analysis — charts, post draft, and pi…
nesanders May 27, 2026
636e051
fix: use sample CSVs in MA_lobbying.md; upload large CSVs to GCS in a…
nesanders May 27, 2026
88ab4ac
feat: normalize entity/client names at DB assembly; show 10 rows on d…
nesanders May 27, 2026
2b3ef74
fix: improve env scoring — expanded non-env examples, raise threshold…
nesanders May 27, 2026
f1ad513
fix: lower env threshold to 0.06 after calibrating with expanded non-…
nesanders May 27, 2026
8ea8e9c
fix: quote legislature CSV with QUOTE_NONNUMERIC to prevent C parser …
nesanders May 27, 2026
b92971f
data: regenerate lobbying charts and samples with full historical data
nesanders May 27, 2026
ff7032b
fix: add actions:write permission to dispatch update-charts workflow
nesanders May 27, 2026
42c7d04
data: regenerate t-SNE bill embedding visualization (25,928 bills)
nesanders May 27, 2026
4ae21b3
test: add test_bill_embedding_pipeline.py for iterating on embedding …
nesanders May 27, 2026
ae99d5b
feat: strip legislative boilerplate, prepend title, expand to 3000 ch…
nesanders May 27, 2026
b53db2b
data: regenerate bill embeddings with boilerplate stripping and updat…
nesanders May 27, 2026
5abb6eb
analysis: fix MA_lobbying_viz.py - proportional env spend, 3 new posi…
nesanders May 27, 2026
804e8f7
fix: correct General Court formula (FIRST_GC_START_YEAR 2005→2003) an…
nesanders May 28, 2026
fa226d1
feat: persist k-means model for incremental cluster assignment
nesanders May 29, 2026
46c658f
fix: QUOTE_NONNUMERIC + engine=python for long-field CSVs; update pip…
nesanders May 29, 2026
28a8dd7
data: commit k-means model and training mean to repo
nesanders May 29, 2026
8fc1ed7
feat: LLM summary+taxonomy pipeline, UMAP clustering, diagnostic suite
nesanders May 31, 2026
bad7937
feat: parallel summarization, thinking token tracking, corrected cost…
nesanders May 31, 2026
afbc63c
fix: correct Gemini output token price (/bin/bash.30→.50/1M, verified…
nesanders Jun 1, 2026
81c3df9
fix: catch checkpoint save errors to prevent silent budget burn
nesanders Jun 1, 2026
9a0edd5
feat: log SUMMARY per bill + add recover_from_log.py for crash recovery
nesanders Jun 1, 2026
310dd22
data: regenerate full lobbying analysis with 25,915-bill summaries
nesanders Jun 1, 2026
efcc913
feat: add 5 new analysis charts to lobbying post + static site proposal
nesanders Jun 2, 2026
182499c
docs: revise static site proposal — all bills, query-string routing
nesanders Jun 2, 2026
cac82f4
chore: remove static site proposal from repo, add to .gitignore
nesanders Jun 2, 2026
b9949c2
fix: correct lobbying table schema in semantic context
nesanders Jun 2, 2026
91ae9d4
fix: deduplicate MA_Lobbying_Bills and MA_Lobbying_Bills_Scored in as…
nesanders Jun 2, 2026
46609ab
fix: replace concatenated multi-bill titles in MA_Lobbying_Bills_Scored
nesanders Jun 2, 2026
f14d54c
fix: correct H/S bill dedup, add bill_id join key, link to lobbying e…
nesanders Jun 2, 2026
d4511a0
docs: update frontpage dataset list with all current sources
nesanders Jun 3, 2026
4406077
fix: improve entity name normalization in assemble_db.py
nesanders Jun 3, 2026
34284fe
ci: replace cross-workflow dispatch with three self-contained workflows
nesanders Jun 3, 2026
0c09d55
fix: pin scikit-learn, bump joblib, add gcsfs/pyarrow to requirements…
nesanders Jun 3, 2026
4f4c5ae
fix: correct H/S bill dedup in score_lobbying_bills.py; split CI fetc…
nesanders Jun 4, 2026
2690110
data: backfill H/S collision bills — 7k new summaries/tags + cost docs
nesanders Jun 5, 2026
67c17fd
Merge remote-tracking branch 'origin/main' into feat/ma-lobbying-data
nesanders Jun 22, 2026
93720c5
feat: finalize lobbying analysis post + facts on rebuilt LLM-classifi…
nesanders Jun 22, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
---
name: env-threshold-review
description: Pending task to revisit environmental relevance threshold after GC fix re-embed, with documented artifact
metadata:
type: project
---

After the GC formula fix and full re-embed (May 2026), the environmental bill count jumped from 329 → 654 at threshold=0.05. This needs a calibration review.

**Task:** Re-run the threshold analysis — plot the score distribution, spot-check bills near the new boundary, and decide whether 0.05 is still correct or needs adjustment. Document the exercise in a written artifact (analysis page or data note) explaining: the differential cosine similarity method, the reference sets, how the threshold was chosen, and what the before/after counts were at various thresholds.

**Why:** The doubling of env bill count is plausible (correct body text adds real signal) but should be verified with spot-checks. Some new bills at 0.05–0.08 may be genuine env bills the old wrong-GC embeddings missed; others may be false positives from body text that semantically resembles env topics without being env legislation.

**Related:** [[project_data_pipeline]] — score_lobbying_bills.py ENV_THRESHOLD constant; [[ai_analysis_feature]] — env bill counts flow into the AMEND.db and dashboard.
1 change: 1 addition & 0 deletions .claude/scheduled_tasks.lock
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
{"sessionId":"d2fc4fd3-d9ff-44a4-a608-05ddeb65f47b","pid":1140968,"procStart":"103407381","acquiredAt":1779884724489}
236 changes: 236 additions & 0 deletions analysis/MA_lobbying_tsne.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,236 @@
"""Generate a UMAP scatter plot of MA lobbying bill embeddings.

Visual design philosophy
─────────────────────────
MA legislative bill embeddings are semantically dense — all bills share heavy
regulatory language, so inter-cluster cosine distances are ~0.006 vs.
intra-cluster spread of ~0.53. Running t-SNE on all 25k bills produces a
featureless blob regardless of perplexity, because the structure simply doesn't
separate in 2-D.

UMAP is used instead of t-SNE because it better preserves global structure,
pulling weakly-separated clusters apart more effectively than t-SNE's purely
local optimisation. Parameters: n_neighbors=30, min_dist=0.1, metric='cosine'.

The chart shows TWO layers:

Background (grey) — stratified sample of ~120 non-environmental bills per
cluster, rendered as tiny translucent grey dots. Provides
geographic context for the policy landscape.

Signal (coloured) — all env-relevant bills (~654), one colour per cluster,
large outlined dots. These are what the visitor cares about.

UMAP is computed on the combined ~3,650 point sample (all env + background),
which runs in ~30s and produces cleaner structure than t-SNE on this corpus.

Run from the analysis/ directory:
/path/to/python -u MA_lobbying_tsne.py

Outputs:
../docs/_includes/charts/lobbying_bill_tsne.html
"""

import sys
from pathlib import Path

import numpy as np
import pandas as pd
import umap
from sklearn.preprocessing import normalize
import plotly.graph_objects as go

sys.path.insert(0, str(Path(__file__).parent))

GCS_PARQUET = 'gs://openamend-data/MA_bill_embeddings.parquet'
LOCAL_PARQUET = Path('../docs/data/MA_bill_embeddings.parquet')
LABELS_CSV = Path('../docs/data/MA_bill_cluster_labels.csv')
OUT_HTML = Path('../docs/_includes/charts/lobbying_bill_tsne.html')

# Non-env bills sampled per cluster for background context.
# 120 × 25 clusters ≈ 3 000 background points + ~329 env = ~3 300 total.
BG_PER_CLUSTER = 120
RANDOM_STATE = 42

# UMAP hyperparameters
UMAP_N_NEIGHBORS = 30 # larger → more global structure
UMAP_MIN_DIST = 0.1 # smaller → tighter clusters
UMAP_METRIC = 'cosine'

# 25-colour palette — qualitative, perceptually distinct, no cycling
PALETTE_25 = [
'#1f77b4', '#ff7f0e', '#2ca02c', '#d62728', '#9467bd',
'#8c564b', '#e377c2', '#7f7f7f', '#bcbd22', '#17becf',
'#aec7e8', '#ffbb78', '#98df8a', '#ff9896', '#c5b0d5',
'#c49c94', '#f7b6d2', '#c7c7c7', '#dbdb8d', '#9edae5',
'#393b79', '#637939', '#8c6d31', '#843c39', '#7b4173',
]


def _load_parquet() -> pd.DataFrame:
try:
import gcsfs
fs = gcsfs.GCSFileSystem()
if fs.exists(GCS_PARQUET):
with fs.open(GCS_PARQUET, 'rb') as f:
df = pd.read_parquet(f)
print(f'Loaded {len(df)} rows from {GCS_PARQUET}')
return df
except Exception as e:
print(f'GCS load failed ({e}), trying local...')
if LOCAL_PARQUET.exists():
df = pd.read_parquet(LOCAL_PARQUET)
print(f'Loaded {len(df)} rows from local Parquet')
return df
raise FileNotFoundError('No Parquet file found. Run score_lobbying_bills.py first.')


def main():
parquet_df = _load_parquet()

# Restrict to clustered bills
parquet_df = parquet_df[
parquet_df['cluster_id'].notna() & (parquet_df['cluster_id'] != -1)
].copy()
parquet_df['cluster_id'] = parquet_df['cluster_id'].astype(int)

if 'is_environmental' not in parquet_df.columns:
parquet_df['is_environmental'] = False
parquet_df['is_environmental'] = parquet_df['is_environmental'].fillna(False).astype(bool)

labels_df = pd.read_csv(LABELS_CSV, engine='python', on_bad_lines='skip')
# example_titles may contain unquoted commas that corrupt row parsing;
# keep only rows with a valid integer cluster_id.
labels_df = labels_df[
pd.to_numeric(labels_df['cluster_id'], errors='coerce').notna()
].copy()
labels_df['cluster_id'] = labels_df['cluster_id'].astype(int)
label_map = dict(zip(labels_df['cluster_id'].astype(int), labels_df['label']))
nenv_map = dict(zip(labels_df['cluster_id'].astype(int), labels_df['n_env_bills']))

# ── Build subsample ──────────────────────────────────────────────────────
# Keep ALL env bills; sample BG_PER_CLUSTER non-env bills per cluster.
env_df = parquet_df[parquet_df['is_environmental']].copy()
non_env = parquet_df[~parquet_df['is_environmental']]

rng = np.random.default_rng(RANDOM_STATE)
bg_parts = []
for cid in sorted(non_env['cluster_id'].unique()):
sub = non_env[non_env['cluster_id'] == cid]
n = min(BG_PER_CLUSTER, len(sub))
bg_parts.append(sub.sample(n=n, random_state=int(rng.integers(0, 2**31))))

bg_df = pd.concat(bg_parts, ignore_index=True)
sample = pd.concat([env_df, bg_df], ignore_index=True)
print(f'Subsample: {len(env_df)} env + {len(bg_df)} background = {len(sample)} total')

# ── Embeddings ───────────────────────────────────────────────────────────
emb = np.vstack(sample['embedding'].apply(
lambda v: np.array(v, dtype=np.float32)
).values)
emb_norm = normalize(emb, norm='l2')

# ── UMAP ─────────────────────────────────────────────────────────────────
print(f'Running UMAP (n={len(sample)}, n_neighbors={UMAP_N_NEIGHBORS}, '
f'min_dist={UMAP_MIN_DIST}, metric={UMAP_METRIC})...')
reducer = umap.UMAP(
n_components=2,
n_neighbors=UMAP_N_NEIGHBORS,
min_dist=UMAP_MIN_DIST,
metric=UMAP_METRIC,
random_state=RANDOM_STATE,
low_memory=False,
)
coords = reducer.fit_transform(emb_norm)
sample = sample.copy()
sample['x'] = coords[:, 0]
sample['y'] = coords[:, 1]

# ── Build Plotly figure ──────────────────────────────────────────────────
fig = go.Figure()

bg = sample[~sample['is_environmental']]
envs = sample[sample['is_environmental']]

# Layer 1 — grey background (all non-env, single trace for performance)
fig.add_trace(go.Scatter(
x=bg['x'], y=bg['y'],
mode='markers',
marker=dict(color='#aaaaaa', size=4, opacity=0.20),
name='Non-environmental bills',
hovertext=[
f'<b>{row.get("bill_title", "")}</b><br>'
f'GC {int(row["general_court"])} · {label_map.get(int(row["cluster_id"]), "")}'
for _, row in bg.iterrows()
],
hoverinfo='text',
showlegend=True,
legendgroup='bg',
legendgrouptitle=dict(text='Background'),
))

# Layer 2 — env bills, one trace per cluster that has any env bills
env_cluster_ids = sorted(envs['cluster_id'].unique())
for i, cid in enumerate(env_cluster_ids):
sub = envs[envs['cluster_id'] == cid]
lbl = label_map.get(cid, f'Cluster {cid}')
nenv = nenv_map.get(cid, len(sub))
color = PALETTE_25[cid % len(PALETTE_25)]

fig.add_trace(go.Scatter(
x=sub['x'], y=sub['y'],
mode='markers',
marker=dict(
color=color, size=11, opacity=0.92,
line=dict(color='black', width=1.2),
),
name=f'{lbl} ({nenv} env)',
hovertext=[
f'<b>{row.get("bill_title", "")}</b><br>'
f'GC {int(row["general_court"])} · 🌿 environmental<br>'
f'Cluster: {lbl}<br>'
f'Score: {row.get("env_relevance_score", ""):.3f}'
for _, row in sub.iterrows()
],
hoverinfo='text',
showlegend=True,
legendgroup='env',
legendgrouptitle=dict(text='Environmental bills by cluster') if i == 0 else dict(text=''),
))

fig.update_layout(
title=dict(
text=(
'MA Lobbying Bills — Environmental Bills in the Policy Landscape'
f'<br><sup>Coloured = {len(envs)} environmentally-relevant bills · '
f'grey = background sample ({len(bg):,} non-env) · '
'colour = topic cluster · hover for details · UMAP projection</sup>'
),
font=dict(size=13),
),
xaxis=dict(visible=False),
yaxis=dict(visible=False),
legend=dict(
font=dict(size=10),
itemsizing='constant',
tracegroupgap=8,
),
margin=dict(l=10, r=10, t=70, b=10),
width=880,
height=600,
plot_bgcolor='#f8f8f8',
paper_bgcolor='white',
hovermode='closest',
)

OUT_HTML.parent.mkdir(parents=True, exist_ok=True)
html = fig.to_html(full_html=False, include_plotlyjs='cdn', config={'responsive': True})
OUT_HTML.write_text(
'{% raw %}\n' + html + '\n{% endraw %}\n',
encoding='utf-8',
)
print(f'Wrote {OUT_HTML}')


if __name__ == '__main__':
main()
Loading
Loading