Idaho Legislation Analysis

This project scrapes legislative bills from the Idaho Legislature, converts them to HTML (preserving strikethrough/underline markup), uses the OpenAI API to detect potential constitutional issues, and presents results in an interactive Streamlit dashboard.

Prerequisites

uv — Python package and project manager (installs the correct Python version for you).
OpenAI API key — required only for Step 3 (ML analysis).

Quick Start

uv sync                                    # install deps + correct Python
uv run python scrape.py                    # step 1: scrape bills
uv run python pdf_to_html.py              # step 2: convert PDFs → HTML
export OPENAI_API_KEY="sk-..."            # step 3 prerequisite
uv run python ml_analysis.py              # step 3: constitutional analysis
uv run streamlit run bill_data_explorer.py # step 4: launch dashboard

Pipeline Steps

Step 1 — Scrape Legislative Data

uv run python scrape.py

Downloads bill metadata (number, title, status, sponsor) and PDF files from the Idaho Legislature website into a date-stamped Data/<DATARUN>/ directory.

On completion the date string is saved to Data/.datarun so that subsequent steps can find it automatically. Override at any time:

export DATARUN=04_30_2025

Step 2 — Convert PDFs to HTML

uv run python pdf_to_html.py

Converts each PDF to DOCX and then to HTML (via mammoth), preserving <u> (additions) and <s> (deletions) formatting used by the Idaho Legislature.

Supported conversion modes:

pdf2docx (default, local conversion)
adobe (Adobe PDF Services API)

Select mode with:

export PDF_CONVERSION_MODE=pdf2docx   # default
# or
export PDF_CONVERSION_MODE=adobe

If using Adobe mode, set credentials first:

export PDF_SERVICES_CLIENT_ID="your_client_id"
export PDF_SERVICES_CLIENT_SECRET="your_client_secret"

And ensure Adobe SDK is installed in your environment:

uv sync --extra adobe

Step 3 — ML Analysis

export OPENAI_API_KEY="sk-***********************"
uv run python ml_analysis.py

Sends each bill's HTML to OpenAI GPT-4o to identify potential constitutional issues. Bills that fail on the first pass are retried with GPT-4o-mini.

Produces two JSONL files in Data/:

File	Contents
`idaho_bills_enriched_<DATARUN>.jsonl`	Bills with detected issues, sorted by issue count
`idaho_bills_failed_<DATARUN>.jsonl`	Bills where analysis returned no data

Step 4 — Interactive Dashboard

uv run streamlit run bill_data_explorer.py

Opens a multi-page Streamlit app with:

Main page — bills ranked by number of constitutional issues, filterable by status and sponsor, with a detail dialog for each bill.
Issue-type histogram — distribution of issue types across all bills.
Sponsor histogram — total issues grouped by sponsor.
Status codes — reference table of Idaho bill status abbreviations.

See it Live

Explore the dashboard online: https://danielrmeyer-idaho-legislation-analys-bill-data-explorer-qxzijs.streamlit.app/

Project Structure

├── scrape.py                # Step 1 — web scraper (requests + BeautifulSoup)
├── pdf_to_html.py           # Step 2 — PDF → DOCX → HTML conversion
├── ml_analysis.py           # Step 3 — OpenAI constitutional analysis
├── bill_data_explorer.py    # Step 4 — Streamlit dashboard (multipage entrypoint)
├── pages/
│   ├── issue_type_histogram.py
│   ├── issues_by_sponsor_histogram.py
│   └── status_codes.py
├── config.py                # DATARUN resolution (env var → .datarun → auto-detect)
├── utils.py                 # Shared data-loading helper (@st.cache_data)
├── Data/
│   ├── .datarun                              # auto-generated by scrape.py
│   ├── idaho_bills_enriched_<DATARUN>.jsonl   # enriched output
│   └── idaho_bills_failed_<DATARUN>.jsonl     # failed analyses
├── pyproject.toml           # uv project config + dependencies
├── uv.lock                  # deterministic lockfile
├── .python-version          # Python 3.13
├── .devcontainer/           # GitHub Codespaces / VS Code devcontainer
├── copilot-instructions.md  # context for AI-assisted development
├── .gitignore
├── LICENSE
└── README.md

Environment Variables

Variable	Required	Purpose
`DATARUN`	No	Override the date string (e.g. `04_30_2025`). See resolution order below.
`PDF_CONVERSION_MODE`	No	PDF converter for Step 2: `pdf2docx` (default) or `adobe`.
`PDF_SERVICES_CLIENT_ID`	Adobe mode only	Adobe PDF Services client ID.
`PDF_SERVICES_CLIENT_SECRET`	Adobe mode only	Adobe PDF Services client secret.
`OPENAI_API_KEY`	Step 3 only	OpenAI API key for GPT-4o analysis.

DATARUN Resolution Order

config.get_datarun() resolves the active date string using three sources, checked in order:

DATARUN environment variable — highest priority, useful for one-off overrides or CI pipelines.
Data/.datarun file — written automatically by scrape.py at the end of a successful run.
Auto-detection from Data/ files — scans for idaho_bills_enriched_*.jsonl and extracts the date from the filename. When multiple files exist the most recent date wins. This allows the dashboard to work out-of-the-box on a fresh clone without running the scraper or setting any environment variable.

Troubleshooting

Problem	Solution
`Could not determine DATARUN`	Run `scrape.py` first, or `export DATARUN=<date>`.
Adobe mode errors in Step 2	Install `pdfservices-sdk` and set `PDF_SERVICES_CLIENT_ID` + `PDF_SERVICES_CLIENT_SECRET`.
`OPENAI_API_KEY` not set	Export the key before running `ml_analysis.py`.
PDF conversion warnings	Safe to ignore — `pdf2docx` prints layout heuristics.
Dashboard shows no data	Ensure all four pipeline steps completed successfully.

Future Goals

Fine-tune an OpenAI or Mistral model on historical Idaho legislation
Migrate the frontend to Django + Bootstrap for richer interactivity
Provide a searchable legislative history for citizens and advocacy groups

License

This project is open-source. See LICENSE for more information.

Contributing

Contributions are welcome! Please open an issue or pull request with ideas or improvements.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Idaho Legislation Analysis

Prerequisites

Quick Start

Pipeline Steps

Step 1 — Scrape Legislative Data

Step 2 — Convert PDFs to HTML

Step 3 — ML Analysis

Step 4 — Interactive Dashboard

See it Live

Project Structure

Environment Variables

DATARUN Resolution Order

Troubleshooting

Future Goals

License

Contributing

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 53 Commits
.devcontainer		.devcontainer
Data		Data
pages		pages
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
bill_data_explorer.py		bill_data_explorer.py
config.py		config.py
copilot-instructions.md		copilot-instructions.md
ml_analysis.py		ml_analysis.py
pdf_to_html.py		pdf_to_html.py
pyproject.toml		pyproject.toml
scrape.py		scrape.py
utils.py		utils.py
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

Idaho Legislation Analysis

Prerequisites

Quick Start

Pipeline Steps

Step 1 — Scrape Legislative Data

Step 2 — Convert PDFs to HTML

Step 3 — ML Analysis

Step 4 — Interactive Dashboard

See it Live

Project Structure

Environment Variables

DATARUN Resolution Order

Troubleshooting

Future Goals

License

Contributing

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages