This project scrapes legislative bills from the Idaho Legislature, converts them to HTML (preserving strikethrough/underline markup), uses the OpenAI API to detect potential constitutional issues, and presents results in an interactive Streamlit dashboard.
- uv — Python package and project manager (installs the correct Python version for you).
- OpenAI API key — required only for Step 3 (ML analysis).
uv sync # install deps + correct Python
uv run python scrape.py # step 1: scrape bills
uv run python pdf_to_html.py # step 2: convert PDFs → HTML
export OPENAI_API_KEY="sk-..." # step 3 prerequisite
uv run python ml_analysis.py # step 3: constitutional analysis
uv run streamlit run bill_data_explorer.py # step 4: launch dashboarduv run python scrape.pyDownloads bill metadata (number, title, status, sponsor) and PDF files from
the Idaho Legislature website into a date-stamped Data/<DATARUN>/ directory.
On completion the date string is saved to Data/.datarun so that subsequent
steps can find it automatically. Override at any time:
export DATARUN=04_30_2025uv run python pdf_to_html.pyConverts each PDF to DOCX and then to HTML (via mammoth), preserving
<u> (additions) and <s> (deletions) formatting used by the Idaho
Legislature.
Supported conversion modes:
pdf2docx(default, local conversion)adobe(Adobe PDF Services API)
Select mode with:
export PDF_CONVERSION_MODE=pdf2docx # default
# or
export PDF_CONVERSION_MODE=adobeIf using Adobe mode, set credentials first:
export PDF_SERVICES_CLIENT_ID="your_client_id"
export PDF_SERVICES_CLIENT_SECRET="your_client_secret"And ensure Adobe SDK is installed in your environment:
uv sync --extra adobeexport OPENAI_API_KEY="sk-***********************"
uv run python ml_analysis.pySends each bill's HTML to OpenAI GPT-4o to identify potential constitutional issues. Bills that fail on the first pass are retried with GPT-4o-mini.
Produces two JSONL files in Data/:
| File | Contents |
|---|---|
idaho_bills_enriched_<DATARUN>.jsonl |
Bills with detected issues, sorted by issue count |
idaho_bills_failed_<DATARUN>.jsonl |
Bills where analysis returned no data |
uv run streamlit run bill_data_explorer.pyOpens a multi-page Streamlit app with:
- Main page — bills ranked by number of constitutional issues, filterable by status and sponsor, with a detail dialog for each bill.
- Issue-type histogram — distribution of issue types across all bills.
- Sponsor histogram — total issues grouped by sponsor.
- Status codes — reference table of Idaho bill status abbreviations.
Explore the dashboard online: https://danielrmeyer-idaho-legislation-analys-bill-data-explorer-qxzijs.streamlit.app/
├── scrape.py # Step 1 — web scraper (requests + BeautifulSoup)
├── pdf_to_html.py # Step 2 — PDF → DOCX → HTML conversion
├── ml_analysis.py # Step 3 — OpenAI constitutional analysis
├── bill_data_explorer.py # Step 4 — Streamlit dashboard (multipage entrypoint)
├── pages/
│ ├── issue_type_histogram.py
│ ├── issues_by_sponsor_histogram.py
│ └── status_codes.py
├── config.py # DATARUN resolution (env var → .datarun → auto-detect)
├── utils.py # Shared data-loading helper (@st.cache_data)
├── Data/
│ ├── .datarun # auto-generated by scrape.py
│ ├── idaho_bills_enriched_<DATARUN>.jsonl # enriched output
│ └── idaho_bills_failed_<DATARUN>.jsonl # failed analyses
├── pyproject.toml # uv project config + dependencies
├── uv.lock # deterministic lockfile
├── .python-version # Python 3.13
├── .devcontainer/ # GitHub Codespaces / VS Code devcontainer
├── copilot-instructions.md # context for AI-assisted development
├── .gitignore
├── LICENSE
└── README.md
| Variable | Required | Purpose |
|---|---|---|
DATARUN |
No | Override the date string (e.g. 04_30_2025). See resolution order below. |
PDF_CONVERSION_MODE |
No | PDF converter for Step 2: pdf2docx (default) or adobe. |
PDF_SERVICES_CLIENT_ID |
Adobe mode only | Adobe PDF Services client ID. |
PDF_SERVICES_CLIENT_SECRET |
Adobe mode only | Adobe PDF Services client secret. |
OPENAI_API_KEY |
Step 3 only | OpenAI API key for GPT-4o analysis. |
config.get_datarun() resolves the active date string using three sources,
checked in order:
DATARUNenvironment variable — highest priority, useful for one-off overrides or CI pipelines.Data/.datarunfile — written automatically byscrape.pyat the end of a successful run.- Auto-detection from
Data/files — scans foridaho_bills_enriched_*.jsonland extracts the date from the filename. When multiple files exist the most recent date wins. This allows the dashboard to work out-of-the-box on a fresh clone without running the scraper or setting any environment variable.
| Problem | Solution |
|---|---|
Could not determine DATARUN |
Run scrape.py first, or export DATARUN=<date>. |
| Adobe mode errors in Step 2 | Install pdfservices-sdk and set PDF_SERVICES_CLIENT_ID + PDF_SERVICES_CLIENT_SECRET. |
OPENAI_API_KEY not set |
Export the key before running ml_analysis.py. |
| PDF conversion warnings | Safe to ignore — pdf2docx prints layout heuristics. |
| Dashboard shows no data | Ensure all four pipeline steps completed successfully. |
- Fine-tune an OpenAI or Mistral model on historical Idaho legislation
- Migrate the frontend to Django + Bootstrap for richer interactivity
- Provide a searchable legislative history for citizens and advocacy groups
This project is open-source. See LICENSE for more information.
Contributions are welcome! Please open an issue or pull request with ideas or improvements.