diff --git a/.github/skills/document-retrieval/SKILL.md b/.github/skills/document-retrieval/SKILL.md new file mode 100644 index 00000000..9c077424 --- /dev/null +++ b/.github/skills/document-retrieval/SKILL.md @@ -0,0 +1,173 @@ +--- +name: document-retrieval +description: Build and tune retrieval configs that search, rank, and collect ordinance documents in COMPASS. Use whenever a user asks to improve retrieval precision/recall, tune search queries/keywords, or debug acquisition quality before extraction tuning. +--- + +# Web Scraper Skill + +Use this skill to improve retrieval precision/recall before extraction tuning. +Applies to both one-shot (schema-driven) and legacy decision-tree extraction +pipelines. + +## When to use + +- Download step returns noisy sources (one-shot extraction). +- Ordinance recall is weak across jurisdictions (one-shot extraction). +- LLM filtering is compensating for poor search quality. + +## Do not use + +- Schema feature definition or value extraction logic design. +- Post-extraction feature/value debugging when retrieval is already correct. + +## Expected assistant output + +When using this skill, return: + +1. The retrieval axis changed (queries, keyword weights, or heuristics). +2. Evidence from artifacts/logs showing why the change was needed. +3. The next run command against the same jurisdiction sample. + +## Canonical reference + +Consult example plugin configurations in `examples/`: +- `examples/one_shot_schema_extraction/plugin_config.yaml` — standard one-shot config +- `examples/water_rights_demo/one-shot/plugin_config.yaml` — multi-document edge cases + +When creating new tech configs, use `_plugin_config.yaml` as a recommended +naming convention (e.g. `geothermal_plugin_config.yaml`). + +## Scope + +- Query-template strategy. +- URL ranking and filtering patterns. +- Heuristic phrase controls before LLM validation. + +## Two retrieval phases + +COMPASS runs two sequential acquisition passes per jurisdiction: + +1. **Search-engine phase** — queries `SerpAPIGoogleSearch` (or configured + engine) using `query_templates`. This phase is the primary source of + ordinance documents. +2. **Website crawl phase** — crawls the jurisdiction's official website, + ranking pages using `website_keywords`. This phase is a secondary pass + and runs only if the search-engine phase did not yield an ordinance + context. + +Key behaviors: +- Playwright browser errors during the website crawl phase are **non-fatal**. + COMPASS logs the error and continues. +- `Found 0 potential documents` at the end of the crawl phase is **expected** + for jurisdictions without relevant online ordinances. +- Disable the crawl phase with `perform_website_search: false` in run config + when you want faster smoke tests or Playwright is unavailable. + +## Key management + +For SerpAPI-backed search, keep `api_key` out of committed config and provide +`SERPAPI_KEY` via environment (for example through `.env` loaded in shell). + +Recommended shell setup: + +```bash +set -a +source .env +set +a +``` + +Avoid spaces around `=` in `.env` assignments. + +## Retrieval design pattern + +1. Create 3-7 jurisdiction queries with `{jurisdiction}`. +2. Weight legal document indicators in URL keywords. +3. Apply exclusions for templates/reports/slides. +4. Add focused negative tech terms to reduce false positives. +5. Start with dynamic search, then switch to deterministic known URLs when + search infrastructure is unstable. + +When using `heuristic_keywords`, use these four lists to guide pre-LLM filtering: +- `GOOD_TECH_KEYWORDS` — strong indicators of the target technology + (e.g., facility types, deployment modes). Documents matching even a + few keywords are marked as candidates. +- `GOOD_TECH_PHRASES` — multi-word phrases that signal relevant + ordinance content. Keep specific to avoid false positives. +- `GOOD_TECH_ACRONYMS` — industry-standard abbreviations for the + technology. Narrow list; include only widely recognized acronyms. +- `NOT_TECH_WORDS` — pre-heuristic filter that rejects documents + before keyword matching. Use to exclude adjacent technologies and + irrelevant domains (e.g., residential HVAC, unrelated industries). + Runs first; prevents wasted keyword evaluation on clearly-wrong + documents. + +If any required list is missing or empty, COMPASS raises a plugin +configuration error and extraction quality should be treated as failed. + +For first-pass reliability, test retrieval with deterministic known URLs +before using live web search. + +## Technology-specific retrieval controls (template) + +- Include target-technology facility/deployment terms. +- Exclude adjacent and non-target terms (residential/HVAC/PV/etc as needed). +- Favor jurisdictional legal-code signals like `land use code`, + `code of ordinances`, `use table`, and `special use permit`. + +## Deterministic smoke-test mode +For this smoke test, at least one of the following documentation sources must be provided: + +- **`known_doc_urls`**: A list of URLs pointing to external documentation that the scraper can access and parse +- **`known_local_docs`**: A collection of local documentation files available in the repository or system + +Use run-config controls to bypass flaky search while tuning: + +- supply `known_doc_urls` or `known_local_docs`, +- set `perform_se_search: false`, +- set `perform_website_search: false`. + +Then validate: + +- download artifacts exist, +- cleaned text exists, +- ordinance DB rows are non-empty. + +## Tuning loop + +1. Run SE-search phase on small sample. +2. Inspect kept vs discarded PDFs (`ordinance_files/`). +3. Run heuristic filter and review false rejects/accepts (`cleaned_text/`). +4. Check website crawl phase independently if needed (enable, run, inspect logs). +5. Update one axis only: + - query templates (affects SE phase), + - URL weights (affects both phases), + - include/exclude heuristic patterns (pre-LLM filter), + - `NOT_TECH_WORDS` (upstream document rejection). +6. Re-run same sample and compare. + +## Cross-tech onboarding + +When reusing this workflow for any technology: + +- keep legal retrieval tokens (`ordinance`, `zoning`, `code`), +- replace all technology terms in `query_templates`, `website_keywords`, + and `heuristic_keywords`, +- seed `known_doc_urls` with authoritative regulatory documents for smoke + testing, +- avoid copying negatives from previous technologies into the new tech config, +- verify `NOT_TECH_WORDS` excludes adjacent technologies for your domain. + +## Phase gates + +- **3 jurisdictions**: ensure major source classes are found. +- **10 jurisdictions**: verify stability across regions. + + +## Guardrails + +- Keep feature extraction logic out of retrieval config. +- Do not overfit to one county's document style. +- Preserve auditable rationale for each retrieval change. +- Keep one canonical retrieval config per active technology. +- Ensure each run uses a unique `out_dir` to avoid COMPASS aborting early. + diff --git a/.github/skills/extraction-run/SKILL.md b/.github/skills/extraction-run/SKILL.md new file mode 100644 index 00000000..c2fafa8b --- /dev/null +++ b/.github/skills/extraction-run/SKILL.md @@ -0,0 +1,273 @@ +--- +name: extraction-run +description: Execute one-shot extraction with COMPASS and iterate quickly with low cost. Use whenever a user asks to run, smoke-test, validate, debug, or scale one-shot schema extraction for any technology. +--- + +# Extraction Run Skill + +**ONE-SHOT EXTRACTION ONLY.** This skill applies only to schema-driven extraction. +For decision-tree extraction (solar, wind, small wind), consult COMPASS +architecture docs. + +Use this skill to run one-shot extraction in a repeatable, low-risk way, +then iterate quickly until you have stable structured outputs. + +## When to use + +- Schema exists and plugin config points to it. +- You need a reliable smoke-test workflow before scaling. +- You are NOT using decision-tree extraction. + +## Do not use + +- Decision-tree extraction feature engineering. +- Python parser implementation in `compass/extraction//parse.py`. +- Non-extraction tasks (for example docs-only updates). + +## Expected assistant output + +When using this skill, return: + +1. The exact `pixi run compass process ...` command used. +2. A pass/fail decision against extraction-quality gates. +3. The smallest next config/schema change and why. + +## Canonical reference + +- `examples/one_shot_schema_extraction/` — complete working examples +- `examples/one_shot_schema_extraction/README.rst` — general one-shot overview +- `examples/water_rights_demo/one-shot/` — multi-doc extraction example + +## Two-pipeline modes + +COMPASS supports two distinct extraction pipelines. Choose one and do not mix +them for the same technology: + +| Mode | Where code lives | Good for | +|---|---|---| +| **One-shot (schema-based)** | `examples/` → `compass/extraction//` | New techs, no Python changes | +| **decision-tree** | Python code in `compass/extraction//` | Existing solar, wind, small wind | + +One-shot is the correct path for all new technology onboarding. It requires +only a schema JSON, a plugin YAML, and a run config — no Python source changes. + +## Tech promotion lifecycle + +New technology assets start in `examples/` and finish in `compass/extraction/`: + +1. **Develop** — place all assets in `examples/one_shot_schema_extraction/` +2. **Stabilize** — iterate schema/plugin until smoke and robustness gates pass +3. **Promote** — copy the three finalized files into `compass/extraction//`: + - `_schema.json` + - `_plugin_config.yaml` + - `__init__.py` — registers the plugin via `create_schema_based_one_shot_extraction_plugin` + + After creating the package, add an import in `compass/extraction/__init__.py` + to register the plugin at startup. See `compass/extraction/ghp/__init__.py` + for a reference implementation. + +## Required inputs + +- Run config for `compass process`. +- Plugin config containing `schema`. +- API keys in environment (never hardcode in configs). +- A jurisdiction set sized to the current phase. + +## Preflight checks (must pass before run) + +- Jurisdiction CSV has headers `County,State` or `County,State,Subdivision,Jurisdiction Type`. +- `out_dir` is unique for this run. +- At least one acquisition step is enabled: + `perform_se_search: true`, `perform_website_search: true`, + `known_doc_urls`, or `known_local_docs`. +- If `heuristic_keywords` exists, all four required lists are present and + non-empty. + +## Naming convention + +Use tech-first names for all one-shot assets: + +- `_plugin_config.yaml` +- `_schema.json` +- `_jurisdictions*.csv` + +The `tech` value in the run config must be a string that becomes the plugin +registry identifier. It must be unique, lowercase, and underscore-separated +(for example `concentrating_solar`, `geothermal_electricity`). COMPASS will +raise `Unknown tech input` if this key does not match any registered plugin. + +## Canonical development pattern + +For early development, start with the proven dynamic baseline, then fall back +to deterministic mode only when search infrastructure is unstable: + +1. Use one small jurisdiction file (1-3 rows). +2. Use your preferred configured search engine. +3. Load `.env` into shell (`set -a && source .env && set +a`). +4. Run with verbose logs: + - `pixi run compass process -c config.json5 -p plugin.yaml -v` +5. Confirm output artifacts exist before tuning schema semantics. + +Fallback mode when needed: + +- Add `known_doc_urls` (or `known_local_docs`) in run config. +- Set `perform_se_search: false` and `perform_website_search: false`. + +## Adaptation rule + +When adapting this workflow for a new technology, keep the run structure +unchanged and swap only technology-specific inputs: + +- `tech` in run config, +- schema file, +- plugin descriptor (`data_type_short_desc`), +- retrieval query/keyword vocabulary, +- known document URL set. + +Change one axis per run unless debugging infrastructure failures. + +## Environment setup + +Load secrets from `.env` before running. Never commit key values in config files. + +```bash +set -a && source .env && set +a # no spaces around = in .env assignments +``` + +## Core command + +```bash +pixi run compass process -c config.json5 -p path/to/plugin_config.yaml -v +``` + +## Phase-gated workflow + +1. **Smoke test (1 jurisdiction)** + - Goal: verify wiring and output contract. +2. **Robustness (5 jurisdictions)** + - Goal: verify feature stability and edge-case handling. + +## Validation checklist + +Evaluate each run on: + +- document relevance (exclude off-domain content), +- feature coverage vs expected ordinance topics, +- section/summary traceability, +- unit consistency, +- null discipline, + +## Expected output artifacts + +A successful run produces these files under `out_dir`: + +| Artifact | Meaning | +|---|---| +| `ordinance_files/*.pdf` | Downloaded source documents | +| `cleaned_text/*.txt` | Heuristic-filtered extracted text | +| `jurisdiction_dbs/*.csv` | Per-jurisdiction raw extraction rows | +| `quantitative_ordinances.csv` | Final compiled numeric features | +| `qualitative_ordinances.csv` | Final compiled qualitative features | +| `usage.json` | Per-jurisdiction LLM token and request counts | +| `meta.json` | Run metadata (cost, timing, version) | + +Final CSV columns: `county`, `state`, `subdivision`, `jurisdiction_type`, +`FIPS`, `feature`, `value`, `units`, `adder`, `min_dist`, `max_dist`, +`summary`, `year`, `section`, `source`. + +## Interpreting output status correctly + +`cleaned_text` files can exist while `Number of documents found` is `0`. + +This means acquisition/text collection worked, but no final structured ordinance +rows were emitted into consolidated DB outputs. + +Check in order: + +1. `outputs/*/cleaned_text/*.txt` (text extraction present) +2. `outputs/*/jurisdiction_dbs/*.csv` (per-jurisdiction parsed rows) +3. `outputs/*/quantitative_ordinances.csv` and + `outputs/*/qualitative_ordinances.csv` (final compiled results) + +Treat the run as **failed for extraction quality** when either is true: +- `Number of jurisdictions with extracted data: 0` +- any configuration exception appears in logs (even if process exits 0) + +Only treat a run as passing when both are true: +- at least one jurisdiction has extracted data +- at least one jurisdiction CSV in `jurisdiction_dbs/` has more than header row + +## Root-cause triage + +- **Wrong or noisy documents** + - Tune query templates, URL keywords, and exclusions. + - Prefer `known_doc_urls` while stabilizing. +- **Right documents, wrong fields** + - Tune schema descriptions/examples and ambiguity rules. + - Check `extraction_system_prompt` in plugin YAML — it is the primary + guard against scope bleed from generic legal documents. +- **Correct values, unstable formatting** + - Tighten enums, unit vocabulary, and null behavior. +- **Nothing downloaded / unstable search** + - Disable live search and use deterministic known URLs/local docs. +- **0 documents found for a jurisdiction during website crawl** + - Expected for jurisdictions with few online ordinances. The website + crawl is a second acquisition pass after search-engine retrieval; + 0 results there is not a pipeline failure. + +## Acceptance gates + +Do not advance phases until all are true: + +- Output rows conform to required contract. +- High share of rows include useful `section` and `summary`. +- Feature names are stable and machine-consistent. +- Repeated runs on same sample show minimal drift. + +## Cost and speed controls + +- Keep sample size minimal while tuning. +- Change one variable per run. +- Archive run command, input set, and output path for each iteration. + +## Workspace hygiene (important) + +Keep one canonical working set per technology in `examples/`: + +- one run config, +- one plugin config, +- one schema, +- one jurisdiction file, +- one known docs file. + +Delete stale `_migrated`, `_smoke`, and duplicate output folders to avoid +configuration drift and debugging confusion. + +## Known infrastructure issues + +### Playwright timeouts + +Web search via `rebrowser_playwright` may fail with 60s timeouts on +`Page.wait_for_selector`. Symptoms: +- `TimeoutError: Page.wait_for_selector: Timeout 60000ms exceeded` +- All search queries fail consistently +- Browser session crashes with `ProtocolError: Internal server error, session closed` + +These errors during the **website crawl phase** (second acquisition pass) are +**non-fatal**. COMPASS logs them and continues. They do not block the +search-engine phase or extraction. + +If search itself is failing, verify provider credentials are loaded and fall +back to deterministic mode. + +**Workaround**: Use `known_local_docs` or `known_doc_urls` and disable +search/website steps while validating extraction logic. + +### known_local_docs loading failures + +`known_local_docs` may fail silently with `ERROR: Failed to read file` in +jurisdiction logs due to external loader behavior. + +**Workaround**: Prefer `known_doc_urls` for deterministic smoke tests and +pre-validate local docs before pipeline runs. + diff --git a/.github/skills/plugin-config-setup/SKILL.md b/.github/skills/plugin-config-setup/SKILL.md new file mode 100644 index 00000000..0c83b5f9 --- /dev/null +++ b/.github/skills/plugin-config-setup/SKILL.md @@ -0,0 +1,279 @@ +--- +name: plugin-config-setup +description: Author and tune one-shot plugin YAML for COMPASS document discovery, filtering, and text collection. Use whenever a user asks to create, clean up, standardize, or troubleshoot one-shot plugin YAML for technology onboarding. +--- + +# YAML Setup Skill + +**ONE-SHOT EXTRACTION ONLY.** This skill applies only to schema-driven extraction. +For legacy decision-tree extraction, consult COMPASS architecture docs. + +Use this skill to create or tune one-shot plugin YAML that controls retrieval, +filtering, and text collection behavior. + +## When to use + +- New technology onboarding in one-shot extraction (NOT decision-tree extraction). +- Schema exists but source relevance is weak. +- You need reproducible config handoff across teams. + +## Do not use + +- Legacy decision-tree parser implementation changes. +- Schema feature semantics work that belongs in `_schema.json`. +- Run-result diagnosis after outputs are generated (use iteration loop skill). + +## Expected assistant output + +When using this skill, return: + +1. The finalized plugin YAML content or exact diff. +2. Any required paired run-config changes. +3. A validation command and pass/fail checks for the edited YAML. + +## Canonical reference + +Consult the working examples in `examples/`: +- `examples/one_shot_schema_extraction/plugin_config.yaml` — standard working example +- `examples/water_rights_demo/one-shot/plugin_config.yaml` — multi-doc edge case + +When creating new tech configs, `_plugin_config.yaml` is the recommended +naming convention (e.g. `geothermal_plugin_config.yaml`). The existing +`plugin_config.yaml` examples use a generic name; new tech-specific assets +should use the tech-first naming pattern. + +Refer to any complete example in `examples/` that matches your retrieval goals. + +## Naming convention + +Use tech-first file names when creating new one-shot assets: +`_config*.json5`, `_plugin_config.yaml`, +`_schema.json`, `_jurisdictions*.csv`. + +## Secret handling + +Keep API keys in environment variables (for example `SERPAPI_KEY`, +`AZURE_OPENAI_API_KEY`) rather than in plugin or run config files. +Load them per shell session with `set -a && source .env && set +a`. +Avoid spaces around `=` in `.env` assignments. + +## Required minimum + +```yaml +schema: ./my_schema.json +``` + +## Non-negotiable runtime constraints + +- Jurisdiction CSV headers are case-sensitive: use `County,State`. +- If `heuristic_keywords` is present, it must include all four lists and + none may be empty. +- A run is not considered passing if logs show config errors or if + extracted jurisdiction count is zero. + +## Key plugin YAML fields + +| Field | Type | Code Reference | +|---|---|---| +| `schema` | string (path) | [base.py#L124–L131](../../../compass/plugin/one_shot/base.py) | +| `data_type_short_desc` | string | [base.py#L483](../../../compass/plugin/one_shot/base.py#L483) | +| `query_templates` | list | [base.py#L217–L240](../../../compass/plugin/one_shot/base.py#L217) | +| `website_keywords` | dict | [base.py#L281–L338](../../../compass/plugin/one_shot/base.py#L281) | +| `heuristic_keywords` | dict or `true` | [base.py#L340–L390](../../../compass/plugin/one_shot/base.py#L340); [base.py#L512](../../../compass/plugin/one_shot/base.py#L512) | +| `collection_prompts` | list or `true` | [base.py#L413–L436](../../../compass/plugin/one_shot/base.py#L413) | +| `text_extraction_prompts` | list or `true` | [base.py#L438–L468](../../../compass/plugin/one_shot/base.py#L438) | +| `extraction_system_prompt` | string | [base.py#L476–L488](../../../compass/plugin/one_shot/base.py#L476) | +| `cache_llm_generated_content` | bool | [base.py#L107–L117](../../../compass/plugin/one_shot/base.py#L107) | + +**For the complete list of all configuration options (including `allow_multi_doc_extraction` and any future additions), consult the docstring of [`create_schema_based_one_shot_extraction_plugin()`](../../../compass/plugin/one_shot/base.py#L51).** + +## Required `heuristic_keywords` shape + +When using `heuristic_keywords`, use these four lists to guide pre-LLM filtering: +- `GOOD_TECH_KEYWORDS` — strong indicators of the target technology + (e.g., facility types, deployment modes). Documents matching even a + few keywords are marked as candidates. +- `GOOD_TECH_PHRASES` — multi-word phrases that signal relevant + ordinance content. Keep specific to avoid false positives. +- `GOOD_TECH_ACRONYMS` — industry-standard abbreviations for the + technology. Narrow list; include only widely recognized acronyms. +- `NOT_TECH_WORDS` — pre-heuristic filter that rejects documents + before keyword matching. Use to exclude adjacent technologies and + irrelevant domains (e.g., residential HVAC, unrelated industries). + Runs first; prevents wasted keyword evaluation on clearly-wrong + documents. + +Use this exact structure when defining `heuristic_keywords`: + +```yaml +heuristic_keywords: + GOOD_TECH_KEYWORDS: + - "" + GOOD_TECH_PHRASES: + - "" + GOOD_TECH_ACRONYMS: + - "" + NOT_TECH_WORDS: + - "" +``` + +Notes: +- Keys are normalized, but using canonical key names reduces mistakes. +- All four lists are required and must be non-empty. + +### `collection_prompts: true` and `text_extraction_prompts: true` + +Setting either flag to `true` (not a list) instructs COMPASS to use the LLM +to auto-generate the prompts from the schema content. This is the recommended +shortcut during development — do not write manual prompt lists until +auto-generated ones prove insufficient. + +### `extraction_system_prompt` + +This is the primary control for preventing scope bleed from generic land-use +code documents. Write it as a multi-line YAML literal block: + +```yaml +extraction_system_prompt: |- + You are a legal scholar extracting structured data from + utility-scale ordinances. + + Extract only enacted requirements for utility-scale facilities. + Exclude adjacent technologies and non-target use cases. + Prefer explicit values. Use null for qualitative obligations. +``` + +See `compass/extraction/ghp/plugin_config.yaml` for a complete example. + +## Progressive config path + +1. **Minimal** + - Confirm schema path and extraction invocation work. +2. **Simple** + - Add `query_templates`, `heuristic_keywords`, and `cache_llm_generated_content`. +3. **Full** + - Add `extraction_system_prompt` if scope bleed or off-domain extraction + is observed. + - Set `collection_prompts: true` and `text_extraction_prompts: true` to + let the LLM auto-generate prompts from the schema. + - Replace `heuristic_keywords: true` with an explicit list if precision + is insufficient. + +Use the same progression for any technology. + +## Baseline YAML pattern + +```yaml +schema: ./my_schema.json +data_type_short_desc: utility-scale ordinance +cache_llm_generated_content: true +query_templates: + - "filetype:pdf {jurisdiction} ordinance" + - "{jurisdiction} zoning ordinance" + - "{jurisdiction} permitting requirements" +website_keywords: + pdf: 92160 + : 46080 + ordinance: 23040 + zoning: 2880 + permit: 1440 +heuristic_keywords: + GOOD_TECH_KEYWORDS: + - "" + - "" + GOOD_TECH_ACRONYMS: + - "" + GOOD_TECH_PHRASES: + - "" + - "" + NOT_TECH_WORDS: + - "" + - "" +``` + +Swap vocabulary for any technology while keeping the same structure. + +## Stable development mode + +Use run-config controls for deterministic smoke tests while iterating schema: + +- `known_doc_urls` or `known_local_docs` — bypass live search +- `perform_se_search: false` — disable search-engine phase +- `perform_website_search: false` — disable website crawl phase + +Re-enable search only after extraction quality is stable on known documents. + +Recommended baseline: use dynamic search first, then use deterministic mode +if search infrastructure fails. + +## Minimal run-config contract (to pair with plugin YAML) + +Use this pattern and require users to provide their own model and client +values: + +```json5 +{ + out_dir: "./outputs__", + tech: "", + jurisdiction_fp: "./_jurisdictions.csv", + perform_se_search: true, + perform_website_search: false, + model: [ + { + name: "", + llm_call_kwargs: { temperature: 0, timeout: 600 }, + client_kwargs: { + api_version: "", + azure_endpoint: "" + } + } + ] +} +``` + +## Acquisition phases + +COMPASS acquisition runs in two sequential phases per jurisdiction: + +1. **Search-engine phase** — uses `SerpAPIGoogleSearch` or similar; driven by + `query_templates`. +2. **Website crawl phase** — crawls the jurisdiction's main website using + `website_keywords` for ranking. Playwright browser errors during this + phase are **non-fatal**; COMPASS logs them and moves on. + +`perform_website_search: false` skips phase 2. Use it during smoke tests to +keep run time short and avoid Playwright dependency issues. + +## Validation checklist + +- Schema path resolves from runtime working directory. +- Query templates include `{jurisdiction}` consistently. +- URL weights favor legal and government documents. +- Heuristic exclusions are precise and not over-broad. +- Prompt overrides are only added when default behavior fails. + +## Cross-tech adaptation checklist + +When adapting to another technology: + +- replace vocabulary in `query_templates` and `website_keywords`, +- keep legal-code terms (`ordinance`, `zoning`, `code of ordinances`), +- keep non-target exclusions explicit in `NOT_TECH_WORDS`, +- do not carry terms from a previous technology into new tech configs, +- write a technology-specific `extraction_system_prompt`. + +## Run command + +```bash +pixi run compass process -c config.json5 -p path/to/plugin_config.yaml -v +``` + +If running outside the tech folder, use absolute paths for `-c` and `-p`. + +## Guardrails + +- Retrieval behavior belongs in plugin YAML. +- Feature logic belongs in schema. +- Adjust one tuning axis per run for clean attribution. +- Keep one canonical plugin file per technology in the active example folder. + diff --git a/.github/skills/schema-creation/SKILL.md b/.github/skills/schema-creation/SKILL.md new file mode 100644 index 00000000..08981132 --- /dev/null +++ b/.github/skills/schema-creation/SKILL.md @@ -0,0 +1,178 @@ +--- +name: schema-creation +description: Author and iterate one-shot extraction schemas for native COMPASS. Use whenever a user asks to create, expand, or debug schema feature definitions, value/unit rules, or extraction instructions. +--- + +# Schema Creation Skill + +**ONE-SHOT EXTRACTION ONLY.** This skill applies only to schema-driven extraction +(new technology onboarding with JSON schema + plugin YAML). For legacy decision-tree +extraction (existing solar/wind/small-wind in `compass/extraction//`), +consult COMPASS architecture docs. + +Use this skill to define what the LLM extracts and how it formats results. +The schema is the single most important config file for output quality. + +## When to use + +- Starting a new one-shot technology extraction (NOT decision-tree legacy extraction). +- Fixing inconsistent or incorrect extracted values in one-shot extraction. +- Adding new features to an existing one-shot extraction. + +## Do not use + +- Retrieval tuning tasks that belong in plugin YAML. +- Legacy decision-tree extraction parser implementation. + +## Expected assistant output + +When using this skill, return: + +1. The proposed schema diff (or full schema block) for the targeted features. +2. The rationale for VALUE, UNITS, and IGNORE wording. +3. A smoke-test check plan for validating the schema change. + +## Canonical reference + +For complete examples, see the `examples/` directory: +- `examples/one_shot_schema_extraction/wind_schema.json` +- `examples/water_rights_demo/one-shot/water_rights_schema.json5` + +Each follows the pattern: `_schema.json` or `_schema.json5`. + +## Required output contract + +Every schema must define `outputs` as an array. Each item must require +exactly these five fields and set `additionalProperties: false`: + +```json +{ + "type": "object", + "required": ["outputs"], + "additionalProperties": false, + "properties": { + "outputs": { + "type": "array", + "items": { + "type": "object", + "required": ["feature", "value", "units", "section", "summary"], + "additionalProperties": false, + "properties": { + "feature": { "type": "string", "enum": ["..."] }, + "value": { "anyOf": [{"type": "number"}, {"type": "string"}, {"type": "boolean"}, {"type": "array", "items": {"type": "string"}}, {"type": "null"}] }, + "units": { "type": ["string", "null"] }, + "section": { "type": ["string", "null"] }, + "summary": { "type": ["string", "null"] } + } + } + } + } +} +``` + +These five fields map directly to the output CSV columns. COMPASS adds +`county`, `state`, `FIPS`, and other metadata columns automatically. + +## Build sequence + +1. **Define the feature enum** — one stable lowercase ID per siting-relevant + requirement. Keep naming consistent across iterations and group IDs by + family (setbacks, noise, zoning, permitting). +2. **Define `value` and `units` rules per feature family** — in each + feature's `description`, state the expected value type and accepted unit + vocabulary explicitly. +3. **Add `$definitions`** — group related feature descriptions here to keep + the `feature` enum block clean. +4. **Add `$instructions`** — encode global extraction policy (scope, null + handling, one-row-per-feature contract, verbatim quote preference). +5. **Smoke-test on one jurisdiction** — validate all enum items appear in + output and null rows are correctly populated for missing features. + +## Feature definition template + +Every feature description must answer four questions: + +1. **What is this?** One sentence identifying the regulatory concept. +2. **VALUE rule:** What type is the value and what specific values/ranges are + valid? +3. **UNITS rule:** What unit string is accepted, or `null` if not applicable? +4. **IGNORE / CLARIFICATION:** What near-miss concepts must NOT match this + feature? + +Example (abbreviated): + +```json +"structure setback": { + "description": "Minimum distance from the generator to an occupied building. VALUE: numerical distance. UNITS: 'feet' or 'meters'. IGNORE: setbacks from property lines or roads — those are separate features." +} +``` + +## Feature family taxonomy + +Organize `$definitions` by these families: + +| Family | Example features | +|---|---| +| Setbacks | `structure setback`, `property line setback`, `road setback` | +| Noise/Emissions | `noise limit`, `emissions standard`, `vibration limit` | +| Operational | `hours of operation` | +| Physical design | `screening requirement`, `enclosure requirement`, `exhaust stack height` | +| Zoning | `primary use districts`, `conditional use districts`, `prohibited use districts` | +| Permitting | `permit requirement`, `capacity threshold` | +| Compliance | `decommissioning` | + +## `$instructions` block + +Always include a `$instructions` object at the top level with these keys: + +```json +"$instructions": { + "scope": "Describe exactly what to extract and what to ignore.", + "null_handling": "Output every enum feature. Use null value and null summary when a feature is not found in the document. Do not omit features.", + "verbatim_quotes": "In summary fields, prefer verbatim quotes from the source. Enclose in double quotation marks.", + "units_discipline": "Do not convert units. Record them exactly as they appear in the document." +} +``` + +## Scope bleed control + +When COMPASS retrieves a large land-use code instead of a tech-specific +ordinance, the LLM may extract off-domain provisions. + +Fix order (most powerful first): +1. `extraction_system_prompt` in plugin YAML — state explicitly what is in + scope and what is excluded. +2. `$instructions.scope` in schema — reinforce with exclusion language. +3. `heuristic_keywords.NOT_TECH_WORDS` — reject documents upstream. + +Do not expand the feature enum to absorb scope bleed. Narrow the prompt. + +## Cross-technology adaptation checklist + +When cloning a schema for a new technology: + +- [ ] Replace all feature IDs with technology-specific names. +- [ ] Replace value/units rules in every feature description. +- [ ] Replace exclusion terms in `$instructions.scope` and feature IGNORE + clauses. +- [ ] Replace `$definitions` group names to match new feature families. +- [ ] Smoke-test before widening to 10+ jurisdictions. + +## Quality checklist + +- [ ] Feature enum uses stable, consistent IDs across all runs. +- [ ] Every feature description contains VALUE, UNITS, and IGNORE clauses. +- [ ] `$instructions` block is present with all five keys. +- [ ] `additionalProperties: false` is set on the top-level object and on + each item in the `outputs` array. +- [ ] Schema validates cleanly against a JSON Schema validator. +- [ ] A smoke run using this schema produces extracted rows (not just + successful process exit logs). + +## Anti-patterns to avoid + +- Feature IDs that change names between iterations. +- Implicit unit assumptions not stated in description text. +- Missing IGNORE clauses for common near-miss features. +- Examples in descriptions that contradict field rules. +- Widening the enum to absorb scope bleed instead of tightening the prompt.