From d9e868de5a567380a17c62c07f6c2d7b3b0ad318 Mon Sep 17 00:00:00 2001 From: Byron Pullutasig <115118857+bpulluta@users.noreply.github.com> Date: Tue, 17 Mar 2026 12:11:12 -0600 Subject: [PATCH 1/7] Add COMPASS workflow skills --- .github/skills/extraction-run/SKILL.md | 253 ++++++++++++++++++++++++ .github/skills/schema-creation/SKILL.md | 173 ++++++++++++++++ .github/skills/web-scraper/SKILL.md | 134 +++++++++++++ .github/skills/yaml-setup/SKILL.md | 199 +++++++++++++++++++ 4 files changed, 759 insertions(+) create mode 100644 .github/skills/extraction-run/SKILL.md create mode 100644 .github/skills/schema-creation/SKILL.md create mode 100644 .github/skills/web-scraper/SKILL.md create mode 100644 .github/skills/yaml-setup/SKILL.md diff --git a/.github/skills/extraction-run/SKILL.md b/.github/skills/extraction-run/SKILL.md new file mode 100644 index 00000000..23321b46 --- /dev/null +++ b/.github/skills/extraction-run/SKILL.md @@ -0,0 +1,253 @@ +--- +name: extraction-run +description: Execute one-shot extraction with COMPASS, evaluate outputs, and iterate schema/config changes with minimal cost. +--- + +# Extraction Run Skill + +Use this skill to run one-shot extraction in a repeatable, low-risk way, +then iterate quickly until you have stable structured outputs. + +## When to use + +- Schema exists and plugin config points to it. +- You are onboarding a new technology (for example geothermal, CHP, hydrogen). +- You need a reliable smoke-test workflow before scaling. + +## Two-pipeline modes + +COMPASS supports two distinct extraction pipelines. Choose one and do not mix +them for the same technology: + +| Mode | Where code lives | Good for | +|---|---|---| +| **One-shot (schema-based)** | `examples/` → `compass/extraction//` | New techs, no Python changes | +| **Legacy decision-tree** | Python code in `compass/extraction//` | Existing solar, wind, small wind | + +One-shot is the correct path for all new technology onboarding. It requires +only a schema JSON, a plugin YAML, and a run config — no Python source changes. + +## Tech promotion lifecycle + +New technology assets start in `examples/` and finish in `compass/extraction/`: + +1. **Develop** — place all assets in `examples/one_shot_schema_extraction_/` +2. **Stabilize** — iterate schema/plugin until smoke and robustness gates pass +3. **Promote** — copy the three finalized files into `compass/extraction//`: + - `_schema.json` + - `_plugin_config.yaml` + - `_config.json5` (optional; useful as a reference run config) + +The promoted extraction folder contains only config files — no Python code is +needed for one-shot techs. + +## Required inputs + +- Run config for `compass process`. +- Plugin config containing `schema`. +- API keys in environment (never hardcode in configs). +- A jurisdiction set sized to the current phase. + +## Naming convention + +Use tech-first names for all one-shot assets: + +- `_config*.json5` +- `_plugin_config.yaml` +- `_schema.json` +- `_jurisdictions*.csv` + +The `tech` value in the run config must be a string that becomes the plugin +registry identifier. It must be unique, lowercase, and underscore-separated +(for example `concentrating_solar`, `geothermal_electricity`). COMPASS will +raise `Unknown tech input` if this key does not match any registered plugin. + +## Canonical development pattern + +For early development, start with the proven dynamic baseline, then fall back +to deterministic mode only when search infrastructure is unstable: + +1. Use one small jurisdiction file (1-3 rows). +2. Use your preferred configured search engine. +3. Load `.env` into shell (`set -a && source .env && set +a`). +4. Run with verbose logs: + - `pixi run compass process -c config.json5 -p plugin.yaml -v` +5. Confirm output artifacts exist before tuning schema semantics. + +Fallback mode when needed: + +- Add `known_doc_urls` (or `known_local_docs`) in run config. +- Set `perform_se_search: false` and `perform_website_search: false`. + +## Adaptation rule + +When adapting this workflow for a new technology, keep the run structure +unchanged and swap only technology-specific inputs: + +- `tech` in run config, +- schema file, +- plugin descriptor (`data_type_short_desc`), +- retrieval query/keyword vocabulary, +- known document URL set. + +Change one axis per run unless debugging infrastructure failures. + +## Example references (optional) + +- `examples/one_shot_schema_extraction/README.rst` +- `examples/one_shot_schema_extraction_geothermal/geothermal_config.json5` +- `examples/one_shot_schema_extraction_geothermal/geothermal_plugin_config.yaml` +- `examples/one_shot_schema_extraction_geothermal/geothermal_schema.json` +- `examples/one_shot_schema_extraction_geothermal/geothermal_jurisdictions_one.csv` +- `examples/one_shot_schema_extraction_geothermal/geothermal_one_shot_guide.md` +- `examples/one_shot_schema_extraction_cst/cst_config.json5` (CST reference) +- `examples/one_shot_schema_extraction_cst/cst_plugin_config.yaml` (CST reference) +- `examples/one_shot_schema_extraction_cst/cst_schema.json` (CST reference) +- `compass/extraction/geothermal_electricity/` (finalized one-shot tech example) +- `docs/source/examples/one_shot_schema_extraction/plugin_config_minimal.json` +- `docs/source/examples/one_shot_schema_extraction/plugin_config.yaml` +- `examples/compass_tech_pipeline/README.md` + +## Environment setup reminder + +Before running, load secrets from `.env` (for example `SERPAPI_KEY`, +`AZURE_OPENAI_API_KEY`) into the current shell. Do not commit secret values +inside config files. + +Common `.env` gotcha: avoid spaces around `=` in variable assignments. + +## Core command + +```bash +pixi run compass process -c config.json5 -p path/to/plugin_config.yaml -v +``` + +## Phase-gated workflow + +1. **Smoke test (3 jurisdictions)** + - Goal: verify wiring and output contract. +2. **Robustness (10-25 jurisdictions)** + - Goal: verify feature stability and edge-case handling. +3. **Scale (full set)** + - Goal: only after earlier phases pass acceptance gates. + +## Validation checklist + +Evaluate each run on: + +- document relevance (exclude off-domain content), +- feature coverage vs expected ordinance topics, +- section/summary traceability, +- unit consistency, +- null discipline, +- **scope bleed** — check that no features appear in the output CSVs that + fall outside the schema enum; generic land-use-code documents can cause + unrelated provisions to leak through. Tighten `extraction_system_prompt` + in plugin YAML to fix this. + +## Expected output artifacts + +A successful run produces these files under `out_dir`: + +| Artifact | Meaning | +|---|---| +| `ordinance_files/*.pdf` | Downloaded source documents | +| `cleaned_text/*.txt` | Heuristic-filtered extracted text | +| `jurisdiction_dbs/*.csv` | Per-jurisdiction raw extraction rows | +| `quantitative_ordinances.csv` | Final compiled numeric features | +| `qualitative_ordinances.csv` | Final compiled qualitative features | +| `usage.json` | Per-jurisdiction LLM token and request counts | +| `meta.json` | Run metadata (cost, timing, version) | + +Final CSV columns: `county`, `state`, `subdivision`, `jurisdiction_type`, +`FIPS`, `feature`, `value`, `units`, `adder`, `min_dist`, `max_dist`, +`summary`, `year`, `section`, `source`. + +## Interpreting output status correctly + +`cleaned_text` files can exist while `Number of documents found` is `0`. + +This means acquisition/text collection worked, but no final structured ordinance +rows were emitted into consolidated DB outputs. + +Check in order: + +1. `outputs/*/cleaned_text/*.txt` (text extraction present) +2. `outputs/*/jurisdiction_dbs/*.csv` (per-jurisdiction parsed rows) +3. `outputs/*/quantitative_ordinances.csv` and + `outputs/*/qualitative_ordinances.csv` (final compiled results) + +## Root-cause triage + +- **Wrong or noisy documents** + - Tune query templates, URL keywords, and exclusions. + - Prefer `known_doc_urls` while stabilizing. +- **Right documents, wrong fields** + - Tune schema descriptions/examples and ambiguity rules. + - Check `extraction_system_prompt` in plugin YAML — it is the primary + guard against scope bleed from generic legal documents. +- **Correct values, unstable formatting** + - Tighten enums, unit vocabulary, and null behavior. +- **Nothing downloaded / unstable search** + - Disable live search and use deterministic known URLs/local docs. +- **0 documents found for a jurisdiction during website crawl** + - Expected for jurisdictions with few online ordinances. The website + crawl is a second acquisition pass after search-engine retrieval; + 0 results there is not a pipeline failure. + +## Acceptance gates + +Do not advance phases until all are true: + +- Output rows conform to required contract. +- High share of rows include useful `section` and `summary`. +- Feature names are stable and machine-consistent. +- Repeated runs on same sample show minimal drift. + +## Cost and speed controls + +- Keep sample size minimal while tuning. +- Change one variable per run. +- Archive run command, input set, and output path for each iteration. + +## Workspace hygiene (important) + +Keep one canonical working set per technology in `examples/`: + +- one run config, +- one plugin config, +- one schema, +- one jurisdiction file, +- one known docs file. + +Delete stale `_migrated`, `_smoke`, and duplicate output folders to avoid +configuration drift and debugging confusion. + +## Known infrastructure issues + +### Playwright timeouts + +Web search via `rebrowser_playwright` may fail with 60s timeouts on +`Page.wait_for_selector`. Symptoms: +- `TimeoutError: Page.wait_for_selector: Timeout 60000ms exceeded` +- All search queries fail consistently +- Browser session crashes with `ProtocolError: Internal server error, session closed` + +These errors during the **website crawl phase** (second acquisition pass) are +**non-fatal**. COMPASS logs them and continues. They do not block the +search-engine phase or extraction. + +If search itself is failing, verify provider credentials are loaded and fall +back to deterministic mode. + +**Workaround**: Use `known_local_docs` or `known_doc_urls` and disable +search/website steps while validating extraction logic. + +### known_local_docs loading failures + +`known_local_docs` may fail silently with `ERROR: Failed to read file` in +jurisdiction logs due to external loader behavior. + +**Workaround**: Prefer `known_doc_urls` for deterministic smoke tests and +pre-validate local docs before pipeline runs. + diff --git a/.github/skills/schema-creation/SKILL.md b/.github/skills/schema-creation/SKILL.md new file mode 100644 index 00000000..951593a3 --- /dev/null +++ b/.github/skills/schema-creation/SKILL.md @@ -0,0 +1,173 @@ +--- +name: schema-creation +description: Author and iterate one-shot extraction schemas that replace legacy decision-tree extraction logic in native COMPASS. +--- + +# Schema Creation Skill + +Use this skill to encode extraction logic in schema so behavior is repeatable +across jurisdictions and technologies. + +## When to use + +- Creating a new one-shot technology plugin. +- Migrating from decision-tree logic to schema-driven extraction. +- Stabilizing inconsistent model outputs. + +## Example references (optional) + +- `examples/one_shot_schema_extraction_geothermal/geothermal_schema.json` +- `examples/one_shot_schema_extraction_geothermal/README.rst` +- `examples/one_shot_schema_extraction_geothermal/geothermal_one_shot_guide.md` +- `docs/source/examples/one_shot_schema_extraction/wind_schema.json` + +## Required output contract + +Top-level object must define `outputs` and each item must require: + +- `feature` +- `value` +- `units` +- `section` +- `summary` + +```json +{ + "type": "object", + "required": ["outputs"], + "properties": { + "outputs": { + "type": "array", + "items": { + "type": "object", + "required": ["feature", "value", "units", "section", "summary"], + "additionalProperties": false + } + } + } +} +``` + +## Build sequence + +1. Copy baseline schema and rename for target tech. +2. Replace `feature` enum with target-tech IDs. +3. Define `value`/`units` rules per feature family. +4. Add `$definitions` for reusable decision logic. +5. Add `$examples` for top failure modes. +6. Add `$instructions` for global extraction policy. + +For new technologies (for example CHP or CST), clone a working schema and +perform a strict vocabulary swap (features, units, exclusions) before adding +new logic. + +## Output column mapping + +Schema field names map directly to the final output CSV columns: + +| Schema field | CSV column | +|---|---| +| `feature` | `feature` | +| `value` | `value` | +| `units` | `units` | +| `section` | `section` | +| `summary` | `summary` | + +Additional columns added by COMPASS finalization: `county`, `state`, +`subdivision`, `jurisdiction_type`, `FIPS`, `adder`, `min_dist`, `max_dist`, +`year`, `source`. These do not need to appear in the schema. + +## Scope bleed from generic legal documents + +When COMPASS retrieves a large generic land-use code rather than a +technology-specific ordinance, the LLM may extract provisions that are +outside the schema enum. This is most visible when unfamiliar feature names +appear in the output CSV. + +Primary controls: +- `extraction_system_prompt` in plugin YAML — this is the strongest signal. + State explicitly what is in scope and what is out. +- `$instructions.scope` in schema — reinforce exclusion language here. +- `heuristic_keywords.not_tech_words` — filter documents upstream. + +Do not widen the feature enum to accommodate scope bleed; narrow the prompt +and upstream filters instead. + +## Technology adaptation guidance + +When adapting a baseline schema to any new technology: + +- Separate core utility-scale requirements from adjacent/non-target systems. +- Keep district/permit features distinct from numerical constraints. +- Encode jurisdiction/governance handling where relevant in summaries. +- Require explicit nulls when a feature is not enacted. + +## Cross-technology adaptation checklist + +Apply this for any new domain: + +1. Define technology-specific `feature` enum with stable IDs. +2. Define allowed unit vocabulary for each feature family. +3. Add explicit exclusion language for adjacent-but-out-of-scope systems. +4. Ensure summaries preserve legal traceability (section + source-faithful text). +5. Validate on deterministic docs before tuning retrieval. +6. Consider including `enactment date` in the enum — COMPASS naturally surfaces it + from documents and it provides important temporal context in outputs. + +## Example specialization patterns (optional) + +Use examples only to shape exclusion strategy: + +- separate core utility-scale requirements from adjacent technologies, +- add explicit exclusion terms in `not_tech_words`, +- preserve legal traceability via `section` and `summary`. + +## Reuse safeguards + +- Keep tech-first file names consistent across assets: + `_config*.json5`, `_plugin_config.yaml`, + `_schema.json`, `_jurisdictions*.csv`. +- Keep credentials out of schema content and examples. +- Validate schema behavior with a small smoke run before scaling. + +## High-value authoring patterns + +- Put restrictive-value selection rules directly in descriptions. +- Explicitly define accepted unit vocabulary. +- Clarify near-miss terms that should not be treated as equivalent. +- State whether qualitative features should keep `value`/`units` null. + +## Anti-patterns + +- Retrieval instructions embedded in schema semantics. +- Feature IDs that change names across iterations. +- Implicit unit assumptions not declared in text. +- Examples that contradict field descriptions. +- Feature enums that include placeholders with no extraction logic. + +## Quality checklist + +- Enum matches target output columns. +- Every feature has deterministic extraction rules. +- `section` and `summary` preserve legal traceability. +- Repeated sample runs produce stable feature typing. + +## Iteration loop + +1. Run 3-jurisdiction smoke sample. +2. Catalog failure modes by feature. +3. Patch only affected descriptions/examples. +4. Re-run same sample before expanding scope. + +Save iterated schema versions as `_schemav2.json`, `_schemav3.json` +etc. to preserve a diff history. The active version is what `schema:` in the +plugin YAML points to. + +## Practical quality signal + +Treat a schema as "working" when all are true on the smoke sample: + +- final ordinance CSV outputs are non-empty, +- extracted rows include stable feature IDs, +- most non-null rows have useful `section` and `summary`, +- repeated runs do not shift feature semantics materially. diff --git a/.github/skills/web-scraper/SKILL.md b/.github/skills/web-scraper/SKILL.md new file mode 100644 index 00000000..f021bdb8 --- /dev/null +++ b/.github/skills/web-scraper/SKILL.md @@ -0,0 +1,134 @@ +--- +name: web-scraper +description: Build and tune one-shot plugin configs that search, rank, and collect ordinance documents with native COMPASS pipeline settings. +--- + +# Web Scraper Skill + +Use this skill to improve retrieval precision/recall before extraction tuning. + +## When to use + +- Download step returns noisy sources. +- Ordinance recall is weak across jurisdictions. +- LLM filtering is compensating for poor search quality. + +## Scope + +- Query-template strategy. +- URL ranking and filtering patterns. +- Heuristic phrase controls before LLM validation. + +## Example references (optional) + +- `examples/one_shot_schema_extraction_geothermal/geothermal_plugin_config.yaml` +- `examples/one_shot_schema_extraction_geothermal/geothermal_config.json5` +- `examples/one_shot_schema_extraction_geothermal/geothermal_jurisdictions_one.csv` +- `examples/one_shot_schema_extraction_cst/cst_plugin_config.yaml` +- `examples/compass_tech_pipeline/README.md` + +## Two retrieval phases + +COMPASS runs two sequential acquisition passes per jurisdiction: + +1. **Search-engine phase** — queries `SerpAPIGoogleSearch` (or configured + engine) using `query_templates`. This phase is the primary source of + ordinance documents. +2. **Website crawl phase** — crawls the jurisdiction's official website, + ranking pages using `website_keywords`. This phase is a secondary pass + and runs even if the SE phase found documents. + +Key behaviors: +- Playwright browser errors during the website crawl phase are **non-fatal**. + COMPASS logs the error and continues. +- `Found 0 potential documents` at the end of the crawl phase is **expected** + for jurisdictions without relevant online ordinances. +- Disable the crawl phase with `perform_website_search: false` in run config + when you want faster smoke tests or Playwright is unavailable. + +## Key management + +For SerpAPI-backed search, keep `api_key` out of committed config and provide +`SERPAPI_KEY` via environment (for example through `.env` loaded in shell). + +Recommended shell setup: + +```bash +set -a +source .env +set +a +``` + +Avoid spaces around `=` in `.env` assignments. + +## Retrieval design pattern + +1. Create 3-7 jurisdiction queries with `{jurisdiction}`. +2. Weight legal document indicators in URL keywords. +3. Apply exclusions for templates/reports/slides. +4. Add focused negative tech terms to reduce false positives. +5. Start with dynamic search, then switch to deterministic known URLs when + search infrastructure is unstable. + +For first-pass reliability, test retrieval with deterministic known URLs +before using live web search. + +## Technology-specific retrieval controls (template) + +- Include target-technology facility/deployment terms. +- Exclude adjacent and non-target terms (residential/HVAC/PV/etc as needed). +- Favor jurisdictional legal-code signals like `land use code`, + `code of ordinances`, `use table`, and `special use permit`. + +## Deterministic smoke-test mode + +Use run-config controls to bypass flaky search while tuning: + +- supply `known_doc_urls` or `known_local_docs`, +- set `perform_se_search: false`, +- set `perform_website_search: false`. + +Then validate: + +- download artifacts exist, +- cleaned text exists, +- ordinance DB rows are non-empty. + +## Tuning loop + +1. Run SE-search phase on small sample. +2. Inspect kept vs discarded PDFs (`ordinance_files/`). +3. Run heuristic filter and review false rejects/accepts (`cleaned_text/`). +4. Check website crawl phase independently if needed (enable, run, inspect logs). +5. Update one axis only: + - query templates (affects SE phase), + - URL weights (affects both phases), + - include/exclude heuristic patterns (pre-LLM filter), + - `not_tech_words` (upstream document rejection). +6. Re-run same sample and compare. + +## Cross-tech onboarding + +When reusing this workflow for any technology: + +- keep legal retrieval tokens (`ordinance`, `zoning`, `code`), +- replace all technology terms in `query_templates`, `website_keywords`, + and `heuristic_keywords`, +- seed `known_doc_urls` with authoritative regulatory documents for smoke + testing, +- avoid copying negatives from previous technologies into the new tech config, +- verify `not_tech_words` excludes adjacent technologies for your domain. + +## Phase gates + +- **3 jurisdictions**: ensure major source classes are found. +- **10-25 jurisdictions**: verify stability across regions. +- **Full scale**: only once false positive/negative rates stabilize. + +## Guardrails + +- Keep feature extraction logic out of retrieval config. +- Do not overfit to one county's document style. +- Preserve auditable rationale for each retrieval change. +- Keep one canonical retrieval config per active technology. +- Ensure each run uses a unique `out_dir` to avoid COMPASS aborting early. diff --git a/.github/skills/yaml-setup/SKILL.md b/.github/skills/yaml-setup/SKILL.md new file mode 100644 index 00000000..a9f93d17 --- /dev/null +++ b/.github/skills/yaml-setup/SKILL.md @@ -0,0 +1,199 @@ +--- +name: yaml-setup +description: Author and tune one-shot plugin YAML configs for COMPASS-native document discovery, filtering, and text collection. +--- + +# YAML Setup Skill + +Use this skill to create or tune one-shot plugin YAML that controls retrieval, +filtering, and text collection behavior. + +## When to use + +- New technology onboarding in one-shot extraction. +- Schema exists but source relevance is weak. +- You need reproducible config handoff across teams. + +## Example references (optional) + +- `examples/one_shot_schema_extraction_geothermal/geothermal_plugin_config.yaml` +- `examples/one_shot_schema_extraction_geothermal/README.rst` +- `examples/one_shot_schema_extraction_geothermal/geothermal_one_shot_guide.md` +- `docs/source/examples/one_shot_schema_extraction/plugin_config_minimal.json` +- `docs/source/examples/one_shot_schema_extraction/plugin_config_simple.json5` +- `docs/source/examples/one_shot_schema_extraction/plugin_config.yaml` + +## Naming convention + +Use tech-first file names when creating new one-shot assets: +`_config*.json5`, `_plugin_config.yaml`, +`_schema.json`, `_jurisdictions*.csv`. + +## Secret handling + +Keep API keys in environment variables (for example `SERPAPI_KEY`, +`AZURE_OPENAI_API_KEY`) rather than in plugin or run config files. +Load them per shell session with `set -a && source .env && set +a`. +Avoid spaces around `=` in `.env` assignments. + +## Required minimum + +```yaml +schema: ./my_schema.json +``` + +## Key plugin YAML fields + +| Field | Type | Behavior | +|---|---|---| +| `schema` | string (path) | **Required.** Path to JSON schema file, relative to plugin YAML location. | +| `data_type_short_desc` | string | Short description used in LLM prompts (e.g. `utility-scale ordinance`). | +| `query_templates` | list | Search query templates; `{jurisdiction}` is replaced at runtime. | +| `website_keywords` | dict | Keyword → score map for URL ranking during website crawl. | +| `heuristic_keywords` | dict or `true` | Pre-LLM text filter. If `true`, LLM generates lists from schema. | +| `collection_prompts` | list or `true` | Text collection prompt(s). If **`true`**, LLM auto-generates from schema. | +| `text_extraction_prompts` | list or `true` | Text consolidation prompt(s). If **`true`**, LLM auto-generates from schema. | +| `extraction_system_prompt` | string | Overrides default LLM system prompt for the extraction step. Use this to scope extraction tightly to the target technology. | +| `cache_llm_generated_content` | bool | Cache LLM-generated `query_templates` and `website_keywords`. Set to `false` when iterating schema to see live changes. | + +### `collection_prompts: true` and `text_extraction_prompts: true` + +Setting either flag to `true` (not a list) instructs COMPASS to use the LLM +to auto-generate the prompts from the schema content. This is the recommended +shortcut during development — do not write manual prompt lists until +auto-generated ones prove insufficient. + +### `extraction_system_prompt` + +This is the primary control for preventing scope bleed from generic land-use +code documents. Write it as a multi-line YAML literal block: + +```yaml +extraction_system_prompt: |- + You are a legal scholar extracting structured data from + utility-scale ordinances. + + Extract only enacted requirements for utility-scale facilities. + Exclude adjacent technologies and non-target use cases. + Prefer explicit values. Use null for qualitative obligations. +``` + +See `compass/extraction/geothermal_electricity/geothermal_plugin_config.yaml` +for a complete example. + +## Progressive config path + +1. **Minimal** + - Confirm schema path and extraction invocation work. +2. **Simple** + - Add `query_templates`, `heuristic_keywords`, and `cache_llm_generated_content`. + - Set `collection_prompts: true` and `text_extraction_prompts: true` to + let the LLM auto-generate prompts from the schema. +3. **Full** + - Add `extraction_system_prompt` if scope bleed or off-domain extraction + is observed. + - Replace `heuristic_keywords: true` with an explicit list if precision + is insufficient. + +Use the same progression for any technology. + +## Baseline YAML pattern + +```yaml +schema: ./my_schema.json +data_type_short_desc: utility-scale ordinance +cache_llm_generated_content: true +query_templates: + - "filetype:pdf {jurisdiction} ordinance" + - "{jurisdiction} zoning ordinance" + - "{jurisdiction} permitting requirements" +website_keywords: + pdf: 92160 + : 46080 + ordinance: 23040 + zoning: 2880 + permit: 1440 +heuristic_keywords: + good_tech_keywords: + - "" + - "" + good_tech_acronyms: + - "" + good_tech_phrases: + - "" + - "" + not_tech_words: + - "" + - "" +collection_prompts: true +text_extraction_prompts: true +extraction_system_prompt: |- + You are a legal scholar extracting structured data from + utility-scale ordinances. + + Extract only requirements for utility-scale facilities. + Exclude adjacent technologies and non-target use cases. +``` + +Swap vocabulary for any technology while keeping the same structure. + +## Stable development mode + +Plugin YAML controls retrieval behavior, but deterministic acquisition for +smoke tests belongs in run config: + +- `known_doc_urls` or `known_local_docs` +- `perform_se_search: false` +- `perform_website_search: false` (disables the website crawl second phase) + +Use this mode first, then re-enable search once schema extraction quality is +stable. + +Recommended baseline: use dynamic search first, then use deterministic mode +if search infrastructure fails. + +## Acquisition phases + +COMPASS acquisition runs in two sequential phases per jurisdiction: + +1. **Search-engine phase** — uses `SerpAPIGoogleSearch` or similar; driven by + `query_templates`. +2. **Website crawl phase** — crawls the jurisdiction's main website using + `website_keywords` for ranking. Playwright browser errors during this + phase are **non-fatal**; COMPASS logs them and moves on. + +`perform_website_search: false` skips phase 2. Use it during smoke tests to +keep run time short and avoid Playwright dependency issues. + +## Validation checklist + +- Schema path resolves from runtime working directory. +- Query templates include `{jurisdiction}` consistently. +- URL weights favor legal and government documents. +- Heuristic exclusions are precise and not over-broad. +- Prompt overrides are only added when default behavior fails. + +## Cross-tech adaptation checklist + +When adapting to another technology: + +- replace vocabulary in `query_templates` and `website_keywords`, +- keep legal-code terms (`ordinance`, `zoning`, `code of ordinances`), +- keep non-target exclusions explicit in `not_tech_words`, +- do not carry terms from a previous technology into new tech configs, +- write a technology-specific `extraction_system_prompt`. + +## Run command + +```bash +pixi run compass process -c config.json5 -p path/to/plugin_config.yaml -v +``` + +If running outside the tech folder, use absolute paths for `-c` and `-p`. + +## Guardrails + +- Retrieval behavior belongs in plugin YAML. +- Feature logic belongs in schema. +- Adjust one tuning axis per run for clean attribution. +- Keep one canonical plugin file per technology in the active example folder. From 54b8d290a082bbf18310c84efa3c97df61cadc18 Mon Sep 17 00:00:00 2001 From: Byron Pullutasig <115118857+bpulluta@users.noreply.github.com> Date: Tue, 17 Mar 2026 13:13:14 -0600 Subject: [PATCH 2/7] Added one-shot skills --- .github/skills/extraction-run/SKILL.md | 57 +++--- .github/skills/schema-creation/SKILL.md | 228 ++++++++++++------------ .github/skills/web-scraper/SKILL.md | 29 +-- .github/skills/yaml-setup/SKILL.md | 107 +++++++---- 4 files changed, 241 insertions(+), 180 deletions(-) diff --git a/.github/skills/extraction-run/SKILL.md b/.github/skills/extraction-run/SKILL.md index 23321b46..e77a10cc 100644 --- a/.github/skills/extraction-run/SKILL.md +++ b/.github/skills/extraction-run/SKILL.md @@ -5,14 +5,19 @@ description: Execute one-shot extraction with COMPASS, evaluate outputs, and ite # Extraction Run Skill +**ONE-SHOT EXTRACTION ONLY.** This skill applies only to schema-driven extraction. +For legacy decision-tree extraction (solar, wind, small wind), consult COMPASS +architecture docs. + Use this skill to run one-shot extraction in a repeatable, low-risk way, then iterate quickly until you have stable structured outputs. ## When to use - Schema exists and plugin config points to it. -- You are onboarding a new technology (for example geothermal, CHP, hydrogen). +- You are onboarding a new technology (diesel generator, geothermal, CHP, hydrogen). - You need a reliable smoke-test workflow before scaling. +- You are NOT using legacy decision-tree extraction. ## Two-pipeline modes @@ -48,6 +53,16 @@ needed for one-shot techs. - API keys in environment (never hardcode in configs). - A jurisdiction set sized to the current phase. +## Preflight checks (must pass before run) + +- Jurisdiction CSV has headers `County,State`. +- `out_dir` is unique for this run. +- At least one acquisition step is enabled: + `perform_se_search: true`, `perform_website_search: true`, + `known_doc_urls`, or `known_local_docs`. +- If `heuristic_keywords` exists, all four required lists are present and + non-empty. + ## Naming convention Use tech-first names for all one-shot assets: @@ -92,29 +107,19 @@ unchanged and swap only technology-specific inputs: Change one axis per run unless debugging infrastructure failures. -## Example references (optional) +## Canonical reference -- `examples/one_shot_schema_extraction/README.rst` -- `examples/one_shot_schema_extraction_geothermal/geothermal_config.json5` -- `examples/one_shot_schema_extraction_geothermal/geothermal_plugin_config.yaml` -- `examples/one_shot_schema_extraction_geothermal/geothermal_schema.json` -- `examples/one_shot_schema_extraction_geothermal/geothermal_jurisdictions_one.csv` -- `examples/one_shot_schema_extraction_geothermal/geothermal_one_shot_guide.md` -- `examples/one_shot_schema_extraction_cst/cst_config.json5` (CST reference) -- `examples/one_shot_schema_extraction_cst/cst_plugin_config.yaml` (CST reference) -- `examples/one_shot_schema_extraction_cst/cst_schema.json` (CST reference) -- `compass/extraction/geothermal_electricity/` (finalized one-shot tech example) -- `docs/source/examples/one_shot_schema_extraction/plugin_config_minimal.json` -- `docs/source/examples/one_shot_schema_extraction/plugin_config.yaml` -- `examples/compass_tech_pipeline/README.md` +- `examples/one_shot_schema_extraction/` — complete working examples +- `examples/one_shot_schema_extraction/README.rst` — general one-shot overview +- `examples/water_rights_demo/one-shot/` — multi-doc extraction example -## Environment setup reminder +## Environment setup -Before running, load secrets from `.env` (for example `SERPAPI_KEY`, -`AZURE_OPENAI_API_KEY`) into the current shell. Do not commit secret values -inside config files. +Load secrets from `.env` before running. Never commit key values in config files. -Common `.env` gotcha: avoid spaces around `=` in variable assignments. +```bash +set -a && source .env && set +a # no spaces around = in .env assignments +``` ## Core command @@ -124,9 +129,9 @@ pixi run compass process -c config.json5 -p path/to/plugin_config.yaml -v ## Phase-gated workflow -1. **Smoke test (3 jurisdictions)** +1. **Smoke test (1 jurisdiction)** - Goal: verify wiring and output contract. -2. **Robustness (10-25 jurisdictions)** +2. **Robustness (5 jurisdictions)** - Goal: verify feature stability and edge-case handling. 3. **Scale (full set)** - Goal: only after earlier phases pass acceptance gates. @@ -177,6 +182,14 @@ Check in order: 3. `outputs/*/quantitative_ordinances.csv` and `outputs/*/qualitative_ordinances.csv` (final compiled results) +Treat the run as **failed for extraction quality** when either is true: +- `Number of jurisdictions with extracted data: 0` +- any configuration exception appears in logs (even if process exits 0) + +Only treat a run as passing when both are true: +- at least one jurisdiction has extracted data +- at least one jurisdiction CSV in `jurisdiction_dbs/` has more than header row + ## Root-cause triage - **Wrong or noisy documents** diff --git a/.github/skills/schema-creation/SKILL.md b/.github/skills/schema-creation/SKILL.md index 951593a3..805d8dc9 100644 --- a/.github/skills/schema-creation/SKILL.md +++ b/.github/skills/schema-creation/SKILL.md @@ -5,169 +5,161 @@ description: Author and iterate one-shot extraction schemas that replace legacy # Schema Creation Skill -Use this skill to encode extraction logic in schema so behavior is repeatable -across jurisdictions and technologies. +**ONE-SHOT EXTRACTION ONLY.** This skill applies only to schema-driven extraction +(new technology onboarding with JSON schema + plugin YAML). For legacy decision-tree +extraction (existing solar/wind/small-wind in `compass/extraction//`), +consult COMPASS architecture docs. + +Use this skill to define what the LLM extracts and how it formats results. +The schema is the single most important config file for output quality. ## When to use -- Creating a new one-shot technology plugin. -- Migrating from decision-tree logic to schema-driven extraction. -- Stabilizing inconsistent model outputs. +- Starting a new one-shot technology extraction (NOT decision-tree legacy extraction). +- Fixing inconsistent or incorrect extracted values in one-shot extraction. +- Adding new features to an existing one-shot extraction. -## Example references (optional) +## Canonical reference -- `examples/one_shot_schema_extraction_geothermal/geothermal_schema.json` -- `examples/one_shot_schema_extraction_geothermal/README.rst` -- `examples/one_shot_schema_extraction_geothermal/geothermal_one_shot_guide.md` -- `docs/source/examples/one_shot_schema_extraction/wind_schema.json` +For complete examples, see the `examples/` directory: +- `examples/one_shot_schema_extraction/wind_schema.json` +- `examples/water_rights_demo/one-shot/water_rights_schema.json5` -## Required output contract +Each follows the pattern: `_schema.json` or `_schema.json5`. -Top-level object must define `outputs` and each item must require: +## Required output contract -- `feature` -- `value` -- `units` -- `section` -- `summary` +Every schema must define `outputs` as an array. Each item must require +exactly these five fields and set `additionalProperties: false`: ```json { "type": "object", "required": ["outputs"], + "additionalProperties": false, "properties": { "outputs": { "type": "array", "items": { "type": "object", "required": ["feature", "value", "units", "section", "summary"], - "additionalProperties": false + "additionalProperties": false, + "properties": { + "feature": { "type": "string", "enum": ["..."] }, + "value": { "anyOf": [{"type": "number"}, {"type": "string"}, {"type": "boolean"}, {"type": "array", "items": {"type": "string"}}, {"type": "null"}] }, + "units": { "type": ["string", "null"] }, + "section": { "type": ["string", "null"] }, + "summary": { "type": ["string", "null"] } + } } } } } ``` -## Build sequence - -1. Copy baseline schema and rename for target tech. -2. Replace `feature` enum with target-tech IDs. -3. Define `value`/`units` rules per feature family. -4. Add `$definitions` for reusable decision logic. -5. Add `$examples` for top failure modes. -6. Add `$instructions` for global extraction policy. - -For new technologies (for example CHP or CST), clone a working schema and -perform a strict vocabulary swap (features, units, exclusions) before adding -new logic. - -## Output column mapping +These five fields map directly to the output CSV columns. COMPASS adds +`county`, `state`, `FIPS`, and other metadata columns automatically. -Schema field names map directly to the final output CSV columns: - -| Schema field | CSV column | -|---|---| -| `feature` | `feature` | -| `value` | `value` | -| `units` | `units` | -| `section` | `section` | -| `summary` | `summary` | - -Additional columns added by COMPASS finalization: `county`, `state`, -`subdivision`, `jurisdiction_type`, `FIPS`, `adder`, `min_dist`, `max_dist`, -`year`, `source`. These do not need to appear in the schema. +## Build sequence -## Scope bleed from generic legal documents +1. **Define the feature enum** — one stable lowercase ID per siting-relevant + requirement. Group IDs by family (setbacks, noise, zoning, permitting). +2. **Define `value` and `units` rules per feature family** — in each + feature's `description`, state the expected value type and accepted unit + vocabulary explicitly. +3. **Add `$definitions`** — group related feature descriptions here to keep + the `feature` enum block clean. +4. **Add `$instructions`** — encode global extraction policy (scope, null + handling, one-row-per-feature contract, verbatim quote preference). +5. **Smoke-test on one jurisdiction** — validate all enum items appear in + output and null rows are correctly populated for missing features. -When COMPASS retrieves a large generic land-use code rather than a -technology-specific ordinance, the LLM may extract provisions that are -outside the schema enum. This is most visible when unfamiliar feature names -appear in the output CSV. +## Feature definition template -Primary controls: -- `extraction_system_prompt` in plugin YAML — this is the strongest signal. - State explicitly what is in scope and what is out. -- `$instructions.scope` in schema — reinforce exclusion language here. -- `heuristic_keywords.not_tech_words` — filter documents upstream. +Every feature description must answer four questions: -Do not widen the feature enum to accommodate scope bleed; narrow the prompt -and upstream filters instead. +1. **What is this?** One sentence identifying the regulatory concept. +2. **VALUE rule:** What type is the value and what specific values/ranges are + valid? +3. **UNITS rule:** What unit string is accepted, or `null` if not applicable? +4. **IGNORE / CLARIFICATION:** What near-miss concepts must NOT match this + feature? -## Technology adaptation guidance +Example (abbreviated): -When adapting a baseline schema to any new technology: +```json +"structure setback": { + "description": "Minimum distance from the generator to an occupied building. VALUE: numerical distance. UNITS: 'feet' or 'meters'. IGNORE: setbacks from property lines or roads — those are separate features." +} +``` -- Separate core utility-scale requirements from adjacent/non-target systems. -- Keep district/permit features distinct from numerical constraints. -- Encode jurisdiction/governance handling where relevant in summaries. -- Require explicit nulls when a feature is not enacted. +## Feature family taxonomy -## Cross-technology adaptation checklist +Organize `$definitions` by these families: -Apply this for any new domain: +| Family | Example features | +|---|---| +| Setbacks | `structure setback`, `property line setback`, `road setback` | +| Noise/Emissions | `noise limit`, `emissions standard`, `vibration limit` | +| Operational | `hours of operation` | +| Physical design | `screening requirement`, `enclosure requirement`, `exhaust stack height` | +| Zoning | `primary use districts`, `conditional use districts`, `prohibited use districts` | +| Permitting | `permit requirement`, `capacity threshold` | +| Compliance | `decommissioning`, `enactment date` | -1. Define technology-specific `feature` enum with stable IDs. -2. Define allowed unit vocabulary for each feature family. -3. Add explicit exclusion language for adjacent-but-out-of-scope systems. -4. Ensure summaries preserve legal traceability (section + source-faithful text). -5. Validate on deterministic docs before tuning retrieval. -6. Consider including `enactment date` in the enum — COMPASS naturally surfaces it - from documents and it provides important temporal context in outputs. +## `$instructions` block -## Example specialization patterns (optional) +Always include a `$instructions` object at the top level with these keys: -Use examples only to shape exclusion strategy: +```json +"$instructions": { + "scope": "Describe exactly what to extract and what to ignore.", + "null_handling": "Output every enum feature. Use null value and null summary when a feature is not found in the document. Do not omit features.", + "one_row_per_feature": "Output exactly one row per feature. If multiple values apply, use the most restrictive and describe variants in summary.", + "verbatim_quotes": "In summary fields, prefer verbatim quotes from the source. Enclose in double quotation marks.", + "units_discipline": "Do not convert units. Record them exactly as they appear in the document." +} +``` -- separate core utility-scale requirements from adjacent technologies, -- add explicit exclusion terms in `not_tech_words`, -- preserve legal traceability via `section` and `summary`. +## Scope bleed control -## Reuse safeguards +When COMPASS retrieves a large land-use code instead of a tech-specific +ordinance, the LLM may extract off-domain provisions. -- Keep tech-first file names consistent across assets: - `_config*.json5`, `_plugin_config.yaml`, - `_schema.json`, `_jurisdictions*.csv`. -- Keep credentials out of schema content and examples. -- Validate schema behavior with a small smoke run before scaling. +Fix order (most powerful first): +1. `extraction_system_prompt` in plugin YAML — state explicitly what is in + scope and what is excluded. +2. `$instructions.scope` in schema — reinforce with exclusion language. +3. `heuristic_keywords.NOT_TECH_WORDS` — reject documents upstream. -## High-value authoring patterns +Do not expand the feature enum to absorb scope bleed. Narrow the prompt. -- Put restrictive-value selection rules directly in descriptions. -- Explicitly define accepted unit vocabulary. -- Clarify near-miss terms that should not be treated as equivalent. -- State whether qualitative features should keep `value`/`units` null. +## Cross-technology adaptation checklist -## Anti-patterns +When cloning this schema for a new technology: -- Retrieval instructions embedded in schema semantics. -- Feature IDs that change names across iterations. -- Implicit unit assumptions not declared in text. -- Examples that contradict field descriptions. -- Feature enums that include placeholders with no extraction logic. +- [ ] Replace all feature IDs with technology-specific names. +- [ ] Replace value/units rules in every feature description. +- [ ] Replace exclusion terms in `$instructions.scope` and feature IGNORE + clauses. +- [ ] Replace `$definitions` group names to match new feature families. +- [ ] Smoke-test before widening to 10+ jurisdictions. ## Quality checklist -- Enum matches target output columns. -- Every feature has deterministic extraction rules. -- `section` and `summary` preserve legal traceability. -- Repeated sample runs produce stable feature typing. - -## Iteration loop - -1. Run 3-jurisdiction smoke sample. -2. Catalog failure modes by feature. -3. Patch only affected descriptions/examples. -4. Re-run same sample before expanding scope. - -Save iterated schema versions as `_schemav2.json`, `_schemav3.json` -etc. to preserve a diff history. The active version is what `schema:` in the -plugin YAML points to. - -## Practical quality signal - -Treat a schema as "working" when all are true on the smoke sample: - -- final ordinance CSV outputs are non-empty, -- extracted rows include stable feature IDs, -- most non-null rows have useful `section` and `summary`, -- repeated runs do not shift feature semantics materially. +- [ ] Feature enum uses stable, lowercase, underscore-separated IDs. +- [ ] Every feature description contains VALUE, UNITS, and IGNORE clauses. +- [ ] `$instructions` block is present with all five keys. +- [ ] `additionalProperties: false` is set on the top-level object and on + each item in the `outputs` array. +- [ ] Schema validates cleanly against a JSON Schema validator. +- [ ] A smoke run using this schema produces extracted rows (not just + successful process exit logs). + +## Anti-patterns to avoid + +- Feature IDs that change names between iterations. +- Implicit unit assumptions not stated in description text. +- Missing IGNORE clauses for common near-miss features. +- Examples in descriptions that contradict field rules. +- Widening the enum to absorb scope bleed instead of tightening the prompt. diff --git a/.github/skills/web-scraper/SKILL.md b/.github/skills/web-scraper/SKILL.md index f021bdb8..f5149364 100644 --- a/.github/skills/web-scraper/SKILL.md +++ b/.github/skills/web-scraper/SKILL.md @@ -6,11 +6,13 @@ description: Build and tune one-shot plugin configs that search, rank, and colle # Web Scraper Skill Use this skill to improve retrieval precision/recall before extraction tuning. +Applies to both one-shot (schema-driven) and legacy decision-tree extraction +pipelines. ## When to use -- Download step returns noisy sources. -- Ordinance recall is weak across jurisdictions. +- Download step returns noisy sources (one-shot extraction). +- Ordinance recall is weak across jurisdictions (one-shot extraction). - LLM filtering is compensating for poor search quality. ## Scope @@ -19,13 +21,11 @@ Use this skill to improve retrieval precision/recall before extraction tuning. - URL ranking and filtering patterns. - Heuristic phrase controls before LLM validation. -## Example references (optional) +## Canonical reference -- `examples/one_shot_schema_extraction_geothermal/geothermal_plugin_config.yaml` -- `examples/one_shot_schema_extraction_geothermal/geothermal_config.json5` -- `examples/one_shot_schema_extraction_geothermal/geothermal_jurisdictions_one.csv` -- `examples/one_shot_schema_extraction_cst/cst_plugin_config.yaml` -- `examples/compass_tech_pipeline/README.md` +Consult example plugin configurations in `examples/` following the tech-first naming pattern: +- `_plugin_config.yaml` — standard one-shot config +- See `examples/water_rights_demo/one-shot/plugin_config.yaml` for multi-document edge cases ## Two retrieval phases @@ -70,6 +70,15 @@ Avoid spaces around `=` in `.env` assignments. 5. Start with dynamic search, then switch to deterministic known URLs when search infrastructure is unstable. +When using `heuristic_keywords`, include all required lists: +- `GOOD_TECH_KEYWORDS` +- `GOOD_TECH_PHRASES` +- `GOOD_TECH_ACRONYMS` +- `NOT_TECH_WORDS` + +If any required list is missing or empty, COMPASS raises a plugin +configuration error and extraction quality should be treated as failed. + For first-pass reliability, test retrieval with deterministic known URLs before using live web search. @@ -104,7 +113,7 @@ Then validate: - query templates (affects SE phase), - URL weights (affects both phases), - include/exclude heuristic patterns (pre-LLM filter), - - `not_tech_words` (upstream document rejection). + - `NOT_TECH_WORDS` (upstream document rejection). 6. Re-run same sample and compare. ## Cross-tech onboarding @@ -117,7 +126,7 @@ When reusing this workflow for any technology: - seed `known_doc_urls` with authoritative regulatory documents for smoke testing, - avoid copying negatives from previous technologies into the new tech config, -- verify `not_tech_words` excludes adjacent technologies for your domain. +- verify `NOT_TECH_WORDS` excludes adjacent technologies for your domain. ## Phase gates diff --git a/.github/skills/yaml-setup/SKILL.md b/.github/skills/yaml-setup/SKILL.md index a9f93d17..11a360af 100644 --- a/.github/skills/yaml-setup/SKILL.md +++ b/.github/skills/yaml-setup/SKILL.md @@ -5,23 +5,25 @@ description: Author and tune one-shot plugin YAML configs for COMPASS-native doc # YAML Setup Skill +**ONE-SHOT EXTRACTION ONLY.** This skill applies only to schema-driven extraction. +For legacy decision-tree extraction, consult COMPASS architecture docs. + Use this skill to create or tune one-shot plugin YAML that controls retrieval, filtering, and text collection behavior. ## When to use -- New technology onboarding in one-shot extraction. +- New technology onboarding in one-shot extraction (NOT decision-tree extraction). - Schema exists but source relevance is weak. - You need reproducible config handoff across teams. -## Example references (optional) +## Canonical reference + +With tech-first naming, configuration examples follow this pattern: +- `examples/one_shot_schema_extraction/_plugin_config.yaml` — standard working example +- `examples/water_rights_demo/one-shot/plugin_config.yaml` — multi-doc edge case -- `examples/one_shot_schema_extraction_geothermal/geothermal_plugin_config.yaml` -- `examples/one_shot_schema_extraction_geothermal/README.rst` -- `examples/one_shot_schema_extraction_geothermal/geothermal_one_shot_guide.md` -- `docs/source/examples/one_shot_schema_extraction/plugin_config_minimal.json` -- `docs/source/examples/one_shot_schema_extraction/plugin_config_simple.json5` -- `docs/source/examples/one_shot_schema_extraction/plugin_config.yaml` +Refer to any complete example in `examples/` that matches your retrieval goals. ## Naming convention @@ -42,6 +44,14 @@ Avoid spaces around `=` in `.env` assignments. schema: ./my_schema.json ``` +## Non-negotiable runtime constraints + +- Jurisdiction CSV headers are case-sensitive: use `County,State`. +- If `heuristic_keywords` is present, it must include all four lists and + none may be empty. +- A run is not considered passing if logs show config errors or if + extracted jurisdiction count is zero. + ## Key plugin YAML fields | Field | Type | Behavior | @@ -56,6 +66,26 @@ schema: ./my_schema.json | `extraction_system_prompt` | string | Overrides default LLM system prompt for the extraction step. Use this to scope extraction tightly to the target technology. | | `cache_llm_generated_content` | bool | Cache LLM-generated `query_templates` and `website_keywords`. Set to `false` when iterating schema to see live changes. | +## Required `heuristic_keywords` shape + +Use this exact structure when defining `heuristic_keywords`: + +```yaml +heuristic_keywords: + GOOD_TECH_KEYWORDS: + - "" + GOOD_TECH_PHRASES: + - "" + GOOD_TECH_ACRONYMS: + - "" + NOT_TECH_WORDS: + - "" +``` + +Notes: +- Keys are normalized, but using canonical key names reduces mistakes. +- All four lists are required and must be non-empty. + ### `collection_prompts: true` and `text_extraction_prompts: true` Setting either flag to `true` (not a list) instructs COMPASS to use the LLM @@ -87,11 +117,11 @@ for a complete example. - Confirm schema path and extraction invocation work. 2. **Simple** - Add `query_templates`, `heuristic_keywords`, and `cache_llm_generated_content`. - - Set `collection_prompts: true` and `text_extraction_prompts: true` to - let the LLM auto-generate prompts from the schema. 3. **Full** - Add `extraction_system_prompt` if scope bleed or off-domain extraction is observed. + - Set `collection_prompts: true` and `text_extraction_prompts: true` to + let the LLM auto-generate prompts from the schema. - Replace `heuristic_keywords: true` with an explicit list if precision is insufficient. @@ -114,44 +144,61 @@ website_keywords: zoning: 2880 permit: 1440 heuristic_keywords: - good_tech_keywords: + GOOD_TECH_KEYWORDS: - "" - "" - good_tech_acronyms: + GOOD_TECH_ACRONYMS: - "" - good_tech_phrases: + GOOD_TECH_PHRASES: - "" - "" - not_tech_words: + NOT_TECH_WORDS: - "" - "" -collection_prompts: true -text_extraction_prompts: true -extraction_system_prompt: |- - You are a legal scholar extracting structured data from - utility-scale ordinances. - - Extract only requirements for utility-scale facilities. - Exclude adjacent technologies and non-target use cases. ``` Swap vocabulary for any technology while keeping the same structure. ## Stable development mode -Plugin YAML controls retrieval behavior, but deterministic acquisition for -smoke tests belongs in run config: +Use run-config controls for deterministic smoke tests while iterating schema: -- `known_doc_urls` or `known_local_docs` -- `perform_se_search: false` -- `perform_website_search: false` (disables the website crawl second phase) +- `known_doc_urls` or `known_local_docs` — bypass live search +- `perform_se_search: false` — disable search-engine phase +- `perform_website_search: false` — disable website crawl phase -Use this mode first, then re-enable search once schema extraction quality is -stable. +Re-enable search only after extraction quality is stable on known documents. Recommended baseline: use dynamic search first, then use deterministic mode if search infrastructure fails. +## Minimal run-config contract (to pair with plugin YAML) + +Use this pattern and require users to provide their own model and client +values: + +```json5 +{ + out_dir: "./outputs__", + tech: "", + jurisdiction_fp: "./_jurisdictions.csv", + perform_se_search: true, + perform_website_search: false, + model: [ + { + name: "", + llm_call_kwargs: { temperature: 0, timeout: 600 }, + client_kwargs: { + api_version: "", + azure_endpoint: "" + } + } + ] +} +``` + +Do not hardcode model names in skills. Prompt the user to supply `name`. + ## Acquisition phases COMPASS acquisition runs in two sequential phases per jurisdiction: @@ -179,7 +226,7 @@ When adapting to another technology: - replace vocabulary in `query_templates` and `website_keywords`, - keep legal-code terms (`ordinance`, `zoning`, `code of ordinances`), -- keep non-target exclusions explicit in `not_tech_words`, +- keep non-target exclusions explicit in `NOT_TECH_WORDS`, - do not carry terms from a previous technology into new tech configs, - write a technology-specific `extraction_system_prompt`. From a71447f1dbf7197cf651538d720f88faaf73a7e8 Mon Sep 17 00:00:00 2001 From: Byron Pullutasig <115118857+bpulluta@users.noreply.github.com> Date: Tue, 17 Mar 2026 13:32:28 -0600 Subject: [PATCH 3/7] update one-shot SKILL.md structure and trigger contracts --- .github/skills/extraction-run/SKILL.md | 28 ++++++++++++++++++------- .github/skills/schema-creation/SKILL.md | 20 +++++++++++++++--- .github/skills/web-scraper/SKILL.md | 24 ++++++++++++++++----- .github/skills/yaml-setup/SKILL.md | 17 ++++++++++++++- 4 files changed, 73 insertions(+), 16 deletions(-) diff --git a/.github/skills/extraction-run/SKILL.md b/.github/skills/extraction-run/SKILL.md index e77a10cc..00be8f92 100644 --- a/.github/skills/extraction-run/SKILL.md +++ b/.github/skills/extraction-run/SKILL.md @@ -1,6 +1,6 @@ --- name: extraction-run -description: Execute one-shot extraction with COMPASS, evaluate outputs, and iterate schema/config changes with minimal cost. +description: Execute one-shot extraction with COMPASS and iterate quickly with low cost. Use whenever a user asks to run, smoke-test, validate, debug, or scale one-shot schema extraction for any technology. --- # Extraction Run Skill @@ -19,6 +19,26 @@ then iterate quickly until you have stable structured outputs. - You need a reliable smoke-test workflow before scaling. - You are NOT using legacy decision-tree extraction. +## Do not use + +- Legacy decision-tree extraction feature engineering. +- Python parser implementation in `compass/extraction//parse.py`. +- Non-extraction tasks (for example docs-only updates). + +## Expected assistant output + +When using this skill, return: + +1. The exact `pixi run compass process ...` command used. +2. A pass/fail decision against extraction-quality gates. +3. The smallest next config/schema change and why. + +## Canonical reference + +- `examples/one_shot_schema_extraction/` — complete working examples +- `examples/one_shot_schema_extraction/README.rst` — general one-shot overview +- `examples/water_rights_demo/one-shot/` — multi-doc extraction example + ## Two-pipeline modes COMPASS supports two distinct extraction pipelines. Choose one and do not mix @@ -107,12 +127,6 @@ unchanged and swap only technology-specific inputs: Change one axis per run unless debugging infrastructure failures. -## Canonical reference - -- `examples/one_shot_schema_extraction/` — complete working examples -- `examples/one_shot_schema_extraction/README.rst` — general one-shot overview -- `examples/water_rights_demo/one-shot/` — multi-doc extraction example - ## Environment setup Load secrets from `.env` before running. Never commit key values in config files. diff --git a/.github/skills/schema-creation/SKILL.md b/.github/skills/schema-creation/SKILL.md index 805d8dc9..c4941bc1 100644 --- a/.github/skills/schema-creation/SKILL.md +++ b/.github/skills/schema-creation/SKILL.md @@ -1,6 +1,6 @@ --- name: schema-creation -description: Author and iterate one-shot extraction schemas that replace legacy decision-tree extraction logic in native COMPASS. +description: Author and iterate one-shot extraction schemas for native COMPASS. Use whenever a user asks to create, expand, or debug schema feature definitions, value/unit rules, or extraction instructions. --- # Schema Creation Skill @@ -19,6 +19,19 @@ The schema is the single most important config file for output quality. - Fixing inconsistent or incorrect extracted values in one-shot extraction. - Adding new features to an existing one-shot extraction. +## Do not use + +- Retrieval tuning tasks that belong in plugin YAML. +- Legacy decision-tree extraction parser implementation. + +## Expected assistant output + +When using this skill, return: + +1. The proposed schema diff (or full schema block) for the targeted features. +2. The rationale for VALUE, UNITS, and IGNORE wording. +3. A smoke-test check plan for validating the schema change. + ## Canonical reference For complete examples, see the `examples/` directory: @@ -63,7 +76,8 @@ These five fields map directly to the output CSV columns. COMPASS adds ## Build sequence 1. **Define the feature enum** — one stable lowercase ID per siting-relevant - requirement. Group IDs by family (setbacks, noise, zoning, permitting). + requirement. Keep naming consistent across iterations and group IDs by + family (setbacks, noise, zoning, permitting). 2. **Define `value` and `units` rules per feature family** — in each feature's `description`, state the expected value type and accepted unit vocabulary explicitly. @@ -147,7 +161,7 @@ When cloning this schema for a new technology: ## Quality checklist -- [ ] Feature enum uses stable, lowercase, underscore-separated IDs. +- [ ] Feature enum uses stable, consistent IDs across all runs. - [ ] Every feature description contains VALUE, UNITS, and IGNORE clauses. - [ ] `$instructions` block is present with all five keys. - [ ] `additionalProperties: false` is set on the top-level object and on diff --git a/.github/skills/web-scraper/SKILL.md b/.github/skills/web-scraper/SKILL.md index f5149364..27a3fa37 100644 --- a/.github/skills/web-scraper/SKILL.md +++ b/.github/skills/web-scraper/SKILL.md @@ -1,6 +1,6 @@ --- name: web-scraper -description: Build and tune one-shot plugin configs that search, rank, and collect ordinance documents with native COMPASS pipeline settings. +description: Build and tune retrieval configs that search, rank, and collect ordinance documents in COMPASS. Use whenever a user asks to improve retrieval precision/recall, tune search queries/keywords, or debug acquisition quality before extraction tuning. --- # Web Scraper Skill @@ -15,11 +15,18 @@ pipelines. - Ordinance recall is weak across jurisdictions (one-shot extraction). - LLM filtering is compensating for poor search quality. -## Scope +## Do not use -- Query-template strategy. -- URL ranking and filtering patterns. -- Heuristic phrase controls before LLM validation. +- Schema feature definition or value extraction logic design. +- Post-extraction feature/value debugging when retrieval is already correct. + +## Expected assistant output + +When using this skill, return: + +1. The retrieval axis changed (queries, keyword weights, or heuristics). +2. Evidence from artifacts/logs showing why the change was needed. +3. The next run command against the same jurisdiction sample. ## Canonical reference @@ -27,6 +34,12 @@ Consult example plugin configurations in `examples/` following the tech-first na - `_plugin_config.yaml` — standard one-shot config - See `examples/water_rights_demo/one-shot/plugin_config.yaml` for multi-document edge cases +## Scope + +- Query-template strategy. +- URL ranking and filtering patterns. +- Heuristic phrase controls before LLM validation. + ## Two retrieval phases COMPASS runs two sequential acquisition passes per jurisdiction: @@ -141,3 +154,4 @@ When reusing this workflow for any technology: - Preserve auditable rationale for each retrieval change. - Keep one canonical retrieval config per active technology. - Ensure each run uses a unique `out_dir` to avoid COMPASS aborting early. + diff --git a/.github/skills/yaml-setup/SKILL.md b/.github/skills/yaml-setup/SKILL.md index 11a360af..af2a82e5 100644 --- a/.github/skills/yaml-setup/SKILL.md +++ b/.github/skills/yaml-setup/SKILL.md @@ -1,6 +1,6 @@ --- name: yaml-setup -description: Author and tune one-shot plugin YAML configs for COMPASS-native document discovery, filtering, and text collection. +description: Author and tune one-shot plugin YAML for COMPASS document discovery, filtering, and text collection. Use whenever a user asks to create, clean up, standardize, or troubleshoot one-shot plugin YAML for technology onboarding. --- # YAML Setup Skill @@ -17,6 +17,20 @@ filtering, and text collection behavior. - Schema exists but source relevance is weak. - You need reproducible config handoff across teams. +## Do not use + +- Legacy decision-tree parser implementation changes. +- Schema feature semantics work that belongs in `_schema.json`. +- Run-result diagnosis after outputs are generated (use iteration loop skill). + +## Expected assistant output + +When using this skill, return: + +1. The finalized plugin YAML content or exact diff. +2. Any required paired run-config changes. +3. A validation command and pass/fail checks for the edited YAML. + ## Canonical reference With tech-first naming, configuration examples follow this pattern: @@ -244,3 +258,4 @@ If running outside the tech folder, use absolute paths for `-c` and `-p`. - Feature logic belongs in schema. - Adjust one tuning axis per run for clean attribution. - Keep one canonical plugin file per technology in the active example folder. + From 74495a6648822194a966b0e48c80a4b37413c5b8 Mon Sep 17 00:00:00 2001 From: Copilot <198982749+Copilot@users.noreply.github.com> Date: Tue, 17 Mar 2026 15:21:58 -0600 Subject: [PATCH 4/7] Initial plan (#398) Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> From 81fcbff8b786d1167527d0e0b5a22841ab05055c Mon Sep 17 00:00:00 2001 From: Copilot <198982749+Copilot@users.noreply.github.com> Date: Tue, 17 Mar 2026 16:26:22 -0600 Subject: [PATCH 5/7] Fix skills documentation: correct paths, caching behavior, and tab formatting (#399) * Initial plan * Fix all review comments in skills documentation Co-authored-by: bpulluta <115118857+bpulluta@users.noreply.github.com> --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: bpulluta <115118857+bpulluta@users.noreply.github.com> --- .github/skills/extraction-run/SKILL.md | 44 ++++++++++++++------------ .github/skills/web-scraper/SKILL.md | 20 +++++++----- .github/skills/yaml-setup/SKILL.md | 14 +++++--- 3 files changed, 44 insertions(+), 34 deletions(-) diff --git a/.github/skills/extraction-run/SKILL.md b/.github/skills/extraction-run/SKILL.md index 00be8f92..a356eea2 100644 --- a/.github/skills/extraction-run/SKILL.md +++ b/.github/skills/extraction-run/SKILL.md @@ -56,15 +56,17 @@ only a schema JSON, a plugin YAML, and a run config — no Python source changes New technology assets start in `examples/` and finish in `compass/extraction/`: -1. **Develop** — place all assets in `examples/one_shot_schema_extraction_/` +1. **Develop** — place all assets in `examples/one_shot_schema_extraction/` 2. **Stabilize** — iterate schema/plugin until smoke and robustness gates pass 3. **Promote** — copy the three finalized files into `compass/extraction//`: - `_schema.json` - `_plugin_config.yaml` - `_config.json5` (optional; useful as a reference run config) + - `__init__.py` — registers the plugin via `create_schema_based_one_shot_extraction_plugin` -The promoted extraction folder contains only config files — no Python code is -needed for one-shot techs. + After creating the package, add an import in `compass/extraction/__init__.py` + to register the plugin at startup. See `compass/extraction/ghp/__init__.py` + for a reference implementation. ## Required inputs @@ -78,10 +80,10 @@ needed for one-shot techs. - Jurisdiction CSV has headers `County,State`. - `out_dir` is unique for this run. - At least one acquisition step is enabled: - `perform_se_search: true`, `perform_website_search: true`, - `known_doc_urls`, or `known_local_docs`. + `perform_se_search: true`, `perform_website_search: true`, + `known_doc_urls`, or `known_local_docs`. - If `heuristic_keywords` exists, all four required lists are present and - non-empty. + non-empty. ## Naming convention @@ -106,7 +108,7 @@ to deterministic mode only when search infrastructure is unstable: 2. Use your preferred configured search engine. 3. Load `.env` into shell (`set -a && source .env && set +a`). 4. Run with verbose logs: - - `pixi run compass process -c config.json5 -p plugin.yaml -v` + - `pixi run compass process -c config.json5 -p plugin.yaml -v` 5. Confirm output artifacts exist before tuning schema semantics. Fallback mode when needed: @@ -144,11 +146,11 @@ pixi run compass process -c config.json5 -p path/to/plugin_config.yaml -v ## Phase-gated workflow 1. **Smoke test (1 jurisdiction)** - - Goal: verify wiring and output contract. + - Goal: verify wiring and output contract. 2. **Robustness (5 jurisdictions)** - - Goal: verify feature stability and edge-case handling. + - Goal: verify feature stability and edge-case handling. 3. **Scale (full set)** - - Goal: only after earlier phases pass acceptance gates. + - Goal: only after earlier phases pass acceptance gates. ## Validation checklist @@ -194,7 +196,7 @@ Check in order: 1. `outputs/*/cleaned_text/*.txt` (text extraction present) 2. `outputs/*/jurisdiction_dbs/*.csv` (per-jurisdiction parsed rows) 3. `outputs/*/quantitative_ordinances.csv` and - `outputs/*/qualitative_ordinances.csv` (final compiled results) + `outputs/*/qualitative_ordinances.csv` (final compiled results) Treat the run as **failed for extraction quality** when either is true: - `Number of jurisdictions with extracted data: 0` @@ -207,20 +209,20 @@ Only treat a run as passing when both are true: ## Root-cause triage - **Wrong or noisy documents** - - Tune query templates, URL keywords, and exclusions. - - Prefer `known_doc_urls` while stabilizing. + - Tune query templates, URL keywords, and exclusions. + - Prefer `known_doc_urls` while stabilizing. - **Right documents, wrong fields** - - Tune schema descriptions/examples and ambiguity rules. - - Check `extraction_system_prompt` in plugin YAML — it is the primary - guard against scope bleed from generic legal documents. + - Tune schema descriptions/examples and ambiguity rules. + - Check `extraction_system_prompt` in plugin YAML — it is the primary + guard against scope bleed from generic legal documents. - **Correct values, unstable formatting** - - Tighten enums, unit vocabulary, and null behavior. + - Tighten enums, unit vocabulary, and null behavior. - **Nothing downloaded / unstable search** - - Disable live search and use deterministic known URLs/local docs. + - Disable live search and use deterministic known URLs/local docs. - **0 documents found for a jurisdiction during website crawl** - - Expected for jurisdictions with few online ordinances. The website - crawl is a second acquisition pass after search-engine retrieval; - 0 results there is not a pipeline failure. + - Expected for jurisdictions with few online ordinances. The website + crawl is a second acquisition pass after search-engine retrieval; + 0 results there is not a pipeline failure. ## Acceptance gates diff --git a/.github/skills/web-scraper/SKILL.md b/.github/skills/web-scraper/SKILL.md index 27a3fa37..05a078f0 100644 --- a/.github/skills/web-scraper/SKILL.md +++ b/.github/skills/web-scraper/SKILL.md @@ -30,9 +30,12 @@ When using this skill, return: ## Canonical reference -Consult example plugin configurations in `examples/` following the tech-first naming pattern: -- `_plugin_config.yaml` — standard one-shot config -- See `examples/water_rights_demo/one-shot/plugin_config.yaml` for multi-document edge cases +Consult example plugin configurations in `examples/`: +- `examples/one_shot_schema_extraction/plugin_config.yaml` — standard one-shot config +- `examples/water_rights_demo/one-shot/plugin_config.yaml` — multi-document edge cases + +When creating new tech configs, use `_plugin_config.yaml` as a recommended +naming convention (e.g. `geothermal_plugin_config.yaml`). ## Scope @@ -49,7 +52,8 @@ COMPASS runs two sequential acquisition passes per jurisdiction: ordinance documents. 2. **Website crawl phase** — crawls the jurisdiction's official website, ranking pages using `website_keywords`. This phase is a secondary pass - and runs even if the SE phase found documents. + and runs only if the search-engine phase did not yield an ordinance + context. Key behaviors: - Playwright browser errors during the website crawl phase are **non-fatal**. @@ -123,10 +127,10 @@ Then validate: 3. Run heuristic filter and review false rejects/accepts (`cleaned_text/`). 4. Check website crawl phase independently if needed (enable, run, inspect logs). 5. Update one axis only: - - query templates (affects SE phase), - - URL weights (affects both phases), - - include/exclude heuristic patterns (pre-LLM filter), - - `NOT_TECH_WORDS` (upstream document rejection). + - query templates (affects SE phase), + - URL weights (affects both phases), + - include/exclude heuristic patterns (pre-LLM filter), + - `NOT_TECH_WORDS` (upstream document rejection). 6. Re-run same sample and compare. ## Cross-tech onboarding diff --git a/.github/skills/yaml-setup/SKILL.md b/.github/skills/yaml-setup/SKILL.md index af2a82e5..1502085c 100644 --- a/.github/skills/yaml-setup/SKILL.md +++ b/.github/skills/yaml-setup/SKILL.md @@ -33,10 +33,15 @@ When using this skill, return: ## Canonical reference -With tech-first naming, configuration examples follow this pattern: -- `examples/one_shot_schema_extraction/_plugin_config.yaml` — standard working example +Consult the working examples in `examples/`: +- `examples/one_shot_schema_extraction/plugin_config.yaml` — standard working example - `examples/water_rights_demo/one-shot/plugin_config.yaml` — multi-doc edge case +When creating new tech configs, `_plugin_config.yaml` is the recommended +naming convention (e.g. `geothermal_plugin_config.yaml`). The existing +`plugin_config.yaml` examples use a generic name; new tech-specific assets +should use the tech-first naming pattern. + Refer to any complete example in `examples/` that matches your retrieval goals. ## Naming convention @@ -78,7 +83,7 @@ schema: ./my_schema.json | `collection_prompts` | list or `true` | Text collection prompt(s). If **`true`**, LLM auto-generates from schema. | | `text_extraction_prompts` | list or `true` | Text consolidation prompt(s). If **`true`**, LLM auto-generates from schema. | | `extraction_system_prompt` | string | Overrides default LLM system prompt for the extraction step. Use this to scope extraction tightly to the target technology. | -| `cache_llm_generated_content` | bool | Cache LLM-generated `query_templates` and `website_keywords`. Set to `false` when iterating schema to see live changes. | +| `cache_llm_generated_content` | bool | Cache LLM-generated `query_templates`, `website_keywords`, and `heuristic_keywords`. Set to `false` when iterating schema to see live changes. | ## Required `heuristic_keywords` shape @@ -122,8 +127,7 @@ extraction_system_prompt: |- Prefer explicit values. Use null for qualitative obligations. ``` -See `compass/extraction/geothermal_electricity/geothermal_plugin_config.yaml` -for a complete example. +See `compass/extraction/ghp/plugin_config.yaml` for a complete example. ## Progressive config path From 1b8571f283056d906ee91ccbb8f933966edd6cc8 Mon Sep 17 00:00:00 2001 From: Byron Pullutasig <115118857+bpulluta@users.noreply.github.com> Date: Thu, 19 Mar 2026 17:07:05 -0600 Subject: [PATCH 6/7] renamed skills and fixed minor comments --- .../SKILL.md | 28 ++- .github/skills/extraction-run/SKILL.md | 18 +- .github/skills/iteration-development/SKILL.md | 224 ++++++++++++++++++ .../SKILL.md | 18 +- .github/skills/schema-creation/SKILL.md | 5 +- 5 files changed, 266 insertions(+), 27 deletions(-) rename .github/skills/{web-scraper => document-retrieval}/SKILL.md (81%) create mode 100644 .github/skills/iteration-development/SKILL.md rename .github/skills/{yaml-setup => plugin-config-setup}/SKILL.md (91%) diff --git a/.github/skills/web-scraper/SKILL.md b/.github/skills/document-retrieval/SKILL.md similarity index 81% rename from .github/skills/web-scraper/SKILL.md rename to .github/skills/document-retrieval/SKILL.md index 05a078f0..9c077424 100644 --- a/.github/skills/web-scraper/SKILL.md +++ b/.github/skills/document-retrieval/SKILL.md @@ -1,5 +1,5 @@ --- -name: web-scraper +name: document-retrieval description: Build and tune retrieval configs that search, rank, and collect ordinance documents in COMPASS. Use whenever a user asks to improve retrieval precision/recall, tune search queries/keywords, or debug acquisition quality before extraction tuning. --- @@ -87,11 +87,19 @@ Avoid spaces around `=` in `.env` assignments. 5. Start with dynamic search, then switch to deterministic known URLs when search infrastructure is unstable. -When using `heuristic_keywords`, include all required lists: -- `GOOD_TECH_KEYWORDS` -- `GOOD_TECH_PHRASES` -- `GOOD_TECH_ACRONYMS` -- `NOT_TECH_WORDS` +When using `heuristic_keywords`, use these four lists to guide pre-LLM filtering: +- `GOOD_TECH_KEYWORDS` — strong indicators of the target technology + (e.g., facility types, deployment modes). Documents matching even a + few keywords are marked as candidates. +- `GOOD_TECH_PHRASES` — multi-word phrases that signal relevant + ordinance content. Keep specific to avoid false positives. +- `GOOD_TECH_ACRONYMS` — industry-standard abbreviations for the + technology. Narrow list; include only widely recognized acronyms. +- `NOT_TECH_WORDS` — pre-heuristic filter that rejects documents + before keyword matching. Use to exclude adjacent technologies and + irrelevant domains (e.g., residential HVAC, unrelated industries). + Runs first; prevents wasted keyword evaluation on clearly-wrong + documents. If any required list is missing or empty, COMPASS raises a plugin configuration error and extraction quality should be treated as failed. @@ -107,6 +115,10 @@ before using live web search. `code of ordinances`, `use table`, and `special use permit`. ## Deterministic smoke-test mode +For this smoke test, at least one of the following documentation sources must be provided: + +- **`known_doc_urls`**: A list of URLs pointing to external documentation that the scraper can access and parse +- **`known_local_docs`**: A collection of local documentation files available in the repository or system Use run-config controls to bypass flaky search while tuning: @@ -148,8 +160,8 @@ When reusing this workflow for any technology: ## Phase gates - **3 jurisdictions**: ensure major source classes are found. -- **10-25 jurisdictions**: verify stability across regions. -- **Full scale**: only once false positive/negative rates stabilize. +- **10 jurisdictions**: verify stability across regions. + ## Guardrails diff --git a/.github/skills/extraction-run/SKILL.md b/.github/skills/extraction-run/SKILL.md index a356eea2..ed41439d 100644 --- a/.github/skills/extraction-run/SKILL.md +++ b/.github/skills/extraction-run/SKILL.md @@ -6,7 +6,7 @@ description: Execute one-shot extraction with COMPASS and iterate quickly with l # Extraction Run Skill **ONE-SHOT EXTRACTION ONLY.** This skill applies only to schema-driven extraction. -For legacy decision-tree extraction (solar, wind, small wind), consult COMPASS +For decision-tree extraction (solar, wind, small wind), consult COMPASS architecture docs. Use this skill to run one-shot extraction in a repeatable, low-risk way, @@ -17,11 +17,11 @@ then iterate quickly until you have stable structured outputs. - Schema exists and plugin config points to it. - You are onboarding a new technology (diesel generator, geothermal, CHP, hydrogen). - You need a reliable smoke-test workflow before scaling. -- You are NOT using legacy decision-tree extraction. +- You are NOT using decision-tree extraction. ## Do not use -- Legacy decision-tree extraction feature engineering. +- Decision-tree extraction feature engineering. - Python parser implementation in `compass/extraction//parse.py`. - Non-extraction tasks (for example docs-only updates). @@ -47,7 +47,7 @@ them for the same technology: | Mode | Where code lives | Good for | |---|---|---| | **One-shot (schema-based)** | `examples/` → `compass/extraction//` | New techs, no Python changes | -| **Legacy decision-tree** | Python code in `compass/extraction//` | Existing solar, wind, small wind | +| **decision-tree** | Python code in `compass/extraction//` | Existing solar, wind, small wind | One-shot is the correct path for all new technology onboarding. It requires only a schema JSON, a plugin YAML, and a run config — no Python source changes. @@ -61,7 +61,6 @@ New technology assets start in `examples/` and finish in `compass/extraction/`: 3. **Promote** — copy the three finalized files into `compass/extraction//`: - `_schema.json` - `_plugin_config.yaml` - - `_config.json5` (optional; useful as a reference run config) - `__init__.py` — registers the plugin via `create_schema_based_one_shot_extraction_plugin` After creating the package, add an import in `compass/extraction/__init__.py` @@ -77,7 +76,7 @@ New technology assets start in `examples/` and finish in `compass/extraction/`: ## Preflight checks (must pass before run) -- Jurisdiction CSV has headers `County,State`. +- Jurisdiction CSV has headers `County,State` or `County,State,Subdivision,Jurisdiction Type`. - `out_dir` is unique for this run. - At least one acquisition step is enabled: `perform_se_search: true`, `perform_website_search: true`, @@ -89,7 +88,6 @@ New technology assets start in `examples/` and finish in `compass/extraction/`: Use tech-first names for all one-shot assets: -- `_config*.json5` - `_plugin_config.yaml` - `_schema.json` - `_jurisdictions*.csv` @@ -149,8 +147,6 @@ pixi run compass process -c config.json5 -p path/to/plugin_config.yaml -v - Goal: verify wiring and output contract. 2. **Robustness (5 jurisdictions)** - Goal: verify feature stability and edge-case handling. -3. **Scale (full set)** - - Goal: only after earlier phases pass acceptance gates. ## Validation checklist @@ -161,10 +157,6 @@ Evaluate each run on: - section/summary traceability, - unit consistency, - null discipline, -- **scope bleed** — check that no features appear in the output CSVs that - fall outside the schema enum; generic land-use-code documents can cause - unrelated provisions to leak through. Tighten `extraction_system_prompt` - in plugin YAML to fix this. ## Expected output artifacts diff --git a/.github/skills/iteration-development/SKILL.md b/.github/skills/iteration-development/SKILL.md new file mode 100644 index 00000000..120bbaef --- /dev/null +++ b/.github/skills/iteration-development/SKILL.md @@ -0,0 +1,224 @@ +--- +name: iteration-development +description: Run → inspect → fix cycle for one-shot extraction after initial setup. Use whenever a user asks to diagnose poor output, reduce scope bleed, improve precision/recall, or scale from smoke tests. +--- + +# Iteration Development Skill + +Use this skill after you have a working schema, plugin YAML, and run config +and want to improve extraction quality through systematic iteration. + +## When to use + +- First smoke run produced output that needs diagnosis or improvement. +- Feature values or units are wrong, missing, or inconsistent. +- Retrieval is returning off-target documents. +- Scaling from 3 jurisdictions to 10–25 or full production. + +## Do not use + +- First-time setup before any successful smoke run. +- Legacy decision-tree extraction development. + +## Expected assistant output + +When using this skill, return: + +1. The observed failure class (retrieval, extraction scope, value/units, or null handling). +2. One concrete fix on a single axis. +3. The re-run command and pass/fail gate check. + +## Canonical reference + +- `examples/one_shot_schema_extraction/` — working examples + to use as a baseline for comparing output quality. + + +## The run → inspect → fix loop + +**Three Phases:** This skill guides you through three phases, all built into +example plugin configurations in the `examples/` directory. + +Repeat this cycle once per iteration. Change exactly one axis per cycle. + +``` +Run → Inspect outputs → Identify failure → Fix one axis → Re-run same sample +``` + +**Never change multiple axes in the same iteration.** You will not know +which change caused the result. + +**Phases encoded in plugin YAML comments:** + +- **Phase 1 (Initial):** Includes query templates, website keywords, and + basic heuristic filters to avoid obvious off-domain results. + **This is ready to run immediately.** +- **Phase 2 (Optional Refinement):** Uncomment advanced heuristic tuning + if Phase 1 retrieval produces off-target documents. +- **Phase 3 (Optional Refinement):** Uncomment extraction_system_prompt + if Phase 1-2 retrieval works but extracted features are wrong (scope bleed). + +Start with Phase 1. Only add Phase 2 / 3 if Phase 1 results need improvement. +See README.rst for the progression path. + + +## Step 1: Inspect output artifacts + +After each run, check these locations inside `out_dir`: + +| Artifact | What to look for | +|---|---| +| `ordinance_files/*.pdf` | Are these on-target documents? | +| `cleaned_text/*.txt` | Does page text contain target technology language? | +| `jurisdiction_dbs/*.csv` | Are feature rows present? Are values correct? | +| `quantitative_ordinances.csv` and `qualitative_ordinances.csv` | Final compiled output — check feature coverage and null rate | +| `logs//*.log` | Error messages, 0-document warnings | + +Minimum passing state for a smoke run: +- At least one `ordinance_files/` PDF per jurisdiction. +- At least one `cleaned_text/` file per jurisdiction. +- Compiled ordinance CSV outputs contain rows for most jurisdictions. + +Immediate fail conditions (fix before any tuning): +- Jurisdiction CSV header mismatch (must include at least `County,State`). +- Plugin configuration exceptions in logs (for example missing required + `heuristic_keywords` lists). +- `Number of jurisdictions with extracted data: 0`. + + +## Step 2: Classify the failure + +Use this decision tree for any defect: + +``` +Is the right document being retrieved? + └─ No → retrieval problem → fix query templates / heuristic_keywords + └─ Yes + Is the document text present in cleaned_text/? + └─ No → text extraction problem → check PDF quality / OCR + └─ Yes + Are the right features being extracted? + └─ No, wrong feature names → schema enum or description problem + └─ No, off-domain features → scope bleed → fix extraction_system_prompt + └─ Yes, but wrong values/units → schema description or units problem + └─ Yes, but nulls where values should be → schema IGNORE clause too broad +``` + + +## Step 3: Fix the right axis + +### Retrieval problems (wrong or missing documents) + +Fix in plugin YAML: +- Add more specific `query_templates` with legal code terms + (e.g., `"filetype:pdf {jurisdiction} generator zoning code"`). +- Add target technology terms to `GOOD_TECH_KEYWORDS` and + `GOOD_TECH_PHRASES`. +- Add adjacent-technology terms being confounded to `NOT_TECH_WORDS`. +- Increase `website_keywords` score for the most discriminating terms. + +Required `heuristic_keywords` keys when present: +- `GOOD_TECH_KEYWORDS` +- `GOOD_TECH_PHRASES` +- `GOOD_TECH_ACRONYMS` +- `NOT_TECH_WORDS` + +### Scope bleed (off-domain features extracted) + +Fix in plugin YAML `extraction_system_prompt`: +- State explicitly what is excluded (e.g., "Do not extract requirements for + residential portable generators"). +- Add the same language to `$instructions.scope` in the schema for + reinforcement. + +### Wrong values or units + +Fix in schema JSON, in the affected feature's `description`: +- Add or sharpen the `VALUE` rule. +- Expand the `UNITS` vocabulary list. +- Add a `IGNORE` clause for the near-miss case. + +### Missing values (nulls where data exists) + +Fix in schema JSON: +- Broaden the feature description to cover the phrasing used in source + documents. +- Remove overly restrictive IGNORE clauses. +- Check that the feature ID is spelled exactly as it appears in the enum. + +### Text extraction failures (blank cleaned_text) + +- Verify the PDF is readable (not scanned without OCR). +- Add `from_ocr: true` to the doc entry in `known_local_docs`. +- Set `pytesseract_exe_fp` in run config if OCR is needed. + + +## Iteration hygiene + +- Use a **unique `out_dir`** per iteration run. COMPASS aborts early if the + output directory already contains results. +- Keep the same small jurisdiction sample across all iterations until + quality gates pass. +- Record what changed and why in a short comment in the config file or + a separate `CHANGELOG.md` in the example folder. +- Save schema versions as `_schema_v2.json` etc. to + preserve a diff history. Point `schema:` in plugin YAML to the active + version. + + +## Scale-up protocol + +Only advance to the next phase when the current phase passes all gates. + +| Phase | Jurisdictions | Gates | +|---|---|---| +| Smoke | 1–3 | Output rows exist; feature names match schema enum; section/summary present for most rows | +| Robustness | 10–25 | Feature value types are stable; null rate is explainable; no scope bleed | +| Production | Full national set | False positive/negative rates acceptable; repeated runs show minimal drift | + +When advancing, keep the same config files. Only change the jurisdictions CSV. + + +## Diagnostic commands + +```bash +# Check if cleaned text was produced +ls outputs/*/cleaned_text/ + +# Count output rows per jurisdiction +wc -l outputs/*/jurisdiction_dbs/*.csv + +# Check for scope bleed — feature values that are off-domain +grep -v "diesel\|generator\|backup\|emergency" outputs/ordinances.csv | head -20 + +# View logs for a specific jurisdiction +cat outputs/logs/San\ Diego*/run.log | grep -i "error\|warning\|found 0" +``` + + +## Common failure modes + +| Symptom | Most likely cause | Fix axis | +|---|---|---| +| 0 documents for all jurisdictions | Credentials not loaded / search API down | Load `.env`; use `known_doc_urls` | +| Downloaded PDFs are from wrong domain | `query_templates` too generic | Narrow queries with `filetype:pdf` and legal code terms | +| `cleaned_text` present but no output CSV rows | Schema enum mismatch or extraction prompt failing | Check schema path in plugin YAML; verify `tech` value in run config | +| Off-domain feature names in output | Scope bleed from large land-use code | Add exclusion language to `extraction_system_prompt` | +| Correct features but wrong values | Feature description lacks VALUE rule | Add explicit VALUE rule to affected descriptions | +| Setback in wrong units | UNITS rule missing or implicit | Add explicit UNITS vocabulary to description | +| Null rows for features that are in the document | IGNORE clause too broad, or feature description doesn't match source phrasing | Broaden description; remove over-strict IGNORE clause | +| Playwright timeout errors in logs | Website crawl phase browser failure | Non-fatal; COMPASS continues. Use `known_doc_urls` while iterating | + + +## Acceptance criteria before promotion + +A technology is ready to promote from `examples/` to +`compass/extraction//` when all of the following are true on the +robustness run (10–25 jurisdictions): + +- [ ] Output CSV rows conform to required schema contract. +- [ ] Feature IDs are stable and match the schema enum exactly. +- [ ] Most non-null rows include a useful `section` and `summary`. +- [ ] Repeated runs on the same sample show minimal drift. +- [ ] No scope bleed (off-domain features) is observed. +- [ ] Null rate for common features is explainable (jurisdiction has no rule). diff --git a/.github/skills/yaml-setup/SKILL.md b/.github/skills/plugin-config-setup/SKILL.md similarity index 91% rename from .github/skills/yaml-setup/SKILL.md rename to .github/skills/plugin-config-setup/SKILL.md index 1502085c..57ec663d 100644 --- a/.github/skills/yaml-setup/SKILL.md +++ b/.github/skills/plugin-config-setup/SKILL.md @@ -1,5 +1,5 @@ --- -name: yaml-setup +name: plugin-config-setup description: Author and tune one-shot plugin YAML for COMPASS document discovery, filtering, and text collection. Use whenever a user asks to create, clean up, standardize, or troubleshoot one-shot plugin YAML for technology onboarding. --- @@ -87,6 +87,20 @@ schema: ./my_schema.json ## Required `heuristic_keywords` shape +When using `heuristic_keywords`, use these four lists to guide pre-LLM filtering: +- `GOOD_TECH_KEYWORDS` — strong indicators of the target technology + (e.g., facility types, deployment modes). Documents matching even a + few keywords are marked as candidates. +- `GOOD_TECH_PHRASES` — multi-word phrases that signal relevant + ordinance content. Keep specific to avoid false positives. +- `GOOD_TECH_ACRONYMS` — industry-standard abbreviations for the + technology. Narrow list; include only widely recognized acronyms. +- `NOT_TECH_WORDS` — pre-heuristic filter that rejects documents + before keyword matching. Use to exclude adjacent technologies and + irrelevant domains (e.g., residential HVAC, unrelated industries). + Runs first; prevents wasted keyword evaluation on clearly-wrong + documents. + Use this exact structure when defining `heuristic_keywords`: ```yaml @@ -215,8 +229,6 @@ values: } ``` -Do not hardcode model names in skills. Prompt the user to supply `name`. - ## Acquisition phases COMPASS acquisition runs in two sequential phases per jurisdiction: diff --git a/.github/skills/schema-creation/SKILL.md b/.github/skills/schema-creation/SKILL.md index c4941bc1..08981132 100644 --- a/.github/skills/schema-creation/SKILL.md +++ b/.github/skills/schema-creation/SKILL.md @@ -119,7 +119,7 @@ Organize `$definitions` by these families: | Physical design | `screening requirement`, `enclosure requirement`, `exhaust stack height` | | Zoning | `primary use districts`, `conditional use districts`, `prohibited use districts` | | Permitting | `permit requirement`, `capacity threshold` | -| Compliance | `decommissioning`, `enactment date` | +| Compliance | `decommissioning` | ## `$instructions` block @@ -129,7 +129,6 @@ Always include a `$instructions` object at the top level with these keys: "$instructions": { "scope": "Describe exactly what to extract and what to ignore.", "null_handling": "Output every enum feature. Use null value and null summary when a feature is not found in the document. Do not omit features.", - "one_row_per_feature": "Output exactly one row per feature. If multiple values apply, use the most restrictive and describe variants in summary.", "verbatim_quotes": "In summary fields, prefer verbatim quotes from the source. Enclose in double quotation marks.", "units_discipline": "Do not convert units. Record them exactly as they appear in the document." } @@ -150,7 +149,7 @@ Do not expand the feature enum to absorb scope bleed. Narrow the prompt. ## Cross-technology adaptation checklist -When cloning this schema for a new technology: +When cloning a schema for a new technology: - [ ] Replace all feature IDs with technology-specific names. - [ ] Replace value/units rules in every feature description. From 3288e2663d174cc4f15cf8f4d1c2d777ea509c3f Mon Sep 17 00:00:00 2001 From: Byron Pullutasig <115118857+bpulluta@users.noreply.github.com> Date: Thu, 26 Mar 2026 16:35:41 -0600 Subject: [PATCH 7/7] udpated skills Paul review march 26 --- .github/skills/extraction-run/SKILL.md | 1 - .github/skills/iteration-development/SKILL.md | 224 ------------------ .github/skills/plugin-config-setup/SKILL.md | 22 +- 3 files changed, 12 insertions(+), 235 deletions(-) delete mode 100644 .github/skills/iteration-development/SKILL.md diff --git a/.github/skills/extraction-run/SKILL.md b/.github/skills/extraction-run/SKILL.md index ed41439d..c2fafa8b 100644 --- a/.github/skills/extraction-run/SKILL.md +++ b/.github/skills/extraction-run/SKILL.md @@ -15,7 +15,6 @@ then iterate quickly until you have stable structured outputs. ## When to use - Schema exists and plugin config points to it. -- You are onboarding a new technology (diesel generator, geothermal, CHP, hydrogen). - You need a reliable smoke-test workflow before scaling. - You are NOT using decision-tree extraction. diff --git a/.github/skills/iteration-development/SKILL.md b/.github/skills/iteration-development/SKILL.md deleted file mode 100644 index 120bbaef..00000000 --- a/.github/skills/iteration-development/SKILL.md +++ /dev/null @@ -1,224 +0,0 @@ ---- -name: iteration-development -description: Run → inspect → fix cycle for one-shot extraction after initial setup. Use whenever a user asks to diagnose poor output, reduce scope bleed, improve precision/recall, or scale from smoke tests. ---- - -# Iteration Development Skill - -Use this skill after you have a working schema, plugin YAML, and run config -and want to improve extraction quality through systematic iteration. - -## When to use - -- First smoke run produced output that needs diagnosis or improvement. -- Feature values or units are wrong, missing, or inconsistent. -- Retrieval is returning off-target documents. -- Scaling from 3 jurisdictions to 10–25 or full production. - -## Do not use - -- First-time setup before any successful smoke run. -- Legacy decision-tree extraction development. - -## Expected assistant output - -When using this skill, return: - -1. The observed failure class (retrieval, extraction scope, value/units, or null handling). -2. One concrete fix on a single axis. -3. The re-run command and pass/fail gate check. - -## Canonical reference - -- `examples/one_shot_schema_extraction/` — working examples - to use as a baseline for comparing output quality. - - -## The run → inspect → fix loop - -**Three Phases:** This skill guides you through three phases, all built into -example plugin configurations in the `examples/` directory. - -Repeat this cycle once per iteration. Change exactly one axis per cycle. - -``` -Run → Inspect outputs → Identify failure → Fix one axis → Re-run same sample -``` - -**Never change multiple axes in the same iteration.** You will not know -which change caused the result. - -**Phases encoded in plugin YAML comments:** - -- **Phase 1 (Initial):** Includes query templates, website keywords, and - basic heuristic filters to avoid obvious off-domain results. - **This is ready to run immediately.** -- **Phase 2 (Optional Refinement):** Uncomment advanced heuristic tuning - if Phase 1 retrieval produces off-target documents. -- **Phase 3 (Optional Refinement):** Uncomment extraction_system_prompt - if Phase 1-2 retrieval works but extracted features are wrong (scope bleed). - -Start with Phase 1. Only add Phase 2 / 3 if Phase 1 results need improvement. -See README.rst for the progression path. - - -## Step 1: Inspect output artifacts - -After each run, check these locations inside `out_dir`: - -| Artifact | What to look for | -|---|---| -| `ordinance_files/*.pdf` | Are these on-target documents? | -| `cleaned_text/*.txt` | Does page text contain target technology language? | -| `jurisdiction_dbs/*.csv` | Are feature rows present? Are values correct? | -| `quantitative_ordinances.csv` and `qualitative_ordinances.csv` | Final compiled output — check feature coverage and null rate | -| `logs//*.log` | Error messages, 0-document warnings | - -Minimum passing state for a smoke run: -- At least one `ordinance_files/` PDF per jurisdiction. -- At least one `cleaned_text/` file per jurisdiction. -- Compiled ordinance CSV outputs contain rows for most jurisdictions. - -Immediate fail conditions (fix before any tuning): -- Jurisdiction CSV header mismatch (must include at least `County,State`). -- Plugin configuration exceptions in logs (for example missing required - `heuristic_keywords` lists). -- `Number of jurisdictions with extracted data: 0`. - - -## Step 2: Classify the failure - -Use this decision tree for any defect: - -``` -Is the right document being retrieved? - └─ No → retrieval problem → fix query templates / heuristic_keywords - └─ Yes - Is the document text present in cleaned_text/? - └─ No → text extraction problem → check PDF quality / OCR - └─ Yes - Are the right features being extracted? - └─ No, wrong feature names → schema enum or description problem - └─ No, off-domain features → scope bleed → fix extraction_system_prompt - └─ Yes, but wrong values/units → schema description or units problem - └─ Yes, but nulls where values should be → schema IGNORE clause too broad -``` - - -## Step 3: Fix the right axis - -### Retrieval problems (wrong or missing documents) - -Fix in plugin YAML: -- Add more specific `query_templates` with legal code terms - (e.g., `"filetype:pdf {jurisdiction} generator zoning code"`). -- Add target technology terms to `GOOD_TECH_KEYWORDS` and - `GOOD_TECH_PHRASES`. -- Add adjacent-technology terms being confounded to `NOT_TECH_WORDS`. -- Increase `website_keywords` score for the most discriminating terms. - -Required `heuristic_keywords` keys when present: -- `GOOD_TECH_KEYWORDS` -- `GOOD_TECH_PHRASES` -- `GOOD_TECH_ACRONYMS` -- `NOT_TECH_WORDS` - -### Scope bleed (off-domain features extracted) - -Fix in plugin YAML `extraction_system_prompt`: -- State explicitly what is excluded (e.g., "Do not extract requirements for - residential portable generators"). -- Add the same language to `$instructions.scope` in the schema for - reinforcement. - -### Wrong values or units - -Fix in schema JSON, in the affected feature's `description`: -- Add or sharpen the `VALUE` rule. -- Expand the `UNITS` vocabulary list. -- Add a `IGNORE` clause for the near-miss case. - -### Missing values (nulls where data exists) - -Fix in schema JSON: -- Broaden the feature description to cover the phrasing used in source - documents. -- Remove overly restrictive IGNORE clauses. -- Check that the feature ID is spelled exactly as it appears in the enum. - -### Text extraction failures (blank cleaned_text) - -- Verify the PDF is readable (not scanned without OCR). -- Add `from_ocr: true` to the doc entry in `known_local_docs`. -- Set `pytesseract_exe_fp` in run config if OCR is needed. - - -## Iteration hygiene - -- Use a **unique `out_dir`** per iteration run. COMPASS aborts early if the - output directory already contains results. -- Keep the same small jurisdiction sample across all iterations until - quality gates pass. -- Record what changed and why in a short comment in the config file or - a separate `CHANGELOG.md` in the example folder. -- Save schema versions as `_schema_v2.json` etc. to - preserve a diff history. Point `schema:` in plugin YAML to the active - version. - - -## Scale-up protocol - -Only advance to the next phase when the current phase passes all gates. - -| Phase | Jurisdictions | Gates | -|---|---|---| -| Smoke | 1–3 | Output rows exist; feature names match schema enum; section/summary present for most rows | -| Robustness | 10–25 | Feature value types are stable; null rate is explainable; no scope bleed | -| Production | Full national set | False positive/negative rates acceptable; repeated runs show minimal drift | - -When advancing, keep the same config files. Only change the jurisdictions CSV. - - -## Diagnostic commands - -```bash -# Check if cleaned text was produced -ls outputs/*/cleaned_text/ - -# Count output rows per jurisdiction -wc -l outputs/*/jurisdiction_dbs/*.csv - -# Check for scope bleed — feature values that are off-domain -grep -v "diesel\|generator\|backup\|emergency" outputs/ordinances.csv | head -20 - -# View logs for a specific jurisdiction -cat outputs/logs/San\ Diego*/run.log | grep -i "error\|warning\|found 0" -``` - - -## Common failure modes - -| Symptom | Most likely cause | Fix axis | -|---|---|---| -| 0 documents for all jurisdictions | Credentials not loaded / search API down | Load `.env`; use `known_doc_urls` | -| Downloaded PDFs are from wrong domain | `query_templates` too generic | Narrow queries with `filetype:pdf` and legal code terms | -| `cleaned_text` present but no output CSV rows | Schema enum mismatch or extraction prompt failing | Check schema path in plugin YAML; verify `tech` value in run config | -| Off-domain feature names in output | Scope bleed from large land-use code | Add exclusion language to `extraction_system_prompt` | -| Correct features but wrong values | Feature description lacks VALUE rule | Add explicit VALUE rule to affected descriptions | -| Setback in wrong units | UNITS rule missing or implicit | Add explicit UNITS vocabulary to description | -| Null rows for features that are in the document | IGNORE clause too broad, or feature description doesn't match source phrasing | Broaden description; remove over-strict IGNORE clause | -| Playwright timeout errors in logs | Website crawl phase browser failure | Non-fatal; COMPASS continues. Use `known_doc_urls` while iterating | - - -## Acceptance criteria before promotion - -A technology is ready to promote from `examples/` to -`compass/extraction//` when all of the following are true on the -robustness run (10–25 jurisdictions): - -- [ ] Output CSV rows conform to required schema contract. -- [ ] Feature IDs are stable and match the schema enum exactly. -- [ ] Most non-null rows include a useful `section` and `summary`. -- [ ] Repeated runs on the same sample show minimal drift. -- [ ] No scope bleed (off-domain features) is observed. -- [ ] Null rate for common features is explainable (jurisdiction has no rule). diff --git a/.github/skills/plugin-config-setup/SKILL.md b/.github/skills/plugin-config-setup/SKILL.md index 57ec663d..0c83b5f9 100644 --- a/.github/skills/plugin-config-setup/SKILL.md +++ b/.github/skills/plugin-config-setup/SKILL.md @@ -73,17 +73,19 @@ schema: ./my_schema.json ## Key plugin YAML fields -| Field | Type | Behavior | +| Field | Type | Code Reference | |---|---|---| -| `schema` | string (path) | **Required.** Path to JSON schema file, relative to plugin YAML location. | -| `data_type_short_desc` | string | Short description used in LLM prompts (e.g. `utility-scale ordinance`). | -| `query_templates` | list | Search query templates; `{jurisdiction}` is replaced at runtime. | -| `website_keywords` | dict | Keyword → score map for URL ranking during website crawl. | -| `heuristic_keywords` | dict or `true` | Pre-LLM text filter. If `true`, LLM generates lists from schema. | -| `collection_prompts` | list or `true` | Text collection prompt(s). If **`true`**, LLM auto-generates from schema. | -| `text_extraction_prompts` | list or `true` | Text consolidation prompt(s). If **`true`**, LLM auto-generates from schema. | -| `extraction_system_prompt` | string | Overrides default LLM system prompt for the extraction step. Use this to scope extraction tightly to the target technology. | -| `cache_llm_generated_content` | bool | Cache LLM-generated `query_templates`, `website_keywords`, and `heuristic_keywords`. Set to `false` when iterating schema to see live changes. | +| `schema` | string (path) | [base.py#L124–L131](../../../compass/plugin/one_shot/base.py) | +| `data_type_short_desc` | string | [base.py#L483](../../../compass/plugin/one_shot/base.py#L483) | +| `query_templates` | list | [base.py#L217–L240](../../../compass/plugin/one_shot/base.py#L217) | +| `website_keywords` | dict | [base.py#L281–L338](../../../compass/plugin/one_shot/base.py#L281) | +| `heuristic_keywords` | dict or `true` | [base.py#L340–L390](../../../compass/plugin/one_shot/base.py#L340); [base.py#L512](../../../compass/plugin/one_shot/base.py#L512) | +| `collection_prompts` | list or `true` | [base.py#L413–L436](../../../compass/plugin/one_shot/base.py#L413) | +| `text_extraction_prompts` | list or `true` | [base.py#L438–L468](../../../compass/plugin/one_shot/base.py#L438) | +| `extraction_system_prompt` | string | [base.py#L476–L488](../../../compass/plugin/one_shot/base.py#L476) | +| `cache_llm_generated_content` | bool | [base.py#L107–L117](../../../compass/plugin/one_shot/base.py#L107) | + +**For the complete list of all configuration options (including `allow_multi_doc_extraction` and any future additions), consult the docstring of [`create_schema_based_one_shot_extraction_plugin()`](../../../compass/plugin/one_shot/base.py#L51).** ## Required `heuristic_keywords` shape