From d4d580b06bed34f4bc7c88b608ff837ad14d97be Mon Sep 17 00:00:00 2001 From: "copilot-swe-agent[bot]" <198982749+Copilot@users.noreply.github.com> Date: Tue, 17 Mar 2026 22:16:59 +0000 Subject: [PATCH 1/2] Initial plan From 38f7a724637c5bec2aacaf9277128b028f492a04 Mon Sep 17 00:00:00 2001 From: "copilot-swe-agent[bot]" <198982749+Copilot@users.noreply.github.com> Date: Tue, 17 Mar 2026 22:24:19 +0000 Subject: [PATCH 2/2] Fix all review comments in skills documentation Co-authored-by: bpulluta <115118857+bpulluta@users.noreply.github.com> --- .github/skills/extraction-run/SKILL.md | 44 ++++++++++++++------------ .github/skills/web-scraper/SKILL.md | 20 +++++++----- .github/skills/yaml-setup/SKILL.md | 14 +++++--- 3 files changed, 44 insertions(+), 34 deletions(-) diff --git a/.github/skills/extraction-run/SKILL.md b/.github/skills/extraction-run/SKILL.md index 00be8f92..a356eea2 100644 --- a/.github/skills/extraction-run/SKILL.md +++ b/.github/skills/extraction-run/SKILL.md @@ -56,15 +56,17 @@ only a schema JSON, a plugin YAML, and a run config — no Python source changes New technology assets start in `examples/` and finish in `compass/extraction/`: -1. **Develop** — place all assets in `examples/one_shot_schema_extraction_/` +1. **Develop** — place all assets in `examples/one_shot_schema_extraction/` 2. **Stabilize** — iterate schema/plugin until smoke and robustness gates pass 3. **Promote** — copy the three finalized files into `compass/extraction//`: - `_schema.json` - `_plugin_config.yaml` - `_config.json5` (optional; useful as a reference run config) + - `__init__.py` — registers the plugin via `create_schema_based_one_shot_extraction_plugin` -The promoted extraction folder contains only config files — no Python code is -needed for one-shot techs. + After creating the package, add an import in `compass/extraction/__init__.py` + to register the plugin at startup. See `compass/extraction/ghp/__init__.py` + for a reference implementation. ## Required inputs @@ -78,10 +80,10 @@ needed for one-shot techs. - Jurisdiction CSV has headers `County,State`. - `out_dir` is unique for this run. - At least one acquisition step is enabled: - `perform_se_search: true`, `perform_website_search: true`, - `known_doc_urls`, or `known_local_docs`. + `perform_se_search: true`, `perform_website_search: true`, + `known_doc_urls`, or `known_local_docs`. - If `heuristic_keywords` exists, all four required lists are present and - non-empty. + non-empty. ## Naming convention @@ -106,7 +108,7 @@ to deterministic mode only when search infrastructure is unstable: 2. Use your preferred configured search engine. 3. Load `.env` into shell (`set -a && source .env && set +a`). 4. Run with verbose logs: - - `pixi run compass process -c config.json5 -p plugin.yaml -v` + - `pixi run compass process -c config.json5 -p plugin.yaml -v` 5. Confirm output artifacts exist before tuning schema semantics. Fallback mode when needed: @@ -144,11 +146,11 @@ pixi run compass process -c config.json5 -p path/to/plugin_config.yaml -v ## Phase-gated workflow 1. **Smoke test (1 jurisdiction)** - - Goal: verify wiring and output contract. + - Goal: verify wiring and output contract. 2. **Robustness (5 jurisdictions)** - - Goal: verify feature stability and edge-case handling. + - Goal: verify feature stability and edge-case handling. 3. **Scale (full set)** - - Goal: only after earlier phases pass acceptance gates. + - Goal: only after earlier phases pass acceptance gates. ## Validation checklist @@ -194,7 +196,7 @@ Check in order: 1. `outputs/*/cleaned_text/*.txt` (text extraction present) 2. `outputs/*/jurisdiction_dbs/*.csv` (per-jurisdiction parsed rows) 3. `outputs/*/quantitative_ordinances.csv` and - `outputs/*/qualitative_ordinances.csv` (final compiled results) + `outputs/*/qualitative_ordinances.csv` (final compiled results) Treat the run as **failed for extraction quality** when either is true: - `Number of jurisdictions with extracted data: 0` @@ -207,20 +209,20 @@ Only treat a run as passing when both are true: ## Root-cause triage - **Wrong or noisy documents** - - Tune query templates, URL keywords, and exclusions. - - Prefer `known_doc_urls` while stabilizing. + - Tune query templates, URL keywords, and exclusions. + - Prefer `known_doc_urls` while stabilizing. - **Right documents, wrong fields** - - Tune schema descriptions/examples and ambiguity rules. - - Check `extraction_system_prompt` in plugin YAML — it is the primary - guard against scope bleed from generic legal documents. + - Tune schema descriptions/examples and ambiguity rules. + - Check `extraction_system_prompt` in plugin YAML — it is the primary + guard against scope bleed from generic legal documents. - **Correct values, unstable formatting** - - Tighten enums, unit vocabulary, and null behavior. + - Tighten enums, unit vocabulary, and null behavior. - **Nothing downloaded / unstable search** - - Disable live search and use deterministic known URLs/local docs. + - Disable live search and use deterministic known URLs/local docs. - **0 documents found for a jurisdiction during website crawl** - - Expected for jurisdictions with few online ordinances. The website - crawl is a second acquisition pass after search-engine retrieval; - 0 results there is not a pipeline failure. + - Expected for jurisdictions with few online ordinances. The website + crawl is a second acquisition pass after search-engine retrieval; + 0 results there is not a pipeline failure. ## Acceptance gates diff --git a/.github/skills/web-scraper/SKILL.md b/.github/skills/web-scraper/SKILL.md index 27a3fa37..05a078f0 100644 --- a/.github/skills/web-scraper/SKILL.md +++ b/.github/skills/web-scraper/SKILL.md @@ -30,9 +30,12 @@ When using this skill, return: ## Canonical reference -Consult example plugin configurations in `examples/` following the tech-first naming pattern: -- `_plugin_config.yaml` — standard one-shot config -- See `examples/water_rights_demo/one-shot/plugin_config.yaml` for multi-document edge cases +Consult example plugin configurations in `examples/`: +- `examples/one_shot_schema_extraction/plugin_config.yaml` — standard one-shot config +- `examples/water_rights_demo/one-shot/plugin_config.yaml` — multi-document edge cases + +When creating new tech configs, use `_plugin_config.yaml` as a recommended +naming convention (e.g. `geothermal_plugin_config.yaml`). ## Scope @@ -49,7 +52,8 @@ COMPASS runs two sequential acquisition passes per jurisdiction: ordinance documents. 2. **Website crawl phase** — crawls the jurisdiction's official website, ranking pages using `website_keywords`. This phase is a secondary pass - and runs even if the SE phase found documents. + and runs only if the search-engine phase did not yield an ordinance + context. Key behaviors: - Playwright browser errors during the website crawl phase are **non-fatal**. @@ -123,10 +127,10 @@ Then validate: 3. Run heuristic filter and review false rejects/accepts (`cleaned_text/`). 4. Check website crawl phase independently if needed (enable, run, inspect logs). 5. Update one axis only: - - query templates (affects SE phase), - - URL weights (affects both phases), - - include/exclude heuristic patterns (pre-LLM filter), - - `NOT_TECH_WORDS` (upstream document rejection). + - query templates (affects SE phase), + - URL weights (affects both phases), + - include/exclude heuristic patterns (pre-LLM filter), + - `NOT_TECH_WORDS` (upstream document rejection). 6. Re-run same sample and compare. ## Cross-tech onboarding diff --git a/.github/skills/yaml-setup/SKILL.md b/.github/skills/yaml-setup/SKILL.md index af2a82e5..1502085c 100644 --- a/.github/skills/yaml-setup/SKILL.md +++ b/.github/skills/yaml-setup/SKILL.md @@ -33,10 +33,15 @@ When using this skill, return: ## Canonical reference -With tech-first naming, configuration examples follow this pattern: -- `examples/one_shot_schema_extraction/_plugin_config.yaml` — standard working example +Consult the working examples in `examples/`: +- `examples/one_shot_schema_extraction/plugin_config.yaml` — standard working example - `examples/water_rights_demo/one-shot/plugin_config.yaml` — multi-doc edge case +When creating new tech configs, `_plugin_config.yaml` is the recommended +naming convention (e.g. `geothermal_plugin_config.yaml`). The existing +`plugin_config.yaml` examples use a generic name; new tech-specific assets +should use the tech-first naming pattern. + Refer to any complete example in `examples/` that matches your retrieval goals. ## Naming convention @@ -78,7 +83,7 @@ schema: ./my_schema.json | `collection_prompts` | list or `true` | Text collection prompt(s). If **`true`**, LLM auto-generates from schema. | | `text_extraction_prompts` | list or `true` | Text consolidation prompt(s). If **`true`**, LLM auto-generates from schema. | | `extraction_system_prompt` | string | Overrides default LLM system prompt for the extraction step. Use this to scope extraction tightly to the target technology. | -| `cache_llm_generated_content` | bool | Cache LLM-generated `query_templates` and `website_keywords`. Set to `false` when iterating schema to see live changes. | +| `cache_llm_generated_content` | bool | Cache LLM-generated `query_templates`, `website_keywords`, and `heuristic_keywords`. Set to `false` when iterating schema to see live changes. | ## Required `heuristic_keywords` shape @@ -122,8 +127,7 @@ extraction_system_prompt: |- Prefer explicit values. Use null for qualitative obligations. ``` -See `compass/extraction/geothermal_electricity/geothermal_plugin_config.yaml` -for a complete example. +See `compass/extraction/ghp/plugin_config.yaml` for a complete example. ## Progressive config path