Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
44 changes: 23 additions & 21 deletions .github/skills/extraction-run/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -56,15 +56,17 @@ only a schema JSON, a plugin YAML, and a run config — no Python source changes

New technology assets start in `examples/` and finish in `compass/extraction/`:

1. **Develop** — place all assets in `examples/one_shot_schema_extraction_<tech>/`
1. **Develop** — place all assets in `examples/one_shot_schema_extraction/`
2. **Stabilize** — iterate schema/plugin until smoke and robustness gates pass
3. **Promote** — copy the three finalized files into `compass/extraction/<tech>/`:
- `<tech>_schema.json`
- `<tech>_plugin_config.yaml`
- `<tech>_config.json5` (optional; useful as a reference run config)
- `__init__.py` — registers the plugin via `create_schema_based_one_shot_extraction_plugin`

The promoted extraction folder contains only config files — no Python code is
needed for one-shot techs.
After creating the package, add an import in `compass/extraction/__init__.py`
to register the plugin at startup. See `compass/extraction/ghp/__init__.py`
for a reference implementation.

## Required inputs

Expand All @@ -78,10 +80,10 @@ needed for one-shot techs.
- Jurisdiction CSV has headers `County,State`.
- `out_dir` is unique for this run.
- At least one acquisition step is enabled:
`perform_se_search: true`, `perform_website_search: true`,
`known_doc_urls`, or `known_local_docs`.
`perform_se_search: true`, `perform_website_search: true`,
`known_doc_urls`, or `known_local_docs`.
- If `heuristic_keywords` exists, all four required lists are present and
non-empty.
non-empty.

## Naming convention

Expand All @@ -106,7 +108,7 @@ to deterministic mode only when search infrastructure is unstable:
2. Use your preferred configured search engine.
3. Load `.env` into shell (`set -a && source .env && set +a`).
4. Run with verbose logs:
- `pixi run compass process -c config.json5 -p plugin.yaml -v`
- `pixi run compass process -c config.json5 -p plugin.yaml -v`
5. Confirm output artifacts exist before tuning schema semantics.

Fallback mode when needed:
Expand Down Expand Up @@ -144,11 +146,11 @@ pixi run compass process -c config.json5 -p path/to/plugin_config.yaml -v
## Phase-gated workflow

1. **Smoke test (1 jurisdiction)**
- Goal: verify wiring and output contract.
- Goal: verify wiring and output contract.
2. **Robustness (5 jurisdictions)**
- Goal: verify feature stability and edge-case handling.
- Goal: verify feature stability and edge-case handling.
3. **Scale (full set)**
- Goal: only after earlier phases pass acceptance gates.
- Goal: only after earlier phases pass acceptance gates.

## Validation checklist

Expand Down Expand Up @@ -194,7 +196,7 @@ Check in order:
1. `outputs/*/cleaned_text/*.txt` (text extraction present)
2. `outputs/*/jurisdiction_dbs/*.csv` (per-jurisdiction parsed rows)
3. `outputs/*/quantitative_ordinances.csv` and
`outputs/*/qualitative_ordinances.csv` (final compiled results)
`outputs/*/qualitative_ordinances.csv` (final compiled results)

Treat the run as **failed for extraction quality** when either is true:
- `Number of jurisdictions with extracted data: 0`
Expand All @@ -207,20 +209,20 @@ Only treat a run as passing when both are true:
## Root-cause triage

- **Wrong or noisy documents**
- Tune query templates, URL keywords, and exclusions.
- Prefer `known_doc_urls` while stabilizing.
- Tune query templates, URL keywords, and exclusions.
- Prefer `known_doc_urls` while stabilizing.
- **Right documents, wrong fields**
- Tune schema descriptions/examples and ambiguity rules.
- Check `extraction_system_prompt` in plugin YAML — it is the primary
guard against scope bleed from generic legal documents.
- Tune schema descriptions/examples and ambiguity rules.
- Check `extraction_system_prompt` in plugin YAML — it is the primary
guard against scope bleed from generic legal documents.
- **Correct values, unstable formatting**
- Tighten enums, unit vocabulary, and null behavior.
- Tighten enums, unit vocabulary, and null behavior.
- **Nothing downloaded / unstable search**
- Disable live search and use deterministic known URLs/local docs.
- Disable live search and use deterministic known URLs/local docs.
- **0 documents found for a jurisdiction during website crawl**
- Expected for jurisdictions with few online ordinances. The website
crawl is a second acquisition pass after search-engine retrieval;
0 results there is not a pipeline failure.
- Expected for jurisdictions with few online ordinances. The website
crawl is a second acquisition pass after search-engine retrieval;
0 results there is not a pipeline failure.

## Acceptance gates

Expand Down
20 changes: 12 additions & 8 deletions .github/skills/web-scraper/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,9 +30,12 @@ When using this skill, return:

## Canonical reference

Consult example plugin configurations in `examples/` following the tech-first naming pattern:
- `<tech>_plugin_config.yaml` — standard one-shot config
- See `examples/water_rights_demo/one-shot/plugin_config.yaml` for multi-document edge cases
Consult example plugin configurations in `examples/`:
- `examples/one_shot_schema_extraction/plugin_config.yaml` — standard one-shot config
- `examples/water_rights_demo/one-shot/plugin_config.yaml` — multi-document edge cases

When creating new tech configs, use `<tech>_plugin_config.yaml` as a recommended
naming convention (e.g. `geothermal_plugin_config.yaml`).

## Scope

Expand All @@ -49,7 +52,8 @@ COMPASS runs two sequential acquisition passes per jurisdiction:
ordinance documents.
2. **Website crawl phase** — crawls the jurisdiction's official website,
ranking pages using `website_keywords`. This phase is a secondary pass
and runs even if the SE phase found documents.
and runs only if the search-engine phase did not yield an ordinance
context.

Key behaviors:
- Playwright browser errors during the website crawl phase are **non-fatal**.
Expand Down Expand Up @@ -123,10 +127,10 @@ Then validate:
3. Run heuristic filter and review false rejects/accepts (`cleaned_text/`).
4. Check website crawl phase independently if needed (enable, run, inspect logs).
5. Update one axis only:
- query templates (affects SE phase),
- URL weights (affects both phases),
- include/exclude heuristic patterns (pre-LLM filter),
- `NOT_TECH_WORDS` (upstream document rejection).
- query templates (affects SE phase),
- URL weights (affects both phases),
- include/exclude heuristic patterns (pre-LLM filter),
- `NOT_TECH_WORDS` (upstream document rejection).
6. Re-run same sample and compare.

## Cross-tech onboarding
Expand Down
14 changes: 9 additions & 5 deletions .github/skills/yaml-setup/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,10 +33,15 @@ When using this skill, return:

## Canonical reference

With tech-first naming, configuration examples follow this pattern:
- `examples/one_shot_schema_extraction/<tech>_plugin_config.yaml` — standard working example
Consult the working examples in `examples/`:
- `examples/one_shot_schema_extraction/plugin_config.yaml` — standard working example
- `examples/water_rights_demo/one-shot/plugin_config.yaml` — multi-doc edge case

When creating new tech configs, `<tech>_plugin_config.yaml` is the recommended
naming convention (e.g. `geothermal_plugin_config.yaml`). The existing
`plugin_config.yaml` examples use a generic name; new tech-specific assets
should use the tech-first naming pattern.

Refer to any complete example in `examples/` that matches your retrieval goals.

## Naming convention
Expand Down Expand Up @@ -78,7 +83,7 @@ schema: ./my_schema.json
| `collection_prompts` | list or `true` | Text collection prompt(s). If **`true`**, LLM auto-generates from schema. |
| `text_extraction_prompts` | list or `true` | Text consolidation prompt(s). If **`true`**, LLM auto-generates from schema. |
| `extraction_system_prompt` | string | Overrides default LLM system prompt for the extraction step. Use this to scope extraction tightly to the target technology. |
| `cache_llm_generated_content` | bool | Cache LLM-generated `query_templates` and `website_keywords`. Set to `false` when iterating schema to see live changes. |
| `cache_llm_generated_content` | bool | Cache LLM-generated `query_templates`, `website_keywords`, and `heuristic_keywords`. Set to `false` when iterating schema to see live changes. |

## Required `heuristic_keywords` shape

Expand Down Expand Up @@ -122,8 +127,7 @@ extraction_system_prompt: |-
Prefer explicit values. Use null for qualitative obligations.
```

See `compass/extraction/geothermal_electricity/geothermal_plugin_config.yaml`
for a complete example.
See `compass/extraction/ghp/plugin_config.yaml` for a complete example.

## Progressive config path

Expand Down