Skip to content

datacircuits/prompt-dissector

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

prompt-dissector

Python pipeline for parsing and analyzing AI steering documents (system prompts). Extracts sections, counts deontic operators, maps corrective scars, identifies internal conflicts, and tracks concept frequency. Powers the quantitative analysis behind datacircuits.org.

Quick start

make install                  # pip install -e "analysis/[dev]"
make test                     # 20 tests
make analyze                  # run pipeline on the example specimen

Requires Python 3.10+.

What it does

The pipeline takes a raw system prompt text file and produces structured JSON across seven analytical axes:

Axis Method Output
Budget allocation Word counts per section sections.json
Category grouping Human-curated YAML mapping sections.json
Modality profile Regex-based deontic operator density modality.json
Concept recurrence Lexicon-based frequency matching concepts.json
Scar inference Human-curated corrective clause mapping scars.json
Conflict detection Human-curated directive tension pairs conflicts.json
Summary stats Aggregated from above summary.json

Automated stages (parsing, modality, concepts) run without curation. Interpretive stages (scars, conflicts) load from human-edited YAML and are validated against the parsed specimen.

Repository layout

analysis/                     Python package (steering_doc_reader)
  steering_doc_reader/        Source modules
  tests/                      20 tests (pytest)
  pyproject.toml              Package metadata, deps, entry point
data/
  specimens/                  Raw steering documents (one per file)
  curated/                    Human-edited YAML (categories, concepts, scars, conflicts)
  derived/                    Auto-generated JSON (pipeline output)
docs/
  methodology.md              Seven-axis analytical framework
  contributing.md             How to contribute specimens and annotations
  analyzing_a_new_specimen.md Step-by-step guide for new analyses
scripts/
  sync_site_data.py           Converts derived JSON → JS module for Astro site

Usage

CLI

steering-doc-reader specimen.txt \
  --curated-dir data/curated \
  --output data/derived \
  --verbose

Library

from steering_doc_reader import Specimen, analyze

specimen = Specimen.load("prompt.txt", categories_file="curated/categories.yaml")
report = analyze(specimen, curated_dir="curated/")

for section in specimen:
    print(f"{section.name} ({section.category}): {section.words} words")

for m in report.modality_scores:
    print(f"{m.section_name}: {m.density_per_100w:.1f} ops/100w")

report.to_json("output/")

Analyzing a new specimen

  1. Drop a system prompt text file into data/specimens/
  2. Run make analyze with no curation — check that parsing works
  3. Add section categories to data/curated/categories.yaml
  4. Curate scars and conflicts in YAML (the hard, interpretive part)
  5. Re-run the pipeline, inspect the JSON

See docs/analyzing_a_new_specimen.md for the full guide.

Design decisions

YAML for curation. Scar and conflict annotations are interpretive work by humans. YAML is the right format for structured data with long prose values.

Validate, don't auto-infer. Scar inference cannot be automated responsibly. The pipeline helps curators avoid mistakes (typos in section names, non-canonical clusters) without replacing their judgment.

Two parse formats. Specimens arrive in different shapes — XML-tagged (<section>...</section>) or ALL-CAPS headers. The parser auto-detects with no configuration.

License

Code: MIT. Essays, documentation, and curated YAML: CC-BY-4.0.

About

Companion code for "The Documents That Govern the Models" @ datacircuits.org

Topics

Resources

License

MIT, Unknown licenses found

Licenses found

MIT
LICENSE
Unknown
LICENSE-CONTENT

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors