Python pipeline for parsing and analyzing AI steering documents (system prompts). Extracts sections, counts deontic operators, maps corrective scars, identifies internal conflicts, and tracks concept frequency. Powers the quantitative analysis behind datacircuits.org.
make install # pip install -e "analysis/[dev]"
make test # 20 tests
make analyze # run pipeline on the example specimenRequires Python 3.10+.
The pipeline takes a raw system prompt text file and produces structured JSON across seven analytical axes:
| Axis | Method | Output |
|---|---|---|
| Budget allocation | Word counts per section | sections.json |
| Category grouping | Human-curated YAML mapping | sections.json |
| Modality profile | Regex-based deontic operator density | modality.json |
| Concept recurrence | Lexicon-based frequency matching | concepts.json |
| Scar inference | Human-curated corrective clause mapping | scars.json |
| Conflict detection | Human-curated directive tension pairs | conflicts.json |
| Summary stats | Aggregated from above | summary.json |
Automated stages (parsing, modality, concepts) run without curation. Interpretive stages (scars, conflicts) load from human-edited YAML and are validated against the parsed specimen.
analysis/ Python package (steering_doc_reader)
steering_doc_reader/ Source modules
tests/ 20 tests (pytest)
pyproject.toml Package metadata, deps, entry point
data/
specimens/ Raw steering documents (one per file)
curated/ Human-edited YAML (categories, concepts, scars, conflicts)
derived/ Auto-generated JSON (pipeline output)
docs/
methodology.md Seven-axis analytical framework
contributing.md How to contribute specimens and annotations
analyzing_a_new_specimen.md Step-by-step guide for new analyses
scripts/
sync_site_data.py Converts derived JSON → JS module for Astro site
steering-doc-reader specimen.txt \
--curated-dir data/curated \
--output data/derived \
--verbosefrom steering_doc_reader import Specimen, analyze
specimen = Specimen.load("prompt.txt", categories_file="curated/categories.yaml")
report = analyze(specimen, curated_dir="curated/")
for section in specimen:
print(f"{section.name} ({section.category}): {section.words} words")
for m in report.modality_scores:
print(f"{m.section_name}: {m.density_per_100w:.1f} ops/100w")
report.to_json("output/")- Drop a system prompt text file into
data/specimens/ - Run
make analyzewith no curation — check that parsing works - Add section categories to
data/curated/categories.yaml - Curate scars and conflicts in YAML (the hard, interpretive part)
- Re-run the pipeline, inspect the JSON
See docs/analyzing_a_new_specimen.md for the full guide.
YAML for curation. Scar and conflict annotations are interpretive work by humans. YAML is the right format for structured data with long prose values.
Validate, don't auto-infer. Scar inference cannot be automated responsibly. The pipeline helps curators avoid mistakes (typos in section names, non-canonical clusters) without replacing their judgment.
Two parse formats. Specimens arrive in different shapes — XML-tagged (<section>...</section>) or ALL-CAPS headers. The parser auto-detects with no configuration.
Code: MIT. Essays, documentation, and curated YAML: CC-BY-4.0.