Generate synthetic document sets from domain-specific configuration with traceable evidence.
This is not yet published to PyPI. You are welcome to test it out. All the code has been written by AI as this is an experimental project. I might graduate it to a more official status later.
synthdocs init myproject
cd myprojectThis creates four configuration files and an examples/ folder.
Edit the configuration files to describe your document domain:
domain_description.md — Describe the domain narrative
- What kind of cases exist and why documents are created
- Typical case flow from start to finish
- How documents look (sections, tone, terminology)
domain-variables.yaml — Define case facts for randomization
case_facts:
- name: diagnosis
description: Primary diagnosis for the case.
values:
- value: Depression
- value: Anxiety
selection:
mode: singlemetadata.yaml — Control document types and quantities
documents_per_case:
min: 3
max: 5
doc_type_mix:
- type: medical_report
description: Formal report from a medical professional.
weight: 3fact-spec.yaml — Specify facts to extract with evidence
facts:
- name: primary_diagnosis
type: enum
values: [Depression, Anxiety]Place sample PDFs in examples/ to help the system learn document patterns.
from synthdocs import SynthDocs
sd = SynthDocs()
plan_path = sd.generate_plan("myproject")This creates a timestamped folder under plans/ containing:
- Normalized YAML configuration
.promptytemplates for each document type- Suggested items (marked
approved: false)
Open the plan YAML files and set approved: true on items you want to include.
Or use approve_all=True for rapid prototyping.
result = sd.generate_documents("myproject", plan_path=plan_path, num_cases=5)myproject/
plans/2026-01-22-001/
metadata.yaml
fact-spec.yaml
document_types/*.prompty
output/2026-01-22-001/
case_0001/
document_001.md
document_001.evidence.yaml
metadata.yaml
Evidence files contain line and span pointers for each fact mention:
facts:
- fact_name: diagnosis
value: Depression
line_start: 12
span_start: 45
span_end: 55sd = SynthDocs()
# Generate plan from project config
plan_path = sd.generate_plan("myproject")
# Generate documents (approve_all=True skips manual approval)
result = sd.generate_documents("myproject", plan_path=plan_path, num_cases=10)
# List generated cases and facts
cases = sd.list_cases("myproject", plan_path=plan_path)
facts = sd.list_facts("myproject", plan_path=plan_path)
# Export as archive
archive = sd.export_archive("myproject", plan_path=plan_path, format="zip")Set MISTRAL_API_KEY for LLM generation and PDF OCR.