Skip to content

soetang/synthdocs

Repository files navigation

Synthdocs

Generate synthetic document sets from domain-specific configuration with traceable evidence.

Install

This is not yet published to PyPI. You are welcome to test it out. All the code has been written by AI as this is an experimental project. I might graduate it to a more official status later.

Quick Start

1. Initialize a Project

synthdocs init myproject
cd myproject

This creates four configuration files and an examples/ folder.

2. Configure Your Domain

Edit the configuration files to describe your document domain:

domain_description.md — Describe the domain narrative

  • What kind of cases exist and why documents are created
  • Typical case flow from start to finish
  • How documents look (sections, tone, terminology)

domain-variables.yaml — Define case facts for randomization

case_facts:
  - name: diagnosis
    description: Primary diagnosis for the case.
    values:
      - value: Depression
      - value: Anxiety
    selection:
      mode: single

metadata.yaml — Control document types and quantities

documents_per_case:
  min: 3
  max: 5

doc_type_mix:
  - type: medical_report
    description: Formal report from a medical professional.
    weight: 3

fact-spec.yaml — Specify facts to extract with evidence

facts:
  - name: primary_diagnosis
    type: enum
    values: [Depression, Anxiety]

3. (Optional) Add Example Documents

Place sample PDFs in examples/ to help the system learn document patterns.

4. Generate a Plan

from synthdocs import SynthDocs

sd = SynthDocs()
plan_path = sd.generate_plan("myproject")

This creates a timestamped folder under plans/ containing:

  • Normalized YAML configuration
  • .prompty templates for each document type
  • Suggested items (marked approved: false)

5. Review and Approve the Plan

Open the plan YAML files and set approved: true on items you want to include. Or use approve_all=True for rapid prototyping.

6. Generate Documents

result = sd.generate_documents("myproject", plan_path=plan_path, num_cases=5)

Output Structure

myproject/
  plans/2026-01-22-001/
    metadata.yaml
    fact-spec.yaml
    document_types/*.prompty
  output/2026-01-22-001/
    case_0001/
      document_001.md
      document_001.evidence.yaml
      metadata.yaml

Evidence files contain line and span pointers for each fact mention:

facts:
  - fact_name: diagnosis
    value: Depression
    line_start: 12
    span_start: 45
    span_end: 55

Python API Reference

sd = SynthDocs()

# Generate plan from project config
plan_path = sd.generate_plan("myproject")

# Generate documents (approve_all=True skips manual approval)
result = sd.generate_documents("myproject", plan_path=plan_path, num_cases=10)

# List generated cases and facts
cases = sd.list_cases("myproject", plan_path=plan_path)
facts = sd.list_facts("myproject", plan_path=plan_path)

# Export as archive
archive = sd.export_archive("myproject", plan_path=plan_path, format="zip")

Environment

Set MISTRAL_API_KEY for LLM generation and PDF OCR.

About

Synthetic document generator

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages