Fetches Hebrew and Aramaic text from the Sefaria API, translates it using large language models, and generates formatted Word documents.
Works with any text available on Sefaria.
pip install -r requirements.txt
export ANTHROPIC_API_KEY='your-key-here'# Basic usage — fetch, translate, clean, and generate a Word doc
python pipeline.py --text "Kessef Mishneh on Mishneh Torah, Rebels" --chapters 7
# With a human-readable display name and translator credit
python pipeline.py \
--text "Kessef Mishneh on Mishneh Torah, Rebels" \
--chapters 7 \
--display-name "Hilchot Mamrim" \
--translator "Your Name"
# Use a different model
python pipeline.py --text "Rashi on Genesis" --chapters 50 --model claude-sonnet-4-20250514
# Only fetch and translate, skip Word doc generation
python pipeline.py --text "Mishnah Sanhedrin" --chapters 11 --no-docx
# Re-run cleaning and doc generation on already-translated text
python pipeline.py --text "Kessef Mishneh on Mishneh Torah, Rebels" --chapters 7 \
--skip-fetch --skip-translate- Fetch — Downloads Hebrew text chapter-by-chapter from Sefaria's API v3
- Translate — Sends each chapter for scholarly translation, with review flags for uncertain terms
- Clean — Replaces Hebrew transliterations with English equivalents, removes markdown artifacts
- Generate — Produces a formatted Word document and an editorial review sheet
All intermediate data is cached in data/, so you can re-run any stage without repeating earlier work.
| Flag | Description |
|---|---|
--text |
Sefaria text reference, required |
--chapters |
Number of chapters, required |
--display-name |
Human-readable name, defaults to --text value |
--translator |
Translator name for title page |
--output-dir |
Output directory, default ./outputs |
--data-dir |
Cache directory, default ./data |
--model |
Claude model, default claude-opus-4-6 |
--workers |
Parallel translation threads, default 4 |
--skip-fetch |
Skip Hebrew fetch stage |
--skip-translate |
Skip translation stage |
--skip-clean |
Skip cleaning stage |
--no-docx |
Skip Word doc generation |
# Clean translations directly
python clean.py data/translation_cachepipeline.py # Main CLI entry point
clean.py # Translation cleaning module
docx_generator.py # Word document generation
requirements.txt # Python dependencies
data/ # Cached Hebrew text and translations
hebrew_cache/
translation_cache/
outputs/ # Generated Word documents
Original pipeline and translations by Jacob Goldman.
Browse https://www.sefaria.org to find the text you want. The --text argument should match Sefaria's title for the text. You can find this in the URL or API.
For example:
"Rashi on Genesis"for Rashi's Torah commentary"Mishnah Sanhedrin"for Mishnah Sanhedrin"Kessef Mishneh on Mishneh Torah, Rebels"for Kesef Mishneh on Hilchot Mamrim