Skip to content

fix: #3 — feat(packs): bundled KJV, 道德经, heart-sutra (api-only)#13

Merged
MiaoDX merged 1 commit into
mainfrom
claude-issue-3
May 2, 2026
Merged

fix: #3 — feat(packs): bundled KJV, 道德经, heart-sutra (api-only)#13
MiaoDX merged 1 commit into
mainfrom
claude-issue-3

Conversation

@MiaoDX

@MiaoDX MiaoDX commented May 2, 2026

Copy link
Copy Markdown
Owner

Closes #3.

Builds and embeds the three v0.1 scripture packs.

Packs

Pack Verses Source Inclusion
bible-kjv 31,102 (across 66 books) Project Gutenberg eBook #10 bundled
dao-de-jing 81 chapters (simplified Chinese, via OpenCC t2s at build time) Project Gutenberg eBook #7337 bundled
heart-sutra 0 (stub) CBETA, license audit pending api_only

Total bundled size: 2.81 MB (well under the 6 MB budget). Achieved by gzipping the JSONL inside embed.FS — without it the KJV alone is 19 MB.

Per the issue's notes, heart-sutra is shipped as an inclusion_mode=api_only stub while CBETA redistribution terms are reviewed; the pack directory and metadata are in place so the registry surfaces it and a follow-up PR can vendor the verses.

What's in this PR

  • scripts/build_packs.py — Python builder that downloads upstream sources, parses, and writes verses.jsonl.gz + metadata.json per pack. Pinned mtime=0 in the gzip header so output is byte-reproducible.
  • scripts/verify_quotes.py — recomputes SHA-256 over every bundled verse's text and exits non-zero on mismatch (or on verse_count drift between metadata and JSONL). Wired into CI as a gate before go test.
  • internal/packs/packs.goembed.FS-backed registry. Lookups are O(1) after init. Uses metadata.json to denormalize the compact JSONL rows into full schema.Verse values, which keeps each pack small enough to fit the size budget.
  • internal/packs/packs_test.go — checksum-based spot checks (Genesis 1:1, John 3:16, Revelation 22:21, Dao chapter 11), per-verse self-consistency, schema-compliance, and an all-66-books-present check (a regression net for the Haggai parse bug I hit during development).
  • cmd/scripture-mcp/main.go--packs and --lookup-id <id> flags so the registry is queryable from main, satisfying the "queryable from main" criterion and giving issue feat(core): stdio MCP server + CLI subcommands #4 a working starting point.
  • CLAUDE.md — engineering guidance to keep sacred-text bodies out of model output (write through scripts/files, assert by checksum in tests, avoid cat-ing pack contents through Bash). Added after the Anthropic content filter blocked an early attempt to write the build script with verse text inline.

Schema change

Relaxed the verse-id regex to allow hyphens within dotted segments (song-of-solomon, 1-samuel, heart-sutra). Both internal/schema/verse.go and internal/schema/verse.schema.json updated together — required to slug multi-word book names without losing readability.

Deviation from the issue text

The issue specifies verses.jsonl; this PR ships verses.jsonl.gz inside embed.FS to satisfy the hard 6 MB total-pack budget (uncompressed JSONL would overshoot). The on-disk format is still JSONL line-by-line — only the storage transport is gzip. Both build_packs.py and the Go loader use the .gz path; verify_quotes.py walks the gz directly.

Acceptance-criteria mapping

  • scripts/build_packs.py (Python) downloads KJV from Project Gutenberg, normalizes, writes internal/packs/bible-kjv/verses.jsonl.gz with all 31,102 verses
  • 道德经 pack at internal/packs/dao-de-jing/verses.jsonl.gz (81 chapters from Project Gutenberg eBook #7337, traditional → simplified via OpenCC at build time)
  • 心经 pack at internal/packs/heart-sutra/verses.jsonl.gz (api-only stub per the issue's CBETA-uncertainty note)
  • Every pack has metadata.json with provider / license / attribution
  • Every verse has SHA-256 checksum_sha256 computed at build time
  • scripts/verify_quotes.py recomputes checksums and fails CI on mismatch
  • All three packs embed via embed.FS and are queryable from main
  • Spot checks: bible.kjv.john.3.16 resolves to canonical KJV (sha verified); dao.daodejing.11.1 resolves to simplified-Chinese chapter 11 (180 bytes ≈ 60 chars, matching 三十辐共一毂…)
  • Total bundled pack size: 2.81 MB (< 6 MB)

Verification

  • make all (lint + verify-packs + test + build) clean
  • python3 scripts/verify_quotes.py → 31,183 verses, 0 mismatches
  • ./bin/scripture-mcp --packs lists all three packs
  • ./bin/scripture-mcp --lookup-id bible.kjv.john.3.16 resolves
  • Re-running python3 scripts/build_packs.py produces byte-identical pack outputs

Generated by Claude Code

Closes #3.

Builds and embeds the three v0.1 scripture packs.

## Packs

- bible-kjv:    31,102 verses, 66 books, from Project Gutenberg eBook #10
- dao-de-jing:  81 chapters, simplified Chinese, from Project Gutenberg eBook #7337
                (traditional → simplified via OpenCC t2s at build time)
- heart-sutra:  inclusion_mode=api_only stub (per issue #3 notes,
                CBETA redistribution audit deferred to a follow-up PR)

Total bundled size: 2.81 MB (well under the 6 MB budget) thanks to
gzipping the JSONL inside embed.FS — without it the KJV alone would
be 19 MB. Each verse keeps a stable id, canonical_ref, text, and
SHA-256 checksum; tradition/lang/work/source live once per pack in
metadata.json.

## What's in this PR

- scripts/build_packs.py — Python builder that downloads upstream
  sources, parses, and writes verses.jsonl.gz + metadata.json
- scripts/verify_quotes.py — recomputes SHA-256 on every bundled
  verse and exits non-zero on mismatch (or on verse_count drift)
- internal/packs/packs.go — embed.FS registry. Lookups are O(1)
  after init. Uses metadata.json to denormalize the compact JSONL
  rows into full schema.Verse values.
- internal/packs/packs_test.go — checksum-based spot checks
  (genesis 1:1, john 3:16, revelation 22:21, dao chapter 11),
  per-verse self-consistency check, schema-compliance check,
  KJV all-66-books-present check
- cmd/scripture-mcp/main.go — `--packs` and `--lookup-id` flags
  so the registry is queryable from main, demonstrating the
  embed wiring for issue #4 to build on
- CLAUDE.md — guidance to keep scripture text out of model output
  (write through scripts/files; assert by checksum in tests; avoid
  echoing pack contents through Bash output)

## Schema change

Relaxed the verse-id regex to allow hyphens within segments
(`song-of-solomon`, `1-samuel`, `heart-sutra`) — required to slug
multi-word book names without losing readability. Both verse.go
and verse.schema.json updated together.

## Deviation from the issue text

The issue specifies `verses.jsonl`; this PR ships `verses.jsonl.gz`
inside embed.FS to satisfy the hard 6 MB total-pack budget (KJV is
~4 MB of raw text plus ~15 MB of JSON envelope — uncompressed JSONL
overshoots). The on-disk format is still JSONL line by line; only
the storage transport is gzip. Both build_packs.py and the Go
loader use the .gz path; verify_quotes.py walks the gz directly.

## Verification

- `make all` (lint + verify-packs + test + build) clean
- `python3 scripts/verify_quotes.py` reports 31,183 verses, 0 mismatches
- `./bin/scripture-mcp --packs` lists all three packs
- `./bin/scripture-mcp --lookup-id bible.kjv.john.3.16` resolves
- Re-running `python3 scripts/build_packs.py` produces byte-identical
  pack outputs (gzip mtime pinned to 0 in the header)

https://claude.ai/code/session_017oFxiyRdiLLc32qfADHFav
@MiaoDX MiaoDX marked this pull request as ready for review May 2, 2026 00:57
@MiaoDX MiaoDX merged commit 9c97ec1 into main May 2, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat(packs): bundled KJV, 道德经, and 心经

2 participants