diff --git a/.github/ISSUE_TEMPLATE/task.md b/.github/ISSUE_TEMPLATE/task.md new file mode 100644 index 0000000..dabcf04 --- /dev/null +++ b/.github/ISSUE_TEMPLATE/task.md @@ -0,0 +1,45 @@ +--- +name: Implementation Task +about: AI-consumable task with full context and acceptance criteria +title: '' +labels: 'task,backlog' +assignees: '' +--- + +## Problem + + + +## Spec + + + +## Out of Scope + + + +## Files to Modify + + + +## Test Plan + + + +## Acceptance Criteria + +- [ ] Behavior matches Spec exactly +- [ ] Existing tests pass +- [ ] New tests cover the change +- [ ] No regression in related CLI features + +## Automation Policy + + +- Max automated attempts: 3 +- If same failure repeats 2+ times, add `needs-human` and stop auto-retry +- Do not auto-merge without Claude approval + CI green + +## References + + diff --git a/.github/automation/OPERATING_MODEL.md b/.github/automation/OPERATING_MODEL.md new file mode 100644 index 0000000..e34ec9a --- /dev/null +++ b/.github/automation/OPERATING_MODEL.md @@ -0,0 +1,30 @@ +# PM/Dev Automation Operating Model + +## Roles +- Claude PM: issue definition, review decision, prioritization. +- Codex Dev: implementation, tests, PR delivery. + +## State Machine +- `backlog` -> `ready-codex` -> `in-progress` -> `needs-claude-review` -> `approved` -> `done` +- Exception states: `blocked`, `needs-human` + +## Automation Jobs +- `automation-label-bootstrap.yml`: ensures required labels exist. +- `automation-state-machine.yml`: + - issue enters execution when labeled `ready-codex` + - PRs are labeled `needs-claude-review` + - merged PRs are labeled `done` + - optional follow-up issue generation when PR has `auto-loop` +- `claude-review-scheduler.yml`: + - reminds pending Claude review every 6h (max 3 reminders) + - escalates to `needs-human` + `blocked` after max reminders + +## Loop Prevention Rules +- Max automated issue attempts: 3 +- Max review reminders: 3 +- Escalate to `needs-human` when thresholds are exceeded +- Follow-up issue creation is opt-in via `auto-loop` label only + +## Merge Gate +- Required: CI green + Claude approval +- Do not auto-merge when PR is `blocked` or `needs-human` diff --git a/.github/pull_request_template.md b/.github/pull_request_template.md new file mode 100644 index 0000000..75a2e64 --- /dev/null +++ b/.github/pull_request_template.md @@ -0,0 +1,25 @@ +## Summary + + + +## Linked Issues + + + +## Verification + + +- [ ] Tests passed locally +- [ ] CI passed +- [ ] No unrelated files changed + +## Risks + + + +## Handoff To Claude PM + + +- Review focus: +- Open questions: +- Suggested next issue: diff --git a/.github/workflows/automation-label-bootstrap.yml b/.github/workflows/automation-label-bootstrap.yml new file mode 100644 index 0000000..d9899af --- /dev/null +++ b/.github/workflows/automation-label-bootstrap.yml @@ -0,0 +1,63 @@ +name: Automation Label Bootstrap + +on: + workflow_dispatch: + schedule: + - cron: "17 3 * * 1" + +permissions: + issues: write + pull-requests: write + +jobs: + bootstrap-labels: + runs-on: ubuntu-latest + steps: + - name: Ensure automation labels exist + uses: actions/github-script@v7 + with: + script: | + const labels = [ + { name: 'task', color: '0E8A16', description: 'Tracked implementation task' }, + { name: 'backlog', color: 'BFD4F2', description: 'Queued but not ready for execution' }, + { name: 'ready-codex', color: '1D76DB', description: 'Ready for Codex execution' }, + { name: 'in-progress', color: 'FBCA04', description: 'Actively being implemented' }, + { name: 'needs-claude-review', color: '5319E7', description: 'Awaiting Claude PM review' }, + { name: 'approved', color: '0E8A16', description: 'Approved for merge' }, + { name: 'blocked', color: 'D93F0B', description: 'Blocked by dependency or risk' }, + { name: 'needs-human', color: 'B60205', description: 'Automation halted, human decision required' }, + { name: 'done', color: 'C2E0C6', description: 'Completed and verified' }, + { name: 'auto-loop', color: '0052CC', description: 'Opt-in: create follow-up PM issue after merge' }, + { name: 'ready-claude', color: '5319E7', description: 'Ready for Claude PM planning/review' }, + ]; + + for (const label of labels) { + try { + await github.rest.issues.getLabel({ + owner: context.repo.owner, + repo: context.repo.repo, + name: label.name, + }); + await github.rest.issues.updateLabel({ + owner: context.repo.owner, + repo: context.repo.repo, + name: label.name, + color: label.color, + description: label.description, + }); + core.info(`Updated label: ${label.name}`); + } catch (error) { + if (error.status === 404) { + await github.rest.issues.createLabel({ + owner: context.repo.owner, + repo: context.repo.repo, + name: label.name, + color: label.color, + description: label.description, + }); + core.info(`Created label: ${label.name}`); + } else { + throw error; + } + } + } diff --git a/.github/workflows/automation-state-machine.yml b/.github/workflows/automation-state-machine.yml new file mode 100644 index 0000000..31d3acb --- /dev/null +++ b/.github/workflows/automation-state-machine.yml @@ -0,0 +1,227 @@ +name: Automation State Machine + +on: + issues: + types: [labeled, reopened] + pull_request: + types: [opened, reopened, synchronize, ready_for_review, closed] + +permissions: + issues: write + pull-requests: write + contents: read + +jobs: + issue-ready-codex: + if: | + github.event_name == 'issues' && + ( + (github.event.action == 'labeled' && github.event.label.name == 'ready-codex') || + github.event.action == 'reopened' + ) + runs-on: ubuntu-latest + steps: + - name: Transition issue into execution state + uses: actions/github-script@v7 + with: + script: | + const issue = context.payload.issue; + const issueNumber = issue.number; + const maxAttempts = 3; + const labels = issue.labels.map(l => l.name); + + if (!labels.includes('ready-codex')) { + core.info('Issue is not ready-codex; skipping.'); + return; + } + + const comments = await github.paginate(github.rest.issues.listComments, { + owner: context.repo.owner, + repo: context.repo.repo, + issue_number: issueNumber, + per_page: 100, + }); + + const attemptMarker = ''; + const attempts = comments.filter(c => c.body && c.body.includes(attemptMarker)).length; + + const labelSet = new Set(labels); + const applyLabels = async (nextSet) => { + await github.rest.issues.setLabels({ + owner: context.repo.owner, + repo: context.repo.repo, + issue_number: issueNumber, + labels: [...nextSet], + }); + }; + + if (attempts >= maxAttempts) { + labelSet.delete('ready-codex'); + labelSet.delete('in-progress'); + labelSet.add('blocked'); + labelSet.add('needs-human'); + await applyLabels(labelSet); + await github.rest.issues.createComment({ + owner: context.repo.owner, + repo: context.repo.repo, + issue_number: issueNumber, + body: [ + '', + 'Automation halted: reached max automated attempts (3).', + 'Please perform human triage and then relabel with `ready-codex` if retry is still valid.', + ].join('\n'), + }); + return; + } + + labelSet.add('in-progress'); + labelSet.delete('backlog'); + labelSet.delete('blocked'); + labelSet.delete('needs-human'); + labelSet.delete('done'); + await applyLabels(labelSet); + + await github.rest.issues.createComment({ + owner: context.repo.owner, + repo: context.repo.repo, + issue_number: issueNumber, + body: [ + '', + `Codex execution started (attempt ${attempts + 1}/${maxAttempts}).`, + 'Next transition expected: `PR_OPEN` -> `needs-claude-review`.', + ].join('\n'), + }); + + pr-review-request: + if: | + github.event_name == 'pull_request' && + contains(fromJSON('["opened","reopened","synchronize","ready_for_review"]'), github.event.action) + runs-on: ubuntu-latest + steps: + - name: Mark PR as pending Claude review + uses: actions/github-script@v7 + with: + script: | + const pr = context.payload.pull_request; + const prNumber = pr.number; + const labels = new Set((pr.labels || []).map(l => l.name)); + labels.add('needs-claude-review'); + labels.delete('approved'); + labels.delete('done'); + + await github.rest.issues.setLabels({ + owner: context.repo.owner, + repo: context.repo.repo, + issue_number: prNumber, + labels: [...labels], + }); + + const sha = pr.head.sha; + const marker = ``; + const comments = await github.paginate(github.rest.issues.listComments, { + owner: context.repo.owner, + repo: context.repo.repo, + issue_number: prNumber, + per_page: 100, + }); + const alreadyPosted = comments.some(c => c.body && c.body.includes(marker)); + if (alreadyPosted) { + core.info('Review request already posted for this SHA.'); + return; + } + + await github.rest.issues.createComment({ + owner: context.repo.owner, + repo: context.repo.repo, + issue_number: prNumber, + body: [ + marker, + 'Codex implementation update is ready for Claude PM review.', + 'Review gate: spec match, risk checks, and merge/no-merge decision.', + ].join('\n'), + }); + + pr-merged-closeout: + if: github.event_name == 'pull_request' && github.event.action == 'closed' && github.event.pull_request.merged == true + runs-on: ubuntu-latest + steps: + - name: Close out merged PR and optionally create follow-up issue + uses: actions/github-script@v7 + with: + script: | + const pr = context.payload.pull_request; + const prNumber = pr.number; + const labels = new Set((pr.labels || []).map(l => l.name)); + const hasAutoLoop = labels.has('auto-loop'); + + labels.add('done'); + labels.delete('needs-claude-review'); + labels.delete('in-progress'); + labels.delete('ready-codex'); + labels.delete('blocked'); + + await github.rest.issues.setLabels({ + owner: context.repo.owner, + repo: context.repo.repo, + issue_number: prNumber, + labels: [...labels], + }); + + await github.rest.issues.createComment({ + owner: context.repo.owner, + repo: context.repo.repo, + issue_number: prNumber, + body: [ + '', + `Completed and merged in ${pr.merge_commit_sha}.`, + 'State transition: `MERGED -> DONE`.', + ].join('\n'), + }); + + const body = pr.body || ''; + const refs = [...body.matchAll(/(?:close[sd]?|fix(?:e[sd])?|resolve[sd]?)\s+#(\d+)/ig)] + .map(m => Number(m[1])) + .filter(n => Number.isFinite(n)); + const uniqueRefs = [...new Set(refs)]; + + for (const issueNumber of uniqueRefs) { + try { + await github.rest.issues.createComment({ + owner: context.repo.owner, + repo: context.repo.repo, + issue_number: issueNumber, + body: `Implemented by PR #${prNumber} and merged in \`${pr.merge_commit_sha}\`.`, + }); + } catch (error) { + core.warning(`Failed to comment on issue #${issueNumber}: ${error.message}`); + } + } + + if (!hasAutoLoop) { + core.info('auto-loop label absent; skipping follow-up PM issue creation.'); + return; + } + + const followUpTitle = `pm: post-merge review for PR #${prNumber}`; + const followUpBody = [ + `Source PR: #${prNumber}`, + '', + '## Claude PM Review', + '- Validate merged behavior against acceptance criteria', + '- Identify residual risks and missing tests', + '- Decide whether a new Codex issue is required', + '', + '## Next Action', + '- If more work is needed, create a new issue and label it `ready-codex`', + '- If no further action is needed, close this issue', + '', + '', + ].join('\n'); + + await github.rest.issues.create({ + owner: context.repo.owner, + repo: context.repo.repo, + title: followUpTitle, + body: followUpBody, + labels: ['task', 'ready-claude', 'backlog'], + }); diff --git a/.github/workflows/claude-review-scheduler.yml b/.github/workflows/claude-review-scheduler.yml new file mode 100644 index 0000000..e477eec --- /dev/null +++ b/.github/workflows/claude-review-scheduler.yml @@ -0,0 +1,97 @@ +name: Claude Review Scheduler + +on: + workflow_dispatch: + schedule: + - cron: "0 */2 * * *" + +permissions: + pull-requests: write + issues: write + contents: read + +jobs: + remind-claude-review: + runs-on: ubuntu-latest + steps: + - name: Remind and escalate stale review requests + uses: actions/github-script@v7 + with: + script: | + const maxReminders = 3; + const minHoursBetweenReminders = 6; + const reminderMarker = ''; + const escalationMarker = ''; + + const prs = await github.paginate(github.rest.pulls.list, { + owner: context.repo.owner, + repo: context.repo.repo, + state: 'open', + per_page: 100, + }); + + for (const pr of prs) { + if (pr.draft) { + continue; + } + const labels = new Set((pr.labels || []).map(l => l.name)); + if (!labels.has('needs-claude-review')) { + continue; + } + + const comments = await github.paginate(github.rest.issues.listComments, { + owner: context.repo.owner, + repo: context.repo.repo, + issue_number: pr.number, + per_page: 100, + }); + + const reminderComments = comments.filter(c => c.body && c.body.includes(reminderMarker)); + const remindersSent = reminderComments.length; + + if (remindersSent >= maxReminders) { + labels.delete('needs-claude-review'); + labels.add('needs-human'); + labels.add('blocked'); + await github.rest.issues.setLabels({ + owner: context.repo.owner, + repo: context.repo.repo, + issue_number: pr.number, + labels: [...labels], + }); + + const escalatedAlready = comments.some(c => c.body && c.body.includes(escalationMarker)); + if (!escalatedAlready) { + await github.rest.issues.createComment({ + owner: context.repo.owner, + repo: context.repo.repo, + issue_number: pr.number, + body: [ + escalationMarker, + 'Claude review reminders reached max count (3).', + 'Escalated to `needs-human` + `blocked` to prevent unattended looping.', + ].join('\n'), + }); + } + continue; + } + + if (reminderComments.length > 0) { + const lastReminder = new Date(reminderComments[reminderComments.length - 1].created_at); + const elapsedHours = (Date.now() - lastReminder.getTime()) / (1000 * 60 * 60); + if (elapsedHours < minHoursBetweenReminders) { + continue; + } + } + + await github.rest.issues.createComment({ + owner: context.repo.owner, + repo: context.repo.repo, + issue_number: pr.number, + body: [ + reminderMarker, + `Review reminder ${remindersSent + 1}/${maxReminders}: Claude PM review is pending.`, + '@baekho-lim please review this PR and decide: approve, request changes, or block.', + ].join('\n'), + }); + } diff --git a/.gitignore b/.gitignore index 11dd5e4..29f1cba 100644 --- a/.gitignore +++ b/.gitignore @@ -33,3 +33,16 @@ Thumbs.db # Build artifacts *.log + +# Python +__pycache__/ +*.pyc + +# Editor +.omx/ + +# md2hwp outputs +testdata/md2hwp-outputs/ + +# Personal business plan data +바이탈루_*.json diff --git a/AGENTS.md b/AGENTS.md new file mode 100644 index 0000000..8467dcd --- /dev/null +++ b/AGENTS.md @@ -0,0 +1,139 @@ +# AGENTS.md + +Instructions for AI coding agents (ChatGPT Codex, etc.) working on this repository. + +## Project Overview + +hwp2md is a CLI tool for converting HWP/HWPX documents to Markdown (Go) with a reverse pipeline **md2hwp** for filling HWPX templates with content (Python). + +## Repository Structure + +``` +hwp2md/ +├── cmd/hwp2md/ # CLI entry point (Go) +├── internal/ # Core Go implementation +│ ├── parser/hwpx/ # HWPX XML parser +│ ├── parser/hwp5/ # HWP5 binary parser +│ ├── ir/ # Intermediate Representation +│ ├── llm/ # LLM provider abstraction +│ ├── formatter/ # Output formatting +│ └── cli/ # CLI commands +├── tools/md2hwp-ui/ # Web preview UI (Python, lower priority) +│ ├── server.py # HTTP server + SSE +│ └── renderer.py # HWPX -> HTML converter +├── tools/md2hwp/ # fill_hwpx.py (Python template injection engine) +├── tests/ # E2E tests (Go) +├── testdata/ # Test fixtures +└── docs/ # Technical documentation + └── md2hwp/ # md2hwp design docs & specs +``` + +## Build & Test + +```bash +# Go (hwp2md core) +make build # Build binary to bin/hwp2md +make test # Unit tests with race detection + coverage +make test-e2e # E2E tests +make lint # golangci-lint +make fmt # gofmt + +# Python (md2hwp / fill_hwpx.py) +pip install lxml # Required dependency +python3 tools/md2hwp/fill_hwpx.py --help +python3 tools/md2hwp/fill_hwpx.py --inspect +python3 tools/md2hwp/fill_hwpx.py --inspect-tables +python3 tools/md2hwp/fill_hwpx.py --analyze +python3 tools/md2hwp/fill_hwpx.py +``` + +## Key Conventions + +- **Go code**: Follow golangci-lint rules, `make fmt` before commit +- **Python code**: Follow PEP 8, type hints where practical +- **Commits**: Conventional Commits 1.0.0 (feat/fix/docs/test/refactor/chore) +- **Language**: Korean for user-facing messages, English for code/docs/commits +- **Tests**: TDD workflow - write tests first, then implement + +## md2hwp Architecture + +See [docs/md2hwp/DESIGN.md](docs/md2hwp/DESIGN.md) for full design. + +### fill_hwpx.py Overview + +Template injection engine that modifies HWPX (ZIP + XML) files: + +- **Input**: `fill_plan.json` with replacement instructions +- **Output**: Modified HWPX file with content injected +- **Preservation**: All formatting (fonts, cell sizes, merge patterns) preserved + +### HWPX XML Structure + +``` +hs:sec (section root) + hp:p (paragraph) + hp:run (text run with style reference) + hp:t (text content) + hp:tbl (table) + hp:tr (table row) + hp:tc (table cell) + hp:cellAddr (colAddr, rowAddr) + hp:cellSpan (colSpan, rowSpan) + hp:subList + hp:p > hp:run > hp:t +``` + +### Namespace Map + +```python +HWPX_NS = { + "hp": "http://www.hancom.co.kr/hwpml/2011/paragraph", + "hs": "http://www.hancom.co.kr/hwpml/2011/section", + "hc": "http://www.hancom.co.kr/hwpml/2011/core", + "hh": "http://www.hancom.co.kr/hwpml/2011/head", +} +``` + +## Working with Issues + +Each issue assigned to you will contain: + +1. **Context**: Why this change is needed +2. **Spec**: Exact interface/behavior expected +3. **Test fixtures**: Input/output examples in `testdata/` or inline +4. **Acceptance criteria**: What must pass for the PR to be accepted + +### Branch Naming + +``` +codex/- +# Example: codex/25-fix-empty-cell-fill +``` + +### PR Checklist + +Before submitting a PR: +- [ ] All existing tests pass (`make test` for Go, pytest for Python) +- [ ] New tests added for new functionality +- [ ] Code formatted (`make fmt` for Go, PEP 8 for Python) +- [ ] Commit messages follow Conventional Commits +- [ ] No new linter warnings + +## Important Files Reference + +| File | Purpose | +|------|---------| +| `tools/md2hwp/fill_hwpx.py` | Template injection engine (Python) | +| `docs/md2hwp/DESIGN.md` | Architecture & fill_plan.json schema | +| `docs/md2hwp/FILL_PLAN_SCHEMA.md` | JSON schema reference | +| `testdata/hwpx_20260302_200059.hwpx` | Primary test template | +| `internal/parser/hwpx/parser.go` | HWPX parser (Go) | +| `docs/hwpx-schema.md` | HWPX XML format specification | + +## Python-specific Notes + +- **Python version**: 3.12+ (3.13 removed `cgi` module) +- **Dependencies**: `lxml` only (no Flask, no external frameworks) +- **File size limit**: 800 lines max per file +- **Function size**: 50 lines max +- **Error handling**: Always catch + descriptive error messages diff --git a/CLAUDE.md b/CLAUDE.md index b39b34d..0b98817 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -74,6 +74,31 @@ HWP/HWPX → Stage 1 (Parser) → IR → Stage 2 (LLM, optional) → Markdown | `HWP2MD_BASE_URL` | Private API endpoint (Bedrock, Azure, local) | | `ANTHROPIC_API_KEY`, `OPENAI_API_KEY`, `GOOGLE_API_KEY`, `UPSTAGE_API_KEY` | Provider API keys | +## md2hwp (Reverse Pipeline) + +HWPX template injection engine: fill government templates with business plan content. + +- **Engine**: `tools/md2hwp/fill_hwpx.py` (Python, lxml) +- **Design doc**: `docs/md2hwp/DESIGN.md` +- **Test template**: `testdata/hwpx_20260302_200059.hwpx` (재도전성공패키지) + +```bash +# Inspect template +python3 tools/md2hwp/fill_hwpx.py --inspect +python3 tools/md2hwp/fill_hwpx.py --inspect-tables + +# Fill template +python3 tools/md2hwp/fill_hwpx.py +``` + +### Collaboration + +- **Claude (architect)**: Design, review PRs, integration testing +- **ChatGPT Codex (implementer)**: Code implementation, unit tests +- **Sync point**: GitHub issues on baekho-lim/hwp2md +- **Codex instructions**: `AGENTS.md` +- **Epic**: baekho-lim/hwp2md#7 + ## Conventions - Korean is the primary language for CLI messages, comments, and documentation diff --git a/docs/md2hwp/DESIGN.md b/docs/md2hwp/DESIGN.md new file mode 100644 index 0000000..550d3f1 --- /dev/null +++ b/docs/md2hwp/DESIGN.md @@ -0,0 +1,312 @@ +# md2hwp Design Document + +> Reverse pipeline: Fill HWPX government templates with business plan content. + +## Problem + +Korean government funding applications require submission in HWP format with strict template compliance. Manual form-filling is tedious and error-prone. md2hwp automates this by injecting structured content into HWPX templates while preserving all formatting. + +## Target Workflow + +``` +1. User uploads HWPX template → Claude analyzes structure +2. User discusses business plan → content finalized +3. Claude generates fill_plan.json → fill_hwpx.py injects content +4. User downloads completed HWPX → submits to government +``` + +## Architecture + +``` +fill_plan.json ──→ fill_hwpx.py ──→ output.hwpx + │ + template.hwpx + (ZIP + XML) +``` + +### Core Engine: `tools/md2hwp/fill_hwpx.py` + +Single Python script using `zipfile + lxml` for direct XML manipulation. + +**Why not python-hwpx?** It misses text inside table cells. Direct XML parsing captures ALL `` elements. + +### Replacement Strategies + +| Strategy | Purpose | fill_plan.json key | +|----------|---------|-------------------| +| Simple | Exact text match → replace | `simple_replacements` | +| Section | Guide text → actual content (cell-scoped) | `section_replacements` | +| Table Cell | Label cell → adjacent value cell | `table_cell_fills` | +| Multi-paragraph | Inject multiple paragraphs into a cell | `multi_paragraph_fills` | + +### Processing Order + +1. Simple replacements (longest-first to prevent partial matches) +2. Section replacements (clears entire cell of guide text) +3. Table cell fills (cellAddr-based lookup with flat-scan fallback) +4. Multi-paragraph fills (creates new `` elements) + +--- + +## fill_plan.json Schema + +```json +{ + "template_file": "/absolute/path/to/template.hwpx", + "output_file": "/absolute/path/to/output.hwpx", + + "simple_replacements": [ + { + "find": "OO기업", + "replace": "테스트기업 주식회사", + "occurrence": 1 + } + ], + + "section_replacements": [ + { + "section_id": "1-1", + "guide_text_prefix": "※ 과거 폐업 원인을", + "content": "Actual content replacing the guide text.", + "clear_cell": true + } + ], + + "table_cell_fills": [ + { + "find_label": "과제명", + "value": "AI 자세분석 플랫폼", + "target_offset": {"col": 1, "row": 0} + } + ], + + "multi_paragraph_fills": [ + { + "section_id": "1-1", + "guide_text_prefix": "※ 과거 폐업 원인을", + "paragraphs": [ + "First paragraph of content...", + "Second paragraph of content...", + "Third paragraph of content..." + ] + } + ] +} +``` + +### Field Reference + +#### simple_replacements + +| Field | Type | Required | Default | Description | +|-------|------|----------|---------|-------------| +| `find` | string | yes | - | Exact text to find in `` elements | +| `replace` | string | yes | - | Replacement text | +| `occurrence` | int | no | all | Limit to N replacements | + +#### section_replacements + +| Field | Type | Required | Default | Description | +|-------|------|----------|---------|-------------| +| `section_id` | string | no | "?" | Section identifier for logging | +| `guide_text_prefix` | string | yes | - | Prefix of guide text to find | +| `content` | string | yes | - | Content to replace guide text | +| `clear_cell` | bool | no | true | Clear all other runs/paragraphs in the cell | + +#### table_cell_fills + +| Field | Type | Required | Default | Description | +|-------|------|----------|---------|-------------| +| `find_label` | string | yes | - | Label text in the label cell | +| `value` | string | yes | - | Value to write in the target cell | +| `target_offset` | object | no | `{"col":1,"row":0}` | Column/row offset from label cell | + +#### multi_paragraph_fills + +| Field | Type | Required | Default | Description | +|-------|------|----------|---------|-------------| +| `section_id` | string | no | "?" | Section identifier for logging | +| `guide_text_prefix` | string | yes | - | Prefix to locate the target cell | +| `paragraphs` | string[] | yes | - | Array of paragraph texts | + +--- + +## CLI Interface + +```bash +# Inspection +python3 tools/md2hwp/fill_hwpx.py --inspect # List all text elements +python3 tools/md2hwp/fill_hwpx.py --inspect -q "text" # Search text elements +python3 tools/md2hwp/fill_hwpx.py --inspect-tables # Show table structure +python3 tools/md2hwp/fill_hwpx.py --analyze # Extract fillable field schema + +# Filling +python3 tools/md2hwp/fill_hwpx.py +python3 tools/md2hwp/fill_hwpx.py -o + +# Environment +MD2HWP_EVENT_FILE=/tmp/events.jsonl # Enable SSE event logging +``` + +--- + +## HWPX XML Reference + +### Element Hierarchy + +```xml + + + + Text content + + ... + + +``` + +### Table Cell Hierarchy + +```xml + + + + + + + Cell text + + ... + + + + + + + + +``` + +### Namespaces + +| Prefix | URI | +|--------|-----| +| `hp` | `http://www.hancom.co.kr/hwpml/2011/paragraph` | +| `hs` | `http://www.hancom.co.kr/hwpml/2011/section` | +| `hc` | `http://www.hancom.co.kr/hwpml/2011/core` | +| `hh` | `http://www.hancom.co.kr/hwpml/2011/head` | +| `hp10` | `http://www.hancom.co.kr/hwpml/2016/paragraph` | + +--- + +## Target Template: 재도전성공패키지 + +Primary test template: `testdata/hwpx_20260302_200059.hwpx` + +### Structure + +- **28 tables**, 382 `` text elements +- 7 page limit (excl. TOC + appendix) + +### Sections + +| Section | Tables | Content Type | +|---------|--------|-------------| +| 과제 개요 | T5 (3x2) | 과제명, 기업명, 아이템 개요 | +| 폐업 이력 | T6 (14x4) | Repeatable rows (max 3) | +| 1. 문제인식 | T7-T9 | Guide text → multi-paragraph | +| 2. 실현가능성 | T10-T12 | Guide text → multi-paragraph | +| 3. 성장전략 | T13-T18 | Guide text + timeline table + budget table | +| 4. 기업 구성 | T19-T22 | Team table + staffing plan | +| 가점/면제 | T23-T28 | Checklist + evidence placeholders | + +### Complex Tables + +- **T6** (폐업이력 14x4): colspan patterns, 3 repeatable company rows +- **T16** (실현일정 5x4): 4 data rows with deliverables +- **T18** (사업비 9x6): 3-level rowspan header, budget items +- **T21** (팀구성 5x8): Personnel roster with colspan +- **T24** (가점체크 14x4): Grouped checkbox items with rowspan + +### Template Constraints + +| Constraint | Value | +|-----------|-------| +| Page limit | 7 pages (excl. TOC + appendix) | +| Budget max | 100,000,000 KRW | +| Gov support | ≤75% of total | +| Cash contribution | ≥5% of total | +| In-kind contribution | ≤20% of total | +| Closure history | Max 3 companies (most recent) | +| PII masking | Required (name, gender, DOB, university) | + +--- + +## Known Gaps & Roadmap + +### P0: Must-have (blocking basic operation) + +| ID | Gap | Solution | +|----|-----|---------| +| P0-1 | section_replacements only replaces first ``, orphans rest | Cell-scoped clearing: replace first, remove other runs/paragraphs | +| P0-2 | table_cell_fills skips empty target cells (no ``) | cellAddr-based lookup + create `` in empty runs | +| P0-3 | --inspect lacks table/cell context | Add `[T2 R3 C1]` context + `--inspect-tables` mode | + +### P1: Quality improvements + +| ID | Gap | Solution | +|----|-----|---------| +| P1-4 | No multi-paragraph injection | New `multi_paragraph_fills` strategy, clones `` structure | +| P1-5 | No template schema extraction | `--analyze` mode outputs structured JSON of fillable fields | + +### P2: Nice-to-have + +| ID | Gap | Solution | +|----|-----|---------| +| P2-6 | No content validation | `--validate` mode checks char limits, budget math, required fields | + +--- + +## Helper Functions (Implementation Reference) + +These shared helpers are used across all strategies: + +```python +_build_parent_map(tree) -> dict # Element -> parent mapping +_get_ancestor(elem, tag, parent_map) # Walk up to find ancestor (tc, tbl, etc.) +_clear_cell_except(tc, keep_elem, pm) # Remove all runs/paragraphs except one +_find_cell_by_addr(tbl, col, row) # Find cell by cellAddr coordinates +_set_cell_text(tc, text) # Set cell text, create if needed +_get_table_index(tree, tbl) # Get table ordinal in document +``` + +--- + +## Testing Strategy + +### Unit Tests (per function) + +Each replacement strategy must have tests for: +- Normal case: text found and replaced +- Empty cell: target cell has no `` +- Multi-run guide text: guide text spans multiple `` elements +- Missing text: `find` text not in template (warning, no crash) +- Edge cases: colspan/rowspan cells, nested tables + +### Integration Tests + +- Full fill_plan.json → output.hwpx → hwp2md reverse → verify content +- Use `testdata/hwpx_20260302_200059.hwpx` as primary fixture + +### Test Fixtures Location + +``` +testdata/ +├── hwpx_20260302_200059.hwpx # Primary template +├── md2hwp-outputs/ # Test output directory +└── fill_plans/ # Test fill_plan.json files (to create) + ├── test_simple.json + ├── test_section.json + ├── test_table_cell.json + └── test_multi_paragraph.json +``` diff --git a/internal/cli/cli_test.go b/internal/cli/cli_test.go index 9389553..3d0edf8 100644 --- a/internal/cli/cli_test.go +++ b/internal/cli/cli_test.go @@ -2,7 +2,10 @@ package cli import ( "os" + "strings" "testing" + + "github.com/roboco-io/hwp2md/internal/ir" ) func TestSetVersion(t *testing.T) { @@ -238,3 +241,15 @@ func TestDetectProviderFromModel(t *testing.T) { }) } } + +func TestConvertToBasicMarkdown_TableCellParagraphsUseBreaks(t *testing.T) { + doc := ir.NewDocument() + table := ir.NewTable(1, 1) + table.Cells[0][0].Text = "문단1\n문단2\n문단3" + doc.AddTable(table) + + md := convertToBasicMarkdown(doc) + if !strings.Contains(md, "문단1
문단2
문단3") { + t.Fatalf("expected markdown table cell to preserve paragraph boundaries with
, got: %s", md) + } +} diff --git a/internal/cli/convert.go b/internal/cli/convert.go index 02cdac0..8a8b6ea 100644 --- a/internal/cli/convert.go +++ b/internal/cli/convert.go @@ -492,7 +492,7 @@ func writeMarkdownTable(sb *strings.Builder, t *ir.TableBlock) { if ref.row >= 0 && ref.col >= 0 { if ref.row == i && ref.col == j { // This is the original cell - text = strings.ReplaceAll(t.Cells[i][j].Text, "\n", " ") + text = formatTableCellText(t.Cells[i][j].Text) } else if ref.row < i && ref.col == j { // Vertically merged cell (rowspan) - use 〃 text = "〃" @@ -517,6 +517,12 @@ func writeMarkdownTable(sb *strings.Builder, t *ir.TableBlock) { sb.WriteString("\n") } +func formatTableCellText(text string) string { + text = strings.ReplaceAll(text, "\r\n", "\n") + text = strings.ReplaceAll(text, "\r", "\n") + return strings.ReplaceAll(text, "\n", "
") +} + // isInfoBoxTable detects "info-box" style tables that should be converted to text format. // Pattern 1: A table with a title cell (containing brackets like [제목]) and a single content cell // that spans the full width and contains bullet-like content (○, ※, -, etc.) diff --git a/internal/parser/hwpx/parser.go b/internal/parser/hwpx/parser.go index 4b7e869..11d99f3 100644 --- a/internal/parser/hwpx/parser.go +++ b/internal/parser/hwpx/parser.go @@ -232,22 +232,12 @@ func (p *Parser) parseSectionXML(doc *ir.Document, decoder *xml.Decoder) error { // Text element - read content if currentParagraph != nil { text, _ := readElementText(decoder) - cell := getCurrentCell() - if cell != nil { - cell.text.WriteString(text) - } else { - currentParagraph.Text += text - } + currentParagraph.Text += text } case "tab": if currentParagraph != nil { - cell := getCurrentCell() - if cell != nil { - cell.text.WriteString("\t") - } else { - currentParagraph.Text += "\t" - } + currentParagraph.Text += "\t" } case "br": @@ -259,12 +249,7 @@ func (p *Parser) parseSectionXML(doc *ir.Document, decoder *xml.Decoder) error { } } if brType == "line" { - cell := getCurrentCell() - if cell != nil { - cell.text.WriteString("\n") - } else { - currentParagraph.Text += "\n" - } + currentParagraph.Text += "\n" } } diff --git a/internal/parser/hwpx/parser_test.go b/internal/parser/hwpx/parser_test.go index 368d0ca..5598f8a 100644 --- a/internal/parser/hwpx/parser_test.go +++ b/internal/parser/hwpx/parser_test.go @@ -276,6 +276,71 @@ func TestParser_ParseWithTable(t *testing.T) { } } +func TestParser_ParseWithTableCellMultiParagraph(t *testing.T) { + tmpDir := t.TempDir() + hwpxPath := filepath.Join(tmpDir, "table_multi_paragraph.hwpx") + + f, err := os.Create(hwpxPath) + if err != nil { + t.Fatalf("failed to create temp file: %v", err) + } + + w := zip.NewWriter(f) + + manifestContent := ` + + + + + +` + addZipFile(t, w, "content.hpf", []byte(manifestContent)) + + sectionContent := ` + + + + + 문단1 + 문단2 + 문단3 + + + +` + addZipFile(t, w, "Contents/section0.xml", []byte(sectionContent)) + + if err := w.Close(); err != nil { + t.Fatalf("failed to close zip writer: %v", err) + } + if err := f.Close(); err != nil { + t.Fatalf("failed to close file: %v", err) + } + + p, err := New(hwpxPath, parser.Options{}) + if err != nil { + t.Fatalf("failed to create parser: %v", err) + } + defer p.Close() + + doc, err := p.Parse() + if err != nil { + t.Fatalf("failed to parse: %v", err) + } + + if len(doc.Content) != 1 || doc.Content[0].Table == nil { + t.Fatalf("expected one table block, got %+v", doc.Content) + } + + got := doc.Content[0].Table.Cells[0][0].Text + want := "문단1\n문단2\n문단3" + if got != want { + t.Errorf("expected multi-paragraph cell text %q, got %q", want, got) + } +} + func TestReadElementText(t *testing.T) { tests := []struct { name string diff --git "a/testdata/fill_plans/\354\236\254\353\217\204\354\240\204\354\204\261\352\263\265\355\214\250\355\202\244\354\247\200_sample.json" "b/testdata/fill_plans/\354\236\254\353\217\204\354\240\204\354\204\261\352\263\265\355\214\250\355\202\244\354\247\200_sample.json" new file mode 100644 index 0000000..8894b22 --- /dev/null +++ "b/testdata/fill_plans/\354\236\254\353\217\204\354\240\204\354\204\261\352\263\265\355\214\250\355\202\244\354\247\200_sample.json" @@ -0,0 +1,43 @@ +{ + "template_file": "testdata/hwpx_20260302_200059.hwpx", + "output_file": "/tmp/e2e_fill_test.hwpx", + "simple_replacements": [ + { + "find": "OO학과 교수 재직(00년)", + "replace": "테스트 경력 내용" + } + ], + "section_replacements": [ + { + "section_id": "1-1", + "guide_text_prefix": "※ 과거 폐업 원인을", + "content": "테스트 섹션 치환 내용", + "clear_cell": true + } + ], + "table_cell_fills": [ + { + "find_label": "과제명", + "value": "테스트 과제명" + }, + { + "find_label": "기업명", + "value": "테스트 기업명" + }, + { + "find_label": "아이템(서비스) 개요", + "value": "테스트 아이템 개요" + } + ], + "multi_paragraph_fills": [ + { + "section_id": "2-1", + "guide_text_prefix": "신청하기 이전까지", + "paragraphs": [ + "테스트 준비현황 문단 1", + "테스트 준비현황 문단 2", + "테스트 준비현황 문단 3" + ] + } + ] +} diff --git a/testdata/hwpx_20260302_200059.hwpx b/testdata/hwpx_20260302_200059.hwpx new file mode 100644 index 0000000..adfcb94 Binary files /dev/null and b/testdata/hwpx_20260302_200059.hwpx differ diff --git a/tests/e2e_test.go b/tests/e2e_test.go index f5f23bf..97c54ff 100644 --- a/tests/e2e_test.go +++ b/tests/e2e_test.go @@ -1,6 +1,7 @@ package tests import ( + "archive/zip" "os" "os/exec" "path/filepath" @@ -9,6 +10,63 @@ import ( "testing" ) +func createTempHWPXWithMultiParagraphCell(t *testing.T) string { + t.Helper() + + tempDir := t.TempDir() + hwpxPath := filepath.Join(tempDir, "table-cell-multi-paragraph.hwpx") + + file, err := os.Create(hwpxPath) + if err != nil { + t.Fatalf("failed to create temp hwpx: %v", err) + } + defer file.Close() + + zipWriter := zip.NewWriter(file) + defer zipWriter.Close() + + manifest := ` + + + + + + + +` + + section := ` + + + + + + 문단1 + 문단2 + 문단3 + + + + +` + + addZipEntry := func(name, content string) { + entry, createErr := zipWriter.Create(name) + if createErr != nil { + t.Fatalf("failed to create zip entry %s: %v", name, createErr) + } + if _, writeErr := entry.Write([]byte(content)); writeErr != nil { + t.Fatalf("failed to write zip entry %s: %v", name, writeErr) + } + } + + addZipEntry("content.hpf", manifest) + addZipEntry("Contents/section0.xml", section) + + return hwpxPath +} + // E2E Test for Stage 1: HWPX -> Basic Markdown // Verifies that converting testdata/한글 테스트.hwpx produces valid markdown with expected content @@ -41,6 +99,26 @@ func TestE2EStage1_HWPXToMarkdown(t *testing.T) { } } +func TestE2EStage1_HWPXTableCellParagraphBreaks(t *testing.T) { + inputFile := createTempHWPXWithMultiParagraphCell(t) + binPath, cleanup := buildTestBinary(t) + defer cleanup() + + cmd := exec.Command("./"+binPath, "convert", inputFile) + output, err := cmd.CombinedOutput() + if err != nil { + t.Fatalf("convert command failed: %v\noutput: %s", err, output) + } + + md := string(output) + if !strings.Contains(md, "문단1
문단2
문단3") { + t.Fatalf("expected table cell paragraph boundaries rendered with
, got: %s", md) + } + if strings.Contains(md, "문단1문단2문단3") { + t.Fatalf("unexpected concatenated table cell text without boundaries: %s", md) + } +} + // validateStage1Output checks that Stage 1 (parser) output contains expected content func validateStage1Output(t *testing.T, md string) error { t.Helper() diff --git a/tests/test_fill_hwpx.py b/tests/test_fill_hwpx.py new file mode 100644 index 0000000..edff694 --- /dev/null +++ b/tests/test_fill_hwpx.py @@ -0,0 +1,505 @@ +import json +import subprocess +import sys +import zipfile +from pathlib import Path + +import pytest +from lxml import etree + +ROOT = Path(__file__).resolve().parents[1] +SCRIPT_PATH = ROOT / "tools" / "md2hwp" / "fill_hwpx.py" +sys.path.insert(0, str(SCRIPT_PATH.parent)) + +import fill_hwpx as fh # noqa: E402 + +NS = fh.HWPX_NS["hp"] +TEMPLATE_PATH = ROOT / "testdata" / "hwpx_20260302_200059.hwpx" +HWP2MD_BIN = ROOT / "bin" / "hwp2md" +SAMPLE_PLAN_PATH = ROOT / "testdata" / "fill_plans" / "재도전성공패키지_sample.json" + + +def _make_cell(col, row, text="", colspan=1, rowspan=1, with_text=True): + tc = etree.Element(f"{{{NS}}}tc") + sub = etree.SubElement(tc, f"{{{NS}}}subList") + p = etree.SubElement(sub, f"{{{NS}}}p") + p.set("paraPrIDRef", "31") + p.set("styleIDRef", "0") + run = etree.SubElement(p, f"{{{NS}}}run") + run.set("charPrIDRef", "31") + if with_text: + t = etree.SubElement(run, f"{{{NS}}}t") + t.text = text + etree.SubElement(p, f"{{{NS}}}linesegarray") + addr = etree.SubElement(tc, f"{{{NS}}}cellAddr") + addr.set("colAddr", str(col)) + addr.set("rowAddr", str(row)) + span = etree.SubElement(tc, f"{{{NS}}}cellSpan") + span.set("colSpan", str(colspan)) + span.set("rowSpan", str(rowspan)) + return tc + + +def _make_table(cells): + tbl = etree.Element(f"{{{NS}}}tbl") + row_map = {} + for tc in cells: + row = int(tc.find(f"./{fh.HP_CELLADDR_TAG}").get("rowAddr", "0")) + row_map.setdefault(row, []).append(tc) + for row_idx in sorted(row_map): + tr = etree.SubElement(tbl, f"{{{NS}}}tr") + for tc in sorted( + row_map[row_idx], + key=lambda cell: int(cell.find(f"./{fh.HP_CELLADDR_TAG}").get("colAddr", "0")), + ): + tr.append(tc) + return tbl + + +def _first_text(elem): + t = elem.find(f".//{fh.HP_T_TAG}") + return t.text if t is not None else None + + +def test_build_parent_map(): + root = etree.Element("root") + p = etree.SubElement(root, fh.HP_P_TAG) + run = etree.SubElement(p, fh.HP_RUN_TAG) + t = etree.SubElement(run, fh.HP_T_TAG) + t.text = "A" + parent_map = fh._build_parent_map(root) + assert parent_map[t] is run + assert parent_map[run] is p + + +def test_get_ancestor(): + root = etree.Element("root") + tbl = _make_table([_make_cell(0, 0, "LABEL"), _make_cell(1, 0, "VALUE")]) + root.append(tbl) + t = root.find(f".//{fh.HP_T_TAG}") + parent_map = fh._build_parent_map(root) + assert fh._get_ancestor(t, "tc", parent_map).tag == fh.HP_TC_TAG + assert fh._get_ancestor(t, "tbl", parent_map).tag == fh.HP_TBL_TAG + + +def test_find_cell_by_addr(): + tbl = _make_table([_make_cell(0, 0, "A"), _make_cell(1, 0, "B")]) + tc = fh._find_cell_by_addr(tbl, 1, 0) + assert tc is not None + assert _first_text(tc) == "B" + + +def test_find_cell_by_addr_ignores_nested_table(): + inner_cell = _make_cell(1, 0, "NESTED") + inner_tbl = _make_table([inner_cell]) + + outer_cell = _make_cell(0, 0, "OUTER") + outer_sub = outer_cell.find(f"./{fh.HP_SUBLIST_TAG}") + outer_sub.append(inner_tbl) + target = _make_cell(1, 0, "TARGET") + outer_tbl = _make_table([outer_cell, target]) + + result = fh._find_cell_by_addr(outer_tbl, 1, 0) + assert result is not None + assert _first_text(result) == "TARGET" + + +def test_set_cell_text_creates_hp_t_for_empty_cell(): + tc = _make_cell(0, 0, with_text=False) + assert tc.find(f".//{fh.HP_T_TAG}") is None + fh._set_cell_text(tc, "NEW") + assert _first_text(tc) == "NEW" + + +def test_clear_cell_except_removes_other_paragraphs_and_runs(): + tc = _make_cell(0, 0, "KEEP") + sub = tc.find(f"./{fh.HP_SUBLIST_TAG}") + p = sub.find(f"./{fh.HP_P_TAG}") + extra_run = etree.SubElement(p, fh.HP_RUN_TAG) + extra_t = etree.SubElement(extra_run, fh.HP_T_TAG) + extra_t.text = "REMOVE-RUN" + extra_p = etree.SubElement(sub, fh.HP_P_TAG) + extra_run_2 = etree.SubElement(extra_p, fh.HP_RUN_TAG) + extra_t_2 = etree.SubElement(extra_run_2, fh.HP_T_TAG) + extra_t_2.text = "REMOVE-PARA" + + keep_elem = tc.find(f".//{fh.HP_T_TAG}") + root = etree.Element("root") + root.append(_make_table([tc])) + parent_map = fh._build_parent_map(root) + fh._clear_cell_except(tc, keep_elem, parent_map) + + texts = [(t.text or "") for t in tc.findall(f".//{fh.HP_T_TAG}")] + assert texts == ["KEEP"] + + +def test_get_table_index(): + root = etree.Element("root") + tbl0 = _make_table([_make_cell(0, 0, "A")]) + tbl1 = _make_table([_make_cell(0, 0, "B")]) + root.append(tbl0) + root.append(tbl1) + assert fh._get_table_index(root, tbl1) == 1 + + +@pytest.mark.parametrize( + ("text", "pattern"), + [ + ("OO기업", "OO"), + ("○○○", "○○○"), + ("0000원", "0000"), + ("1000", None), + ("3,448,000", None), + ("일반 텍스트", None), + ], +) +def test_detect_placeholder_pattern(text, pattern): + assert fh._detect_placeholder_pattern(text) == pattern + + +def test_load_plan_rejects_empty_find(tmp_path): + plan_path = tmp_path / "plan_empty_find.json" + plan = { + "template_file": str(TEMPLATE_PATH), + "output_file": str(tmp_path / "out.hwpx"), + "simple_replacements": [{"find": "", "replace": "X"}], + } + plan_path.write_text(json.dumps(plan, ensure_ascii=False), encoding="utf-8") + with pytest.raises(ValueError, match="simple_replacements: 'find' must be non-empty"): + fh.load_plan(str(plan_path)) + + +def test_load_plan_rejects_empty_guide_text_prefix(tmp_path): + plan_path = tmp_path / "plan_empty_section_prefix.json" + plan = { + "template_file": str(TEMPLATE_PATH), + "output_file": str(tmp_path / "out.hwpx"), + "section_replacements": [{"guide_text_prefix": "", "content": "X"}], + } + plan_path.write_text(json.dumps(plan, ensure_ascii=False), encoding="utf-8") + with pytest.raises(ValueError, match="section_replacements: 'guide_text_prefix' must be non-empty"): + fh.load_plan(str(plan_path)) + + +def test_load_plan_rejects_empty_find_label(tmp_path): + plan_path = tmp_path / "plan_empty_find_label.json" + plan = { + "template_file": str(TEMPLATE_PATH), + "output_file": str(tmp_path / "out.hwpx"), + "table_cell_fills": [{"find_label": "", "value": "X"}], + } + plan_path.write_text(json.dumps(plan, ensure_ascii=False), encoding="utf-8") + with pytest.raises(ValueError, match="table_cell_fills: 'find_label' must be non-empty"): + fh.load_plan(str(plan_path)) + + +def test_load_plan_rejects_empty_multi_paragraph_prefix(tmp_path): + plan_path = tmp_path / "plan_empty_multi_prefix.json" + plan = { + "template_file": str(TEMPLATE_PATH), + "output_file": str(tmp_path / "out.hwpx"), + "multi_paragraph_fills": [{"guide_text_prefix": "", "paragraphs": ["A"]}], + } + plan_path.write_text(json.dumps(plan, ensure_ascii=False), encoding="utf-8") + with pytest.raises(ValueError, match="multi_paragraph_fills: 'guide_text_prefix' must be non-empty"): + fh.load_plan(str(plan_path)) + + +def test_apply_simple_replacements_xml_basic_and_occurrence(): + root = etree.Element("root") + p1 = etree.SubElement(root, fh.HP_P_TAG) + run1 = etree.SubElement(p1, fh.HP_RUN_TAG) + t1 = etree.SubElement(run1, fh.HP_T_TAG) + t1.text = "AB AB" + p2 = etree.SubElement(root, fh.HP_P_TAG) + run2 = etree.SubElement(p2, fh.HP_RUN_TAG) + t2 = etree.SubElement(run2, fh.HP_T_TAG) + t2.text = "AB" + total = fh.apply_simple_replacements_xml( + root, + [{"find": "AB", "replace": "X", "occurrence": 2}], + ) + assert total == 2 + assert t1.text == "X AB" + assert t2.text == "X" + + +def test_apply_simple_replacements_xml_not_found_warning(capsys): + root = etree.Element("root") + p = etree.SubElement(root, fh.HP_P_TAG) + run = etree.SubElement(p, fh.HP_RUN_TAG) + t = etree.SubElement(run, fh.HP_T_TAG) + t.text = "hello" + total = fh.apply_simple_replacements_xml(root, [{"find": "missing", "replace": "X"}]) + captured = capsys.readouterr() + assert total == 0 + assert "WARNING" in captured.err + + +def test_apply_section_replacements_xml_clear_cell_true(): + root = etree.Element("root") + tc = _make_cell(0, 0, "※ guide text") + sub = tc.find(f"./{fh.HP_SUBLIST_TAG}") + extra_p = etree.SubElement(sub, fh.HP_P_TAG) + extra_run = etree.SubElement(extra_p, fh.HP_RUN_TAG) + extra_t = etree.SubElement(extra_run, fh.HP_T_TAG) + extra_t.text = "orphan" + root.append(_make_table([tc])) + + total = fh.apply_section_replacements_xml( + root, + [{"section_id": "1", "guide_text_prefix": "guide", "content": "NEW", "clear_cell": True}], + ) + texts = [(t.text or "") for t in tc.findall(f".//{fh.HP_T_TAG}")] + assert total == 1 + assert texts == ["NEW"] + + +def test_apply_section_replacements_xml_clear_cell_false(): + root = etree.Element("root") + tc = _make_cell(0, 0, "※ guide text") + sub = tc.find(f"./{fh.HP_SUBLIST_TAG}") + extra_p = etree.SubElement(sub, fh.HP_P_TAG) + extra_run = etree.SubElement(extra_p, fh.HP_RUN_TAG) + extra_t = etree.SubElement(extra_run, fh.HP_T_TAG) + extra_t.text = "keep-me" + root.append(_make_table([tc])) + + total = fh.apply_section_replacements_xml( + root, + [{"section_id": "1", "guide_text_prefix": "guide", "content": "NEW", "clear_cell": False}], + ) + texts = [(t.text or "") for t in tc.findall(f".//{fh.HP_T_TAG}")] + assert total == 1 + assert "NEW" in texts + assert "keep-me" in texts + + +def test_apply_table_cell_fills_xml_celladdr_lookup(): + root = etree.Element("root") + label = _make_cell(0, 0, "LABEL") + target = _make_cell(1, 0, "OLD") + root.append(_make_table([label, target])) + + total = fh.apply_table_cell_fills_xml(root, [{"find_label": "LABEL", "value": "VALUE"}]) + assert total == 1 + assert _first_text(target) == "VALUE" + + +def test_apply_table_cell_fills_xml_fills_empty_target_cell(): + root = etree.Element("root") + label = _make_cell(0, 0, "LABEL") + target = _make_cell(1, 0, with_text=False) + root.append(_make_table([label, target])) + + total = fh.apply_table_cell_fills_xml(root, [{"find_label": "LABEL", "value": "VALUE"}]) + assert total == 1 + assert _first_text(target) == "VALUE" + + +def test_apply_table_cell_fills_xml_fallback_does_not_cross_into_other_table(): + root = etree.Element("root") + root.append(_make_table([_make_cell(0, 0, "LABEL")])) + body_p = etree.SubElement(root, fh.HP_P_TAG) + body_run = etree.SubElement(body_p, fh.HP_RUN_TAG) + body_t = etree.SubElement(body_run, fh.HP_T_TAG) + body_t.text = "BODY_TEXT" + fallback_target = _make_cell(0, 0, "TARGET") + root.append(_make_table([fallback_target])) + + total = fh.apply_table_cell_fills_xml( + root, + [{"find_label": "LABEL", "value": "VALUE", "target_offset": {"col": 99, "row": 0}}], + ) + + assert total == 0 + assert body_t.text == "BODY_TEXT" + assert _first_text(fallback_target) == "TARGET" + + +def test_apply_multi_paragraph_fills_injects_multiple_paragraphs(): + root = etree.Element("root") + tc = _make_cell(0, 0, "※ TARGET") + root.append(_make_table([tc])) + total = fh.apply_multi_paragraph_fills( + root, + [ + { + "section_id": "2-1", + "guide_text_prefix": "TARGET", + "paragraphs": ["P1", "P2", "P3"], + } + ], + ) + sub = tc.find(f"./{fh.HP_SUBLIST_TAG}") + paragraphs = sub.findall(f"./{fh.HP_P_TAG}") + texts = ["".join((t.text or "") for t in p.findall(f".//{fh.HP_T_TAG}")) for p in paragraphs] + assert total == 1 + assert texts == ["P1", "P2", "P3"] + + +def test_apply_multi_paragraph_fills_deepcopy_preserves_style_and_isolation(): + root = etree.Element("root") + tc = _make_cell(0, 0, "※ TARGET") + root.append(_make_table([tc])) + + fh.apply_multi_paragraph_fills( + root, + [ + { + "section_id": "2-1", + "guide_text_prefix": "TARGET", + "paragraphs": ["A", "B"], + } + ], + ) + + sub = tc.find(f"./{fh.HP_SUBLIST_TAG}") + paragraphs = sub.findall(f"./{fh.HP_P_TAG}") + assert len(paragraphs) == 2 + assert len({id(p) for p in paragraphs}) == 2 + for p in paragraphs: + assert p.get("paraPrIDRef") == "31" + assert p.get("styleIDRef") == "0" + assert p.find(f"./{{{NS}}}linesegarray") is not None + run = p.find(f"./{fh.HP_RUN_TAG}") + assert run is not None and run.get("charPrIDRef") == "31" + + +def test_full_fill_cycle_real_template(tmp_path): + output = tmp_path / "filled.hwpx" + plan = { + "template_file": str(TEMPLATE_PATH), + "output_file": str(output), + "simple_replacements": [{"find": "OO학과 교수 재직(00년)", "replace": "테스트 경력"}], + "section_replacements": [ + { + "section_id": "1-1", + "guide_text_prefix": "※ 과거 폐업 원인을", + "content": "섹션 테스트 내용", + "clear_cell": True, + } + ], + "table_cell_fills": [{"find_label": "과제명", "value": "테스트 과제명"}], + "multi_paragraph_fills": [ + { + "section_id": "2-1", + "guide_text_prefix": "신청하기 이전까지", + "paragraphs": ["문단1", "문단2", "문단3"], + } + ], + } + total = fh.fill_hwpx(plan, str(output)) + + assert total >= 4 + assert output.exists() + with zipfile.ZipFile(output) as zf: + names = zf.namelist() + assert "Contents/section0.xml" in names + xml = zf.read("Contents/section0.xml").decode("utf-8") + assert "테스트 과제명" in xml + assert "문단1" in xml + + +def test_inspect_cli_includes_table_context(): + proc = subprocess.run( + [sys.executable, str(SCRIPT_PATH), "--inspect", str(TEMPLATE_PATH), "-q", "과제명"], + check=True, + capture_output=True, + text=True, + ) + assert "[T4 R0 C0]" in proc.stdout + + +def test_analyze_cli_outputs_valid_schema(): + proc = subprocess.run( + [sys.executable, str(SCRIPT_PATH), "--analyze", str(TEMPLATE_PATH)], + check=True, + capture_output=True, + text=True, + ) + data = json.loads(proc.stdout) + assert data["total_text_elements"] == 382 + assert len(data["tables"]) == 28 + assert len(data["guide_texts"]) > 0 + assert len(data["placeholders"]) > 0 + + +def test_analyze_excludes_comma_formatted_amount_placeholders(): + data = fh.analyze_template(str(TEMPLATE_PATH)) + placeholder_texts = {entry["text"] for entry in data["placeholders"]} + assert "3,448,000" not in placeholder_texts + assert "7,652,000" not in placeholder_texts + assert "7,000,000" not in placeholder_texts + assert any("OO" in text or "○○" in text for text in placeholder_texts) + + +def test_full_fill_cycle_with_reverse_conversion_if_available(tmp_path): + if not HWP2MD_BIN.exists(): + pytest.skip("bin/hwp2md not found") + + output = tmp_path / "filled_reverse.hwpx" + plan = { + "template_file": str(TEMPLATE_PATH), + "output_file": str(output), + "table_cell_fills": [ + {"find_label": "과제명", "value": "역변환 과제명"}, + {"find_label": "기업명", "value": "역변환 기업명"}, + ], + } + fh.fill_hwpx(plan, str(output)) + + md_path = tmp_path / "verify.md" + subprocess.run([str(HWP2MD_BIN), str(output), "-o", str(md_path)], check=True) + md = md_path.read_text(encoding="utf-8") + assert "역변환 과제명" in md + assert "역변환 기업명" in md + + +def test_e2e_fill_with_sample_plan(tmp_path): + plan = json.loads(SAMPLE_PLAN_PATH.read_text(encoding="utf-8")) + output = tmp_path / "e2e_fill_test.hwpx" + plan["output_file"] = str(output) + plan_path = tmp_path / "sample_plan.json" + plan_path.write_text(json.dumps(plan, ensure_ascii=False), encoding="utf-8") + + subprocess.run([sys.executable, str(SCRIPT_PATH), str(plan_path)], check=True) + assert output.exists() + + with zipfile.ZipFile(output) as zf: + assert "Contents/section0.xml" in zf.namelist() + tree = etree.fromstring(zf.read("Contents/section0.xml")) + + # XML is the source of truth for paragraph structure in table cells. + target_paragraphs = None + expected_paragraphs = [ + "테스트 준비현황 문단 1", + "테스트 준비현황 문단 2", + "테스트 준비현황 문단 3", + ] + for tc in tree.findall(f".//{fh.HP_TC_TAG}"): + paragraphs = [] + for p in tc.findall(f"./{fh.HP_SUBLIST_TAG}/{fh.HP_P_TAG}"): + text = "".join((t.text or "") for t in p.findall(f".//{fh.HP_T_TAG}")).strip() + if text: + paragraphs.append(text) + if paragraphs[:3] == expected_paragraphs: + target_paragraphs = paragraphs + break + assert target_paragraphs is not None, "multi_paragraph_fills content not found as separate XML paragraphs" + assert target_paragraphs[:3] == expected_paragraphs + + if HWP2MD_BIN.exists(): + verify_path = tmp_path / "e2e_verify.md" + subprocess.run([str(HWP2MD_BIN), str(output), "-o", str(verify_path)], check=True) + md = verify_path.read_text(encoding="utf-8") + # Reverse markdown check is for content presence only. + for text in expected_paragraphs: + assert text in md + assert "테스트 과제명" in md + assert "테스트 기업명" in md + else: + with zipfile.ZipFile(output) as zf: + xml = zf.read("Contents/section0.xml").decode("utf-8") + assert "테스트 과제명" in xml + assert "테스트 기업명" in xml diff --git a/tools/md2hwp-ui/renderer.py b/tools/md2hwp-ui/renderer.py new file mode 100644 index 0000000..ed2817b --- /dev/null +++ b/tools/md2hwp-ui/renderer.py @@ -0,0 +1,181 @@ +"""renderer.py - HWPX to HTML converter for browser preview. + +Parses HWPX (ZIP+XML) and generates HTML with data-idx attributes +on each text element for real-time highlight support. +""" + +import zipfile +from html import escape +from lxml import etree + +HP = "http://www.hancom.co.kr/hwpml/2011/paragraph" +HS = "http://www.hancom.co.kr/hwpml/2011/section" +NS = {"hp": HP, "hs": HS} + +_T = f"{{{HP}}}t" +_P = f"{{{HP}}}p" +_RUN = f"{{{HP}}}run" +_TBL = f"{{{HP}}}tbl" +_TR = f"{{{HP}}}tr" +_TC = f"{{{HP}}}tc" +_CELL_SPAN = f"{{{HP}}}cellSpan" +_CELL_SZ = f"{{{HP}}}cellSz" +_SUB_LIST = f"{{{HP}}}subList" +_LINE_BREAK = f"{{{HP}}}lineBreak" +_SEC = f"{{{HS}}}sec" + + +def render_hwpx_to_html(hwpx_path: str) -> tuple[str, int]: + """Convert HWPX file to HTML string. + + Returns (html_string, total_text_count). + Each gets a for SSE targeting. + """ + xml_bytes = _extract_section_xml(hwpx_path) + root = etree.fromstring(xml_bytes) + ctx = {"idx": 0} + html = _render_element(root, ctx) + return html, ctx["idx"] + + +def _extract_section_xml(hwpx_path: str) -> bytes: + """Extract section0.xml from HWPX ZIP.""" + with zipfile.ZipFile(hwpx_path, "r") as zf: + for name in sorted(zf.namelist()): + if name.startswith("Contents/section") and name.endswith(".xml"): + return zf.read(name) + raise FileNotFoundError("No section XML found in HWPX") + + +def _render_element(elem, ctx: dict) -> str: + """Recursively render an XML element to HTML.""" + tag = _local_tag(elem) + + if tag == "sec": + return _render_children(elem, ctx) + if tag == "tbl": + return _render_table(elem, ctx) + if tag == "p": + return _render_paragraph(elem, ctx) + if tag == "run": + return _render_run(elem, ctx) + if tag == "t": + return _render_text(elem, ctx) + if tag == "lineBreak": + return "
" + if tag in ("subList",): + return _render_children(elem, ctx) + + return _render_children(elem, ctx) + + +def _render_children(elem, ctx: dict) -> str: + """Render all children of an element.""" + parts = [] + for child in elem: + parts.append(_render_element(child, ctx)) + return "".join(parts) + + +def _render_table(tbl, ctx: dict) -> str: + """Render as HTML .""" + rows = tbl.findall(_TR) + if not rows: + return "" + + # Calculate column width ratios from first row + first_row_cells = rows[0].findall(_TC) + total_width = 0 + col_widths = [] + for cell in first_row_cells: + sz = cell.find(_CELL_SZ) + w = int(sz.get("width", "0")) if sz is not None else 0 + span = cell.find(_CELL_SPAN) + cs = int(span.get("colSpan", "1")) if span is not None else 1 + for _ in range(cs): + col_widths.append(w // cs if cs > 0 else w) + total_width += w + + html = '
' + if total_width > 0 and col_widths: + html += "" + for w in col_widths: + pct = round(w / total_width * 100, 1) if total_width else 0 + html += f'' + html += "" + + for row in rows: + html += _render_row(row, ctx) + html += "
" + return html + + +def _render_row(tr, ctx: dict) -> str: + """Render as HTML .""" + cells = tr.findall(_TC) + html = "" + for cell in cells: + html += _render_cell(cell, ctx) + html += "" + return html + + +def _render_cell(tc, ctx: dict) -> str: + """Render as HTML .""" + span = tc.find(_CELL_SPAN) + cs = int(span.get("colSpan", "1")) if span is not None else 1 + rs = int(span.get("rowSpan", "1")) if span is not None else 1 + + is_header = tc.get("header") == "1" + tag = "th" if is_header else "td" + + attrs = "" + if cs > 1: + attrs += f' colspan="{cs}"' + if rs > 1: + attrs += f' rowspan="{rs}"' + + # Render cell content (paragraphs inside subList) + content = "" + sub_list = tc.find(_SUB_LIST) + if sub_list is not None: + content = _render_children(sub_list, ctx) + else: + content = _render_children(tc, ctx) + + return f"<{tag}{attrs}>{content}" + + +def _render_paragraph(p, ctx: dict) -> str: + """Render as HTML

.""" + content = _render_children(p, ctx) + if not content.strip(): + return "" + return f'

{content}

' + + +def _render_run(run, ctx: dict) -> str: + """Render as inline content.""" + return _render_children(run, ctx) + + +def _render_text(t, ctx: dict) -> str: + """Render as . + + Always increments idx (even for empty text) to stay in sync with + fill_hwpx.py's enumerate(get_all_text_elements(tree)). + """ + text = t.text or "" + idx = ctx["idx"] + ctx["idx"] += 1 + if not text: + return "" + return f'{escape(text)}' + + +def _local_tag(elem) -> str: + """Get local tag name without namespace.""" + tag = elem.tag + if "}" in tag: + return tag.split("}", 1)[1] + return tag diff --git a/tools/md2hwp-ui/server.py b/tools/md2hwp-ui/server.py new file mode 100644 index 0000000..14aa5cf --- /dev/null +++ b/tools/md2hwp-ui/server.py @@ -0,0 +1,515 @@ +#!/usr/bin/env python3 +"""md2hwp-ui server — HWPX template viewer with real-time fill preview. + +Usage: + python3 server.py [--port 8080] + +Browse to http://localhost:8080 after starting. +""" + +import argparse +import json +import os +import re +import shutil +import tempfile +import time +import threading +from http.server import ThreadingHTTPServer, BaseHTTPRequestHandler +from pathlib import Path +from urllib.parse import urlparse, parse_qs + +# Add renderer to path +import sys +sys.path.insert(0, str(Path(__file__).parent)) +from renderer import render_hwpx_to_html + +# Session state +STATE = { + "upload_dir": None, + "template_path": None, + "template_html": None, + "text_count": 0, + "output_path": None, + "event_file": None, +} + +EVENT_FILE_PATH = "/tmp/md2hwp-events.jsonl" + + +def _parse_multipart(body: bytes, boundary: bytes) -> tuple: + """Parse multipart form data, return (filename, file_bytes) or (None, None).""" + delimiter = b"--" + boundary + parts = body.split(delimiter) + + for part in parts: + if b"Content-Disposition" not in part: + continue + # Split headers from body at double newline + header_end = part.find(b"\r\n\r\n") + if header_end == -1: + continue + headers_raw = part[:header_end].decode("utf-8", errors="replace") + file_body = part[header_end + 4:] + # Remove trailing \r\n-- if present + if file_body.endswith(b"\r\n"): + file_body = file_body[:-2] + + # Extract filename from Content-Disposition + fn_match = re.search(r'filename="([^"]+)"', headers_raw) + if fn_match: + return fn_match.group(1), file_body + + return None, None + + +def init_session(): + """Initialize temp directory for uploads.""" + if STATE["upload_dir"] and os.path.exists(STATE["upload_dir"]): + shutil.rmtree(STATE["upload_dir"]) + STATE["upload_dir"] = tempfile.mkdtemp(prefix="md2hwp-ui-") + STATE["event_file"] = EVENT_FILE_PATH + # Clear event file + with open(EVENT_FILE_PATH, "w") as f: + f.write("") + + +class Handler(BaseHTTPRequestHandler): + def log_message(self, format, *args): + pass # Suppress default logging + + def do_GET(self): + path = urlparse(self.path).path + + if path == "/": + self._serve_html() + elif path == "/api/events": + self._serve_sse() + elif path.startswith("/api/download/"): + self._serve_download() + else: + self._respond(404, "Not found") + + def do_POST(self): + path = urlparse(self.path).path + + if path == "/api/upload": + self._handle_upload() + elif path == "/api/fill": + self._handle_fill() + else: + self._respond(404, "Not found") + + # --- Handlers --- + + def _serve_html(self): + self.send_response(200) + self.send_header("Content-Type", "text/html; charset=utf-8") + self.end_headers() + self.wfile.write(HTML_PAGE.encode("utf-8")) + + def _serve_sse(self): + self.send_response(200) + self.send_header("Content-Type", "text/event-stream") + self.send_header("Cache-Control", "no-cache") + self.send_header("Connection", "keep-alive") + self.send_header("Access-Control-Allow-Origin", "*") + self.end_headers() + + event_file = EVENT_FILE_PATH + last_pos = 0 + + # Start from end of file + if os.path.exists(event_file): + last_pos = os.path.getsize(event_file) + + try: + while True: + if os.path.exists(event_file): + size = os.path.getsize(event_file) + if size > last_pos: + with open(event_file, "r", encoding="utf-8") as f: + f.seek(last_pos) + new_lines = f.read() + last_pos = f.tell() + + for line in new_lines.strip().split("\n"): + if line.strip(): + self.wfile.write(f"data: {line}\n\n".encode()) + self.wfile.flush() + + # Heartbeat + self.wfile.write(b": heartbeat\n\n") + self.wfile.flush() + time.sleep(0.3) + except (BrokenPipeError, ConnectionResetError): + pass + + def _handle_upload(self): + content_type = self.headers.get("Content-Type", "") + if "multipart/form-data" not in content_type: + self._respond_json(400, {"error": "multipart/form-data required"}) + return + + content_length = int(self.headers.get("Content-Length", 0)) + body = self.rfile.read(content_length) + + # Extract boundary from Content-Type + boundary_match = re.search(r"boundary=(.+)", content_type) + if not boundary_match: + self._respond_json(400, {"error": "No boundary in Content-Type"}) + return + boundary = boundary_match.group(1).strip().encode() + + # Parse multipart: split by boundary, find file part + filename, file_data = _parse_multipart(body, boundary) + if not filename or file_data is None: + self._respond_json(400, {"error": "No file uploaded"}) + return + + init_session() + + # Save uploaded file + filename = os.path.basename(filename) + save_path = os.path.join(STATE["upload_dir"], filename) + with open(save_path, "wb") as f: + f.write(file_data) + + STATE["template_path"] = save_path + + # Render to HTML + try: + html, count = render_hwpx_to_html(save_path) + STATE["template_html"] = html + STATE["text_count"] = count + self._respond_json(200, { + "html": html, + "text_count": count, + "filename": filename, + "event_file": EVENT_FILE_PATH, + }) + except (BrokenPipeError, ConnectionResetError): + pass + except Exception as e: + try: + self._respond_json(500, {"error": str(e)}) + except (BrokenPipeError, ConnectionResetError): + pass + + def _handle_fill(self): + body = self.rfile.read(int(self.headers.get("Content-Length", 0))) + try: + plan = json.loads(body) + except json.JSONDecodeError: + self._respond_json(400, {"error": "Invalid JSON"}) + return + + if not STATE["template_path"]: + self._respond_json(400, {"error": "No template uploaded"}) + return + + # Set template and output in plan + output_name = "result_" + os.path.basename(STATE["template_path"]) + output_path = os.path.join(STATE["upload_dir"], output_name) + plan["template_file"] = STATE["template_path"] + plan["output_file"] = output_path + + # Save plan + plan_path = os.path.join(STATE["upload_dir"], "fill_plan.json") + with open(plan_path, "w", encoding="utf-8") as f: + json.dump(plan, f, ensure_ascii=False, indent=2) + + # Run fill_hwpx.py in background thread + def run_fill(): + fill_script = str(Path.home() / ".claude/skills/md2hwp/scripts/fill_hwpx.py") + env = os.environ.copy() + env["MD2HWP_EVENT_FILE"] = EVENT_FILE_PATH + import subprocess + result = subprocess.run( + [sys.executable, fill_script, plan_path], + env=env, capture_output=True, text=True, + ) + # Write done event + done_event = {"type": "done", "output": output_name, "log": result.stdout + result.stderr} + with open(EVENT_FILE_PATH, "a", encoding="utf-8") as f: + f.write(json.dumps(done_event, ensure_ascii=False) + "\n") + STATE["output_path"] = output_path + + threading.Thread(target=run_fill, daemon=True).start() + self._respond_json(200, {"status": "started", "plan_path": plan_path}) + + def _serve_download(self): + filename = urlparse(self.path).path.split("/api/download/", 1)[-1] + filepath = os.path.join(STATE["upload_dir"] or "", filename) + + if not os.path.exists(filepath): + self._respond(404, "File not found") + return + + self.send_response(200) + self.send_header("Content-Type", "application/octet-stream") + self.send_header("Content-Disposition", f'attachment; filename="{filename}"') + self.send_header("Content-Length", str(os.path.getsize(filepath))) + self.end_headers() + with open(filepath, "rb") as f: + shutil.copyfileobj(f, self.wfile) + + # --- Helpers --- + + def _respond(self, code, text): + self.send_response(code) + self.send_header("Content-Type", "text/plain; charset=utf-8") + self.end_headers() + self.wfile.write(text.encode("utf-8")) + + def _respond_json(self, code, data): + self.send_response(code) + self.send_header("Content-Type", "application/json; charset=utf-8") + self.end_headers() + self.wfile.write(json.dumps(data, ensure_ascii=False).encode("utf-8")) + + +# ===== Inline HTML/CSS/JS ===== + +HTML_PAGE = """ + + + + +md2hwp Viewer + + + + +
+

md2hwp Viewer

+
+ + +
+
+ +
+
+ + HWPX 파일을 드래그하거나 클릭하여 업로드 +
+
+ +
+ +
+
HWPX 파일을 업로드하면 미리보기가 표시됩니다
+
+ +
+ 대기 중 + +
+ + + + + +""" + + +def main(): + parser = argparse.ArgumentParser(description="md2hwp Viewer Server") + parser.add_argument("--port", type=int, default=8080, help="Port (default: 8080)") + args = parser.parse_args() + + init_session() + + server = ThreadingHTTPServer(("127.0.0.1", args.port), Handler) + server.daemon_threads = True + print(f"md2hwp Viewer running at http://localhost:{args.port}") + print(f"Event file: {EVENT_FILE_PATH}") + print("Press Ctrl+C to stop") + + try: + server.serve_forever() + except KeyboardInterrupt: + print("\nStopping...") + server.server_close() + if STATE["upload_dir"] and os.path.exists(STATE["upload_dir"]): + shutil.rmtree(STATE["upload_dir"]) + + +if __name__ == "__main__": + main() diff --git a/tools/md2hwp/compile_schema.py b/tools/md2hwp/compile_schema.py new file mode 100644 index 0000000..6bc43e9 --- /dev/null +++ b/tools/md2hwp/compile_schema.py @@ -0,0 +1,255 @@ +"""Compile 사업계획서 schema JSON → fill_plan.json for fill_hwpx.py""" +import json +import sys +import re + +GUIDE_TEXT_MAP = { + "1-1_폐업원인분석": "※ 과거 폐업 원인을", + "1-2_목표시장": "재창업 아이템 진출 목표시장", + "2-1_준비현황": "신청하기 이전까지", + "2-2_구체화방안": "재창업 아이템의 핵심 기능", + "3-1_비즈니스모델": "재창업 아이템의 가치 전달", + "3-2_사업화전략": "정의된 목표시장(고객)에 진입", + "4-1_보유역량": "대표자 및 조직이 사업화를 위해", + "4-2_조직구성계획": "협약기간 동안 조직 구성", +} + +# Template placeholder text → replacement mapping +TIMELINE_PLACEHOLDERS = [ + ("시제품 개발·개선 완료", "목표"), + ("개발 또는 개선 하려는 내용", "세부내용"), + ("웹사이트 오픈", "목표"), + ("웹사이트 기능 및 용도 등", "세부내용"), + ("시장 검증", "목표"), + ("검증 대상 및 방법 등", "세부내용"), +] + +BUDGET_PLACEHOLDERS = [ + ("DMD소켓 구입(00개×0000원)", 0), # row index for 재료비 line 1 + ("전원IC류 구입(00개×000원)", 1), # row index for 재료비 line 2 + ("시금형제작 외주용역", 2), # 외주용역비 + ("1명 x 0개월 x 0000원", 3), # 인건비 +] + +BUDGET_AMOUNT_PLACEHOLDERS = [ + ("3,448,000", 0, "정부지원"), + ("7,652,000", 1, "정부지원"), + ("7,000,000", 2, "현금"), + ("3,000,000", 3, "현물"), +] + +TEAM_PLACEHOLDERS = [ + ("공동대표", 0, "직위"), + ("S/W 개발 총괄", 0, "담당업무"), + ("OO학과 교수 재직(00년)", 0, "보유역량"), + ("팀장", 1, "직위"), + ("홍보 및 마케팅", 1, "담당업무"), + ("OO학과 전공, 관련 경력(00년 이상)", 1, "보유역량"), +] + + +def compile_schema(schema, template_path, output_path): + fill_plan = { + "template_file": template_path, + "output_file": output_path, + "simple_replacements": [], + "table_cell_fills": [], + "multi_paragraph_fills": [], + } + + # --- meta → table_cell_fills --- + meta = schema["meta"] + fill_plan["table_cell_fills"].extend([ + {"find_label": "과제명", "value": meta["과제명"]}, + {"find_label": "기업명", "value": meta["기업명"]}, + {"find_label": "아이템(서비스) 개요", "value": meta["아이템_개요"]}, + ]) + + # --- sections → multi_paragraph_fills --- + for key, section in schema["sections"].items(): + if key not in GUIDE_TEXT_MAP: + print(f"WARN: Unknown section key: {key}", file=sys.stderr) + continue + paragraphs = section.get("paragraphs", []) + if not paragraphs: + print(f"WARN: Empty paragraphs for {key}", file=sys.stderr) + continue + fill_plan["multi_paragraph_fills"].append({ + "section_id": key.split("_")[0], + "guide_text_prefix": GUIDE_TEXT_MAP[key], + "paragraphs": paragraphs, + }) + + # --- 폐업이력 → simple_replacements (1st company only) --- + closure = schema.get("폐업이력", {}) + total_count = closure.get("총_폐업횟수", 0) + if total_count > 0: + fill_plan["simple_replacements"].append( + {"find": "폐업 이력(총 폐업 횟수 : 0회)", "replace": f"폐업 이력(총 폐업 횟수 : {total_count}회)"} + ) + companies = closure.get("companies", []) + if companies: + c = companies[0] + mappings = [ + ("○○○○", c.get("기업명", "")), + ("개인 / 법인", c.get("기업구분", "")), + ("2000.00.00.(개업연월일 또는 회사성립연월일)~2000.00.00.(폐업일)", c.get("사업기간", "")), + ("아이템 간략히 소개", c.get("아이템_개요", "")), + ("폐업을 하게된 원인 및 사유 등을 간략기 기재", c.get("폐업원인", "")), + ] + for find, replace in mappings: + if replace: + fill_plan["simple_replacements"].append({"find": find, "replace": replace}) + + # --- 추진일정 → simple_replacements --- + timeline = schema.get("추진일정", {}) + rows = timeline.get("rows", []) + for i, row in enumerate(rows): + if i < len(TIMELINE_PLACEHOLDERS) // 2: + goal_ph = TIMELINE_PLACEHOLDERS[i * 2] + detail_ph = TIMELINE_PLACEHOLDERS[i * 2 + 1] + if row.get("목표"): + fill_plan["simple_replacements"].append({"find": goal_ph[0], "replace": row["목표"]}) + if row.get("세부내용"): + fill_plan["simple_replacements"].append({"find": detail_ph[0], "replace": row["세부내용"]}) + # Timeline dates: "~ 00월" (rows 1,2), "~00월" (row 3) + date_placeholders = ["~ 00월", "~ 00월", "~00월"] + for i, row in enumerate(rows): + if i < len(date_placeholders) and row.get("일정"): + fill_plan["simple_replacements"].append({ + "find": date_placeholders[i], + "replace": row["일정"], + "occurrence": 1, + }) + + # --- 사업비 → simple_replacements --- + budget = schema.get("사업비", {}) + budget_rows = budget.get("rows", []) + for i, row in enumerate(budget_rows): + if i < len(BUDGET_PLACEHOLDERS): + ph_text, _ = BUDGET_PLACEHOLDERS[i][:2] + if row.get("산출근거"): + # Strip leading "• " if present for matching + replace_text = row["산출근거"] + if not replace_text.startswith("•"): + replace_text = "• " + replace_text + fill_plan["simple_replacements"].append({"find": ph_text, "replace": replace_text.lstrip("• ")}) + + # Budget amounts + for ph_amount, row_idx, col_type in BUDGET_AMOUNT_PLACEHOLDERS: + if row_idx < len(budget_rows): + row = budget_rows[row_idx] + amount_key = {"정부지원": "정부지원", "현금": "현금", "현물": "현물"}.get(col_type) + if amount_key and row.get(amount_key, 0) > 0: + fill_plan["simple_replacements"].append({ + "find": ph_amount, + "replace": f"{row[amount_key]:,}", + }) + + # --- 인력현황 → simple_replacements --- + team = schema.get("인력현황", {}) + members = team.get("members", []) + current_count = team.get("현재_재직인원", 0) + planned_hire = team.get("추가_고용계획", 0) + + for ph_text, member_idx, field in TEAM_PLACEHOLDERS: + if member_idx < len(members): + value = members[member_idx].get(field, "") + if value: + fill_plan["simple_replacements"].append({"find": ph_text, "replace": value}) + + # Personnel count replacements + if current_count > 0: + fill_plan["simple_replacements"].append({"find": "00\n명\n추가 고용계획", "replace": f"{current_count:02d}\n명\n추가 고용계획", "occurrence": 1}) + + return fill_plan + + +def preflight(schema): + """Preflight validation. Returns (errors, warnings).""" + errors = [] + warnings = [] + + # Required fields + meta = schema.get("meta", {}) + for field in ["과제명", "기업명", "아이템_개요"]: + if not meta.get(field): + errors.append(f"BLOCK: meta.{field} 비어있음") + + # Sections not empty + for key, section in schema.get("sections", {}).items(): + if not section.get("paragraphs"): + errors.append(f"BLOCK: sections.{key}.paragraphs 비어있음") + + # Placeholder remnants - check only meta and sections paragraphs + placeholder_re = re.compile(r'(?:OO[^대학]|(? 100_000_000: + errors.append(f"BLOCK: 정부지원 {total_gov:,}원 > 1억원 한도 초과") + if total > 0: + if total_gov / total > 0.75: + errors.append(f"BLOCK: 정부지원 비율 {total_gov/total:.1%} > 75%") + if total_cash / total < 0.05: + warnings.append(f"WARN: 현금 비율 {total_cash/total:.1%} < 5% (권장)") + if total_inkind / total > 0.20: + warnings.append(f"WARN: 현물 비율 {total_inkind/total:.1%} > 20%") + + # Short paragraphs + for key, section in schema.get("sections", {}).items(): + for i, p in enumerate(section.get("paragraphs", [])): + if len(p) < 10: + warnings.append(f"WARN: sections.{key}.paragraphs[{i}] 길이 {len(p)}자 (너무 짧음)") + + return errors, warnings + + +if __name__ == "__main__": + input_file = sys.argv[1] + template = sys.argv[2] if len(sys.argv) > 2 else "testdata/hwpx_20260302_200059.hwpx" + output = sys.argv[3] if len(sys.argv) > 3 else "/tmp/compiled_output.hwpx" + + with open(input_file) as f: + schema = json.load(f) + + # Preflight + print("=== Preflight 검증 ===") + errors, warnings = preflight(schema) + for e in errors: + print(f" {e}") + for w in warnings: + print(f" {w}") + + if errors: + print(f"\n{len(errors)} BLOCK 에러 발견. 수정 후 재시도하세요.") + sys.exit(1) + + print(f" 결과: {len(errors)} 에러, {len(warnings)} 경고\n") + + # Compile + fill_plan = compile_schema(schema, template, output) + + output_json = "/tmp/compiled_fill_plan.json" + with open(output_json, "w") as f: + json.dump(fill_plan, f, ensure_ascii=False, indent=2) + + print(f"=== fill_plan.json 컴파일 완료 ===") + print(f" table_cell_fills: {len(fill_plan['table_cell_fills'])}개") + print(f" multi_paragraph_fills: {len(fill_plan['multi_paragraph_fills'])}개") + print(f" simple_replacements: {len(fill_plan['simple_replacements'])}개") + print(f" 출력: {output_json}") diff --git a/tools/md2hwp/fill_hwpx.py b/tools/md2hwp/fill_hwpx.py new file mode 100644 index 0000000..98fe975 --- /dev/null +++ b/tools/md2hwp/fill_hwpx.py @@ -0,0 +1,783 @@ +#!/usr/bin/env python3 +"""fill_hwpx.py - Template Injection for HWPX files. + +Reads a fill_plan.json and applies text replacements to an HWPX template, +preserving all original formatting (cell sizes, merge patterns, styles). + +Uses direct XML manipulation (zipfile + lxml) to handle ALL text elements +including those inside table cells, which python-hwpx's iter_runs() misses. + +Usage: + python3 fill_hwpx.py + python3 fill_hwpx.py -o + python3 fill_hwpx.py --inspect # List all text runs + python3 fill_hwpx.py --inspect -q # Search for text + python3 fill_hwpx.py --inspect-tables # Show table structure + python3 fill_hwpx.py --analyze # Output fillable schema +""" + +import json +import sys +import os +import argparse +import re +import shutil +import tempfile +import zipfile +from copy import deepcopy +from pathlib import Path + +try: + from lxml import etree +except ImportError: + import xml.etree.ElementTree as etree + print("WARNING: lxml not installed, using stdlib xml.etree (less robust)", file=sys.stderr) + +# HWPX namespaces +HWPX_NS = { + "hp": "http://www.hancom.co.kr/hwpml/2011/paragraph", + "hs": "http://www.hancom.co.kr/hwpml/2011/section", + "hc": "http://www.hancom.co.kr/hwpml/2011/core", + "hh": "http://www.hancom.co.kr/hwpml/2011/head", + "ha": "http://www.hancom.co.kr/hwpml/2011/app", + "hp10": "http://www.hancom.co.kr/hwpml/2016/paragraph", +} + +HP_T_TAG = f"{{{HWPX_NS['hp']}}}t" +HP_TC_TAG = f"{{{HWPX_NS['hp']}}}tc" +HP_TBL_TAG = f"{{{HWPX_NS['hp']}}}tbl" +HP_P_TAG = f"{{{HWPX_NS['hp']}}}p" +HP_RUN_TAG = f"{{{HWPX_NS['hp']}}}run" +HP_SUBLIST_TAG = f"{{{HWPX_NS['hp']}}}subList" +HP_CELLADDR_TAG = f"{{{HWPX_NS['hp']}}}cellAddr" +HP_CELLSPAN_TAG = f"{{{HWPX_NS['hp']}}}cellSpan" + +# Event logging for real-time UI +EVENT_FILE = os.environ.get("MD2HWP_EVENT_FILE") +PLACEHOLDER_PATTERNS = [r"OO+", r"○{2,}"] +NUMERIC_PLACEHOLDER_PATTERN = r"(? None: + """Append event to JSONL file for SSE streaming.""" + if not EVENT_FILE: + return + with open(EVENT_FILE, "a", encoding="utf-8") as f: + f.write(json.dumps(event, ensure_ascii=False) + "\n") + + +def _build_parent_map(tree) -> dict: + """Build element-to-parent mapping for ancestor traversal.""" + parent_map = {} + for parent in tree.iter(): + for child in parent: + parent_map[child] = parent + return parent_map + + +def _local_name(tag: str) -> str: + """Extract local tag name from a namespaced XML tag.""" + if "}" in tag: + return tag.split("}", 1)[1] + return tag + + +def _get_ancestor(elem, tag_local: str, parent_map: dict): + """Walk up parent chain to find ancestor by local tag name.""" + current = parent_map.get(elem) + while current is not None: + tag = current.tag if isinstance(current.tag, str) else "" + if _local_name(tag) == tag_local: + return current + current = parent_map.get(current) + return None + + +def _find_cell_by_addr(tbl, col: int, row: int): + """Find by its coordinates.""" + for tr in tbl: + for tc in tr: + if tc.tag != HP_TC_TAG: + continue + cell_addr = tc.find(f"./{HP_CELLADDR_TAG}") + if cell_addr is None: + continue + try: + col_addr = int(cell_addr.get("colAddr", "-1")) + row_addr = int(cell_addr.get("rowAddr", "-1")) + except ValueError: + continue + if col_addr == col and row_addr == row: + return tc + return None + + +def _set_cell_text(tc, text: str) -> None: + """Set cell text, creating in first when absent.""" + run = tc.find(f".//{HP_RUN_TAG}") + if run is None: + sub_list = tc.find(f"./{HP_SUBLIST_TAG}") + if sub_list is None: + sub_list = etree.Element(HP_SUBLIST_TAG) + tc.insert(0, sub_list) + paragraph = sub_list.find(f"./{HP_P_TAG}") + if paragraph is None: + paragraph = etree.Element(HP_P_TAG) + sub_list.append(paragraph) + run = etree.Element(HP_RUN_TAG) + paragraph.append(run) + + text_elem = run.find(f"./{HP_T_TAG}") + if text_elem is None: + text_elem = etree.Element(HP_T_TAG) + run.append(text_elem) + + text_elem.text = text + for child in list(text_elem): + text_elem.remove(child) + + +def _clear_cell_except(tc, keep_elem, parent_map: dict) -> None: + """Clear a cell except the run/paragraph containing keep_elem.""" + keep_run = _get_ancestor(keep_elem, "run", parent_map) + keep_paragraph = _get_ancestor(keep_elem, "p", parent_map) + sub_list = tc.find(f"./{HP_SUBLIST_TAG}") + + if sub_list is not None: + for paragraph in list(sub_list.findall(f"./{HP_P_TAG}")): + paragraph_parent = parent_map.get(paragraph) + if paragraph is not keep_paragraph: + if paragraph_parent is not None: + paragraph_parent.remove(paragraph) + continue + + for run in list(paragraph.findall(f"./{HP_RUN_TAG}")): + if run is keep_run: + continue + paragraph.remove(run) + + if keep_run is not None: + for text_elem in list(keep_run.findall(f"./{HP_T_TAG}")): + if text_elem is keep_elem: + continue + keep_run.remove(text_elem) + + +def _get_table_index(tree, tbl) -> int: + """Return ordinal table index in document tree.""" + for idx, candidate in enumerate(tree.findall(f".//{HP_TBL_TAG}")): + if candidate is tbl: + return idx + return -1 + + +def load_plan(plan_path: str) -> dict: + """Load and validate fill_plan.json.""" + with open(plan_path, encoding="utf-8") as f: + plan = json.load(f) + + required_keys = ["template_file", "output_file"] + for key in required_keys: + if key not in plan: + raise ValueError(f"Missing required key in fill_plan.json: {key}") + + if not os.path.exists(plan["template_file"]): + raise FileNotFoundError(f"Template file not found: {plan['template_file']}") + + for r in plan.get("simple_replacements", []): + if not r.get("find"): + raise ValueError("simple_replacements: 'find' must be non-empty") + + for r in plan.get("section_replacements", []): + if not r.get("guide_text_prefix"): + raise ValueError("section_replacements: 'guide_text_prefix' must be non-empty") + + for r in plan.get("table_cell_fills", []): + if not r.get("find_label"): + raise ValueError("table_cell_fills: 'find_label' must be non-empty") + + for r in plan.get("multi_paragraph_fills", []): + if not r.get("guide_text_prefix"): + raise ValueError("multi_paragraph_fills: 'guide_text_prefix' must be non-empty") + if not r.get("paragraphs"): + raise ValueError("multi_paragraph_fills: 'paragraphs' must be non-empty") + + return plan + + +def find_section_xmls(hwpx_path: str) -> list[str]: + """Find all section XML files in HWPX archive.""" + sections = [] + with zipfile.ZipFile(hwpx_path, "r") as zf: + for name in zf.namelist(): + if name.startswith("Contents/section") and name.endswith(".xml"): + sections.append(name) + sections.sort() + return sections + + +def get_all_text_elements(tree) -> list: + """Get all text elements from an XML tree.""" + return tree.findall(f".//{HP_T_TAG}") + + +def apply_simple_replacements_xml(tree, replacements: list) -> int: + """Apply exact text match replacements on XML tree. + + Each replacement: {"find": str, "replace": str, "occurrence"?: int} + Replacements are sorted by find text length (longest first) to prevent + shorter matches from breaking longer ones. + """ + total = 0 + text_elements = get_all_text_elements(tree) + + # Sort by find text length descending to avoid partial match conflicts + sorted_replacements = sorted(replacements, key=lambda r: len(r["find"]), reverse=True) + + for r in sorted_replacements: + find_text = r["find"] + replace_text = r["replace"] + limit = r.get("occurrence") + count = 0 + + for elem_idx, elem in enumerate(text_elements): + if elem.text and find_text in elem.text: + elem.text = elem.text.replace(find_text, replace_text, 1) + count += 1 + _log_event({"type": "replace", "idx": elem_idx, "find": find_text, "replace": replace_text}) + if limit and count >= limit: + break + + total += count + find_display = find_text[:50] + ("..." if len(find_text) > 50 else "") + replace_display = replace_text[:50] + ("..." if len(replace_text) > 50 else "") + if count == 0: + print(f" WARNING: '{find_display}' not found", file=sys.stderr) + else: + print(f" Replaced '{find_display}' -> '{replace_display}' ({count}x)") + + return total + + +def apply_section_replacements_xml(tree, replacements: list) -> int: + """Replace guide text with actual content on XML tree. + + Each replacement: {"section_id": str, "guide_text_prefix": str, "content": str} + """ + parent_map = _build_parent_map(tree) + text_elements = get_all_text_elements(tree) + total = 0 + + for r in replacements: + prefix = r["guide_text_prefix"] + content = r["content"] + section_id = r.get("section_id", "?") + clear_cell = r.get("clear_cell", True) + replaced = False + + for elem_idx, elem in enumerate(text_elements): + if elem.text and prefix in elem.text: + elem.text = content + for child in list(elem): + elem.remove(child) + + if clear_cell: + cell = _get_ancestor(elem, "tc", parent_map) + if cell is not None: + _clear_cell_except(cell, elem, parent_map) + + total += 1 + replaced = True + _log_event({"type": "replace", "idx": elem_idx, "find": prefix, "replace": content}) + if clear_cell: + print(f" Section {section_id}: replaced guide text (cell cleared)") + else: + print(f" Section {section_id}: replaced guide text") + break + + if not replaced: + prefix_display = prefix[:50] + ("..." if len(prefix) > 50 else "") + print( + f" WARNING: Section {section_id} guide text not found: '{prefix_display}'", + file=sys.stderr, + ) + + return total + + +def _find_label_matches(text_elements: list, label: str) -> list[tuple[int, object]]: + """Find label matches, preferring exact text matches over contains matches.""" + exact_matches = [(i, elem) for i, elem in enumerate(text_elements) if elem.text and elem.text.strip() == label] + if exact_matches: + return exact_matches + return [(i, elem) for i, elem in enumerate(text_elements) if elem.text and label in elem.text] + + +def _try_celladdr_fill( + tree, + label_idx: int, + label: str, + value: str, + label_cell, + label_addr, + offset_col: int, + offset_row: int, + parent_map: dict, +) -> bool: + """Try table fill using cellAddr coordinates and target offset.""" + label_tbl = _get_ancestor(label_cell, "tbl", parent_map) + if label_tbl is None or label_addr is None: + return False + + try: + label_col = int(label_addr.get("colAddr", "-1")) + label_row = int(label_addr.get("rowAddr", "-1")) + except ValueError: + label_col = -1 + label_row = -1 + + target_col = label_col + offset_col + target_row = label_row + offset_row + target_cell = _find_cell_by_addr(label_tbl, target_col, target_row) + if target_cell is None: + return False + + _set_cell_text(target_cell, value) + table_idx = _get_table_index(tree, label_tbl) + _log_event({"type": "replace", "idx": label_idx, "find": label, "replace": value}) + value_display = value[:40] + ("..." if len(value) > 40 else "") + print(f" Table cell '{label}' -> '{value_display}' (T{table_idx} R{target_row} C{target_col})") + return True + + +def _try_fallback_fill( + text_elements: list, + start_idx: int, + label: str, + value: str, + label_cell, + parent_map: dict, +) -> bool: + """Try table fill by scanning nearby text elements within the same table.""" + label_tbl = _get_ancestor(label_cell, "tbl", parent_map) + for j in range(start_idx + 1, min(start_idx + 50, len(text_elements))): + next_elem = text_elements[j] + next_cell = _get_ancestor(next_elem, "tc", parent_map) + next_tbl = _get_ancestor(next_cell, "tbl", parent_map) if next_cell is not None else None + if next_cell is None or next_cell is label_cell or next_tbl is not label_tbl: + continue + + next_elem.text = value + for child in list(next_elem): + next_elem.remove(child) + _log_event({"type": "replace", "idx": j, "find": label, "replace": value}) + value_display = value[:40] + ("..." if len(value) > 40 else "") + print(f" Table cell '{label}' -> '{value_display}' (fallback)") + return True + return False + + +def apply_table_cell_fills_xml(tree, fills: list) -> int: + """Fill table cells by finding label text and replacing the adjacent value cell. + + Each fill: {"find_label": str, "value": str} + + Primary strategy: cellAddr-based table lookup by offset. + Fallback strategy: flat scan for next text element in a different cell. + """ + total = 0 + parent_map = _build_parent_map(tree) + text_elements = get_all_text_elements(tree) + + for fill in fills: + label = fill["find_label"] + value = fill["value"] + offset = fill.get("target_offset", {"col": 1, "row": 0}) + offset_col = int(offset.get("col", 1)) + offset_row = int(offset.get("row", 0)) + found = False + matches = _find_label_matches(text_elements, label) + + for i, elem in matches: + label_cell = _get_ancestor(elem, "tc", parent_map) + if label_cell is None: + continue + + label_addr = label_cell.find(f"./{HP_CELLADDR_TAG}") + if _try_celladdr_fill(tree, i, label, value, label_cell, label_addr, offset_col, offset_row, parent_map): + total += 1 + found = True + break + if _try_fallback_fill(text_elements, i, label, value, label_cell, parent_map): + total += 1 + found = True + break + + if not found: + print(f" WARNING: Table label '{label}' not found or no adjacent cell", file=sys.stderr) + + return total + + +def _create_paragraph(ref_p, text: str): + """Create paragraph by cloning reference paragraph style/layout.""" + new_p = deepcopy(ref_p) + + runs = new_p.findall(f"./{HP_RUN_TAG}") + if not runs: + run = etree.Element(HP_RUN_TAG) + new_p.append(run) + runs = [run] + + for run in runs[1:]: + new_p.remove(run) + + t_elem = runs[0].find(f"./{HP_T_TAG}") + if t_elem is None: + t_elem = etree.SubElement(runs[0], HP_T_TAG) + t_elem.text = text + for child in list(t_elem): + t_elem.remove(child) + + return new_p + + +def apply_multi_paragraph_fills(tree, fills: list) -> int: + """Inject multi-paragraph content into a target cell.""" + parent_map = _build_parent_map(tree) + text_elements = get_all_text_elements(tree) + total = 0 + + for fill in fills: + section_id = fill.get("section_id", "?") + prefix = fill["guide_text_prefix"] + paragraphs = fill.get("paragraphs", []) + replaced = False + + for elem_idx, elem in enumerate(text_elements): + if not (elem.text and prefix in elem.text): + continue + + cell = _get_ancestor(elem, "tc", parent_map) + if cell is None: + continue + sub_list = cell.find(f"./{HP_SUBLIST_TAG}") + if sub_list is None: + continue + ref_p = sub_list.find(f"./{HP_P_TAG}") + if ref_p is None: + continue + + for paragraph in list(sub_list.findall(f"./{HP_P_TAG}")): + sub_list.remove(paragraph) + for paragraph_text in paragraphs: + sub_list.append(_create_paragraph(ref_p, paragraph_text)) + + total += 1 + replaced = True + _log_event({"type": "replace", "idx": elem_idx, "find": prefix, "replace": f"{len(paragraphs)} paragraphs"}) + print(f" Section {section_id}: inserted {len(paragraphs)} paragraph(s)") + break + + if not replaced: + print(f" WARNING: Section {section_id} multi-paragraph target not found", file=sys.stderr) + + return total + + +def fill_hwpx(plan: dict, output_path: str) -> int: + """Main fill operation: copy template, modify XML, save.""" + template_path = plan["template_file"] + shutil.copy2(template_path, output_path) + section_files = find_section_xmls(template_path) + if not section_files: + raise ValueError("No section XML files found in HWPX") + print(f"Found {len(section_files)} section(s): {', '.join(section_files)}") + + total_replacements = 0 + with zipfile.ZipFile(template_path, "r") as zf_in: + for section_file in section_files: + tree = etree.fromstring(zf_in.read(section_file)) + section_total = 0 + if plan.get("simple_replacements"): + print(f"\n--- Simple Replacements ({section_file}) ---") + section_total += apply_simple_replacements_xml(tree, plan["simple_replacements"]) + if plan.get("section_replacements"): + print(f"\n--- Section Replacements ({section_file}) ---") + section_total += apply_section_replacements_xml(tree, plan["section_replacements"]) + if plan.get("table_cell_fills"): + print(f"\n--- Table Cell Fills ({section_file}) ---") + section_total += apply_table_cell_fills_xml(tree, plan["table_cell_fills"]) + if plan.get("multi_paragraph_fills"): + print(f"\n--- Multi Paragraph Fills ({section_file}) ---") + section_total += apply_multi_paragraph_fills(tree, plan["multi_paragraph_fills"]) + + total_replacements += section_total + if section_total > 0: + modified_xml = etree.tostring(tree, xml_declaration=True, encoding="UTF-8") + _update_zip_file(output_path, section_file, modified_xml) + print(f" Updated {section_file} ({section_total} replacements)") + + _log_event({"type": "done", "total": total_replacements, "output": output_path}) + return total_replacements + + +def _update_zip_file(zip_path: str, target_file: str, new_content: bytes) -> None: + """Replace a single file inside a ZIP archive.""" + tmp_fd, tmp_path = tempfile.mkstemp(suffix=".hwpx") + os.close(tmp_fd) + + with zipfile.ZipFile(zip_path, "r") as zf_in, \ + zipfile.ZipFile(tmp_path, "w") as zf_out: + for item in zf_in.infolist(): + if item.filename == target_file: + zf_out.writestr(item, new_content) + else: + zf_out.writestr(item, zf_in.read(item.filename)) + + shutil.move(tmp_path, zip_path) + + +def inspect_template(template_path: str, query: str | None = None) -> None: + """List all text elements in a template for debugging. + + Uses direct XML parsing to find ALL text including table cells. + """ + section_files = find_section_xmls(template_path) + + total_elements = 0 + with zipfile.ZipFile(template_path, "r") as zf: + for section_file in section_files: + xml_bytes = zf.read(section_file) + tree = etree.fromstring(xml_bytes) + text_elements = get_all_text_elements(tree) + parent_map = _build_parent_map(tree) + total_elements += len(text_elements) + + print(f"Section: {section_file} ({len(text_elements)} text elements)\n") + + for i, elem in enumerate(text_elements): + text = elem.text or "" + if not text.strip(): + continue + if query and query.lower() not in text.lower(): + continue + display = text[:100] + ("..." if len(text) > 100 else "") + context = "" + cell = _get_ancestor(elem, "tc", parent_map) + if cell is not None: + table = _get_ancestor(cell, "tbl", parent_map) + cell_addr = cell.find(f"./{HP_CELLADDR_TAG}") + if table is not None and cell_addr is not None: + table_idx = _get_table_index(tree, table) + try: + col = int(cell_addr.get("colAddr", "-1")) + row = int(cell_addr.get("rowAddr", "-1")) + except ValueError: + col = -1 + row = -1 + context = f"[T{table_idx} R{row} C{col}] " + + print(f" [{i:4d}] {context}{display}") + + print(f"\nTotal elements: {total_elements}") + + +def _collect_table_cell_infos(table) -> list[tuple[int, int, int, int, str]]: + """Collect row/col/span/text info from table cells.""" + cell_infos = [] + for cell in table.findall(f".//{HP_TC_TAG}"): + col, row = _parse_cell_addr(cell) + if col is None or row is None: + continue + cell_span = cell.find(f"./{HP_CELLSPAN_TAG}") + if cell_span is not None: + try: + col_span = int(cell_span.get("colSpan", "1")) + row_span = int(cell_span.get("rowSpan", "1")) + except ValueError: + col_span = 1 + row_span = 1 + else: + col_span = 1 + row_span = 1 + text = _get_cell_text(cell) or "[EMPTY]" + cell_infos.append((row, col, col_span, row_span, text)) + return sorted(cell_infos, key=lambda x: (x[0], x[1])) + + +def _inspect_table_structure(template_path: str) -> None: + """Inspect table layout with cell coordinates and spans.""" + with zipfile.ZipFile(template_path, "r") as zf: + for section_file in find_section_xmls(template_path): + tree = etree.fromstring(zf.read(section_file)) + tables = tree.findall(f".//{HP_TBL_TAG}") + print(f"Section: {section_file} ({len(tables)} tables)\n") + for table_idx, table in enumerate(tables): + row_cnt = table.get("rowCnt", "?") + col_cnt = table.get("colCnt", "?") + print(f" Table {table_idx}: {row_cnt} rows x {col_cnt} cols") + for row, col, col_span, row_span, text in _collect_table_cell_infos(table): + span = f" (span {col_span}x{row_span})" if col_span > 1 or row_span > 1 else "" + print(f" R{row} C{col}{span}: {text}") + print() + + +def _get_cell_text(tc) -> str: + """Get normalized text content of a table cell.""" + return "".join((t.text or "") for t in tc.findall(f".//{HP_T_TAG}")).strip() + + +def _parse_cell_addr(tc) -> tuple[int | None, int | None]: + """Parse cell address (colAddr, rowAddr) from .""" + cell_addr = tc.find(f"./{HP_CELLADDR_TAG}") + if cell_addr is None: + return None, None + try: + return int(cell_addr.get("colAddr", "-1")), int(cell_addr.get("rowAddr", "-1")) + except ValueError: + return None, None + + +def _detect_placeholder_pattern(text: str) -> str | None: + """Detect placeholder-like patterns such as OO/○○/000.""" + for pattern in PLACEHOLDER_PATTERNS: + match = re.search(pattern, text) + if match: + return match.group(0) + numeric_match = re.search(NUMERIC_PLACEHOLDER_PATTERN, text) + if numeric_match: + return numeric_match.group(1) + return None + + +def _extract_table_schema(tree) -> list[dict]: + """Extract table layout and cell fillability metadata.""" + tables = [] + for table_idx, table in enumerate(tree.findall(f".//{HP_TBL_TAG}")): + row_cnt = table.get("rowCnt", "0") + col_cnt = table.get("colCnt", "0") + table_info = { + "index": table_idx, + "rows": int(row_cnt) if row_cnt.isdigit() else 0, + "cols": int(col_cnt) if col_cnt.isdigit() else 0, + "cells": [], + } + for cell in table.findall(f".//{HP_TC_TAG}"): + col, row = _parse_cell_addr(cell) + if col is None or row is None: + continue + text = _get_cell_text(cell) + cell_info = {"row": row, "col": col, "text": text} + if not text: + cell_info["is_empty"] = True + elif col == 0: + cell_info["is_label"] = True + table_info["cells"].append(cell_info) + + table_info["cells"].sort(key=lambda c: (c["row"], c["col"])) + tables.append(table_info) + return tables + + +def _extract_text_markers(tree, index_offset: int) -> tuple[list[dict], list[dict]]: + """Extract guide text markers and placeholders from text elements.""" + guide_texts = [] + placeholders = [] + parent_map = _build_parent_map(tree) + text_elements = get_all_text_elements(tree) + + for local_idx, elem in enumerate(text_elements): + text = (elem.text or "").strip() + if not text: + continue + element_index = index_offset + local_idx + cell = _get_ancestor(elem, "tc", parent_map) + table = _get_ancestor(cell, "tbl", parent_map) if cell is not None else None + table_index = _get_table_index(tree, table) if table is not None else -1 + col, row = _parse_cell_addr(cell) if cell is not None else (None, None) + + if text.startswith("※"): + guide = {"element_index": element_index, "prefix": text[:100]} + if table_index >= 0: + guide["table_index"] = table_index + if row is not None and col is not None: + guide["cell"] = f"R{row}C{col}" + guide_texts.append(guide) + + pattern = _detect_placeholder_pattern(text) + if pattern: + placeholders.append({"element_index": element_index, "text": text, "pattern": pattern}) + + return guide_texts, placeholders + + +def analyze_template(template_path: str) -> dict: + """Analyze template and return fillable schema metadata.""" + schema = { + "template_file": template_path, + "total_text_elements": 0, + "tables": [], + "guide_texts": [], + "placeholders": [], + } + index_offset = 0 + + with zipfile.ZipFile(template_path, "r") as zf: + for section_file in find_section_xmls(template_path): + tree = etree.fromstring(zf.read(section_file)) + text_elements = get_all_text_elements(tree) + schema["total_text_elements"] += len(text_elements) + schema["tables"].extend(_extract_table_schema(tree)) + guide_texts, placeholders = _extract_text_markers(tree, index_offset) + schema["guide_texts"].extend(guide_texts) + schema["placeholders"].extend(placeholders) + index_offset += len(text_elements) + + return schema + + +def main(): + parser = argparse.ArgumentParser(description="Fill HWPX template with content") + parser.add_argument("plan", nargs="?", help="Path to fill_plan.json") + parser.add_argument("-o", "--output", help="Override output path") + parser.add_argument("--inspect", metavar="HWPX", help="Inspect template text runs") + parser.add_argument("--inspect-tables", metavar="HWPX", help="Show table structure of template") + parser.add_argument("--analyze", metavar="HWPX", help="Analyze template and output JSON schema") + parser.add_argument("-q", "--query", help="Filter runs by text (with --inspect)") + args = parser.parse_args() + + if args.inspect: + inspect_template(args.inspect, args.query) + return + if args.inspect_tables: + _inspect_table_structure(args.inspect_tables) + return + if args.analyze: + print(json.dumps(analyze_template(args.analyze), ensure_ascii=False, indent=2)) + return + + if not args.plan: + parser.error("fill_plan.json is required (or use --inspect / --inspect-tables / --analyze)") + + # Load plan + plan = load_plan(args.plan) + template_path = plan["template_file"] + output_path = args.output or plan["output_file"] + + print(f"Template: {template_path}") + print(f"Output: {output_path}") + print() + + # Fill template + total = fill_hwpx(plan, output_path) + + # Report + size = os.path.getsize(output_path) + print(f"\n--- Done ---") + print(f"Saved: {output_path} ({size:,} bytes)") + print(f"Total replacements: {total}") + + +if __name__ == "__main__": + main() diff --git "a/\354\202\254\354\227\205\352\263\204\355\232\215\354\204\234_schema.json" "b/\354\202\254\354\227\205\352\263\204\355\232\215\354\204\234_schema.json" new file mode 100644 index 0000000..35d85fe --- /dev/null +++ "b/\354\202\254\354\227\205\352\263\204\355\232\215\354\204\234_schema.json" @@ -0,0 +1,140 @@ +{ + "_instructions": { + "purpose": "2026년 재도전성공패키지 (예비)재창업기업 사업계획서 작성용 스키마", + "how_to_use": [ + "1. 모든 빈 문자열(\"\")과 빈 배열([])을 실제 내용으로 채워주세요", + "2. instruction 필드는 작성 지침이므로 수정하지 마세요", + "3. sections의 paragraphs 배열: 각 원소가 한글 파일에서 독립 문단이 됩니다", + "4. 완성된 JSON을 Claude Code에 전달하면 자동으로 HWPX 파일을 생성합니다" + ], + "constraints": [ + "사업계획서 본문은 목차 제외 7페이지 이내", + "개인정보는 반드시 마스킹 (성명→○○○, 대학→○○대, 생년→YYYY년 등)", + "과제명은 K-Startup 신청 시 입력한 과제명과 동일하게 기재" + ] + }, + + "meta": { + "과제명": "", + "기업명": "", + "아이템_개요": "" + }, + + "sections": { + "1-1_폐업원인분석": { + "instruction": "과거 폐업 원인을 분석한 결과 및 이를 극복하기 위한 개선 방안. 폐업의 근본 원인(시장/재무/경영/기술)을 구체적으로 분석하고, 재창업에서의 극복 방안을 제시.", + "paragraphs": [] + }, + "1-2_목표시장": { + "instruction": "목표시장(고객) 현황 및 경쟁사 분석, 재창업 아이템의 경쟁력. TAM/SAM/SOM, 경쟁사 대비 차별점, 고객 페인포인트 포함.", + "paragraphs": [] + }, + "2-1_준비현황": { + "instruction": "신청 이전까지 시제품 제작, 시장 반응 조사 등 사전 준비 현황. 개발 단계, 프로토타입, 고객 피드백, 기술 검증 포함.", + "paragraphs": [] + }, + "2-2_구체화방안": { + "instruction": "핵심 기능 및 개발/개선 구체적 계획. 기술 아키텍처, 개발 로드맵, 핵심 기능 명세 포함.", + "paragraphs": [] + }, + "3-1_비즈니스모델": { + "instruction": "가치 전달 체계 및 수익 창출 방법. 수익 모델(구독/수수료/라이선스 등), 가격 전략, 핵심 파트너, 비용 구조 포함.", + "paragraphs": [] + }, + "3-2_사업화전략": { + "instruction": "목표시장 진입/진출 방안과 고객 확보 전략. GTM 전략, 마케팅 채널, 초기 고객 확보, 파트너십 전략 포함.", + "paragraphs": [] + }, + "4-1_보유역량": { + "instruction": "대표자 및 조직의 사업화 역량. 예비재창업자는 대표자 중심. 기술 역량, 사업 경험, 도메인 전문성 포함.", + "paragraphs": [] + }, + "4-2_조직구성계획": { + "instruction": "협약기간 조직 구성 계획. 필요 직무, 보완 역량, 채용 전략 포함.", + "paragraphs": [] + } + }, + + "폐업이력": { + "instruction": "과거 폐업기업 개요. 최대 3개 기업, 최근순. 1개 기업만 있으면 companies 배열에 1개만.", + "총_폐업횟수": 0, + "companies": [ + { + "기업명": "", + "기업구분": "", + "사업기간": "", + "아이템_개요": "", + "폐업원인": "" + } + ] + }, + + "추진일정": { + "instruction": "협약기간('26.4~10월) 목표 및 추진 일정. 3~5개 행.", + "rows": [ + {"순번": 1, "목표": "", "세부내용": "", "일정": ""}, + {"순번": 2, "목표": "", "세부내용": "", "일정": ""}, + {"순번": 3, "목표": "", "세부내용": "", "일정": ""} + ] + }, + + "사업비": { + "instruction": "사업비 구성. 비목별 산출근거와 금액.", + "constraints": { + "정부지원_한도": "최대 1억원", + "정부지원_비율": "총사업비의 75% 이하", + "현금_비율": "총사업비의 5% 이상", + "현물_비율": "총사업비의 20% 이하" + }, + "비목_options": ["재료비", "외주용역비", "인건비", "지식재산권 등 무형자산 취득비", "여비", "기타"], + "rows": [ + {"비목": "재료비", "산출근거": "", "정부지원": 0, "현금": 0, "현물": 0}, + {"비목": "외주용역비", "산출근거": "", "정부지원": 0, "현금": 0, "현물": 0}, + {"비목": "인건비", "산출근거": "", "정부지원": 0, "현금": 0, "현물": 0} + ] + }, + + "인력현황": { + "instruction": "재직 인력 고용현황. 채용 완료 인력만. 개인정보 마스킹 필수.", + "현재_재직인원": 0, + "추가_고용계획": 0, + "members": [ + {"순번": 1, "직위": "", "담당업무": "", "보유역량": ""}, + {"순번": 2, "직위": "", "담당업무": "", "보유역량": ""} + ] + }, + + "가점": { + "instruction": "해당 항목만 true로 변경. 해당 없으면 false 유지.", + "서류평가가점": { + "특별지원지역_소재": false, + "노란우산공제_가입": false, + "3년이상_업력": false, + "재도전_사례공모전_수상": false + }, + "발표평가가점": { + "유공포상_수상": false, + "타사업자등록_없음": false + }, + "서류평가면제": { + "K스타트업_왕중왕전_진출": false, + "중진공_심층평가_통과": false, + "중진공_특화교육_우수": false, + "유공포상_수상": false, + "TIPS_선정이력": false + }, + "우선선정": { + "K스타트업_왕중왕전_대상": false + } + }, + + "_manual_only": { + "_note": "아래 항목만 한글에서 직접 수정 필요", + "items": [ + "폐업이력 2~3번째 기업 (빈 셀 직접 입력)", + "파란색 안내문구 삭제", + "붙임: 증빙서류 이미지 삽입", + "전체 7페이지 이내 확인" + ] + } +} diff --git "a/\354\213\234\354\212\244\355\205\234\355\224\204\353\241\254\355\224\204\355\212\270.md" "b/\354\213\234\354\212\244\355\205\234\355\224\204\353\241\254\355\224\204\355\212\270.md" new file mode 100644 index 0000000..35c93ee --- /dev/null +++ "b/\354\213\234\354\212\244\355\205\234\355\224\204\353\241\254\355\224\204\355\212\270.md" @@ -0,0 +1,210 @@ +# 재도전성공패키지 사업계획서 작성 어시스턴트 v2 + +## 역할 + +당신은 중소벤처기업부 정부지원사업 사업계획서 전문 컨설턴트입니다. 사용자와 대화하며 2026년 재도전성공패키지 (예비)재창업기업 사업계획서를 완성합니다. + +## 핵심 규칙 + +1. **프로젝트에 업로드된 `사업계획서_schema.json`이 마스터 양식입니다.** 섹션 구조와 필드를 숙지하세요. +2. **초안 작성·토론·수정 단계에서는 반드시 마크다운으로 작업합니다.** JSON 변환은 사용자가 "최종 확정" 또는 "JSON 변환"을 명시적으로 요청할 때만 수행합니다. +3. **사용자가 최종 승인하기 전까지 JSON으로 변환하지 마세요.** 어떤 경우에도 사용자의 명시적 승인 없이 JSON을 출력하지 않습니다. + +--- + +## 작업 흐름 + +### Phase 1: 컨텍스트 수집 + +사용자에게 다음을 질문하세요 (이미 알고 있는 것은 건너뛰기): +- 사업 아이템 개요 (무엇을 만드는가?) +- 이전 폐업 경험 (몇 회? 원인?) +- 현재 준비 상태 (MVP, 프로토타입, 팀 구성) +- 목표 시장과 경쟁 환경 +- 수익 모델 +- 예산 규모 (정부지원 희망 금액) + +### Phase 2: 마크다운 초안 작성 + 토론 + +**이 단계에서 모든 내용은 마크다운으로 작성합니다.** + +#### 마크다운 포맷 + +```markdown +# 재도전성공패키지 사업계획서 + +## 과제 개요 +- **과제명**: (내용) +- **기업명**: (내용) +- **아이템 개요**: (내용) + +## 폐업 이력 (총 N회) + +### 1번째 기업 +- 기업명: / 기업구분: 개인·법인 +- 사업기간: YYYY.MM.DD.~YYYY.MM.DD. +- 아이템 개요: +- 폐업 원인: + +## 1-1. 폐업 원인 분석 및 개선 방안 + +(여러 단락으로 자유롭게 작성. 각 단락 사이에 빈 줄.) + +## 1-2. 목표시장(고객) 현황 및 필요성 + +(내용) + +## 2-1. 재창업 아이템 준비 현황 + +(내용) + +## 2-2. 재창업 아이템 실현 및 구체화 방안 + +(내용) + +## 3-1. 비즈니스 모델 + +(내용) + +## 3-2. 사업화 추진 전략 + +(내용) + +## 3-3. 추진 일정 및 자금 운용 계획 + +### 추진 일정 ('26.4~10월) + +| 순번 | 목표 | 세부 내용 | 일정 | +|------|------|----------|------| +| 1 | ... | ... | ~ 6월 | +| 2 | ... | ... | ~ 8월 | +| 3 | ... | ... | ~ 10월 | + +### 사업비 + +| 비목 | 산출근거 | 정부지원 | 현금 | 현물 | 합계 | +|------|---------|---------|------|------|------| +| 재료비 | • ... | 0 | 0 | 0 | 0 | +| ... | | | | | | +| **합계** | | **0** | **0** | **0** | **0** | + +## 4-1. 조직 구성 및 보유 역량 + +(내용) + +### 인력 현황 + +| 순번 | 직위 | 담당 업무 | 보유역량 | +|------|------|----------|---------| +| 1 | ... | ... | ... | + +- 현재 재직 인원: N명 +- 추가 고용 계획: N명 + +## 4-2. 조직 구성 계획 + +(내용) + +## 가점/면제 해당 사항 + +- [ ] 서류평가 가점 ①~④: (해당 항목 체크) +- [ ] 발표평가 가점 ①~②: (해당 항목 체크) +- [ ] 서류평가 면제 ①~⑤: (해당 항목 체크) +- [ ] 우선선정: (해당 시) +``` + +#### 토론 진행 방식 + +1. **섹션 단위로 작성** → 사용자 피드백 → 수정 반복 +2. 사용자가 "OK", "다음", "넘어가" 등으로 승인하면 다음 섹션으로 이동 +3. 사용자가 수정 요청하면 해당 섹션만 마크다운으로 재작성 +4. 전체 초안 완성 후, **전문을 한번에 마크다운으로 보여주고** 최종 리뷰 요청 + +### Phase 3: 최종 확정 → JSON 변환 + +**트리거**: 사용자가 다음 중 하나를 말할 때만 JSON 변환을 수행합니다: +- "최종 확정" +- "JSON 변환해줘" +- "확정이야" +- "이대로 진행" + +**변환 시 수행할 작업:** + +1. 확정된 마크다운 내용을 `사업계획서_schema.json` 구조에 맞춰 JSON으로 변환 +2. 마크다운의 각 `##` 섹션의 단락들 → `paragraphs` 배열로 매핑 (빈 줄 기준으로 단락 분리) +3. 테이블 데이터 → 해당 JSON 필드로 매핑 +4. 출력 전 자체 검증 (체크리스트 참조) +5. **완성된 JSON 전체를 코드 블록으로 출력** + +--- + +## 작성 가이드라인 + +### 서술 섹션 (1-1 ~ 4-2) +- 각 섹션 2~4단락 (마크다운에서 빈 줄로 구분) +- 한 단락 3~5문장 +- 구체적 숫자, 데이터, 근거 포함 +- "~할 것이다" → "~를 통해 ~를 달성한다" 식의 구체적 서술 +- 전문 용어는 심사위원이 이해 가능한 수준으로 + +### 폐업이력 +- 최대 3개 기업, 최근순 +- 기업구분: "개인" 또는 "법인" +- 사업기간: "YYYY.MM.DD.~YYYY.MM.DD." 형식 + +### 추진일정 +- 3~5개 행, 협약기간 '26.4월~10월 +- 일정: "~ 6월", "~ 8월" 형식 + +### 사업비 +- 정부지원 합계 ≤ 1억원 +- 정부지원 ≤ 총사업비의 75% +- 현금 ≥ 총사업비의 5% +- 현물 ≤ 총사업비의 20% +- 산출근거: "• 항목명(수량×단가)" 형식 +- 금액 단위: 원 + +### 인력현황 +- **개인정보 마스킹 필수**: 성명→○○○, 대학→○○대 +- 보유역량: 학력과 경력 중심 (마스킹 형태) +- 채용 완료 인력만 기재 +- 예비재창업자는 작성 불필요 + +### 가점 +- 해당 항목만 체크 + +## 제약사항 + +- 본문 7페이지 이내 (목차·붙임 제외) +- 개인정보(성명, 성별, 생년월일, 대학명, 소재지, 직장명) 반드시 마스킹 +- 과제명은 K-Startup 신청 과제명과 동일 + +--- + +## JSON 변환 시 체크리스트 + +출력 전 자체 검증: +- [ ] meta 3개 필드 비어있지 않은가? +- [ ] 8개 섹션 모두 paragraphs가 채워져 있는가? +- [ ] 폐업이력 companies에 최소 1개 기업이 있는가? +- [ ] 추진일정 rows가 3개 이상인가? +- [ ] 사업비 정부지원 합계 ≤ 1억인가? +- [ ] 사업비 비율 조건을 충족하는가? (정부지원≤75%, 현금≥5%, 현물≤20%) +- [ ] 인력현황 members가 최소 1명 있는가? +- [ ] "OO", "00개", "0000원" 같은 플레이스홀더가 남아있지 않은가? +- [ ] 개인정보가 마스킹되어 있는가? +- [ ] `사업계획서_schema.json`의 모든 키가 빠짐없이 포함되어 있는가? + +## JSON 구조 매핑 참조 + +마크다운 → JSON 변환 시 참조: + +| 마크다운 섹션 | JSON 경로 | +|-------------|-----------| +| 과제명/기업명/아이템 개요 | `meta.과제명`, `meta.기업명`, `meta.아이템_개요` | +| ## 1-1 ~ ## 4-2 | `sections.{key}.paragraphs` (빈 줄 기준 단락 분리) | +| 폐업 이력 | `폐업이력.총_폐업횟수`, `폐업이력.companies[]` | +| 추진 일정 테이블 | `추진일정.rows[]` | +| 사업비 테이블 | `사업비.rows[]` (금액은 정수, 콤마 없이) | +| 인력 현황 테이블 | `인력현황.members[]`, `현재_재직인원`, `추가_고용계획` | +| 가점 체크 | `가점.서류평가가점.*`, `가점.발표평가가점.*` 등 (true/false) |