Skip to content

Commit 791abaf

Browse files
committed
Add workflow CLI gap-closure tools
1 parent 9e04422 commit 791abaf

13 files changed

Lines changed: 1026 additions & 316 deletions

CHANGELOG.md

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,19 @@
22

33
모든 중요한 변경 사항은 이 문서에 기록됩니다. 형식은 [Keep a Changelog](https://keepachangelog.com/ko/1.1.0/)[Semantic Versioning](https://semver.org/lang/ko/)을 따릅니다.
44

5+
## [2.7] - 2026-03-08
6+
### 추가
7+
- `hwpx-unpack`, `hwpx-pack`, `hwpx-analyze-template` CLI를 추가했습니다.
8+
- `src/hwpx/tools/archive_cli.py`를 추가해 unpack/pack 워크플로를 패키지 레벨 도구로 승격했습니다.
9+
- unpack 시 `.hwpx-pack-metadata.json`을 기록하고, pack 시 이를 사용해 원본 ZIP 엔트리 순서/압축 방식을 가능한 범위에서 보존하도록 했습니다.
10+
- `src/hwpx/tools/template_analyzer.py``DevDoc/hwpxskill_gap_audit.md`를 추가했습니다.
11+
12+
### 변경
13+
- `scripts/office/unpack.py`, `scripts/office/pack.py`, `scripts/analyze_template.py`를 패키지 도구 래퍼로 정리했습니다.
14+
- `page_guard`에 shape/control count 및 히스토그램 비교를 추가하고, rendered page count가 아닌 layout-drift proxy임을 문서와 CLI 설명에 명시했습니다.
15+
- README와 `docs/usage.md`에 새 CLI 사용 예시를 추가했습니다.
16+
- 새 tooling에 대한 CLI/추출/overwrite/page-guard 회귀 테스트를 강화했습니다.
17+
518
## [2.6] - 2026-03-08
619
### 추가
720
- `hwpx-validate-package` CLI와 `hwpx.tools.package_validator`를 추가해 ZIP/OPC/HWPX 패키지 구조, `mimetype`, `container.xml`, manifest/spine 참조, XML well-formedness를 점검할 수 있게 했습니다.

DevDoc/hwpxskill_gap_audit.md

Lines changed: 193 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,193 @@
1+
# hwpxskill Gap Audit
2+
3+
## Scope
4+
5+
This audit compares the practical workflow surface of `python-hwpx` against
6+
`Canine89/hwpxskill` without assuming the competitor's README claims are
7+
correct. The goal is to identify real workflow gaps, reuse existing engine
8+
abstractions where possible, and separate reproduced bugs from unverified
9+
assertions.
10+
11+
## Current Repository Summary
12+
13+
- Core engine: `src/hwpx/document.py`, `src/hwpx/opc/package.py`,
14+
`src/hwpx/oxml/*`
15+
- Existing high-level tooling before this patch set:
16+
- schema validator: `src/hwpx/tools/validator.py`
17+
- text extraction engine: `src/hwpx/tools/text_extractor.py`
18+
- object finder / exporters: `src/hwpx/tools/object_finder.py`,
19+
`src/hwpx/tools/exporter.py`
20+
- Workflow-gap code already present at audit start:
21+
- package validator: `src/hwpx/tools/package_validator.py`
22+
- page guard: `src/hwpx/tools/page_guard.py`
23+
- text extraction CLI: `src/hwpx/tools/text_extract_cli.py`
24+
- script-only unpack/pack/analyze tools under `scripts/`
25+
26+
## Confirmed Gaps vs hwpxskill
27+
28+
These were confirmed by inspecting both repos and the current local checkout.
29+
30+
1. Public unpack/pack workflow was incomplete.
31+
- `python-hwpx` had script files for unpack/pack, but no package-level CLI
32+
entry points such as `hwpx-unpack` / `hwpx-pack`.
33+
- The existing pack/unpack scripts did not record archive entry order or
34+
compression metadata, so they could not preserve original ZIP layout
35+
details when repacking.
36+
- Overwrite behavior was not explicit or safe.
37+
38+
2. Template analysis workflow was incomplete.
39+
- `python-hwpx` had a script for analyzing reference documents, but it was
40+
not promoted to a package-level CLI, did not emit a structured JSON
41+
summary, and had only a smoke test instead of extraction-focused tests.
42+
43+
3. Page-guard coverage was narrower than requested.
44+
- The existing page guard already acted as a structural drift detector, but
45+
it did not count shape/control deltas yet.
46+
- It also needed clearer documentation that it is a proxy/risk heuristic,
47+
not a rendered page counter.
48+
49+
4. Public docs lagged behind the implemented tooling.
50+
- `README.md` still documented only `hwpx-validate` in the CLI section.
51+
- The main usage guide did not document unpack/pack/analyze/page-guard/text
52+
extraction workflows.
53+
54+
5. Audit documentation itself was missing.
55+
- There was no repository-local audit note separating verified findings from
56+
competitor marketing claims.
57+
58+
## Reusable Internals Confirmed
59+
60+
These existing internals made it unnecessary to cargo-cult `hwpxskill`'s raw
61+
XML-first approach.
62+
63+
- `src/hwpx/opc/package.py`
64+
- `HwpxPackage.open()`
65+
- `HwpxPackage.part_names()`
66+
- `HwpxPackage.get_part()`
67+
- `HwpxPackage.get_xml()`
68+
- `HwpxPackage.header_paths()`
69+
- `HwpxPackage.section_paths()`
70+
- `HwpxPackage.main_content`
71+
- `src/hwpx/tools/validator.py`
72+
- existing schema validation path
73+
- `src/hwpx/tools/text_extractor.py`
74+
- existing traversal and text extraction engine
75+
- `src/hwpx/tools/page_guard.py`
76+
- existing metrics collection shape that could be extended instead of replaced
77+
78+
Conclusion: `python-hwpx` already had enough engine-level primitives to add the
79+
missing workflows without switching to competitor-style "raw XML everywhere".
80+
81+
## Real Reproduced Bugs
82+
83+
### 1. Validation dirty-state mutation (historical, now fixed)
84+
85+
The concrete bug candidate worth treating seriously was whether validation
86+
mutated document state. That bug was real in the earlier implementation:
87+
88+
- `HwpxDocument.validate()` serialized via `_to_bytes_raw()`
89+
- `_to_bytes_raw()` called `self._root.reset_dirty()`
90+
- Result: validating a modified document could clear the dirty state even when
91+
the user had not saved yet
92+
93+
That behavior is now covered by a regression test:
94+
95+
- `tests/test_gap_closure_tools.py::test_validate_preserves_dirty_state`
96+
97+
At the time of this audit, current `main` already contains the fix, so the bug
98+
does not reproduce anymore on HEAD.
99+
100+
## Bugs I Could Not Reproduce
101+
102+
These claims appeared in or were implied by `hwpxskill`, but I could not
103+
substantiate them from evidence in the current `python-hwpx` checkout.
104+
105+
1. "python-hwpx API has many bugs"
106+
- Too vague to verify.
107+
- Current tests and integration flows do not support that broad claim.
108+
109+
2. "High-level API editing necessarily destroys styles/structure"
110+
- Not reproduced for ordinary paragraph/table editing in the current test
111+
suite.
112+
- Existing tests already cover roundtrip and style-preserving behavior.
113+
114+
3. "page_guard detects actual page count changes"
115+
- Not supported by the competitor implementation itself.
116+
- Their script measures structural/text drift in `section0.xml`; it does not
117+
compute rendered page count.
118+
119+
4. Header/footer instability or TypeError complaints
120+
- No current reproduction from repository tests.
121+
- Existing `tests/test_section_headers.py` covers the public API surface.
122+
123+
## Competitor Claims That Remain Unverified
124+
125+
1. "XML-direct workflow preserves formatting almost exactly"
126+
- Plausible for some templates, but not benchmarked here.
127+
- No controlled comparison was performed in this patch.
128+
129+
2. "Their workflow is more reliable for all existing documents"
130+
- Not established.
131+
- The competitor repo does not provide a broad evidence matrix for this.
132+
133+
3. "Template replacement quality is universally better than the object API"
134+
- Not established.
135+
- Likely document-dependent.
136+
137+
## Exact Files / Functions Inspected
138+
139+
### Local repository
140+
141+
- `pyproject.toml`
142+
- `README.md`
143+
- `docs/usage.md`
144+
- `src/hwpx/document.py`
145+
- `HwpxDocument.validate`
146+
- `HwpxDocument._to_bytes_raw`
147+
- `src/hwpx/opc/package.py`
148+
- `HwpxPackage.open`
149+
- `HwpxPackage.part_names`
150+
- `HwpxPackage.get_part`
151+
- `HwpxPackage.get_xml`
152+
- `HwpxPackage.main_content`
153+
- `HwpxPackage.header_paths`
154+
- `HwpxPackage.section_paths`
155+
- `HwpxPackage.save`
156+
- `src/hwpx/tools/validator.py`
157+
- `validate_document`
158+
- `src/hwpx/tools/package_validator.py`
159+
- `validate_package`
160+
- `src/hwpx/tools/page_guard.py`
161+
- `collect_metrics`
162+
- `compare_metrics`
163+
- `src/hwpx/tools/text_extractor.py`
164+
- `TextExtractor.iter_sections`
165+
- `TextExtractor.iter_paragraphs`
166+
- `TextExtractor.extract_text`
167+
- `src/hwpx/tools/text_extract_cli.py`
168+
- `scripts/office/unpack.py`
169+
- `scripts/office/pack.py`
170+
- `scripts/analyze_template.py`
171+
- `tests/test_gap_closure_tools.py`
172+
- `tests/test_section_headers.py`
173+
- `.github/workflows/release.yml`
174+
- `.github/workflows/tests.yml`
175+
176+
### Competitor repository (`Canine89/hwpxskill`)
177+
178+
- `README.md`
179+
- `scripts/validate.py`
180+
- `scripts/page_guard.py`
181+
- `scripts/text_extract.py`
182+
- `scripts/analyze_template.py`
183+
184+
## Patch Direction Chosen
185+
186+
This first PR-equivalent patch should:
187+
188+
1. promote unpack/pack/analyze into package-level tooling with CLI entry points
189+
2. keep using `python-hwpx` engine abstractions for package inspection and text
190+
extraction
191+
3. extend page guard as a proxy detector, not as a fake page counter
192+
4. keep backward compatibility with existing `HwpxDocument` APIs
193+
5. strengthen tests and docs around the new tooling

README.md

Lines changed: 26 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -98,7 +98,8 @@ doc.save_to_path("결과물.hwpx")
9898
| 🔎 **객체 검색** | 태그/속성/XPath | 특정 요소 탐색, 주석 이터레이터 |
9999
| 🎨 **스타일 치환** | 서식 기반 필터 | 색상/밑줄/charPrIDRef 기반 Run 검색 및 교체 |
100100
| 📤 **내보내기** | 텍스트/HTML/Markdown | 문서 변환 출력 |
101-
|**유효성 검사** | XSD 스키마 | CLI(`hwpx-validate`) 및 API |
101+
|**유효성 검사** | XSD + 패키지 구조 | CLI(`hwpx-validate`, `hwpx-validate-package`) 및 API |
102+
| 🧰 **워크플로 도구** | unpack/pack/template analyze/page guard | 템플릿 보존형 XML-first 작업 보조 |
102103
| 🏗️ **저수준 XML** | 데이터클래스 매핑 | OWPML 스키마 ↔ Python 객체 직접 조작 |
103104
| 🔄 **네임스페이스 호환** | 자동 정규화 | HWPML 20162011 자동 변환 |
104105
@@ -195,10 +196,15 @@ python-hwpx
195196
│ ├── body.py # 타입이 지정된 본문 모델
196197
│ └── common.py # 범용 XML ↔ 데이터클래스
197198
├── hwpx.tools
199+
│ ├── archive_cli # unpack/pack CLI 및 재패킹 메타데이터
198200
│ ├── text_extractor # 텍스트 추출 파이프라인
201+
│ ├── text_extract_cli # 텍스트 추출 CLI
199202
│ ├── object_finder # 객체 탐색 유틸리티
200203
│ ├── exporter # 텍스트/HTML/Markdown 내보내기
201-
│ └── validator # 스키마 유효성 검사 (hwpx-validate CLI)
204+
│ ├── validator # 스키마 유효성 검사 (hwpx-validate CLI)
205+
│ ├── package_validator# ZIP/OPC/HWPX 구조 검사
206+
│ ├── page_guard # layout-drift proxy
207+
│ └── template_analyzer# 레퍼런스 문서 분석/추출
202208
└── hwpx.templates # 내장 빈 문서 템플릿
203209
```
204210

@@ -207,8 +213,26 @@ python-hwpx
207213
```bash
208214
# HWPX 문서 스키마 유효성 검사
209215
hwpx-validate 문서.hwpx
216+
217+
# ZIP/OPC/HWPX 패키지 구조 검사
218+
hwpx-validate-package 문서.hwpx
219+
220+
# HWPX 풀기 / 다시 묶기
221+
hwpx-unpack 문서.hwpx ./unpacked
222+
hwpx-pack ./unpacked ./repacked.hwpx
223+
224+
# 레퍼런스 템플릿 분석과 파트 추출
225+
hwpx-analyze-template 문서.hwpx --extract-dir ./template-parts --json
226+
227+
# plain / markdown 텍스트 추출
228+
hwpx-text-extract 문서.hwpx --format markdown --output 문서.md
229+
230+
# 레이아웃 드리프트 프록시 비교
231+
hwpx-page-guard --reference 원본.hwpx --output 결과.hwpx
210232
```
211233

234+
`hwpx-page-guard`는 렌더된 실제 쪽수를 계산하지 않습니다. 대신 단락 수, 표 수, shape/control 수, 명시적 page/column break, 텍스트 길이 통계를 비교해 레이아웃 드리프트 위험을 탐지하는 프록시 도구입니다.
235+
212236
## 문서
213237

214238
| | |

docs/usage.md

Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,30 @@
22

33
python-hwpx는 HWPX 컨테이너를 검증하고 편집하기 위한 여러 계층의 API를 제공합니다. 이 문서에서는 패키지 수준에서 문서를 여는 방법부터 문단과 주석을 다루는 고수준 도구까지 핵심 사용 패턴을 소개합니다.
44

5+
## CLI 워크플로
6+
7+
라이브러리 API 외에도 템플릿 보존형 작업 흐름을 위한 CLI를 제공합니다.
8+
9+
```bash
10+
# 패키지 구조 점검
11+
hwpx-validate-package sample.hwpx
12+
13+
# XML-first 편집용 unpack / pack
14+
hwpx-unpack sample.hwpx ./sample-unpacked
15+
hwpx-pack ./sample-unpacked ./sample-repacked.hwpx
16+
17+
# 템플릿 분석과 파트 추출
18+
hwpx-analyze-template sample.hwpx --extract-dir ./template-parts --json
19+
20+
# 텍스트 추출
21+
hwpx-text-extract sample.hwpx --format markdown --output sample.md
22+
23+
# 레이아웃 드리프트 프록시
24+
hwpx-page-guard --reference sample.hwpx --output edited.hwpx
25+
```
26+
27+
`hwpx-page-guard`는 실제 렌더러의 쪽수를 계산하지 않고, 구조 및 텍스트 통계를 비교해 레이아웃 변화 위험을 탐지하는 프록시 검사기입니다.
28+
529
## 빠른 예제 모음
630

731
### 예제 1: 문단 수 세기

pyproject.toml

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"
44

55
[project]
66
name = "python-hwpx"
7-
version = "2.6"
7+
version = "2.7"
88
description = "Hancom HWPX 패키지를 로드하고 편집하기 위한 Python 유틸리티 모음"
99
readme = { file = "README.md", content-type = "text/markdown" }
1010
license = { file = "LICENSE" }
@@ -49,9 +49,12 @@ Documentation = "https://github.com/airmang/python-hwpx/tree/main/docs"
4949
Issues = "https://github.com/airmang/python-hwpx/issues"
5050

5151
[project.scripts]
52+
hwpx-unpack = "hwpx.tools.archive_cli:unpack_main"
53+
hwpx-pack = "hwpx.tools.archive_cli:pack_main"
5254
hwpx-validate = "hwpx.tools.validator:main"
5355
hwpx-validate-package = "hwpx.tools.package_validator:main"
5456
hwpx-page-guard = "hwpx.tools.page_guard:main"
57+
hwpx-analyze-template = "hwpx.tools.template_analyzer:main"
5558
hwpx-text-extract = "hwpx.tools.text_extract_cli:main"
5659

5760
[tool.setuptools]

0 commit comments

Comments
 (0)