|
| 1 | +# hwpxskill Gap Audit |
| 2 | + |
| 3 | +## Scope |
| 4 | + |
| 5 | +This audit compares the practical workflow surface of `python-hwpx` against |
| 6 | +`Canine89/hwpxskill` without assuming the competitor's README claims are |
| 7 | +correct. The goal is to identify real workflow gaps, reuse existing engine |
| 8 | +abstractions where possible, and separate reproduced bugs from unverified |
| 9 | +assertions. |
| 10 | + |
| 11 | +## Current Repository Summary |
| 12 | + |
| 13 | +- Core engine: `src/hwpx/document.py`, `src/hwpx/opc/package.py`, |
| 14 | + `src/hwpx/oxml/*` |
| 15 | +- Existing high-level tooling before this patch set: |
| 16 | + - schema validator: `src/hwpx/tools/validator.py` |
| 17 | + - text extraction engine: `src/hwpx/tools/text_extractor.py` |
| 18 | + - object finder / exporters: `src/hwpx/tools/object_finder.py`, |
| 19 | + `src/hwpx/tools/exporter.py` |
| 20 | +- Workflow-gap code already present at audit start: |
| 21 | + - package validator: `src/hwpx/tools/package_validator.py` |
| 22 | + - page guard: `src/hwpx/tools/page_guard.py` |
| 23 | + - text extraction CLI: `src/hwpx/tools/text_extract_cli.py` |
| 24 | + - script-only unpack/pack/analyze tools under `scripts/` |
| 25 | + |
| 26 | +## Confirmed Gaps vs hwpxskill |
| 27 | + |
| 28 | +These were confirmed by inspecting both repos and the current local checkout. |
| 29 | + |
| 30 | +1. Public unpack/pack workflow was incomplete. |
| 31 | + - `python-hwpx` had script files for unpack/pack, but no package-level CLI |
| 32 | + entry points such as `hwpx-unpack` / `hwpx-pack`. |
| 33 | + - The existing pack/unpack scripts did not record archive entry order or |
| 34 | + compression metadata, so they could not preserve original ZIP layout |
| 35 | + details when repacking. |
| 36 | + - Overwrite behavior was not explicit or safe. |
| 37 | + |
| 38 | +2. Template analysis workflow was incomplete. |
| 39 | + - `python-hwpx` had a script for analyzing reference documents, but it was |
| 40 | + not promoted to a package-level CLI, did not emit a structured JSON |
| 41 | + summary, and had only a smoke test instead of extraction-focused tests. |
| 42 | + |
| 43 | +3. Page-guard coverage was narrower than requested. |
| 44 | + - The existing page guard already acted as a structural drift detector, but |
| 45 | + it did not count shape/control deltas yet. |
| 46 | + - It also needed clearer documentation that it is a proxy/risk heuristic, |
| 47 | + not a rendered page counter. |
| 48 | + |
| 49 | +4. Public docs lagged behind the implemented tooling. |
| 50 | + - `README.md` still documented only `hwpx-validate` in the CLI section. |
| 51 | + - The main usage guide did not document unpack/pack/analyze/page-guard/text |
| 52 | + extraction workflows. |
| 53 | + |
| 54 | +5. Audit documentation itself was missing. |
| 55 | + - There was no repository-local audit note separating verified findings from |
| 56 | + competitor marketing claims. |
| 57 | + |
| 58 | +## Reusable Internals Confirmed |
| 59 | + |
| 60 | +These existing internals made it unnecessary to cargo-cult `hwpxskill`'s raw |
| 61 | +XML-first approach. |
| 62 | + |
| 63 | +- `src/hwpx/opc/package.py` |
| 64 | + - `HwpxPackage.open()` |
| 65 | + - `HwpxPackage.part_names()` |
| 66 | + - `HwpxPackage.get_part()` |
| 67 | + - `HwpxPackage.get_xml()` |
| 68 | + - `HwpxPackage.header_paths()` |
| 69 | + - `HwpxPackage.section_paths()` |
| 70 | + - `HwpxPackage.main_content` |
| 71 | +- `src/hwpx/tools/validator.py` |
| 72 | + - existing schema validation path |
| 73 | +- `src/hwpx/tools/text_extractor.py` |
| 74 | + - existing traversal and text extraction engine |
| 75 | +- `src/hwpx/tools/page_guard.py` |
| 76 | + - existing metrics collection shape that could be extended instead of replaced |
| 77 | + |
| 78 | +Conclusion: `python-hwpx` already had enough engine-level primitives to add the |
| 79 | +missing workflows without switching to competitor-style "raw XML everywhere". |
| 80 | + |
| 81 | +## Real Reproduced Bugs |
| 82 | + |
| 83 | +### 1. Validation dirty-state mutation (historical, now fixed) |
| 84 | + |
| 85 | +The concrete bug candidate worth treating seriously was whether validation |
| 86 | +mutated document state. That bug was real in the earlier implementation: |
| 87 | + |
| 88 | +- `HwpxDocument.validate()` serialized via `_to_bytes_raw()` |
| 89 | +- `_to_bytes_raw()` called `self._root.reset_dirty()` |
| 90 | +- Result: validating a modified document could clear the dirty state even when |
| 91 | + the user had not saved yet |
| 92 | + |
| 93 | +That behavior is now covered by a regression test: |
| 94 | + |
| 95 | +- `tests/test_gap_closure_tools.py::test_validate_preserves_dirty_state` |
| 96 | + |
| 97 | +At the time of this audit, current `main` already contains the fix, so the bug |
| 98 | +does not reproduce anymore on HEAD. |
| 99 | + |
| 100 | +## Bugs I Could Not Reproduce |
| 101 | + |
| 102 | +These claims appeared in or were implied by `hwpxskill`, but I could not |
| 103 | +substantiate them from evidence in the current `python-hwpx` checkout. |
| 104 | + |
| 105 | +1. "python-hwpx API has many bugs" |
| 106 | + - Too vague to verify. |
| 107 | + - Current tests and integration flows do not support that broad claim. |
| 108 | + |
| 109 | +2. "High-level API editing necessarily destroys styles/structure" |
| 110 | + - Not reproduced for ordinary paragraph/table editing in the current test |
| 111 | + suite. |
| 112 | + - Existing tests already cover roundtrip and style-preserving behavior. |
| 113 | + |
| 114 | +3. "page_guard detects actual page count changes" |
| 115 | + - Not supported by the competitor implementation itself. |
| 116 | + - Their script measures structural/text drift in `section0.xml`; it does not |
| 117 | + compute rendered page count. |
| 118 | + |
| 119 | +4. Header/footer instability or TypeError complaints |
| 120 | + - No current reproduction from repository tests. |
| 121 | + - Existing `tests/test_section_headers.py` covers the public API surface. |
| 122 | + |
| 123 | +## Competitor Claims That Remain Unverified |
| 124 | + |
| 125 | +1. "XML-direct workflow preserves formatting almost exactly" |
| 126 | + - Plausible for some templates, but not benchmarked here. |
| 127 | + - No controlled comparison was performed in this patch. |
| 128 | + |
| 129 | +2. "Their workflow is more reliable for all existing documents" |
| 130 | + - Not established. |
| 131 | + - The competitor repo does not provide a broad evidence matrix for this. |
| 132 | + |
| 133 | +3. "Template replacement quality is universally better than the object API" |
| 134 | + - Not established. |
| 135 | + - Likely document-dependent. |
| 136 | + |
| 137 | +## Exact Files / Functions Inspected |
| 138 | + |
| 139 | +### Local repository |
| 140 | + |
| 141 | +- `pyproject.toml` |
| 142 | +- `README.md` |
| 143 | +- `docs/usage.md` |
| 144 | +- `src/hwpx/document.py` |
| 145 | + - `HwpxDocument.validate` |
| 146 | + - `HwpxDocument._to_bytes_raw` |
| 147 | +- `src/hwpx/opc/package.py` |
| 148 | + - `HwpxPackage.open` |
| 149 | + - `HwpxPackage.part_names` |
| 150 | + - `HwpxPackage.get_part` |
| 151 | + - `HwpxPackage.get_xml` |
| 152 | + - `HwpxPackage.main_content` |
| 153 | + - `HwpxPackage.header_paths` |
| 154 | + - `HwpxPackage.section_paths` |
| 155 | + - `HwpxPackage.save` |
| 156 | +- `src/hwpx/tools/validator.py` |
| 157 | + - `validate_document` |
| 158 | +- `src/hwpx/tools/package_validator.py` |
| 159 | + - `validate_package` |
| 160 | +- `src/hwpx/tools/page_guard.py` |
| 161 | + - `collect_metrics` |
| 162 | + - `compare_metrics` |
| 163 | +- `src/hwpx/tools/text_extractor.py` |
| 164 | + - `TextExtractor.iter_sections` |
| 165 | + - `TextExtractor.iter_paragraphs` |
| 166 | + - `TextExtractor.extract_text` |
| 167 | +- `src/hwpx/tools/text_extract_cli.py` |
| 168 | +- `scripts/office/unpack.py` |
| 169 | +- `scripts/office/pack.py` |
| 170 | +- `scripts/analyze_template.py` |
| 171 | +- `tests/test_gap_closure_tools.py` |
| 172 | +- `tests/test_section_headers.py` |
| 173 | +- `.github/workflows/release.yml` |
| 174 | +- `.github/workflows/tests.yml` |
| 175 | + |
| 176 | +### Competitor repository (`Canine89/hwpxskill`) |
| 177 | + |
| 178 | +- `README.md` |
| 179 | +- `scripts/validate.py` |
| 180 | +- `scripts/page_guard.py` |
| 181 | +- `scripts/text_extract.py` |
| 182 | +- `scripts/analyze_template.py` |
| 183 | + |
| 184 | +## Patch Direction Chosen |
| 185 | + |
| 186 | +This first PR-equivalent patch should: |
| 187 | + |
| 188 | +1. promote unpack/pack/analyze into package-level tooling with CLI entry points |
| 189 | +2. keep using `python-hwpx` engine abstractions for package inspection and text |
| 190 | + extraction |
| 191 | +3. extend page guard as a proxy detector, not as a fake page counter |
| 192 | +4. keep backward compatibility with existing `HwpxDocument` APIs |
| 193 | +5. strengthen tests and docs around the new tooling |
0 commit comments