Skip to content

Evaluator: ZIP content inspection for compound document detection #51

@unclesp1d3r

Description

@unclesp1d3r

Summary

Implement content inspection within ZIP archives to detect compound document formats (DOCX, XLSX, HWP, etc.) that use ZIP as a container.

Context

5 tests in the compatibility corpus detect the outer ZIP format but miss the specific document type:

  • `escapevel` - ZIP detected, expected "Zip data (MIME type ...)"
  • `gpkg-1-zst` - TAR detected, expected "Gentoo GLEP 78 binary package"
  • `HWP2016.hwpx.zip` - ZIP detected, expected "Hancom HWP file, HWPX"
  • `issue311docx` - ZIP detected, expected "Microsoft Word 2007+"
  • `issue359xlsx` - ZIP detected, expected "Microsoft Excel 2007+"

GNU file achieves this through magic rules that use indirect offsets and string matching within ZIP entries (checking for `[Content_Types].xml`, `word/`, `xl/` paths).

Acceptance Criteria

  • Magic rules can inspect ZIP entry names for compound document detection
  • DOCX, XLSX, PPTX correctly identified
  • Other ZIP-based formats (HWP, ODF, JAR) identifiable with appropriate rules
  • Built-in rules updated with compound document detection

Depends On

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    compatibilitylibmagic compatibility and migrationenhancementNew feature or requestevaluatorRule evaluation engine and logicpriority:lowNice to have, can defer

    Type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions