| name | office-archive-search | ||||
|---|---|---|---|---|---|
| description | Archive and search folders of Office and PDF documents when the user wants to organize a folder, build a per-file archive list, search files by contents instead of filename, or find which document mentions a topic, number, clause, or phrase. Supports doc, docx, xls, xlsx, xlsm, ppt, pptx, and pdf. | ||||
| metadata |
|
Use this skill when the user asks to:
- organize a folder of documents
- generate an archive file listing each document and a short content preview
- search a folder by file contents instead of only filename
- find which Word, Excel, PowerPoint, or PDF file mentions a topic, number, or phrase
- build a document inventory, archive list, filing list, or per-file summary list
- search inside Office or PDF files by keyword, clause number, invoice number, contract number, or policy title
Typical examples:
- "Organize this document folder"
- "Build me an archive list for this folder"
- "Which file mentions this contract number?"
- "Search these Office files for invoice April"
This skill is deterministic. It extracts text, stores an incremental SQLite index, writes archive outputs, and then searches that index.
Prefer this skill before attempting an LLM-only folder summary. The deterministic index is faster, cheaper, and easier to verify. If the user later wants a polished narrative report, generate it from archive.md or archive.jsonl after the archive step succeeds.
Important safety default: legacy Office COM extraction is disabled by default because it can trigger repair popups or interactive Office prompts on user machines. Only opt in with --allow-legacy-com when the operator explicitly accepts that risk.
Do not use this skill when:
- the user only wants a conversational summary of text already pasted into chat
- the target is mainly images, audio, or video rather than Office or PDF documents
- the user asked about one web page or one URL instead of a local folder
- the task is editing document formatting rather than extracting or searching content
docdocxxlsxlsxxlsmpptpptxpdf
Build or refresh the archive for a folder:
python {baseDir}/scripts/office_archive.py archive /path/to/folderSearch the folder by content:
python {baseDir}/scripts/office_archive.py search /path/to/folder "invoice april"Inspect one file:
python {baseDir}/scripts/office_archive.py inspect /path/to/file.docxCheck machine support before promising legacy doc/xls/ppt coverage:
python {baseDir}/scripts/office_archive.py checkOnly if the operator explicitly wants to probe whether Office COM can start on that machine:
python {baseDir}/scripts/office_archive.py check --probe-legacy-comRunning archive creates <root>/.office-archive/ with:
archive.md: human-readable inventoryarchive.jsonl: one JSON object per fileindex.sqlite: incremental search index
- If the user gives a folder path, run
archivefirst. - If the user gives one file path only, run
inspectfirst. - After
archive, read the generated outputs in<root>/.office-archive/. - If the user asked a search question, run
searchafter the archive step and answer from the results. - If the user asked for a folder overview, summarize from
archive.mdorarchive.jsonl. - Use
checkonly for debugging, environment validation, or legacy-format support checks. - Do not enable
--allow-legacy-comcasually on end-user machines.
Archive a Windows folder:
python {baseDir}/scripts/office_archive.py archive "D:\Docs\Contracts"Search the same folder:
python {baseDir}/scripts/office_archive.py search "D:\Docs\Contracts" "contract number"
python {baseDir}/scripts/office_archive.py search "D:\Docs\Contracts" "invoice april"Force a full rebuild when extraction code changed or cached results look stale:
python {baseDir}/scripts/office_archive.py archive "D:\Docs\Contracts" --forceInspect one file directly:
python {baseDir}/scripts/office_archive.py inspect "D:\Docs\Contracts\renewal.doc"- Re-runs are incremental: unchanged files are reused from SQLite instead of re-extracted.
- PDF extraction prefers
pdftotext; if that is unavailable, the script triespypdf. - Modern zipped formats (
docx/xlsx/xlsm/pptx) work cross-platform. - Legacy binary formats are Windows-first and handled conservatively.
- Search is best-effort token search, not semantic retrieval. Exact phrases, IDs, vendor names, invoice numbers, and contract numbers work best.