-
Notifications
You must be signed in to change notification settings - Fork 0
doc_knowledge doc_parser
Document parser — discovers and parses .md, .pdf, and .txt files into chunks with metadata (doc name, section, page number) for semantic search.
| Term | Definition | Example |
|---|---|---|
| chunk | A piece of a document stored as a unit for search. For markdown = one heading section, for PDF = one page, for txt = character window. |
## Deploy\nSteps to deploy... is one chunk from a markdown file. |
| PyMuPDF (fitz) | A Python library for reading PDF files. Extracts text page-by-page. |
fitz.open("guide.pdf") → iterate pages → page.get_text()
|
| Constant | Value | Purpose |
|---|---|---|
DOC_EXTENSIONS |
{".md", ".pdf", ".txt"} |
Supported document file types |
TXT_CHUNK_SIZE |
3000 |
Max characters per text chunk |
TXT_CHUNK_OVERLAP |
200 |
Character overlap between text chunks |
One-line: Walk a directory and find all supported document files.
Input: docs_path = "/Users/me/team-docs"
Directory contains: deploy.md, arch.pdf, notes.txt, logo.png
Line 28: docs_path = "/Users/me/team-docs"
Line 32: os.walk → finds deploy.md (.md ✓), arch.pdf (.pdf ✓), notes.txt (.txt ✓), logo.png (.png ✗)
Return:
[
{"file_path": "deploy.md", "absolute_path": "/Users/me/team-docs/deploy.md", "ext": ".md"},
{"file_path": "arch.pdf", "absolute_path": "/Users/me/team-docs/arch.pdf", "ext": ".pdf"},
{"file_path": "notes.txt", "absolute_path": "/Users/me/team-docs/notes.txt", "ext": ".txt"},
]One-line: Parse a markdown file into chunks split by headings.
Input: file content =
"# Intro\nWelcome to the project.\n## Deploy\nRun deploy.sh to deploy.\n## Rollback\nRevert with git."
Line 75: lines = ["# Intro", "Welcome to the project.", "## Deploy", "Run deploy.sh to deploy.", "## Rollback", "Revert with git."]
Line 80: "# Intro" matches heading regex → save previous (empty), start section "Intro"
Line 80: "## Deploy" matches → save "Intro" section, start "Deploy"
Line 80: "## Rollback" matches → save "Deploy" section, start "Rollback"
Line 90: End of lines → save "Rollback" section
Return: 3 chunks, each with metadata.section = "Intro", "Deploy", "Rollback" and metadata.source_type = "markdown".
One-line: Parse a PDF file into chunks — one chunk per page.
Uses PyMuPDF (fitz) to extract text. Each page becomes one chunk with metadata.page set to the page number.
One-line: Parse a plain .txt file into overlapping character-window chunks.
Uses _split_text_by_chars() with 3000-char windows and 200-char overlap. Small files (≤3000 chars) become a single chunk.
One-line: Route a document to the right parser based on file extension.
| Extension | Parser |
|---|---|
.md |
parse_markdown() |
.pdf |
parse_pdf() |
.txt |
parse_text() |
One-line: Discover all documents in a folder and parse them all into chunks.
Calls discover_docs() → iterates results → calls parse_doc() for each → returns flat list of all chunks.