Skip to content

doc_knowledge doc_parser

aakash-anko edited this page May 28, 2026 · 1 revision

doc_knowledge/doc_parser.py

Document parser — discovers and parses .md, .pdf, and .txt files into chunks with metadata (doc name, section, page number) for semantic search.


Key Concepts

Term Definition Example
chunk A piece of a document stored as a unit for search. For markdown = one heading section, for PDF = one page, for txt = character window. ## Deploy\nSteps to deploy... is one chunk from a markdown file.
PyMuPDF (fitz) A Python library for reading PDF files. Extracts text page-by-page. fitz.open("guide.pdf") → iterate pages → page.get_text()

Constants

Constant Value Purpose
DOC_EXTENSIONS {".md", ".pdf", ".txt"} Supported document file types
TXT_CHUNK_SIZE 3000 Max characters per text chunk
TXT_CHUNK_OVERLAP 200 Character overlap between text chunks

Function: discover_docs(docs_path)

One-line: Walk a directory and find all supported document files.

Example

Input: docs_path = "/Users/me/team-docs"
       Directory contains: deploy.md, arch.pdf, notes.txt, logo.png
Line 28: docs_path = "/Users/me/team-docs"
Line 32: os.walk → finds deploy.md (.md ✓), arch.pdf (.pdf ✓), notes.txt (.txt ✓), logo.png (.png ✗)

Return:

[
    {"file_path": "deploy.md", "absolute_path": "/Users/me/team-docs/deploy.md", "ext": ".md"},
    {"file_path": "arch.pdf", "absolute_path": "/Users/me/team-docs/arch.pdf", "ext": ".pdf"},
    {"file_path": "notes.txt", "absolute_path": "/Users/me/team-docs/notes.txt", "ext": ".txt"},
]

Function: parse_markdown(file_path, relative_path)

One-line: Parse a markdown file into chunks split by headings.

Example

Input: file content =
  "# Intro\nWelcome to the project.\n## Deploy\nRun deploy.sh to deploy.\n## Rollback\nRevert with git."
Line 75: lines = ["# Intro", "Welcome to the project.", "## Deploy", "Run deploy.sh to deploy.", "## Rollback", "Revert with git."]
Line 80: "# Intro" matches heading regex → save previous (empty), start section "Intro"
Line 80: "## Deploy" matches → save "Intro" section, start "Deploy"
Line 80: "## Rollback" matches → save "Deploy" section, start "Rollback"
Line 90: End of lines → save "Rollback" section

Return: 3 chunks, each with metadata.section = "Intro", "Deploy", "Rollback" and metadata.source_type = "markdown".


Function: parse_pdf(file_path, relative_path)

One-line: Parse a PDF file into chunks — one chunk per page.

Uses PyMuPDF (fitz) to extract text. Each page becomes one chunk with metadata.page set to the page number.


Function: parse_text(file_path, relative_path)

One-line: Parse a plain .txt file into overlapping character-window chunks.

Uses _split_text_by_chars() with 3000-char windows and 200-char overlap. Small files (≤3000 chars) become a single chunk.


Function: parse_doc(doc_info)

One-line: Route a document to the right parser based on file extension.

Extension Parser
.md parse_markdown()
.pdf parse_pdf()
.txt parse_text()

Function: parse_all_docs(docs_path)

One-line: Discover all documents in a folder and parse them all into chunks.

Calls discover_docs() → iterates results → calls parse_doc() for each → returns flat list of all chunks.

Clone this wiki locally