doc_knowledge doc_parser

doc_knowledge/doc_parser.py

Document parser — discovers and parses .md, .pdf, and .txt files into chunks with metadata (doc name, section, page number) for semantic search.

Key Concepts

Term	Definition	Example
chunk	A piece of a document stored as a unit for search. For markdown = one heading section, for PDF = one page, for txt = character window.	`## Deploy\nSteps to deploy...` is one chunk from a markdown file.
PyMuPDF (fitz)	A Python library for reading PDF files. Extracts text page-by-page.	`fitz.open("guide.pdf")` → iterate pages → `page.get_text()`

Constants

Constant	Value	Purpose
`DOC_EXTENSIONS`	`{".md", ".pdf", ".txt"}`	Supported document file types
`TXT_CHUNK_SIZE`	`3000`	Max characters per text chunk
`TXT_CHUNK_OVERLAP`	`200`	Character overlap between text chunks

Function: `discover_docs(docs_path)`

One-line: Walk a directory and find all supported document files.

Example

Input: docs_path = "/Users/me/team-docs"
       Directory contains: deploy.md, arch.pdf, notes.txt, logo.png

Line 28: docs_path = "/Users/me/team-docs"
Line 32: os.walk → finds deploy.md (.md ✓), arch.pdf (.pdf ✓), notes.txt (.txt ✓), logo.png (.png ✗)

Return:

[
    {"file_path": "deploy.md", "absolute_path": "/Users/me/team-docs/deploy.md", "ext": ".md"},
    {"file_path": "arch.pdf", "absolute_path": "/Users/me/team-docs/arch.pdf", "ext": ".pdf"},
    {"file_path": "notes.txt", "absolute_path": "/Users/me/team-docs/notes.txt", "ext": ".txt"},
]

Function: `parse_markdown(file_path, relative_path)`

One-line: Parse a markdown file into chunks split by headings.

Example

Input: file content =
  "# Intro\nWelcome to the project.\n## Deploy\nRun deploy.sh to deploy.\n## Rollback\nRevert with git."

Line 75: lines = ["# Intro", "Welcome to the project.", "## Deploy", "Run deploy.sh to deploy.", "## Rollback", "Revert with git."]
Line 80: "# Intro" matches heading regex → save previous (empty), start section "Intro"
Line 80: "## Deploy" matches → save "Intro" section, start "Deploy"
Line 80: "## Rollback" matches → save "Deploy" section, start "Rollback"
Line 90: End of lines → save "Rollback" section

Return: 3 chunks, each with metadata.section = "Intro", "Deploy", "Rollback" and metadata.source_type = "markdown".

Function: `parse_pdf(file_path, relative_path)`

One-line: Parse a PDF file into chunks — one chunk per page.

Uses PyMuPDF (fitz) to extract text. Each page becomes one chunk with metadata.page set to the page number.

Function: `parse_text(file_path, relative_path)`

One-line: Parse a plain .txt file into overlapping character-window chunks.

Uses _split_text_by_chars() with 3000-char windows and 200-char overlap. Small files (≤3000 chars) become a single chunk.

Function: `parse_doc(doc_info)`

One-line: Route a document to the right parser based on file extension.

Extension	Parser
`.md`	`parse_markdown()`
`.pdf`	`parse_pdf()`
`.txt`	`parse_text()`

Function: `parse_all_docs(docs_path)`

One-line: Discover all documents in a folder and parse them all into chunks.

Calls discover_docs() → iterates results → calls parse_doc() for each → returns flat list of all chunks.

doc_knowledge doc_parser

doc_knowledge/doc_parser.py

Key Concepts

Constants

Function: discover_docs(docs_path)

Example

Function: parse_markdown(file_path, relative_path)

Example

Function: parse_pdf(file_path, relative_path)

Function: parse_text(file_path, relative_path)

Function: parse_doc(doc_info)

Function: parse_all_docs(docs_path)

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally

Function: `discover_docs(docs_path)`

Function: `parse_markdown(file_path, relative_path)`

Function: `parse_pdf(file_path, relative_path)`

Function: `parse_text(file_path, relative_path)`

Function: `parse_doc(doc_info)`

Function: `parse_all_docs(docs_path)`