You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+153-4Lines changed: 153 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -6,6 +6,8 @@ This project provides a configurable tool (`doc2vec`) to crawl specified website
6
6
7
7
The primary goal is to prepare documentation content for Retrieval-Augmented Generation (RAG) systems or semantic search applications.
8
8
9
+
> **⚠️ Version 2.0.0 Breaking Change:** Version 2.0.0 introduced enhanced chunking with new metadata fields (`chunk_index` and `total_chunks`) that enable page reconstruction and improved chunk ordering. The database schema has changed, and databases created with versions prior to 2.0.0 use a different format. **If you're upgrading to version 2.0.0 or later, you should start with fresh databases** to take advantage of the new features. While the MCP server maintains backward compatibility for querying old databases, doc2vec itself will create databases in the new format. If you need to migrate existing data, consider re-running doc2vec on your sources to regenerate the databases with the enhanced chunking format.
10
+
9
11
## Key Features
10
12
11
13
***Website Crawling:** Recursively crawls websites starting from a given base URL.
@@ -19,8 +21,12 @@ The primary goal is to prepare documentation content for Retrieval-Augmented Gen
19
21
***Flexible Filtering:** Filter tickets by status and priority.
20
22
***Local Directory Processing:** Scans local directories for files, converts content to searchable chunks.
21
23
***PDF Support:** Automatically extracts text from PDF files and converts them to Markdown format using Mozilla's PDF.js.
24
+
***Word Document Support:** Processes both legacy `.doc` and modern `.docx` files, extracting text and formatting.
22
25
***Content Extraction:** Uses Puppeteer for rendering JavaScript-heavy pages and `@mozilla/readability` to extract the main article content.
26
+
***Smart H1 Preservation:** Automatically extracts and preserves page titles (H1 headings) that Readability might strip as "page chrome", ensuring proper heading hierarchy.
27
+
***Flexible Content Selectors:** Supports multiple content container patterns (`.docs-content`, `.doc-content`, `.markdown-body`, `article`, etc.) for better compatibility with various documentation sites.
23
28
***HTML to Markdown:** Converts extracted HTML to clean Markdown using `turndown`, preserving code blocks and basic formatting.
29
+
***Clean Heading Text:** Automatically removes anchor links (like `[](#section-id)`) from heading text for cleaner hierarchy display.
24
30
***Intelligent Chunking:** Splits Markdown content into manageable chunks based on headings and token limits, preserving context.
25
31
***Vector Embeddings:** Generates embeddings for each chunk using OpenAI's `text-embedding-3-large` model.
26
32
***Vector Storage:** Supports storing chunks, metadata, and embeddings in:
@@ -32,6 +38,58 @@ The primary goal is to prepare documentation content for Retrieval-Augmented Gen
32
38
***Configuration:** Driven by a YAML configuration file (`config.yaml`) specifying sites, repositories, local directories, Zendesk instances, database types, metadata, and other parameters.
33
39
***Structured Logging:** Uses a custom logger (`logger.ts`) with levels, timestamps, colors, progress bars, and child loggers for clear execution monitoring.
34
40
41
+
## Chunk Metadata & Page Reconstruction
42
+
43
+
Each chunk stored in the database includes rich metadata that enables powerful retrieval and page reconstruction capabilities.
44
+
45
+
### Metadata Fields
46
+
47
+
| Field | Type | Description |
48
+
|-------|------|-------------|
49
+
|`product_name`| string | Product identifier from config |
50
+
|`version`| string | Version identifier from config |
// Check if there are more chunks after the current one
71
+
if (currentChunk.chunk_index<currentChunk.total_chunks-1) {
72
+
// More chunks available - fetch the next one
73
+
const nextChunkIndex =currentChunk.chunk_index+1;
74
+
}
75
+
76
+
// Reconstruct full page content
77
+
const fullPageContent =chunks
78
+
.sort((a, b) =>a.chunk_index-b.chunk_index)
79
+
.map(c=>c.content)
80
+
.join("\n\n");
81
+
```
82
+
83
+
### Heading Hierarchy (Breadcrumbs)
84
+
85
+
Each chunk includes a `heading_hierarchy` array that provides context about where the content appears in the document structure. This is injected as a `[Topic: ...]` prefix in the chunk content to improve vector search relevance.
86
+
87
+
For example, a chunk under "Installation > Prerequisites > Docker" will have:
This ensures that searches for parent topics (like "Installation") will also match relevant child content.
92
+
35
93
## Prerequisites
36
94
37
95
***Node.js:** Version 18 or higher recommended (check `.nvmrc` if available).
@@ -99,7 +157,7 @@ Configuration is managed through two files:
99
157
100
158
For local directories (`type: 'local_directory'`):
101
159
* `path`: Path to the local directory to process.
102
-
* `include_extensions`: (Optional) Array of file extensions to include (e.g., `['.md', '.txt', '.pdf']`). Defaults to `['.md', '.txt', '.html', '.htm', '.pdf']`.
160
+
* `include_extensions`: (Optional) Array of file extensions to include (e.g., `['.md', '.txt', '.pdf', '.doc', '.docx']`). Defaults to `['.md', '.txt', '.html', '.htm', '.pdf']`.
103
161
* `exclude_extensions`: (Optional) Array of file extensions to exclude.
104
162
* `recursive`: (Optional) Whether to traverse subdirectories (defaults to `true`).
0 commit comments