Skip to content

Commit f0720f9

Browse files
authored
doc2vec v2.0.0 (#47)
* Adding topics and multiple parsing improvements Signed-off-by: Denis Jannot <denis.jannot@solo.io> * Adding chunk index and total Signed-off-by: Denis Jannot <denis.jannot@solo.io> * Fixing sqlite Signed-off-by: Denis Jannot <denis.jannot@solo.io> * Adding doc and docx support Signed-off-by: Denis Jannot <denis.jannot@solo.io> * Updating the MCP server Signed-off-by: Denis Jannot <denis.jannot@solo.io> --------- Signed-off-by: Denis Jannot <denis.jannot@solo.io>
1 parent e189db1 commit f0720f9

11 files changed

Lines changed: 1072 additions & 103 deletions

File tree

Dockerfile

Lines changed: 11 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,7 @@ RUN apt-get update && apt-get install -y \
1212
curl \
1313
ca-certificates \
1414
chromium \
15+
chromium-sandbox \
1516
fonts-freefont-ttf \
1617
fonts-ipafont-gothic \
1718
fonts-kacst \
@@ -26,4 +27,13 @@ RUN apt-get update && apt-get install -y \
2627
libxrandr2 \
2728
libxshmfence1 \
2829
libxtst6 \
29-
&& apt-get clean
30+
&& apt-get clean \
31+
&& ln -s /usr/bin/chromium /usr/bin/chromium-browser || true
32+
33+
COPY package*.json ./
34+
RUN npm install --ignore-scripts
35+
# Install Chrome via Puppeteer as fallback (system Chromium will be used first)
36+
RUN npx puppeteer browsers install chrome || true
37+
COPY . .
38+
39+
RUN npm run build

README.md

Lines changed: 153 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,8 @@ This project provides a configurable tool (`doc2vec`) to crawl specified website
66

77
The primary goal is to prepare documentation content for Retrieval-Augmented Generation (RAG) systems or semantic search applications.
88

9+
> **⚠️ Version 2.0.0 Breaking Change:** Version 2.0.0 introduced enhanced chunking with new metadata fields (`chunk_index` and `total_chunks`) that enable page reconstruction and improved chunk ordering. The database schema has changed, and databases created with versions prior to 2.0.0 use a different format. **If you're upgrading to version 2.0.0 or later, you should start with fresh databases** to take advantage of the new features. While the MCP server maintains backward compatibility for querying old databases, doc2vec itself will create databases in the new format. If you need to migrate existing data, consider re-running doc2vec on your sources to regenerate the databases with the enhanced chunking format.
10+
911
## Key Features
1012

1113
* **Website Crawling:** Recursively crawls websites starting from a given base URL.
@@ -19,8 +21,12 @@ The primary goal is to prepare documentation content for Retrieval-Augmented Gen
1921
* **Flexible Filtering:** Filter tickets by status and priority.
2022
* **Local Directory Processing:** Scans local directories for files, converts content to searchable chunks.
2123
* **PDF Support:** Automatically extracts text from PDF files and converts them to Markdown format using Mozilla's PDF.js.
24+
* **Word Document Support:** Processes both legacy `.doc` and modern `.docx` files, extracting text and formatting.
2225
* **Content Extraction:** Uses Puppeteer for rendering JavaScript-heavy pages and `@mozilla/readability` to extract the main article content.
26+
* **Smart H1 Preservation:** Automatically extracts and preserves page titles (H1 headings) that Readability might strip as "page chrome", ensuring proper heading hierarchy.
27+
* **Flexible Content Selectors:** Supports multiple content container patterns (`.docs-content`, `.doc-content`, `.markdown-body`, `article`, etc.) for better compatibility with various documentation sites.
2328
* **HTML to Markdown:** Converts extracted HTML to clean Markdown using `turndown`, preserving code blocks and basic formatting.
29+
* **Clean Heading Text:** Automatically removes anchor links (like `[](#section-id)`) from heading text for cleaner hierarchy display.
2430
* **Intelligent Chunking:** Splits Markdown content into manageable chunks based on headings and token limits, preserving context.
2531
* **Vector Embeddings:** Generates embeddings for each chunk using OpenAI's `text-embedding-3-large` model.
2632
* **Vector Storage:** Supports storing chunks, metadata, and embeddings in:
@@ -32,6 +38,58 @@ The primary goal is to prepare documentation content for Retrieval-Augmented Gen
3238
* **Configuration:** Driven by a YAML configuration file (`config.yaml`) specifying sites, repositories, local directories, Zendesk instances, database types, metadata, and other parameters.
3339
* **Structured Logging:** Uses a custom logger (`logger.ts`) with levels, timestamps, colors, progress bars, and child loggers for clear execution monitoring.
3440

41+
## Chunk Metadata & Page Reconstruction
42+
43+
Each chunk stored in the database includes rich metadata that enables powerful retrieval and page reconstruction capabilities.
44+
45+
### Metadata Fields
46+
47+
| Field | Type | Description |
48+
|-------|------|-------------|
49+
| `product_name` | string | Product identifier from config |
50+
| `version` | string | Version identifier from config |
51+
| `heading_hierarchy` | string[] | Hierarchical breadcrumb trail (e.g., `["Installation", "Prerequisites", "Docker"]`) |
52+
| `section` | string | Current section heading |
53+
| `chunk_id` | string | Unique hash identifier for the chunk |
54+
| `url` | string | Source URL/path of the original document |
55+
| `hash` | string | Content hash for change detection |
56+
| `chunk_index` | number | Position of this chunk within the page (0-based) |
57+
| `total_chunks` | number | Total number of chunks for this page |
58+
59+
### Page Reconstruction
60+
61+
The `chunk_index` and `total_chunks` fields enable you to reconstruct full pages from chunks:
62+
63+
```typescript
64+
// Example: Retrieve all chunks for a URL and reconstruct the page
65+
const chunks = await db.query({
66+
filter: { url: "https://docs.example.com/guide" },
67+
sort: { chunk_index: "asc" }
68+
});
69+
70+
// Check if there are more chunks after the current one
71+
if (currentChunk.chunk_index < currentChunk.total_chunks - 1) {
72+
// More chunks available - fetch the next one
73+
const nextChunkIndex = currentChunk.chunk_index + 1;
74+
}
75+
76+
// Reconstruct full page content
77+
const fullPageContent = chunks
78+
.sort((a, b) => a.chunk_index - b.chunk_index)
79+
.map(c => c.content)
80+
.join("\n\n");
81+
```
82+
83+
### Heading Hierarchy (Breadcrumbs)
84+
85+
Each chunk includes a `heading_hierarchy` array that provides context about where the content appears in the document structure. This is injected as a `[Topic: ...]` prefix in the chunk content to improve vector search relevance.
86+
87+
For example, a chunk under "Installation > Prerequisites > Docker" will have:
88+
- `heading_hierarchy`: `["Installation", "Prerequisites", "Docker"]`
89+
- Content prefix: `[Topic: Installation > Prerequisites > Docker]`
90+
91+
This ensures that searches for parent topics (like "Installation") will also match relevant child content.
92+
3593
## Prerequisites
3694

3795
* **Node.js:** Version 18 or higher recommended (check `.nvmrc` if available).
@@ -99,7 +157,7 @@ Configuration is managed through two files:
99157
100158
For local directories (`type: 'local_directory'`):
101159
* `path`: Path to the local directory to process.
102-
* `include_extensions`: (Optional) Array of file extensions to include (e.g., `['.md', '.txt', '.pdf']`). Defaults to `['.md', '.txt', '.html', '.htm', '.pdf']`.
160+
* `include_extensions`: (Optional) Array of file extensions to include (e.g., `['.md', '.txt', '.pdf', '.doc', '.docx']`). Defaults to `['.md', '.txt', '.html', '.htm', '.pdf']`.
103161
* `exclude_extensions`: (Optional) Array of file extensions to exclude.
104162
* `recursive`: (Optional) Whether to traverse subdirectories (defaults to `true`).
105163
* `url_rewrite_prefix` (Optional) URL prefix to rewrite `file://` URLs (e.g., `https://mydomain.com`)
@@ -161,9 +219,9 @@ Configuration is managed through two files:
161219
product_name: 'project-docs'
162220
version: 'current'
163221
path: './docs'
164-
include_extensions: ['.md', '.txt', '.pdf']
222+
include_extensions: ['.md', '.txt', '.pdf', '.doc', '.docx']
165223
recursive: true
166-
max_size: 10485760 # 10MB recommended for PDF files
224+
max_size: 10485760 # 10MB recommended for PDF/Word files
167225
database_config:
168226
type: 'sqlite'
169227
params:
@@ -291,6 +349,67 @@ A PDF file named "user-guide.pdf" will be converted to Markdown format like:
291349
292350
The resulting Markdown is then chunked and embedded using the same process as other text content.
293351
352+
## Word Document Processing
353+
354+
Doc2Vec supports processing Microsoft Word documents in both legacy `.doc` format and modern `.docx` format.
355+
356+
### Supported Formats
357+
358+
| Extension | Format | Library Used |
359+
|-----------|--------|--------------|
360+
| `.doc` | Legacy Word (97-2003) | [word-extractor](https://github.com/morungos/node-word-extractor) |
361+
| `.docx` | Modern Word (2007+) | [mammoth](https://github.com/mwilliamson/mammoth.js) |
362+
363+
### Features
364+
365+
* **Legacy .doc Support:** Extracts plain text from older Word documents using binary parsing
366+
* **Modern .docx Support:** Converts DOCX files to HTML first (preserving formatting), then to clean Markdown
367+
* **Formatting Preservation:** For `.docx` files, headings, lists, bold, italic, and links are preserved
368+
* **Automatic Title:** Uses the filename as an H1 heading for proper document structure
369+
* **Local File Support:** Processes Word files found in local directories alongside other documents
370+
371+
### Configuration
372+
373+
Include `.doc` and/or `.docx` in your `include_extensions` array:
374+
375+
```yaml
376+
- type: 'local_directory'
377+
product_name: 'company-docs'
378+
version: 'current'
379+
path: './documents'
380+
include_extensions: ['.doc', '.docx', '.pdf', '.md']
381+
recursive: true
382+
max_size: 10485760 # 10MB recommended
383+
database_config:
384+
type: 'sqlite'
385+
params:
386+
db_path: './company-docs.db'
387+
```
388+
389+
### Example Output
390+
391+
A Word document named "meeting-notes.docx" will be converted to Markdown like:
392+
393+
```markdown
394+
# meeting-notes
395+
396+
## Agenda
397+
398+
1. Review Q4 results
399+
2. Discuss roadmap
400+
401+
## Action Items
402+
403+
- **John:** Prepare budget report
404+
- **Sarah:** Schedule follow-up meeting
405+
```
406+
407+
### Notes
408+
409+
* **`.doc` files:** Only plain text is extracted. Formatting like bold/italic is not preserved in legacy Word format.
410+
* **`.docx` files:** Full formatting is preserved including headings, lists, bold, italic, links, and tables.
411+
* **Embedded Images:** Images embedded in Word documents are not extracted (text-only).
412+
294413
## Now Available via npx
295414
296415
You can run `doc2vec` without cloning the repo or installing it globally. Just use:
@@ -333,6 +452,7 @@ If you don't specify a config path, it will look for config.yaml in the current
333452
* Recursively scan directories for files matching the configured extensions.
334453
* Read file content, converting HTML to Markdown if needed.
335454
* For PDF files, extract text using Mozilla's PDF.js and convert to Markdown format with proper page structure.
455+
* For Word documents, extract text from `.doc` files or convert `.docx` files to Markdown with formatting.
336456
* Process each file's content.
337457
- **For Zendesk:**
338458
* Fetch tickets and articles using the Zendesk API.
@@ -345,4 +465,33 @@ If you don't specify a config path, it will look for config.yaml in the current
345465
* **Embed (if needed):** If the chunk is new or changed, call the OpenAI API (`createEmbeddings`) to get the vector embedding.
346466
* **Store:** Insert or update the chunk, metadata, hash, and embedding in the database (SQLite `vec_items` table or Qdrant collection).
347467
4. **Cleanup:** After processing, remove any obsolete chunks from the database.
348-
4. **Complete:** Log completion status.
468+
4. **Complete:** Log completion status.
469+
470+
## Recent Changes
471+
472+
### Word Document Support
473+
- Added support for legacy `.doc` files using the `word-extractor` library
474+
- Added support for modern `.docx` files using the `mammoth` library
475+
- DOCX files preserve formatting (headings, lists, bold, italic, links)
476+
- Both formats are converted to clean Markdown for embedding
477+
478+
### Page Reconstruction Support
479+
- Added `chunk_index` field to track each chunk's position within a page (0-based)
480+
- Added `total_chunks` field to indicate the total number of chunks per page
481+
- Enables AI agents and applications to fetch additional context or reconstruct full pages
482+
- Works consistently across all content types: websites, GitHub, Zendesk, and local directories
483+
484+
### Improved H1/Title Handling
485+
- Smart H1 preservation ensures page titles aren't stripped by Readability
486+
- Falls back to `article.title` when H1 extraction fails
487+
- Proper heading hierarchy starting from H1 through the document structure
488+
489+
### Enhanced Content Extraction
490+
- Added support for multiple content container selectors (`.docs-content`, `.doc-content`, `.markdown-body`, `article`)
491+
- Cleaner heading text by removing anchor links like `[](#section-id)`
492+
- Better handling of pages where H1 is outside the main content container
493+
494+
### Heading Hierarchy Improvements
495+
- Fixed sparse array issues that caused `NULL` values in heading hierarchy
496+
- Proper breadcrumb generation for nested sections
497+
- Hierarchical context preserved across chunk boundaries

0 commit comments

Comments
 (0)