You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Optional feature that stores the generated markdown for each crawled website URL in a Postgres table. This provides a searchable, raw-text copy of all documentation pages alongside the vector embeddings.
4
+
5
+
## How it works
6
+
7
+
| Sync | URL in Postgres? | Lastmod/ETag unchanged? | What happens |
| Any | N/A | HEAD returns 404 | Skipped, deleted from Postgres if present |
13
+
14
+
On the first sync, all pages are force-processed (bypassing lastmod/ETag skip logic) because no URLs exist in the Postgres table yet. This ensures the table is fully populated. On subsequent syncs, the normal caching layers apply and only pages with detected changes get their rows updated.
15
+
16
+
## Postgres setup
17
+
18
+
### 1. Create a user and database
19
+
20
+
Connect as a Postgres superuser (e.g., `postgres`):
21
+
22
+
```sql
23
+
CREATEUSERdoc2vec WITH PASSWORD 'your_password_here';
24
+
CREATEDATABASEdoc2vec OWNER doc2vec;
25
+
```
26
+
27
+
Or if the database already exists:
28
+
29
+
```sql
30
+
CREATEUSERdoc2vec WITH PASSWORD 'your_password_here';
31
+
GRANT ALL PRIVILEGES ON DATABASE doc2vec TO doc2vec;
32
+
```
33
+
34
+
Then connect to the `doc2vec` database and grant schema permissions:
35
+
36
+
```sql
37
+
\c doc2vec
38
+
GRANT USAGE, CREATE ON SCHEMA public TO doc2vec;
39
+
```
40
+
41
+
The `CREATE` grant on `public` schema is required so that the application can create the `markdown_pages` table automatically on the first run.
42
+
43
+
### 2. Table creation
44
+
45
+
The table is created automatically via `CREATE TABLE IF NOT EXISTS` when the application starts. You do **not** need to create it manually. The schema is:
46
+
47
+
```sql
48
+
CREATETABLEIF NOT EXISTS markdown_pages (
49
+
url TEXTPRIMARY KEY,
50
+
product_name TEXTNOT NULL,
51
+
markdown TEXTNOT NULL,
52
+
updated_at TIMESTAMPTZ DEFAULT NOW()
53
+
);
54
+
```
55
+
56
+
The table name defaults to `markdown_pages` but can be overridden via `table_name` in the config.
markdown_store: true # Enable storing markdown in Postgres
95
+
database_config:
96
+
type: 'sqlite'
97
+
params:
98
+
db_path: './vector-dbs/istio.db'
99
+
```
100
+
101
+
Only website sources with `markdown_store: true` will store their markdown. The feature is disabled by default and has no effect on non-website source types.
Copy file name to clipboardExpand all lines: README.md
+45-1Lines changed: 45 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -36,13 +36,19 @@ The primary goal is to prepare documentation content for Retrieval-Augmented Gen
36
36
***Vector Storage:** Supports storing chunks, metadata, and embeddings in:
37
37
***SQLite:** Using `better-sqlite3` and the `sqlite-vec` extension for efficient vector search.
38
38
***Qdrant:** A dedicated vector database, using the `@qdrant/js-client-rest`.
39
+
***[Postgres Markdown Store](MARKDOWN_STORE.md):** Optionally stores the generated markdown for each crawled URL in a Postgres table. Useful for maintaining a searchable, raw-text copy of all documentation pages alongside the vector embeddings.
40
+
***Automatic population:** On the first sync, all pages are force-processed (bypassing lastmod/ETag caching) to fully populate the store. Subsequent syncs only update rows when a change is detected.
41
+
***404 cleanup:** Pages that return 404 are automatically removed from the store.
42
+
***Shared table:** A single table (configurable name, default `markdown_pages`) is shared across all sources, with a `product_name` column to distinguish them.
39
43
***Multi-Layer Change Detection:** Four layers of change detection minimize unnecessary re-processing:
40
44
1.**Sitemap `lastmod`:** When available, compares the sitemap's `<lastmod>` date against the stored value — skips without any HTTP request. Child URLs inherit `lastmod` from their most specific parent directory.
41
45
2.**ETag via HEAD request:** For URLs without `lastmod`, sends a lightweight HEAD request and compares the ETag header against the stored value. Adaptive backoff prevents rate limiting (starts at 0ms delay, increases on 429 responses, decays on success).
42
46
3.**Content hash comparison:** After full page load, compares chunk content hashes against stored values — skips embedding if content is unchanged.
43
47
4.**Embedding:** Only re-embeds chunks when content has actually changed.
44
48
45
49
ETag and lastmod values are only stored when chunking and embedding succeed, ensuring failed pages are retried on the next run.
50
+
51
+
A `sync_complete` metadata flag tracks whether a full sync has ever completed successfully. If a sync is interrupted (process killed), the next run force-processes all pages regardless of lastmod/ETag values, ensuring no pages are permanently skipped.
46
52
***Incremental Updates:** For GitHub and Zendesk sources, tracks the last run date to only fetch new or updated issues/tickets.
47
53
***Cleanup:** Removes obsolete chunks from the database corresponding to pages or files that are no longer found during processing.
48
54
***Configuration:** Driven by a YAML configuration file (`config.yaml`) specifying sites, repositories, local directories, Zendesk instances, database types, metadata, and other parameters.
@@ -175,6 +181,7 @@ Configuration is managed through two files:
175
181
For websites (`type: 'website'`):
176
182
*`url`: The starting URL for crawling the documentation site.
177
183
*`sitemap_url`: (Optional) URL to the site's XML sitemap for discovering additional pages not linked in navigation.
184
+
* `markdown_store`: (Optional) Set to `true` to store generated markdown in the Postgres markdown store (requires top-level `markdown_store` config). Defaults to `false`.
178
185
179
186
For GitHub repositories (`type: 'github'`):
180
187
* `repo`: Repository name in the format `'owner/repo'` (e.g., `'istio/istio'`).
@@ -230,6 +237,17 @@ Configuration is managed through two files:
230
237
*`embedding.provider`: Provider for embeddings (`openai` or `azure`).
231
238
*`embedding.dimension`: Embedding vector size. Defaults to `3072` when not set.
232
239
240
+
Optional Postgres markdown store (top-level):
241
+
*`markdown_store.connection_string`: (Optional) Full Postgres connection string (e.g., `'postgres://user:pass@host:5432/db'`). Takes priority over individual fields.
*`markdown_store.password`: (Optional) Postgres password. Supports `${PG_PASSWORD}` env var substitution.
247
+
*`markdown_store.table_name`: (Optional) Table name. Defaults to `'markdown_pages'`.
248
+
249
+
When configured, website sources with `markdown_store: true` will store the generated markdown foreach URLin this Postgres table. On the first sync, all pages are force-processed (bypassing lastmod/ETag skip logic) to populate the table. On subsequent syncs, only pages with detected changes get their rows updated.
250
+
233
251
**Example (`config.yaml`):**
234
252
```yaml
235
253
# Optional: Configure embedding provider
@@ -248,13 +266,25 @@ Configuration is managed through two files:
markdown_store: true # Store generated markdown in Postgres
258
288
max_size: 1048576
259
289
database_config:
260
290
type: 'sqlite'
@@ -548,6 +578,20 @@ If you don't specify a config path, it will look for config.yaml in the current
548
578
549
579
## Recent Changes
550
580
581
+
### Postgres Markdown Store
582
+
- **New feature:** Optionally store the generated markdown foreach crawled website URLin a Postgres table (`url`, `product_name`, `markdown`, `updated_at`)
583
+
- **Top-level configuration:** Configure Postgres connection once at the top level via `connection_string` or individual `host`/`port`/`database`/`user`/`password` fields, with environment variable substitution support
584
+
- **Per-source opt-in:** Enable per website source with `markdown_store: true` (disabled by default)
585
+
- **First-sync force-processing:** When the markdown store is enabled, pages that aren't yet in the Postgres table bypass lastmod and ETag skip logic, ensuring all pages are stored on the first sync
586
+
- **Change-only updates:** On subsequent syncs, only pages with detected content changes (via lastmod/ETag) have their Postgres rows updated
587
+
- **404 cleanup:** Pages that return 404 during HEAD checks are automatically removed from the Postgres store
588
+
589
+
### Incomplete Sync Recovery
590
+
- **New feature:** Tracks whether a full sync has ever completed successfully for each website source via a `sync_complete:<url_prefix>` metadata key
591
+
- **Interrupted sync handling:** If a sync is killed mid-crawl (process terminated, crash, etc.), the stored ETags/lastmods from the partial run would otherwise cause remaining pages to be skipped permanently. The `sync_complete` flag prevents this — when absent, all pages are force-processed regardless of caching signals
592
+
- **Gated on clean completion:** The flag is only set when the crawl completes without network errors (DNS failures, connection refused, timeouts). If the site is unreachable, the next run will force a full sync again
593
+
- **Scoped per source:** Each website source has its own `sync_complete` key based on its URL prefix. Changing the source URL naturally triggers a new full sync
594
+
551
595
### Multi-Layer Change Detection for Websites
552
596
- **Sitemap `lastmod` support:** When a sitemap includes `<lastmod>` dates, pages are skipped entirely if the date hasn't changed — no HEAD request, no Puppeteer load, no chunking. One sitemap fetch replaces hundreds of individual HEAD requests.
553
597
- **`lastmod` inheritance:** Child URLs without their own `<lastmod>` inherit from the most specific parent directory URL in the sitemap (e.g., `/docs/2.10.x/reference/cli/` inherits from `/docs/2.10.x/`).
0 commit comments