Skip to content

Commit 4ab0a65

Browse files
authored
Add Postgres markdown store and incomplete sync recovery (#60)
Signed-off-by: Denis Jannot <denis.jannot@solo.io>
1 parent d9c4cf6 commit 4ab0a65

10 files changed

Lines changed: 1264 additions & 137 deletions

File tree

MARKDOWN_STORE.md

Lines changed: 147 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,147 @@
1+
# Postgres Markdown Store
2+
3+
Optional feature that stores the generated markdown for each crawled website URL in a Postgres table. This provides a searchable, raw-text copy of all documentation pages alongside the vector embeddings.
4+
5+
## How it works
6+
7+
| Sync | URL in Postgres? | Lastmod/ETag unchanged? | What happens |
8+
|------|-----------------|------------------------|--------------|
9+
| 1st | No | Yes or No | Force-processed, markdown stored |
10+
| 2nd+ | Yes | Yes | Skipped (normal caching) |
11+
| 2nd+ | Yes | No (change detected) | Processed, markdown updated |
12+
| Any | N/A | HEAD returns 404 | Skipped, deleted from Postgres if present |
13+
14+
On the first sync, all pages are force-processed (bypassing lastmod/ETag skip logic) because no URLs exist in the Postgres table yet. This ensures the table is fully populated. On subsequent syncs, the normal caching layers apply and only pages with detected changes get their rows updated.
15+
16+
## Postgres setup
17+
18+
### 1. Create a user and database
19+
20+
Connect as a Postgres superuser (e.g., `postgres`):
21+
22+
```sql
23+
CREATE USER doc2vec WITH PASSWORD 'your_password_here';
24+
CREATE DATABASE doc2vec OWNER doc2vec;
25+
```
26+
27+
Or if the database already exists:
28+
29+
```sql
30+
CREATE USER doc2vec WITH PASSWORD 'your_password_here';
31+
GRANT ALL PRIVILEGES ON DATABASE doc2vec TO doc2vec;
32+
```
33+
34+
Then connect to the `doc2vec` database and grant schema permissions:
35+
36+
```sql
37+
\c doc2vec
38+
GRANT USAGE, CREATE ON SCHEMA public TO doc2vec;
39+
```
40+
41+
The `CREATE` grant on `public` schema is required so that the application can create the `markdown_pages` table automatically on the first run.
42+
43+
### 2. Table creation
44+
45+
The table is created automatically via `CREATE TABLE IF NOT EXISTS` when the application starts. You do **not** need to create it manually. The schema is:
46+
47+
```sql
48+
CREATE TABLE IF NOT EXISTS markdown_pages (
49+
url TEXT PRIMARY KEY,
50+
product_name TEXT NOT NULL,
51+
markdown TEXT NOT NULL,
52+
updated_at TIMESTAMPTZ DEFAULT NOW()
53+
);
54+
```
55+
56+
The table name defaults to `markdown_pages` but can be overridden via `table_name` in the config.
57+
58+
## Configuration
59+
60+
### Top-level Postgres connection (`config.yaml`)
61+
62+
Using a connection string:
63+
64+
```yaml
65+
markdown_store:
66+
connection_string: 'postgres://doc2vec:${PG_PASSWORD}@localhost:5432/doc2vec'
67+
```
68+
69+
Or using individual fields:
70+
71+
```yaml
72+
markdown_store:
73+
host: 'localhost'
74+
port: 5432
75+
database: 'doc2vec'
76+
user: 'doc2vec'
77+
password: '${PG_PASSWORD}'
78+
# table_name: 'markdown_pages' # Optional, defaults to 'markdown_pages'
79+
```
80+
81+
`connection_string` takes priority if both are provided. Environment variable substitution (`${VAR_NAME}`) works in all fields.
82+
83+
### Per-source opt-in
84+
85+
Enable the markdown store on individual website sources:
86+
87+
```yaml
88+
sources:
89+
- type: 'website'
90+
product_name: 'istio'
91+
version: 'latest'
92+
url: 'https://istio.io/latest/docs/'
93+
sitemap_url: 'https://istio.io/latest/docs/sitemap.xml'
94+
markdown_store: true # Enable storing markdown in Postgres
95+
database_config:
96+
type: 'sqlite'
97+
params:
98+
db_path: './vector-dbs/istio.db'
99+
```
100+
101+
Only website sources with `markdown_store: true` will store their markdown. The feature is disabled by default and has no effect on non-website source types.
102+
103+
## Full example
104+
105+
```yaml
106+
markdown_store:
107+
host: 'localhost'
108+
port: 5432
109+
database: 'doc2vec'
110+
user: 'doc2vec'
111+
password: '${PG_PASSWORD}'
112+
113+
sources:
114+
- type: 'website'
115+
product_name: 'argo'
116+
version: 'stable'
117+
url: 'https://argo-cd.readthedocs.io/en/stable/'
118+
sitemap_url: 'https://argo-cd.readthedocs.io/en/stable/sitemap.xml'
119+
markdown_store: true
120+
max_size: 1048576
121+
database_config:
122+
type: 'sqlite'
123+
params:
124+
db_path: './vector-dbs/argo-cd.db'
125+
126+
- type: 'website'
127+
product_name: 'istio'
128+
version: 'latest'
129+
url: 'https://istio.io/latest/docs/'
130+
markdown_store: true
131+
max_size: 1048576
132+
database_config:
133+
type: 'sqlite'
134+
params:
135+
db_path: './vector-dbs/istio.db'
136+
137+
# This source does NOT store markdown (markdown_store not set)
138+
- type: 'website'
139+
product_name: 'kubernetes'
140+
version: '1.30'
141+
url: 'https://kubernetes.io/docs/'
142+
max_size: 1048576
143+
database_config:
144+
type: 'sqlite'
145+
params:
146+
db_path: './vector-dbs/k8s.db'
147+
```

README.md

Lines changed: 45 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -36,13 +36,19 @@ The primary goal is to prepare documentation content for Retrieval-Augmented Gen
3636
* **Vector Storage:** Supports storing chunks, metadata, and embeddings in:
3737
* **SQLite:** Using `better-sqlite3` and the `sqlite-vec` extension for efficient vector search.
3838
* **Qdrant:** A dedicated vector database, using the `@qdrant/js-client-rest`.
39+
* **[Postgres Markdown Store](MARKDOWN_STORE.md):** Optionally stores the generated markdown for each crawled URL in a Postgres table. Useful for maintaining a searchable, raw-text copy of all documentation pages alongside the vector embeddings.
40+
* **Automatic population:** On the first sync, all pages are force-processed (bypassing lastmod/ETag caching) to fully populate the store. Subsequent syncs only update rows when a change is detected.
41+
* **404 cleanup:** Pages that return 404 are automatically removed from the store.
42+
* **Shared table:** A single table (configurable name, default `markdown_pages`) is shared across all sources, with a `product_name` column to distinguish them.
3943
* **Multi-Layer Change Detection:** Four layers of change detection minimize unnecessary re-processing:
4044
1. **Sitemap `lastmod`:** When available, compares the sitemap's `<lastmod>` date against the stored value — skips without any HTTP request. Child URLs inherit `lastmod` from their most specific parent directory.
4145
2. **ETag via HEAD request:** For URLs without `lastmod`, sends a lightweight HEAD request and compares the ETag header against the stored value. Adaptive backoff prevents rate limiting (starts at 0ms delay, increases on 429 responses, decays on success).
4246
3. **Content hash comparison:** After full page load, compares chunk content hashes against stored values — skips embedding if content is unchanged.
4347
4. **Embedding:** Only re-embeds chunks when content has actually changed.
4448

4549
ETag and lastmod values are only stored when chunking and embedding succeed, ensuring failed pages are retried on the next run.
50+
51+
A `sync_complete` metadata flag tracks whether a full sync has ever completed successfully. If a sync is interrupted (process killed), the next run force-processes all pages regardless of lastmod/ETag values, ensuring no pages are permanently skipped.
4652
* **Incremental Updates:** For GitHub and Zendesk sources, tracks the last run date to only fetch new or updated issues/tickets.
4753
* **Cleanup:** Removes obsolete chunks from the database corresponding to pages or files that are no longer found during processing.
4854
* **Configuration:** Driven by a YAML configuration file (`config.yaml`) specifying sites, repositories, local directories, Zendesk instances, database types, metadata, and other parameters.
@@ -175,6 +181,7 @@ Configuration is managed through two files:
175181
For websites (`type: 'website'`):
176182
* `url`: The starting URL for crawling the documentation site.
177183
* `sitemap_url`: (Optional) URL to the site's XML sitemap for discovering additional pages not linked in navigation.
184+
* `markdown_store`: (Optional) Set to `true` to store generated markdown in the Postgres markdown store (requires top-level `markdown_store` config). Defaults to `false`.
178185
179186
For GitHub repositories (`type: 'github'`):
180187
* `repo`: Repository name in the format `'owner/repo'` (e.g., `'istio/istio'`).
@@ -230,6 +237,17 @@ Configuration is managed through two files:
230237
* `embedding.provider`: Provider for embeddings (`openai` or `azure`).
231238
* `embedding.dimension`: Embedding vector size. Defaults to `3072` when not set.
232239

240+
Optional Postgres markdown store (top-level):
241+
* `markdown_store.connection_string`: (Optional) Full Postgres connection string (e.g., `'postgres://user:pass@host:5432/db'`). Takes priority over individual fields.
242+
* `markdown_store.host`: (Optional) Postgres host.
243+
* `markdown_store.port`: (Optional) Postgres port.
244+
* `markdown_store.database`: (Optional) Postgres database name.
245+
* `markdown_store.user`: (Optional) Postgres user.
246+
* `markdown_store.password`: (Optional) Postgres password. Supports `${PG_PASSWORD}` env var substitution.
247+
* `markdown_store.table_name`: (Optional) Table name. Defaults to `'markdown_pages'`.
248+
249+
When configured, website sources with `markdown_store: true` will store the generated markdown for each URL in this Postgres table. On the first sync, all pages are force-processed (bypassing lastmod/ETag skip logic) to populate the table. On subsequent syncs, only pages with detected changes get their rows updated.
250+
233251
**Example (`config.yaml`):**
234252
```yaml
235253
# Optional: Configure embedding provider
@@ -248,13 +266,25 @@ Configuration is managed through two files:
248266
# deployment_name: 'text-embedding-3-large'
249267
# api_version: '2024-10-21' # Optional
250268
269+
# Optional: Store generated markdown in Postgres
270+
# markdown_store:
271+
# connection_string: 'postgres://user:pass@host:5432/db'
272+
# # OR use individual fields:
273+
# # host: 'localhost'
274+
# # port: 5432
275+
# # database: 'doc2vec'
276+
# # user: 'myuser'
277+
# # password: '${PG_PASSWORD}'
278+
# # table_name: 'markdown_pages' # Optional, defaults to 'markdown_pages'
279+
251280
sources:
252-
# Website source example
281+
# Website source example (with markdown store enabled)
253282
- type: 'website'
254283
product_name: 'argo'
255284
version: 'stable'
256285
url: 'https://argo-cd.readthedocs.io/en/stable/'
257286
sitemap_url: 'https://argo-cd.readthedocs.io/en/stable/sitemap.xml'
287+
markdown_store: true # Store generated markdown in Postgres
258288
max_size: 1048576
259289
database_config:
260290
type: 'sqlite'
@@ -548,6 +578,20 @@ If you don't specify a config path, it will look for config.yaml in the current
548578

549579
## Recent Changes
550580

581+
### Postgres Markdown Store
582+
- **New feature:** Optionally store the generated markdown for each crawled website URL in a Postgres table (`url`, `product_name`, `markdown`, `updated_at`)
583+
- **Top-level configuration:** Configure Postgres connection once at the top level via `connection_string` or individual `host`/`port`/`database`/`user`/`password` fields, with environment variable substitution support
584+
- **Per-source opt-in:** Enable per website source with `markdown_store: true` (disabled by default)
585+
- **First-sync force-processing:** When the markdown store is enabled, pages that aren't yet in the Postgres table bypass lastmod and ETag skip logic, ensuring all pages are stored on the first sync
586+
- **Change-only updates:** On subsequent syncs, only pages with detected content changes (via lastmod/ETag) have their Postgres rows updated
587+
- **404 cleanup:** Pages that return 404 during HEAD checks are automatically removed from the Postgres store
588+
589+
### Incomplete Sync Recovery
590+
- **New feature:** Tracks whether a full sync has ever completed successfully for each website source via a `sync_complete:<url_prefix>` metadata key
591+
- **Interrupted sync handling:** If a sync is killed mid-crawl (process terminated, crash, etc.), the stored ETags/lastmods from the partial run would otherwise cause remaining pages to be skipped permanently. The `sync_complete` flag prevents this — when absent, all pages are force-processed regardless of caching signals
592+
- **Gated on clean completion:** The flag is only set when the crawl completes without network errors (DNS failures, connection refused, timeouts). If the site is unreachable, the next run will force a full sync again
593+
- **Scoped per source:** Each website source has its own `sync_complete` key based on its URL prefix. Changing the source URL naturally triggers a new full sync
594+
551595
### Multi-Layer Change Detection for Websites
552596
- **Sitemap `lastmod` support:** When a sitemap includes `<lastmod>` dates, pages are skipped entirely if the date hasn't changed — no HEAD request, no Puppeteer load, no chunking. One sitemap fetch replaces hundreds of individual HEAD requests.
553597
- **`lastmod` inheritance:** Child URLs without their own `<lastmod>` inherit from the most specific parent directory URL in the sitemap (e.g., `/docs/2.10.x/reference/cli/` inherits from `/docs/2.10.x/`).

0 commit comments

Comments
 (0)