kagent-dev
diff --git a/‎MARKDOWN_STORE.md‎
Lines changed: 147 additions & 0 deletions b/‎MARKDOWN_STORE.md‎
Lines changed: 147 additions & 0 deletions
diff --git a/‎README.md‎
Lines changed: 45 additions & 1 deletion b/‎README.md‎
Lines changed: 45 additions & 1 deletion
@@ -0,0 +1,147 @@
+# Postgres Markdown Store
+
+Optional feature that stores the generated markdown for each crawled website URL in a Postgres table. This provides a searchable, raw-text copy of all documentation pages alongside the vector embeddings.
+
+## How it works
+
+| Sync | URL in Postgres? | Lastmod/ETag unchanged? | What happens |
+|------|-----------------|------------------------|--------------|
+| 1st  | No              | Yes or No              | Force-processed, markdown stored |
+| 2nd+ | Yes             | Yes                    | Skipped (normal caching) |
+| 2nd+ | Yes             | No (change detected)   | Processed, markdown updated |
+| Any  | N/A             | HEAD returns 404       | Skipped, deleted from Postgres if present |
+
+On the first sync, all pages are force-processed (bypassing lastmod/ETag skip logic) because no URLs exist in the Postgres table yet. This ensures the table is fully populated. On subsequent syncs, the normal caching layers apply and only pages with detected changes get their rows updated.
+
+## Postgres setup
+
+### 1. Create a user and database
+
+Connect as a Postgres superuser (e.g., `postgres`):
+
+```sql
+CREATE USER doc2vec WITH PASSWORD 'your_password_here';
+CREATE DATABASE doc2vec OWNER doc2vec;
+```
+
+Or if the database already exists:
+
+```sql
+CREATE USER doc2vec WITH PASSWORD 'your_password_here';
+GRANT ALL PRIVILEGES ON DATABASE doc2vec TO doc2vec;
+```
+
+Then connect to the `doc2vec` database and grant schema permissions:
+
+```sql
+\c doc2vec
+GRANT USAGE, CREATE ON SCHEMA public TO doc2vec;
+```
+
+The `CREATE` grant on `public` schema is required so that the application can create the `markdown_pages` table automatically on the first run.
+
+### 2. Table creation
+
+The table is created automatically via `CREATE TABLE IF NOT EXISTS` when the application starts. You do **not** need to create it manually. The schema is:
+
+```sql
+CREATE TABLE IF NOT EXISTS markdown_pages (
+    url          TEXT PRIMARY KEY,
+    product_name TEXT NOT NULL,
+    markdown     TEXT NOT NULL,
+    updated_at   TIMESTAMPTZ DEFAULT NOW()
+);
+```
+
+The table name defaults to `markdown_pages` but can be overridden via `table_name` in the config.
+
+## Configuration
+
+### Top-level Postgres connection (`config.yaml`)
+
+Using a connection string:
+
+```yaml
+markdown_store:
+  connection_string: 'postgres://doc2vec:${PG_PASSWORD}@localhost:5432/doc2vec'
+```
+
+Or using individual fields:
+
+```yaml
+markdown_store:
+  host: 'localhost'
+  port: 5432
+  database: 'doc2vec'
+  user: 'doc2vec'
+  password: '${PG_PASSWORD}'
+  # table_name: 'markdown_pages'  # Optional, defaults to 'markdown_pages'
+```
+
+`connection_string` takes priority if both are provided. Environment variable substitution (`${VAR_NAME}`) works in all fields.
+
+### Per-source opt-in
+
+Enable the markdown store on individual website sources:
+
+```yaml
+sources:
+  - type: 'website'
+    product_name: 'istio'
+    version: 'latest'
+    url: 'https://istio.io/latest/docs/'
+    sitemap_url: 'https://istio.io/latest/docs/sitemap.xml'
+    markdown_store: true  # Enable storing markdown in Postgres
+    database_config:
+      type: 'sqlite'
+      params:
+        db_path: './vector-dbs/istio.db'
+```
+
+Only website sources with `markdown_store: true` will store their markdown. The feature is disabled by default and has no effect on non-website source types.
+
+## Full example
+
+```yaml
+markdown_store:
+  host: 'localhost'
+  port: 5432
+  database: 'doc2vec'
+  user: 'doc2vec'
+  password: '${PG_PASSWORD}'
+
+sources:
+  - type: 'website'
+    product_name: 'argo'
+    version: 'stable'
+    url: 'https://argo-cd.readthedocs.io/en/stable/'
+    sitemap_url: 'https://argo-cd.readthedocs.io/en/stable/sitemap.xml'
+    markdown_store: true
+    max_size: 1048576
+    database_config:
+      type: 'sqlite'
+      params:
+        db_path: './vector-dbs/argo-cd.db'
+
+  - type: 'website'
+    product_name: 'istio'
+    version: 'latest'
+    url: 'https://istio.io/latest/docs/'
+    markdown_store: true
+    max_size: 1048576
+    database_config:
+      type: 'sqlite'
+      params:
+        db_path: './vector-dbs/istio.db'
+
+  # This source does NOT store markdown (markdown_store not set)
+  - type: 'website'
+    product_name: 'kubernetes'
+    version: '1.30'
+    url: 'https://kubernetes.io/docs/'
+    max_size: 1048576
+    database_config:
+      type: 'sqlite'
+      params:
+        db_path: './vector-dbs/k8s.db'
+```
@@ -36,13 +36,19 @@ The primary goal is to prepare documentation content for Retrieval-Augmented Gen
 *   **Vector Storage:** Supports storing chunks, metadata, and embeddings in:
     *   **SQLite:** Using `better-sqlite3` and the `sqlite-vec` extension for efficient vector search.
     *   **Qdrant:** A dedicated vector database, using the `@qdrant/js-client-rest`.
+*   **[Postgres Markdown Store](MARKDOWN_STORE.md):** Optionally stores the generated markdown for each crawled URL in a Postgres table. Useful for maintaining a searchable, raw-text copy of all documentation pages alongside the vector embeddings.
+    *   **Automatic population:** On the first sync, all pages are force-processed (bypassing lastmod/ETag caching) to fully populate the store. Subsequent syncs only update rows when a change is detected.
+    *   **404 cleanup:** Pages that return 404 are automatically removed from the store.
+    *   **Shared table:** A single table (configurable name, default `markdown_pages`) is shared across all sources, with a `product_name` column to distinguish them.
 *   **Multi-Layer Change Detection:** Four layers of change detection minimize unnecessary re-processing:
     1. **Sitemap `lastmod`:** When available, compares the sitemap's `<lastmod>` date against the stored value — skips without any HTTP request. Child URLs inherit `lastmod` from their most specific parent directory.
     2. **ETag via HEAD request:** For URLs without `lastmod`, sends a lightweight HEAD request and compares the ETag header against the stored value. Adaptive backoff prevents rate limiting (starts at 0ms delay, increases on 429 responses, decays on success).
     3. **Content hash comparison:** After full page load, compares chunk content hashes against stored values — skips embedding if content is unchanged.
     4. **Embedding:** Only re-embeds chunks when content has actually changed.
 
     ETag and lastmod values are only stored when chunking and embedding succeed, ensuring failed pages are retried on the next run.
+    
+    A `sync_complete` metadata flag tracks whether a full sync has ever completed successfully. If a sync is interrupted (process killed), the next run force-processes all pages regardless of lastmod/ETag values, ensuring no pages are permanently skipped.
 *   **Incremental Updates:** For GitHub and Zendesk sources, tracks the last run date to only fetch new or updated issues/tickets.
 *   **Cleanup:** Removes obsolete chunks from the database corresponding to pages or files that are no longer found during processing.
 *   **Configuration:** Driven by a YAML configuration file (`config.yaml`) specifying sites, repositories, local directories, Zendesk instances, database types, metadata, and other parameters.
@@ -175,6 +181,7 @@ Configuration is managed through two files:
         For websites (`type: 'website'`):
         *   `url`: The starting URL for crawling the documentation site.
         *   `sitemap_url`: (Optional) URL to the site's XML sitemap for discovering additional pages not linked in navigation.
+        *   `markdown_store`: (Optional) Set to `true` to store generated markdown in the Postgres markdown store (requires top-level `markdown_store` config). Defaults to `false`.
         
         For GitHub repositories (`type: 'github'`):
         *   `repo`: Repository name in the format `'owner/repo'` (e.g., `'istio/istio'`).
@@ -230,6 +237,17 @@ Configuration is managed through two files:
         *   `embedding.provider`: Provider for embeddings (`openai` or `azure`).
         *   `embedding.dimension`: Embedding vector size. Defaults to `3072` when not set.
 
+        Optional Postgres markdown store (top-level):
+        *   `markdown_store.connection_string`: (Optional) Full Postgres connection string (e.g., `'postgres://user:pass@host:5432/db'`). Takes priority over individual fields.
+        *   `markdown_store.host`: (Optional) Postgres host.
+        *   `markdown_store.port`: (Optional) Postgres port.
+        *   `markdown_store.database`: (Optional) Postgres database name.
+        *   `markdown_store.user`: (Optional) Postgres user.
+        *   `markdown_store.password`: (Optional) Postgres password. Supports `${PG_PASSWORD}` env var substitution.
+        *   `markdown_store.table_name`: (Optional) Table name. Defaults to `'markdown_pages'`.
+        
+        When configured, website sources with `markdown_store: true` will store the generated markdown for each URL in this Postgres table. On the first sync, all pages are force-processed (bypassing lastmod/ETag skip logic) to populate the table. On subsequent syncs, only pages with detected changes get their rows updated.
+
     **Example (`config.yaml`):**
     ```yaml
     # Optional: Configure embedding provider
@@ -248,13 +266,25 @@ Configuration is managed through two files:
       #   deployment_name: 'text-embedding-3-large'
       #   api_version: '2024-10-21'  # Optional
 
+    # Optional: Store generated markdown in Postgres
+    # markdown_store:
+    #   connection_string: 'postgres://user:pass@host:5432/db'
+    #   # OR use individual fields:
+    #   # host: 'localhost'
+    #   # port: 5432
+    #   # database: 'doc2vec'
+    #   # user: 'myuser'
+    #   # password: '${PG_PASSWORD}'
+    #   # table_name: 'markdown_pages'  # Optional, defaults to 'markdown_pages'
+
     sources:
-      # Website source example
+      # Website source example (with markdown store enabled)
       - type: 'website'
         product_name: 'argo'
         version: 'stable'
         url: 'https://argo-cd.readthedocs.io/en/stable/'
         sitemap_url: 'https://argo-cd.readthedocs.io/en/stable/sitemap.xml'
+        markdown_store: true  # Store generated markdown in Postgres
         max_size: 1048576
         database_config:
           type: 'sqlite'
@@ -548,6 +578,20 @@ If you don't specify a config path, it will look for config.yaml in the current
 
 ## Recent Changes
 
+### Postgres Markdown Store
+- **New feature:** Optionally store the generated markdown for each crawled website URL in a Postgres table (`url`, `product_name`, `markdown`, `updated_at`)
+- **Top-level configuration:** Configure Postgres connection once at the top level via `connection_string` or individual `host`/`port`/`database`/`user`/`password` fields, with environment variable substitution support
+- **Per-source opt-in:** Enable per website source with `markdown_store: true` (disabled by default)
+- **First-sync force-processing:** When the markdown store is enabled, pages that aren't yet in the Postgres table bypass lastmod and ETag skip logic, ensuring all pages are stored on the first sync
+- **Change-only updates:** On subsequent syncs, only pages with detected content changes (via lastmod/ETag) have their Postgres rows updated
+- **404 cleanup:** Pages that return 404 during HEAD checks are automatically removed from the Postgres store
+
+### Incomplete Sync Recovery
+- **New feature:** Tracks whether a full sync has ever completed successfully for each website source via a `sync_complete:<url_prefix>` metadata key
+- **Interrupted sync handling:** If a sync is killed mid-crawl (process terminated, crash, etc.), the stored ETags/lastmods from the partial run would otherwise cause remaining pages to be skipped permanently. The `sync_complete` flag prevents this — when absent, all pages are force-processed regardless of caching signals
+- **Gated on clean completion:** The flag is only set when the crawl completes without network errors (DNS failures, connection refused, timeouts). If the site is unreachable, the next run will force a full sync again
+- **Scoped per source:** Each website source has its own `sync_complete` key based on its URL prefix. Changing the source URL naturally triggers a new full sync
+
 ### Multi-Layer Change Detection for Websites
 - **Sitemap `lastmod` support:** When a sitemap includes `<lastmod>` dates, pages are skipped entirely if the date hasn't changed — no HEAD request, no Puppeteer load, no chunking. One sitemap fetch replaces hundreds of individual HEAD requests.
 - **`lastmod` inheritance:** Child URLs without their own `<lastmod>` inherit from the most specific parent directory URL in the sitemap (e.g., `/docs/2.10.x/reference/cli/` inherits from `/docs/2.10.x/`).