Skip to content

feat: add PGVector knowledge source#280

Open
esafwan wants to merge 18 commits into
developfrom
feat/pgvector-knowledge-doctypes
Open

feat: add PGVector knowledge source#280
esafwan wants to merge 18 commits into
developfrom
feat/pgvector-knowledge-doctypes

Conversation

@esafwan
Copy link
Copy Markdown
Contributor

@esafwan esafwan commented Jun 1, 2026

feat: add PGVector knowledge source

Adds PostgreSQL/pgvector as a vector backend for HUF Knowledge Sources, alongside the existing Chroma and SQLite-vec backends.

What's in this PR

New DocTypes

  • Knowledge Source — pgvector_* connection fields, embedding model/provider/dimension config
  • Knowledge Input — file, text, URL input types with background indexing queue
  • Integration Credential, Integration Recipient, Integration Service, Integration Settings

Backend

  • pgvector_backend.py — LlamaIndex PGVectorStore adapter: add_chunks, search, delete_chunks, get_stats, health
  • embedding.py — LiteLLM provider-agnostic embeddings (OpenAI, Gemini, Ollama, etc.)
  • indexer.py — background job: extract → chunk → embed → insert into pgvector

Fixes included (post-test)

  • sslmode bug: strip before passing to LlamaIndex, keep for SQLAlchemy health check
  • Configurable timeout: litellm_embedding_timeout site_config key (default 600s) — discovered during 3.26MB document indexing that hit LiteLLM’s effective 300s wall
  • Ollama 400 retry: single retry on intermittent /api/embed 400 errors; batch_size=1 for ollama/ models
  • Test Connection: whitelist API + form button on Knowledge Source (pgvector only) with results dialog
  • Embedding provider change warning: validate() msgprint when model/provider changes on a source with existing chunks
  • Large document UX hint: description on the file field in Knowledge Input referencing litellm_embedding_timeout (replaced misleading “split files >1.5MB” guidance)

Testing Extent

All tests run on Frappe 16.18.2, pgvector 0.8.2, bench 5.28.0, devcontainer fdocker.

Test Status Notes
T1 Create pgvector Knowledge Source Form saves, connection established, table auto-created
T2 Text Knowledge Input Background queue → chunks → embeddings → pgvector insert
T3 File Knowledge Input File doc lookup works, chunks indexed correctly
T4 Semantic search retrieval Cosine similarity 0.72–0.77 on relevant queries
T5 Ollama local embedding nomic-embed-text, air-gapped, ~2–10 min for large files
T6 Large document scale 3.26MB Paul Graham essays → 1,564 chunks indexed end-to-end (after timeout fix; pre-fix required splitting into 3 parts as workaround)
T7 Error recovery / reprocess Wrong password → fix → reprocess clears old data and re-indexes
T8 Agent integration (backend) Knowledge source linked to agent nivo via agent_knowledge; backend retrieval validated
T9 Unit tests test_tool, test_chroma_backend (skip), direct smoke add/search/stats/delete/health
Frontend build yarn typecheck, yarn build, bench build --app huf pass
Edge cases Empty text, duplicate hash, 1-char text, invalid table name, non-ASCII query all handled

Not yet validated / caveats

  • OpenAI/Gemini provider switches: Config paths exist and were reviewed; OpenAI not yet smoke-tested end-to-end. Gemini key hit a billing/project permission issue during validation — blocked, not a code bug.
  • Browser chat flow: Backend retrieval and agent linkage are verified; full browser-side chat test with nivo was not explicitly completed in this session.

Related PRs

Checklist

  • llama-index-vector-stores-postgres in requirements
  • pgvector container documented in deployment setup
  • temp_test_docs/ removed
  • Manual test guide available at development/edge16/HUF_pgvector_PR280_manual_test.md

@esafwan esafwan changed the title feat: add PGVector knowledge source schema feat: add PGVector knowledge source Jun 1, 2026
esafwan added 6 commits June 1, 2026 20:00
- embedding.py: configurable litellm_embedding_timeout (default 600s) passed
  as request_timeout; Ollama 400 single retry; batch_size=1 for ollama/ models
- knowledge_source.py: test_connection whitelist API (tests pgvector + embedding)
- knowledge_source.py: _warn_if_embedding_changed() warning when provider/model
  changes on a source with existing chunks
- knowledge_source.js: Test Connection button handler with results dialog
- knowledge_source.json: Test Connection button field (pgvector only)
- knowledge_input.json: file field description updated to mention litellm_embedding_timeout
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant