Skip to content

Refactor PgvectorDocumentStore making it reusable for future PostgreSQL-related integrations #3239

@davidsbatista

Description

@davidsbatista

Context

We currently have three PostgreSQL-backed document stores in the repo:

  • PgvectorDocumentStore — pure PostgreSQL access via psycopg
  • SupabaseDocumentStore — subclasses PgvectorDocumentStore and overrides only the connection layer
  • AlloyDBDocumentStore — currently in an open PR, requires the google-cloud-alloydb-connector

PgvectorDocumentStore can be conceptually split into two layers:

  1. Connection layer — responsible for the psycopg connection, cursor management, etc.
  2. Data layer — SQL schema, filters, converters, retrieval logic (pure PostgreSQL)

Problem

The Supabase integration follows a clean pattern: it subclasses PgvectorDocumentStore and overrides only the connection layer, inheriting all SQL/data logic.

AlloyDB duplicates the entire data layer from pgvector. This means any bug fixed or feature added in one store must be manually mirrored to the other, which is error-prone and unsustainable as more PostgreSQL-backed integrations are added.

Proposal

Separate the connection and data layers in the pgvector package:

  • Extract the data layer into a base class inside the pgvector package, exposing an abstract _get_connection() method (and any other connection-specific hooks needed).
  • PgvectorDocumentStore becomes a thin subclass that implements _get_connection() using psycopg directly.
  • Future PostgreSQL-backed integrations (AlloyDB, and any others) follow the same pattern as Supabase: one new class that overrides connection-related methods only, with no duplicated SQL or data-layer code.

The proposed structure

  pgvector/
  └── _base.py  ← new file: PostgreSQLDocumentStore (abstract)
        - all SQL constants (CREATE_TABLE_STATEMENT, etc.)
        - all data methods (count, filter, write, delete, retrieve...)
        - abstract _ensure_db_setup(self) -> None
        - abstract _ensure_db_setup_async(self) -> None  (optional: NotImplementedError)

  └── document_store.py  ← PgvectorDocumentStore(PostgreSQLDocumentStore)
        __init__: takes connection_string
        _ensure_db_setup: Connection.connect(conn_str)
        _ensure_db_setup_async: AsyncConnection.connect(conn_str)

  alloydb/
  └── document_store.py  ← AlloyDBDocumentStore(PostgreSQLDocumentStore)
        __init__: takes instance_uri, user, password, ip_type, enable_iam_auth
        _ensure_db_setup: Connector(...).connect(instance_uri, ...)
        _ensure_db_setup_async: NotImplementedError (until connector supports it)

  supabase/
  └── document_store.py  ← SupabasePgvectorDocumentStore(PgvectorDocumentStore)
        __init__: reads SUPABASE_DB_URL, create_extension=False  ← unchanged, already correct

The base class lives inside the pgvector package (no new package needed). alloydb-haystack would depend on pgvector-haystack rather than reimplementing it. supabase-haystack already does this correctly and requires no change.

Benefits

  • All SQL, filtering, and conversion logic is inherited and tested once.
  • Bug fixes and improvements to the data layer automatically propagate to all PostgreSQL-backed stores.
  • New integrations only need to implement the connection layer.
  • Consistent behavior across Pgvector, Supabase, AlloyDB, and future variants.

Related

  • Open PR for AlloyDB integration (to be updated to follow this pattern once the refactor lands).

Metadata

Metadata

Assignees

Type

No fields configured for Task.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions