Skip to content

Conversation

@ahmed-bhs
Copy link
Contributor

@ahmed-bhs ahmed-bhs commented Oct 15, 2025

Q A
Bug fix? no
New feature? yes
Docs? no
License MIT

Problem

Vector search and full-text search each have limitations:

  • Vector search: Great for semantic similarity, but may rank exact term matches lower
  • Full-text search: Great for exact matches, but misses semantic relationships

Users often need both: "Find documents about space travel that mention Apollo"

Solution

New PostgreSQL HybridStore combining three search methods with Reciprocal Rank Fusion (RRF):

Method Extension Purpose
Semantic pgvector Conceptual similarity
Keyword BM25 or native FTS Exact term matching
Fuzzy pg_trgm Typo tolerance

Why BM25 over native PostgreSQL FTS?

Native PostgreSQL uses TF-IDF which has known limitations:

  • No document length normalization (long documents score higher unfairly)
  • Term frequency grows unbounded (repeating a word 100x inflates score)

BM25 fixes these issues with saturation and length normalization — that's why Elasticsearch, Meilisearch, and Lucene all use it.

Fallback strategy

BM25 requires the plpgsql_bm25 extension. For users without it:

  • Default: PostgresTextSearchStrategy using native ts_rank_cd (works everywhere)
  • Optional: Bm25TextSearchStrategy for better ranking (requires extension)
// Native FTS fallback (default)
$store = new HybridStore($pdo, 'movies');

// BM25 for better ranking (requires extension)
$store = new HybridStore($pdo, 'movies', 
    textSearchStrategy: new Bm25TextSearchStrategy(bm25Language: 'en')
);

Features

  • Pluggable text search: BM25 (ParadeDB) or native PostgreSQL FTS
  • RRF fusion: Merges vector + keyword + fuzzy rankings
  • Configurable ratio: 0.0 = keyword-only → 1.0 = vector-only
  • Fuzzy matching: Typo tolerance via pg_trgm
  • Field boosting: title: 2x, overview: 1x
  • Score normalization: 0-100 range

Configuration

framework:
    ai:
        stores:
            hybrid:
                postgres:
                    connection: doctrine.dbal.default_connection
                    table_name: movies
                    semantic_ratio: 0.7
                    fuzzy_weight: 0.3
                    normalize_scores: true
                    searchable_attributes:
                        title: { boost: 2.0, metadata_key: 'title' }
                        overview: { boost: 1.0, metadata_key: 'overview' }

Usage

$store = new HybridStore($pdo, 'movies', semanticRatio: 0.7);
$results = $store->query($vector, ['q' => 'space adventure', 'limit' => 10]);

References

@carsonbot carsonbot added Feature New feature Store Issues & PRs about the AI Store component Status: Needs Review labels Oct 15, 2025
@ahmed-bhs ahmed-bhs force-pushed the feature/postgres-hybrid-search branch 2 times, most recently from 3807878 to 8d4ccfe Compare October 16, 2025 07:36
@chr-hertel chr-hertel requested a review from Copilot October 23, 2025 19:06
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces PostgresHybridStore, a new vector store implementation that combines semantic vector search (pgvector) with PostgreSQL Full-Text Search (FTS) using Reciprocal Rank Fusion (RRF), following Supabase's hybrid search approach.

Key changes:

  • Implements configurable hybrid search with adjustable semantic ratio (0.0 for pure FTS, 1.0 for pure vector, 0.5 for balanced)
  • Uses RRF algorithm with k=60 default to merge vector similarity and ts_rank_cd rankings
  • Supports multilingual content through configurable PostgreSQL text search configurations

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File Description
src/store/src/Bridge/Postgres/PostgresHybridStore.php Core implementation of hybrid store with vector/FTS query building, RRF fusion logic, and table setup with tsvector generation
src/store/tests/Bridge/Postgres/PostgresHybridStoreTest.php Comprehensive test coverage for constructor validation, setup, pure vector/FTS queries, hybrid RRF queries, and various configuration options

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Copy link
Member

@chr-hertel chr-hertel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general this is a super cool feature - some copilot findings seem valid to me - please check.

On top, I was unsure if all sprintf need to be sprintf or some values can/should be a prepared parameter - that'd be great to double check as well please.

ahmed-bhs added a commit to ahmed-bhs/ai-demo that referenced this pull request Oct 30, 2025
Side-by-side comparison of FTS, Hybrid (RRF), and Semantic search.
Uses Supabase (pgvector + PostgreSQL FTS).
30 sample articles with interactive Live Component.

Related: symfony/ai#783
Author: Ahmed EBEN HASSINE <ahmedbhs123@gmail.com>
@chr-hertel
Copy link
Member

@ahmed-bhs could you please have a look at the pipeline failures - i think there's still some minor parts open

Combines pgvector semantic search with PostgreSQL Full-Text Search
using Reciprocal Rank Fusion (RRF), following Supabase approach.

Features:
- Configurable semantic/keyword ratio (0.0 to 1.0)
- RRF fusion with customizable k parameter
- Multilingual FTS support (default: 'simple')
- Optional relevance filtering with defaultMaxScore
- All pgvector distance metrics supported
- Extract WHERE clause logic into addFilterToWhereClause() helper method
- Fix embedding param logic: ensure it's set before maxScore uses it
- Replace fragile str_replace() with robust str_starts_with() approach
- Remove code duplication between buildFtsOnlyQuery and buildHybridQuery

This addresses review feedback about fragile WHERE clause manipulation
and centralizes the logic in a single, reusable method.
- Rename class from PostgresHybridStore to HybridStore
- The namespace already indicates it's Postgres-specific
- Add postgres-hybrid.php RAG example demonstrating:
  * Different semantic ratios (0.0, 0.5, 1.0)
  * RRF (Reciprocal Rank Fusion) hybrid search
  * Full-text search with 'q' parameter
  * Per-query semanticRatio override
@ahmed-bhs ahmed-bhs force-pushed the feature/postgres-hybrid-search branch 2 times, most recently from c75380e to 19623bb Compare November 7, 2025 13:56
@ahmed-bhs ahmed-bhs force-pushed the feature/postgres-hybrid-search branch from 19623bb to 2c7b49a Compare November 7, 2025 13:57
Replace ts_rank_cd (PostgreSQL Full-Text Search) with BM25 algorithm
for better keyword search ranking in hybrid search.

Changes:
- Add bm25Language parameter (configurable via YAML)
- Replace FTS CTEs with bm25topk() function calls
- Add DISTINCT ON fixes to prevent duplicate results
- Add fuzzy matching with word_similarity (pg_trgm)
- Add score normalization (0-100 range)
- Add searchable attributes with field-specific boosting
- Bundle configuration in options.php and AiBundle.php

Tests:
- Update 6 existing tests for BM25 compatibility
- Add 3 new tests for fuzzy matching and searchable attributes
- All 19 tests passing (132 assertions)

Breaking changes:
- Requires plpgsql_bm25 extension instead of native FTS
- BM25 uses short language codes ('en', 'fr') vs FTS full names
Add 3 new tests covering newly introduced functionality:

- testFuzzyMatchingWithWordSimilarity: Verifies pg_trgm fuzzy matching
  with word_similarity() and custom thresholds (primary, secondary, strict)

- testSearchableAttributesWithBoost: Ensures field-specific tsvector
  columns are created with proper GIN indexes (title_tsv, overview_tsv)

- testFuzzyWeightParameter: Validates fuzzy weight distribution in RRF
  formula when combining vector, BM25, and fuzzy scores

All tests verify SQL generation via callback assertions.
Test suite: 19 tests, 132 assertions, all passing.
@ahmed-bhs ahmed-bhs changed the title [Store] Add PostgresHybridStore with RRF following Supabase approach [Store] Add HybridStore with BM25 ranking for PostgreSQL Nov 23, 2025
@OskarStark
Copy link
Contributor

Open to finish this PR @ahmed-bhs ?

…dStore

- Extract RRF logic into dedicated ReciprocalRankFusion class
- Introduce TextSearchStrategyInterface for pluggable search strategies
- Remove debug code (file_put_contents calls)
- Replace empty() with strict comparisons ([] !==) per PHPStan rules
- Add missing PHPDoc types for array parameters
- Mark properties as readonly for immutability
- Extract helper methods (buildTsvectorColumns, createSearchTextTrigger)
- Use NullVector for results without embeddings
- Update tests to reflect new setup() execution order
@ahmed-bhs
Copy link
Contributor Author

Hi @OskarStark,

The work on my side is complete, you can take a look whenever you have time.

To give you a bit of context on the evolution of this work:

Initially, I wanted to propose a hybrid search implementation on PostgreSQL, combining semantic search (pgvector) with PostgreSQL’s native text search based on TF-IDF (ts_rank_cd).

However, TF-IDF has well-known scoring limitations:

  • No length normalization: longer documents are unfairly favored
  • Unbounded term frequency: repeating a word 100 times artificially inflates the score

That’s why I then suggested using BM25 (the algorithm used by Elasticsearch, Meilisearch, Lucene), which addresses these issues through saturation and document length normalization.

BM25, however, requires the plpgsql_bm25 extension, which is not installed by default. So I implemented a fallback architecture:

  • Default: PostgresTextSearchStrategy using native FTS (works everywhere)
  • Optional: Bm25TextSearchStrategy for better ranking (requires the extension)

I also extracted the RRF (Reciprocal Rank Fusion) logic into a dedicated class for reusability.

Feel free to reach out if you have any questions or feedback!

@ahmed-bhs ahmed-bhs force-pushed the feature/postgres-hybrid-search branch from 171ba50 to 27954f9 Compare November 26, 2025 04:05
- Demonstrate BM25TextSearchStrategy vs native PostgreSQL FTS
- Show explicit ReciprocalRankFusion configuration
- Add comparison between both text search strategies
- Simplify summary and improve clarity
@ahmed-bhs ahmed-bhs force-pushed the feature/postgres-hybrid-search branch from 27954f9 to d7446d5 Compare November 26, 2025 04:12
Copy link
Member

@chr-hertel chr-hertel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @ahmed-bhs, coming back to this proposal now - I think it's great to promote Postgres more - since for quite some folks this is a great use case to combine it with given infrastructure.

One thing I'd like to see here tho is to identify and leverage synergies with Symfony\AI\Store\Bridge\Postgres\Store implementation.

For example:

  1. do we need to separate bundle configs with postgres and postgres_hybrid - I'd say hybrid could be a keyword below postgres instead.
  2. let's extract some code in separate classes please, see toPgvector + fromPgvector or the different queries maybe?

Thanks already!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Feature New feature Status: Needs Work Store Issues & PRs about the AI Store component

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants