Vector Search: hybrid search by shanbady · Pull Request #3060 · mitodl/mit-learn

shanbady · 2026-03-17T15:51:03Z

What are the relevant tickets?

Closes https://github.com/mitodl/hq/issues/10380

Description (What does it do?)

This PR integrates and enables the following:

generation of sparse embeddings for local and deployed environments (sklearn.HashingVectorizer for local and bm25 for deployed environments via qdrant cloud inferencing)
use of hybrid search when searching the contentfile and resource vector endpoints
some configuration changes/optimizations for performance in qdrant collections

How can this be tested?

testing local hybrid search

checkout this branch.
make sure settings.QDRANT_SPARSE_MODEL defaults to "sklearn/hashing_vectorizer_sparse_model" and settings.QDRANT_SPARSE_ENCODER defaults to "vector_search.encoders.sparse_hash.SparseHashEncoder"
rebuild your web and celery containers and do a down/up on them
delete your local qdrant collections from your local qdrant dashboard
make sure you have resources and contentfiles locally and generate embeddings via ./manage.py generate_embeddings --all
go back to your qdrant dashboard and see that the collections have been created with hashing_vectorizer_sparse_model as the sparse model and whatever your settings.QDRANT_DENSE_MODEL has been set to as the dense model.
go into the contentfiles collection on the dashboard and grab some qdrant point id
run the following in the qdrant console replacing the point id with the one you found:

GET collections/resource_embeddings.content_files/points/00e468bb-93dc-576f-9df1-f045eb6c394c

under the "vector" attribute of the response you should see that both the sparse and dense vectors have values populated
performing searches using the vector endpoints should behave as expected although this time they are using hybrid search. the "hybrid_search=true" parameter toggles hybrid search

testing deployed/clound inferenced hybrid search

perform steps 1 & 2 from the previous instructions above
signup/login on qdrant
you may need to ask @blarghmatey to add you to our cloud account if it is not visible

open the mitol-learn-qa cluster and go to the "api keys" section. create a new api key and set settings.QDRANT_API_KEY. Set settings.QDRANT_HOST to "https://3cd6878c-6d1a-4c75-9056-840e277a0f8b.us-east-1-0.aws.cloud.qdrant.io"
set settings.QDRANT_SPARSE_MODEL to "qdrant/bm25" and settings.QDRANT_SPARSE_ENCODER to "vector_search.encoders.qdrant_cloud.QdrantCloudEncoder"
restart celery
run ./manage.py generate_embeddings --all
you should see new collections appear in the qdrant cluster (named resource_embeddings.content_files resource_embeddings.resources etc - you may need to paginate to see them in the list).
perform the same console search from step 7 in the previous step and you should see the point has a bm25 vector in addition to a dense vector populated
when finished testing make sure you delete the api key and collections you just created (be careful to delete the correct ones!)

Additional Context

for local environments we use sklearn.HashingVectorizer as the sparse vector which is fast, does not require pre-fitting to an entire dataset and is good for local testing
deployed environments use qdrant cloud inferencing which we get for free since we use their paid offering
when performing search, the use of hybrid search is activated via a url GET param "hybrid_search"

github-actions · 2026-03-17T15:52:10Z

OpenAPI Changes

Show/hide 2 changes: 0 error, 0 warning, 2 info

2 changes: 0 error, 0 warning, 2 info
info	[new-optional-request-parameter] at head/openapi/specs/v0.yaml	
	in API GET /api/v0/vector_content_files_search/
		added the new optional 'query' request parameter 'hybrid_search'

info	[new-optional-request-parameter] at head/openapi/specs/v0.yaml	
	in API GET /api/v0/vector_learning_resources_search/
		added the new optional 'query' request parameter 'hybrid_search'

Unexpected changes? Ensure your branch is up-to-date with main (consider rebasing).

…arch

vector_search/utils.py

vector_search/encoders/sparse_hash.py

vector_search/encoders/utils.py

vector_search/utils.py

vector_search/encoders/sparse_hash.py

vector_search/utils.py

abeglova · 2026-03-18T18:32:02Z

vector_search/encoders/utils.py

+    """
+    Return the sparse encoder based on settings
+    """
+    Encoder = import_string(settings.QDRANT_SPARSE_ENCODER)


what does import_string do here?

settings.QDRANT_SPARSE_ENCODER specifies the encoder class to instantiate (this is also how the dense encoder works)

oh i get it! That makes sense

abeglova · 2026-03-18T19:09:50Z

vector_search/utils.py

        collection_name=search_collection,
        count_filter=search_filter,
-        exact=True,
+        exact=False,


Will this make the counts incorrect for non hybrid queries? Will that be a problem for paging?

It could be an issue for collections with very large number of points (the contentfile chunks collection) however I don't that is not a collection i expect we will ever need to accurately paginate through (or rely on this count number). The performance tradeoff in having it inexact I think outweighs the need for an accurate count of chunks

abeglova · 2026-03-18T19:34:25Z

vector_search/utils.py



-def vector_search(
+def vector_search(  # noqa: PLR0913


Can you add a test for vector_search for hybrid_search=True or just group_by= null in general?

Also, this does not need to be addressed in this pr but vector_search shouldn't be in utils and should be in it's own file since it's only called by the views and is not a utility function

abeglova

Works great locally. I'm having trouble joining the group Tobias set up to test the qdrant cloud encoder

sentry · 2026-03-18T20:35:57Z

vector_search/encoders/qdrant_cloud.py

+        try:
+            self.token_encoding_name = tiktoken.encoding_name_for_model(model_name)
+        except KeyError:
+            msg = f"Model {model_name} not found in tiktoken. defaulting to None"
+            log.warning(msg)
+


Bug: In QdrantCloudEncoder.__init__, a KeyError exception logs a fallback to None but doesn't assign self.token_encoding_name, leading to a potential AttributeError.
_{Severity: MEDIUM}

Suggested Fix

In the except KeyError block of the QdrantCloudEncoder.__init__ method, add the line self.token_encoding_name = None after the log warning. This will ensure the attribute is set as intended by the log message and prevent subsequent AttributeError exceptions.

Prompt for AI Agent

Review the code at the location below. A potential bug has been identified by an AI agent. Verify if this is a real issue. If it is, propose a fix; if not, explain why it's not valid. Location: vector_search/encoders/qdrant_cloud.py#L21-L26 Potential issue: In the `QdrantCloudEncoder.__init__` method, if `tiktoken.encoding_name_for_model(model_name)` raises a `KeyError`, the `except` block logs a warning message stating it is "defaulting to None" but fails to actually assign `self.token_encoding_name = None`. Unlike `LiteLLMEncoder`, which has a class-level fallback, `QdrantCloudEncoder` has no such safety net. Consequently, any downstream code attempting to access the `token_encoding_name` attribute on the instance will trigger an `AttributeError`, causing a runtime crash.

sentry · 2026-03-18T20:35:57Z

vector_search/utils.py

+        else:
+            # fallback to dense only search
+            search_params["using"] = encoder_dense.model_short_name()
+            search_params["query"] = encoder_dense.embed_query(query_string)
+
        if "group_by" in params:
+            search_params.pop("search_params", None)
            search_params["group_by"] = params.get("group_by")
            search_params["group_size"] = params.get("group_size", 1)
            group_result = client.query_points_groups(**search_params)


Bug: The vector_search function incorrectly passes with_payload and with_vectors parameters to client.query_points_groups() during grouped searches, which will cause a TypeError.
_{Severity: HIGH}

Suggested Fix

Before calling client.query_points_groups within the if "group_by" in params: block, remove the with_payload and with_vectors keys from the search_params dictionary, similar to how search_params is removed. Add search_params.pop("with_payload", None) and search_params.pop("with_vectors", None).

Prompt for AI Agent

Review the code at the location below. A potential bug has been identified by an AI agent. Verify if this is a real issue. If it is, propose a fix; if not, explain why it's not valid. Location: vector_search/utils.py#L929-L938 Potential issue: In the `vector_search` function, when a `group_by` parameter is present, the code correctly removes the `search_params` key before calling `client.query_points_groups`. However, it fails to also remove the `with_payload` and `with_vectors` keys from the parameters dictionary. These keys are not intended for grouped queries in this context. Passing these extraneous keyword arguments will cause `client.query_points_groups` to raise a `TypeError` at runtime, causing grouped searches to fail.

abeglova · 2026-03-18T20:43:06Z

works now!

shanbady added 16 commits March 5, 2026 13:27

adding sparse encoder util

68acf34

adding sparse encoder setting

4d368a0

add sparse enc

615e785

Merge branch 'main' into shanbady/sparse-hybrid-search

afe3644

adding sparse hash encoder

57ff39a

adding scikit-learn

a477364

fix sparse encoder

c86947b

fix topic embedding'

fbf4c82

fix default vectorizer name

5325828

adding cloud inference capability

d8a6739

adding openai api key to options dict

7d7d2c7

fix limits

c38b7b0

docstring updates

a06375d

adding test

a2ad799

some optimizations

ae4e2af

fixing limit for prefetch queries

287599d

shanbady changed the base branch from main to shanbady/qdrant-upgrade March 17, 2026 15:51

shanbady added 2 commits March 17, 2026 14:01

hide hybrid search behind posthog feature flag

02d6a1d

Merge branch 'shanbady/qdrant-upgrade' into shanbady/sparse-hybrid-se…

6e8b7a7

…arch

shanbady marked this pull request as ready for review March 17, 2026 19:01

shanbady added the Needs Review An open Pull Request that is ready for review label Mar 17, 2026

sentry bot reviewed Mar 17, 2026

View reviewed changes

vector_search/utils.py Outdated Show resolved Hide resolved

vector_search/encoders/sparse_hash.py Show resolved Hide resolved

shanbady added 2 commits March 17, 2026 15:22

scale prefetch with offset

3a741d3

fix yield return

0bc74e9

sentry bot reviewed Mar 18, 2026

View reviewed changes

vector_search/encoders/utils.py Show resolved Hide resolved

shanbady added 2 commits March 18, 2026 09:53

fix sparse hash threshold calculation

3e0df9e

switching hybrid search to be a url param

55f4216

sentry bot reviewed Mar 18, 2026

View reviewed changes

vector_search/utils.py Show resolved Hide resolved

remove search params from groupby

c7cd591

adding cache decorator to sparse encoder

fdd9a19

sentry bot reviewed Mar 18, 2026

View reviewed changes

vector_search/encoders/sparse_hash.py Outdated Show resolved Hide resolved

vector_search/utils.py Show resolved Hide resolved

fix test

17e5298

abeglova self-assigned this Mar 18, 2026

fix test

c8ef9cb

sentry bot reviewed Mar 18, 2026

View reviewed changes

vector_search/utils.py Show resolved Hide resolved

shanbady added 2 commits March 18, 2026 11:51

add default encoding name

cd7548f

fix tests

2c5dff0

sentry bot reviewed Mar 18, 2026

View reviewed changes

vector_search/utils.py Show resolved Hide resolved

fix stop_words param

331fb43

sentry bot reviewed Mar 18, 2026

View reviewed changes

vector_search/utils.py Outdated Show resolved Hide resolved

abeglova reviewed Mar 18, 2026

View reviewed changes

abeglova requested changes Mar 18, 2026

View reviewed changes

shanbady added 3 commits March 18, 2026 16:02

adding test for hybrid flag and group_by

5acf448

pinning tokenizer to None for tests

154d85a

fix sparse embedding when searching

744a8f6

sentry bot reviewed Mar 18, 2026

View reviewed changes

abeglova approved these changes Mar 18, 2026

View reviewed changes

shanbady merged commit 299752c into shanbady/qdrant-upgrade Mar 19, 2026
10 checks passed

shanbady deleted the shanbady/sparse-hybrid-search branch March 19, 2026 16:22

Conversation

shanbady commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What are the relevant tickets?

Description (What does it do?)

How can this be tested?

testing local hybrid search

testing deployed/clound inferenced hybrid search

Additional Context

Uh oh!

github-actions bot commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

OpenAPI Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

abeglova Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

shanbady Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

abeglova Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

abeglova Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

shanbady Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

abeglova Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

abeglova left a comment

Choose a reason for hiding this comment

Uh oh!

sentry bot Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

sentry bot Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

abeglova commented Mar 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

shanbady commented Mar 17, 2026 •

edited

Loading

github-actions bot commented Mar 17, 2026 •

edited

Loading