fix: support custom model architectures and restore model param wiring in search#79
Open
eugenepro2 wants to merge 1 commit intotirth8205:mainfrom
Open
fix: support custom model architectures and restore model param wiring in search#79eugenepro2 wants to merge 1 commit intotirth8205:mainfrom
eugenepro2 wants to merge 1 commit intotirth8205:mainfrom
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Title: fix: support custom model architectures and restore model param wiring in search
Summary
trust_remote_code=TrueinLocalEmbeddingProviderso models with custom architectures (e.g.jinaai/jina-embeddings-v2-base-code) load correctly viaSentenceTransformermodelparameter wiring through the search path (semantic_search_nodes->hybrid_search->_embedding_search->EmbeddingStore) lost during the v2 refactoringProblem
1. Broken embeddings with custom-architecture models
jinaai/jina-embeddings-v2-base-codeuses a customJinaBertModelwith ALiBi positional embeddings. Withouttrust_remote_code=True,transformersfalls back to standardBertModeland randomly initializes missing weights (position_embeddings, all encoder layers).This produces garbage embeddings where all cosine similarities converge to ~0.77, making semantic search return effectively random results. The issue is silent — no errors are raised, embeddings are stored and searched, but results are meaningless.
Before fix (broken):
After fix:
2.
modelparameter dropped during v2 refactoringIn PR #55 (v1.x),
semantic_search_nodespassedmodeldirectly toEmbeddingStore:In v2.0.0,
tools.pywas split intotools/sub-modules andsearch.pywas extracted. Themodelparameter is accepted bysemantic_search_nodes_toolMCP wrapper but dropped before reachingEmbeddingStore:This forces users to rely solely on
CRG_EMBEDDING_MODELenv var.Changes
embeddings.py— Passtrust_remote_code=Trueandmodel_kwargs={"trust_remote_code": True}toSentenceTransformer(). Safe for all models (flag is ignored when no custom code exists).search.py— Addmodelparameter to_embedding_search()andhybrid_search(), forward toEmbeddingStore.tools/query.py— Forwardmodelfromsemantic_search_nodes()tohybrid_search().Affected models
Any HuggingFace model with
auto_mapin its config, including:jinaai/jina-embeddings-v2-base-codejinaai/jina-embeddings-v2-base-enModels without custom code (e.g.
all-MiniLM-L6-v2,BAAI/bge-small-en-v1.5) are unaffected — the flag is silently ignored.