Skip to content

fix: support custom model architectures and restore model param wiring in search#79

Open
eugenepro2 wants to merge 1 commit intotirth8205:mainfrom
eugenepro2:fix/trust-remote-code
Open

fix: support custom model architectures and restore model param wiring in search#79
eugenepro2 wants to merge 1 commit intotirth8205:mainfrom
eugenepro2:fix/trust-remote-code

Conversation

@eugenepro2
Copy link
Copy Markdown

Title: fix: support custom model architectures and restore model param wiring in search

Summary

  • Pass trust_remote_code=True in LocalEmbeddingProvider so models with custom architectures (e.g. jinaai/jina-embeddings-v2-base-code) load correctly via SentenceTransformer
  • Restore model parameter wiring through the search path (semantic_search_nodes -> hybrid_search -> _embedding_search -> EmbeddingStore) lost during the v2 refactoring

Problem

1. Broken embeddings with custom-architecture models

jinaai/jina-embeddings-v2-base-code uses a custom JinaBertModel with ALiBi positional embeddings. Without trust_remote_code=True, transformers falls back to standard BertModel and randomly initializes missing weights (position_embeddings, all encoder layers).

This produces garbage embeddings where all cosine similarities converge to ~0.77, making semantic search return effectively random results. The issue is silent — no errors are raised, embeddings are stored and searched, but results are meaningless.

Before fix (broken):

Query: "authentication and session management"
  0.7738  renderIndex          (unrelated)
  0.7738  escapeRegExp         (unrelated)
  0.7738  CCookies             (unrelated)

After fix:

Query: "authentication and session management"
  0.7012  getSession           ✓
  0.5719  useSession           ✓
  0.5317  AuthPage             ✓
  0.4914  TrpcAuthMiddleware   ✓

2. model parameter dropped during v2 refactoring

In PR #55 (v1.x), semantic_search_nodes passed model directly to EmbeddingStore:

emb_store = EmbeddingStore(db_path, model=model)  # v1.x — worked

In v2.0.0, tools.py was split into tools/ sub-modules and search.py was extracted. The model parameter is accepted by semantic_search_nodes_tool MCP wrapper but dropped before reaching EmbeddingStore:

# search.py — v2.0.0
emb_store = EmbeddingStore(store.db_path)  # model lost

This forces users to rely solely on CRG_EMBEDDING_MODEL env var.

Changes

  • embeddings.py — Pass trust_remote_code=True and model_kwargs={"trust_remote_code": True} to SentenceTransformer(). Safe for all models (flag is ignored when no custom code exists).
  • search.py — Add model parameter to _embedding_search() and hybrid_search(), forward to EmbeddingStore.
  • tools/query.py — Forward model from semantic_search_nodes() to hybrid_search().

Affected models

Any HuggingFace model with auto_map in its config, including:

  • jinaai/jina-embeddings-v2-base-code
  • jinaai/jina-embeddings-v2-base-en
  • Other models with custom architectures

Models without custom code (e.g. all-MiniLM-L6-v2, BAAI/bge-small-en-v1.5) are unaffected — the flag is silently ignored.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant