Skip to content

Pluggable embedding backend — allow OpenAI (and other) encoders alongside Sentence-Transformers #368

@aminsmd

Description

@aminsmd

Motivation

FeatureBuilder currently depends on a single, hardcoded embedding model: SentenceTransformer('all-MiniLM-L6-v2'), initialized at module import in src/team_comm_tools/utils/check_embeddings.py:19. Every vector-based feature (Forward Flow, BERT Mimicry, Discursive Diversity, user centroids) reads message_embedding produced by this model.

There are a few reasons a user might want to swap this out:

  • Consistency with an existing pipeline. Some research groups already embed chat data with OpenAI's text-embedding-3-* (or another provider) for retrieval/clustering/LLM-adjacent tasks, and want their TCT-derived features computed over the same vectors so downstream comparisons aren't confounded by the embedding model.
  • Quality / dimensionality. MiniLM is 384-dim and tuned for semantic similarity on general English. Larger or domain-specific encoders may produce more discriminative mimicry/forward-flow signals for longer, technical, or multilingual transcripts.
  • Privacy-constrained environments. Some teams must run entirely on-device (no cloud calls) and want to pin a different local model — e.g., a larger mpnet or a domain-tuned model — without forking the package.

The current code doesn't support any of these without either monkey-patching the module-level model_vect or forking.

Proposal

Add an optional embedding_fn (or encoder) parameter to FeatureBuilder.__init__. The default stays None, which falls through to today's SentenceTransformer('all-MiniLM-L6-v2') — so this is fully backward-compatible.

Sketch:

class FeatureBuilder:
    def __init__(
        self,
        input_df,
        ...,
        embedding_fn: Optional[Callable[[List[str]], np.ndarray]] = None,
        embedding_dim: Optional[int] = None,  # for sanity checks / cache invalidation
        ...
    ):
        self.embedding_fn = embedding_fn or _default_minilm_encoder
        ...

Then check_embeddings.generate_vect calls self.embedding_fn(messages) instead of a module-global model. Consumers supply whatever encoder they want:

# OpenAI
import openai, numpy as np
def openai_encoder(texts):
    r = openai.embeddings.create(model="text-embedding-3-small", input=texts)
    return np.array([d.embedding for d in r.data])

fb = FeatureBuilder(..., embedding_fn=openai_encoder, embedding_dim=1536)

# Custom sentence-transformers model
from sentence_transformers import SentenceTransformer
m = SentenceTransformer("sentence-transformers/all-mpnet-base-v2")
fb = FeatureBuilder(..., embedding_fn=lambda t: m.encode(t, show_progress_bar=False))

Cache correctness

The vector_directory cache is the main thing that needs care — switching encoders between runs must invalidate. Two options:

  1. Include an encoder fingerprint (embedding_fn.__qualname__ or a user-supplied embedding_backend_id string, plus embedding_dim) in the cache-file name / header.
  2. Keep the current behavior (user sets regenerate_vectors=True when they switch) and just document it.

I'd lean toward option 1 since silent cache poisoning is easy to miss; happy to do either.

Scope / non-goals

  • Not proposing a dependency on openai or any specific SDK. The parameter is a bare callable — the package stays dependency-free on that axis. OpenAI integration would be a user-land pip install openai + ~3-line wrapper function, exactly as in the sketch above.
  • Not proposing to change the default behavior or break any existing notebook.
  • Not touching feature math — the downstream consumers of message_embedding (forward flow, mimicry, discursive diversity, user centroids) should be agnostic to the embedding source as long as vectors are consistent within a run.

Prior issues / context

Willing to contribute

Happy to send a PR against main with:

  • The embedding_fn parameter + default fallback
  • Cache-fingerprint update (option 1 above)
  • Unit tests covering (a) default-path equivalence vs current behavior, (b) a fake-encoder injection path, (c) cache-invalidation-on-encoder-change
  • No new required dependencies
  • Docs snippet

Does this direction look reasonable? If yes, I'll open a PR; if you'd rather I approach this differently (e.g., a plugin registry, a config file, a subclass API instead of a callable), I'd rather calibrate on the API shape before writing the code.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions