Motivation
FeatureBuilder currently depends on a single, hardcoded embedding model: SentenceTransformer('all-MiniLM-L6-v2'), initialized at module import in src/team_comm_tools/utils/check_embeddings.py:19. Every vector-based feature (Forward Flow, BERT Mimicry, Discursive Diversity, user centroids) reads message_embedding produced by this model.
There are a few reasons a user might want to swap this out:
- Consistency with an existing pipeline. Some research groups already embed chat data with OpenAI's
text-embedding-3-* (or another provider) for retrieval/clustering/LLM-adjacent tasks, and want their TCT-derived features computed over the same vectors so downstream comparisons aren't confounded by the embedding model.
- Quality / dimensionality. MiniLM is 384-dim and tuned for semantic similarity on general English. Larger or domain-specific encoders may produce more discriminative mimicry/forward-flow signals for longer, technical, or multilingual transcripts.
- Privacy-constrained environments. Some teams must run entirely on-device (no cloud calls) and want to pin a different local model — e.g., a larger mpnet or a domain-tuned model — without forking the package.
The current code doesn't support any of these without either monkey-patching the module-level model_vect or forking.
Proposal
Add an optional embedding_fn (or encoder) parameter to FeatureBuilder.__init__. The default stays None, which falls through to today's SentenceTransformer('all-MiniLM-L6-v2') — so this is fully backward-compatible.
Sketch:
class FeatureBuilder:
def __init__(
self,
input_df,
...,
embedding_fn: Optional[Callable[[List[str]], np.ndarray]] = None,
embedding_dim: Optional[int] = None, # for sanity checks / cache invalidation
...
):
self.embedding_fn = embedding_fn or _default_minilm_encoder
...
Then check_embeddings.generate_vect calls self.embedding_fn(messages) instead of a module-global model. Consumers supply whatever encoder they want:
# OpenAI
import openai, numpy as np
def openai_encoder(texts):
r = openai.embeddings.create(model="text-embedding-3-small", input=texts)
return np.array([d.embedding for d in r.data])
fb = FeatureBuilder(..., embedding_fn=openai_encoder, embedding_dim=1536)
# Custom sentence-transformers model
from sentence_transformers import SentenceTransformer
m = SentenceTransformer("sentence-transformers/all-mpnet-base-v2")
fb = FeatureBuilder(..., embedding_fn=lambda t: m.encode(t, show_progress_bar=False))
Cache correctness
The vector_directory cache is the main thing that needs care — switching encoders between runs must invalidate. Two options:
- Include an encoder fingerprint (
embedding_fn.__qualname__ or a user-supplied embedding_backend_id string, plus embedding_dim) in the cache-file name / header.
- Keep the current behavior (user sets
regenerate_vectors=True when they switch) and just document it.
I'd lean toward option 1 since silent cache poisoning is easy to miss; happy to do either.
Scope / non-goals
- Not proposing a dependency on
openai or any specific SDK. The parameter is a bare callable — the package stays dependency-free on that axis. OpenAI integration would be a user-land pip install openai + ~3-line wrapper function, exactly as in the sketch above.
- Not proposing to change the default behavior or break any existing notebook.
- Not touching feature math — the downstream consumers of
message_embedding (forward flow, mimicry, discursive diversity, user centroids) should be agnostic to the embedding source as long as vectors are consistent within a run.
Prior issues / context
Willing to contribute
Happy to send a PR against main with:
- The
embedding_fn parameter + default fallback
- Cache-fingerprint update (option 1 above)
- Unit tests covering (a) default-path equivalence vs current behavior, (b) a fake-encoder injection path, (c) cache-invalidation-on-encoder-change
- No new required dependencies
- Docs snippet
Does this direction look reasonable? If yes, I'll open a PR; if you'd rather I approach this differently (e.g., a plugin registry, a config file, a subclass API instead of a callable), I'd rather calibrate on the API shape before writing the code.
Motivation
FeatureBuildercurrently depends on a single, hardcoded embedding model:SentenceTransformer('all-MiniLM-L6-v2'), initialized at module import insrc/team_comm_tools/utils/check_embeddings.py:19. Every vector-based feature (Forward Flow, BERT Mimicry, Discursive Diversity, user centroids) readsmessage_embeddingproduced by this model.There are a few reasons a user might want to swap this out:
text-embedding-3-*(or another provider) for retrieval/clustering/LLM-adjacent tasks, and want their TCT-derived features computed over the same vectors so downstream comparisons aren't confounded by the embedding model.The current code doesn't support any of these without either monkey-patching the module-level
model_vector forking.Proposal
Add an optional
embedding_fn(orencoder) parameter toFeatureBuilder.__init__. The default staysNone, which falls through to today'sSentenceTransformer('all-MiniLM-L6-v2')— so this is fully backward-compatible.Sketch:
Then
check_embeddings.generate_vectcallsself.embedding_fn(messages)instead of a module-global model. Consumers supply whatever encoder they want:Cache correctness
The
vector_directorycache is the main thing that needs care — switching encoders between runs must invalidate. Two options:embedding_fn.__qualname__or a user-suppliedembedding_backend_idstring, plusembedding_dim) in the cache-file name / header.regenerate_vectors=Truewhen they switch) and just document it.I'd lean toward option 1 since silent cache poisoning is easy to miss; happy to do either.
Scope / non-goals
openaior any specific SDK. The parameter is a bare callable — the package stays dependency-free on that axis. OpenAI integration would be a user-landpip install openai+ ~3-line wrapper function, exactly as in the sketch above.message_embedding(forward flow, mimicry, discursive diversity, user centroids) should be agnostic to the embedding source as long as vectors are consistent within a run.Prior issues / context
main(v0.1.7) atutils/check_embeddings.py:19.Willing to contribute
Happy to send a PR against
mainwith:embedding_fnparameter + default fallbackDoes this direction look reasonable? If yes, I'll open a PR; if you'd rather I approach this differently (e.g., a plugin registry, a config file, a subclass API instead of a callable), I'd rather calibrate on the API shape before writing the code.