Pluggable embedding backend — allow OpenAI (and other) encoders alongside Sentence-Transformers

## Motivation

`FeatureBuilder` currently depends on a single, hardcoded embedding model: `SentenceTransformer('all-MiniLM-L6-v2')`, initialized at module import in `src/team_comm_tools/utils/check_embeddings.py:19`. Every vector-based feature (Forward Flow, BERT Mimicry, Discursive Diversity, user centroids) reads `message_embedding` produced by this model.

There are a few reasons a user might want to swap this out:

- **Consistency with an existing pipeline.** Some research groups already embed chat data with OpenAI's `text-embedding-3-*` (or another provider) for retrieval/clustering/LLM-adjacent tasks, and want their TCT-derived features computed over the *same* vectors so downstream comparisons aren't confounded by the embedding model.
- **Quality / dimensionality.** MiniLM is 384-dim and tuned for semantic similarity on general English. Larger or domain-specific encoders may produce more discriminative mimicry/forward-flow signals for longer, technical, or multilingual transcripts.
- **Privacy-constrained environments.** Some teams must run entirely on-device (no cloud calls) and want to pin a *different* local model — e.g., a larger mpnet or a domain-tuned model — without forking the package.

The current code doesn't support any of these without either monkey-patching the module-level `model_vect` or forking.

## Proposal

Add an optional `embedding_fn` (or `encoder`) parameter to `FeatureBuilder.__init__`. The default stays `None`, which falls through to today's `SentenceTransformer('all-MiniLM-L6-v2')` — so **this is fully backward-compatible**.

Sketch:

```python
class FeatureBuilder:
    def __init__(
        self,
        input_df,
        ...,
        embedding_fn: Optional[Callable[[List[str]], np.ndarray]] = None,
        embedding_dim: Optional[int] = None,  # for sanity checks / cache invalidation
        ...
    ):
        self.embedding_fn = embedding_fn or _default_minilm_encoder
        ...
```

Then `check_embeddings.generate_vect` calls `self.embedding_fn(messages)` instead of a module-global model. Consumers supply whatever encoder they want:

```python
# OpenAI
import openai, numpy as np
def openai_encoder(texts):
    r = openai.embeddings.create(model="text-embedding-3-small", input=texts)
    return np.array([d.embedding for d in r.data])

fb = FeatureBuilder(..., embedding_fn=openai_encoder, embedding_dim=1536)

# Custom sentence-transformers model
from sentence_transformers import SentenceTransformer
m = SentenceTransformer("sentence-transformers/all-mpnet-base-v2")
fb = FeatureBuilder(..., embedding_fn=lambda t: m.encode(t, show_progress_bar=False))
```

## Cache correctness

The `vector_directory` cache is the main thing that needs care — switching encoders between runs must invalidate. Two options:

1. Include an encoder fingerprint (`embedding_fn.__qualname__` or a user-supplied `embedding_backend_id` string, plus `embedding_dim`) in the cache-file name / header.
2. Keep the current behavior (user sets `regenerate_vectors=True` when they switch) and just document it.

I'd lean toward option 1 since silent cache poisoning is easy to miss; happy to do either.

## Scope / non-goals

- Not proposing a dependency on `openai` or any specific SDK. The parameter is a bare callable — the package stays dependency-free on that axis. OpenAI integration would be a user-land `pip install openai` + ~3-line wrapper function, exactly as in the sketch above.
- Not proposing to change the *default* behavior or break any existing notebook.
- Not touching feature math — the downstream consumers of `message_embedding` (forward flow, mimicry, discursive diversity, user centroids) should be agnostic to the embedding source as long as vectors are consistent within a run.

## Prior issues / context

- Looked through closed issues — #120, #122, #198 all touch the initial embedding implementation but none discuss pluggable backends. I don't see an open issue for this; happy to be pointed at one if I missed it.
- Installed version on my side is 0.1.0; confirmed the same pattern is still present on `main` (v0.1.7) at `utils/check_embeddings.py:19`.

## Willing to contribute

Happy to send a PR against `main` with:

- The `embedding_fn` parameter + default fallback
- Cache-fingerprint update (option 1 above)
- Unit tests covering (a) default-path equivalence vs current behavior, (b) a fake-encoder injection path, (c) cache-invalidation-on-encoder-change
- No new required dependencies
- Docs snippet

Does this direction look reasonable? If yes, I'll open a PR; if you'd rather I approach this differently (e.g., a plugin registry, a config file, a subclass API instead of a callable), I'd rather calibrate on the API shape before writing the code.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pluggable embedding backend — allow OpenAI (and other) encoders alongside Sentence-Transformers #368

Motivation

Proposal

Cache correctness

Scope / non-goals

Prior issues / context

Willing to contribute

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Pluggable embedding backend — allow OpenAI (and other) encoders alongside Sentence-Transformers #368

Description

Motivation

Proposal

Cache correctness

Scope / non-goals

Prior issues / context

Willing to contribute

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions