Skip to content

Feature/compound keys#3

Open
whilo wants to merge 9 commits into
mainfrom
feature/compound-keys
Open

Feature/compound keys#3
whilo wants to merge 9 commits into
mainfrom
feature/compound-keys

Conversation

@whilo
Copy link
Copy Markdown
Member

@whilo whilo commented Feb 19, 2026

No description provided.

Enables multi-field indexing (Cozo-style) and ColBERT token-level
indexing. External IDs can now be vectors like ["doc-1" :title]
or ["doc-1" :content 0].

- Added compare-coll for lexicographic comparison of nested collections
- Updated external-id-comparator to support both simple and compound IDs
- Backward compatible: simple string/UUID/Long IDs still work
Tests cover:
- compare-coll with simple values, vectors, nested structures
- Compound keys in external-id-index
- Mixed compound and simple keys
- Compound key uniqueness enforcement
Enables multi-vector document representation with MaxSim scoring:

- insert-document: Insert token vectors with compound keys [doc-id :token idx]
- insert-documents: Batch insert multiple documents
- maxsim-search: Token-level HNSW search + MaxSim aggregation
- maxsim-search-filtered: MaxSim with document ID filtering

ColBERT outperforms single-vector models on complex queries by
capturing token-level semantic interactions.

References:
- ColBERT (SIGIR'20): https://arxiv.org/abs/2004.12832
- ColBERTv2 (NAACL'22): https://arxiv.org/abs/2112.01488
Weighted Field Search:
- Combine results from different semantic fields (title, content, metadata)
- User-specified weights for each field
- Optional required field constraints

Hybrid Search:
- Combine weighted field search with ColBERT MaxSim
- Configurable alpha parameter for field vs token-level balance
- Returns both field and MaxSim scores for transparency

API:
- weighted-field-search: Single query across multiple fields
- weighted-field-search-with-constraints: Require matches in specific fields
- hybrid-search: Combine field + token-level matching
- Uses sentence-transformers to generate real embeddings
- Tests ColBERT MaxSim search with token-level embeddings
- Tests weighted field search with real semantic vectors
- Adds data.json to test dependencies
- Marks integration tests with ^:integration metadata

Integration tests use simulated token embeddings (chunked text)
since ColBERT models require C++ extension compilation.
- docs/multi-field-colbert.md: Comprehensive documentation of
  compound keys, weighted field search, ColBERT, and hybrid search
- test/data/colbert_minimal.json: Minimal synthetic test data (666 bytes)
  for CI/CD without Python dependency
- test/generate_colbert_data.py: Script to generate real embeddings
  with sentence-transformers (optional)
- Updated integration test to prefer committed minimal data

Users can run tests without Python. To test with real embeddings:
  python -m venv .venv && . .venv/bin/activate
  pip install sentence-transformers
  python test/generate_colbert_data.py > /tmp/colbert_data.json
  clojure -M:test
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant