Feature/compound keys#3
Open
whilo wants to merge 9 commits into
Open
Conversation
Enables multi-field indexing (Cozo-style) and ColBERT token-level indexing. External IDs can now be vectors like ["doc-1" :title] or ["doc-1" :content 0]. - Added compare-coll for lexicographic comparison of nested collections - Updated external-id-comparator to support both simple and compound IDs - Backward compatible: simple string/UUID/Long IDs still work
Tests cover: - compare-coll with simple values, vectors, nested structures - Compound keys in external-id-index - Mixed compound and simple keys - Compound key uniqueness enforcement
Enables multi-vector document representation with MaxSim scoring: - insert-document: Insert token vectors with compound keys [doc-id :token idx] - insert-documents: Batch insert multiple documents - maxsim-search: Token-level HNSW search + MaxSim aggregation - maxsim-search-filtered: MaxSim with document ID filtering ColBERT outperforms single-vector models on complex queries by capturing token-level semantic interactions. References: - ColBERT (SIGIR'20): https://arxiv.org/abs/2004.12832 - ColBERTv2 (NAACL'22): https://arxiv.org/abs/2112.01488
Weighted Field Search: - Combine results from different semantic fields (title, content, metadata) - User-specified weights for each field - Optional required field constraints Hybrid Search: - Combine weighted field search with ColBERT MaxSim - Configurable alpha parameter for field vs token-level balance - Returns both field and MaxSim scores for transparency API: - weighted-field-search: Single query across multiple fields - weighted-field-search-with-constraints: Require matches in specific fields - hybrid-search: Combine field + token-level matching
- Uses sentence-transformers to generate real embeddings - Tests ColBERT MaxSim search with token-level embeddings - Tests weighted field search with real semantic vectors - Adds data.json to test dependencies - Marks integration tests with ^:integration metadata Integration tests use simulated token embeddings (chunked text) since ColBERT models require C++ extension compilation.
- docs/multi-field-colbert.md: Comprehensive documentation of compound keys, weighted field search, ColBERT, and hybrid search - test/data/colbert_minimal.json: Minimal synthetic test data (666 bytes) for CI/CD without Python dependency - test/generate_colbert_data.py: Script to generate real embeddings with sentence-transformers (optional) - Updated integration test to prefer committed minimal data Users can run tests without Python. To test with real embeddings: python -m venv .venv && . .venv/bin/activate pip install sentence-transformers python test/generate_colbert_data.py > /tmp/colbert_data.json clojure -M:test
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.