Feature/compound keys by whilo · Pull Request #3 · replikativ/proximum

whilo · 2026-02-19T08:35:32Z

No description provided.

Enables multi-field indexing (Cozo-style) and ColBERT token-level indexing. External IDs can now be vectors like ["doc-1" :title] or ["doc-1" :content 0]. - Added compare-coll for lexicographic comparison of nested collections - Updated external-id-comparator to support both simple and compound IDs - Backward compatible: simple string/UUID/Long IDs still work

Tests cover: - compare-coll with simple values, vectors, nested structures - Compound keys in external-id-index - Mixed compound and simple keys - Compound key uniqueness enforcement

Enables multi-vector document representation with MaxSim scoring: - insert-document: Insert token vectors with compound keys [doc-id :token idx] - insert-documents: Batch insert multiple documents - maxsim-search: Token-level HNSW search + MaxSim aggregation - maxsim-search-filtered: MaxSim with document ID filtering ColBERT outperforms single-vector models on complex queries by capturing token-level semantic interactions. References: - ColBERT (SIGIR'20): https://arxiv.org/abs/2004.12832 - ColBERTv2 (NAACL'22): https://arxiv.org/abs/2112.01488

Weighted Field Search: - Combine results from different semantic fields (title, content, metadata) - User-specified weights for each field - Optional required field constraints Hybrid Search: - Combine weighted field search with ColBERT MaxSim - Configurable alpha parameter for field vs token-level balance - Returns both field and MaxSim scores for transparency API: - weighted-field-search: Single query across multiple fields - weighted-field-search-with-constraints: Require matches in specific fields - hybrid-search: Combine field + token-level matching

- Uses sentence-transformers to generate real embeddings - Tests ColBERT MaxSim search with token-level embeddings - Tests weighted field search with real semantic vectors - Adds data.json to test dependencies - Marks integration tests with ^:integration metadata Integration tests use simulated token embeddings (chunked text) since ColBERT models require C++ extension compilation.

- docs/multi-field-colbert.md: Comprehensive documentation of compound keys, weighted field search, ColBERT, and hybrid search - test/data/colbert_minimal.json: Minimal synthetic test data (666 bytes) for CI/CD without Python dependency - test/generate_colbert_data.py: Script to generate real embeddings with sentence-transformers (optional) - Updated integration test to prefer committed minimal data Users can run tests without Python. To test with real embeddings: python -m venv .venv && . .venv/bin/activate pip install sentence-transformers python test/generate_colbert_data.py > /tmp/colbert_data.json clojure -M:test

whilo added 9 commits February 18, 2026 23:52

Add tests for compound key support

85f84d4

Tests cover: - compare-coll with simple values, vectors, nested structures - Compound keys in external-id-index - Mixed compound and simple keys - Compound key uniqueness enforcement

Add Eclipse files to .gitignore

4237b4a

Follow markdown file name convention.

37f5581

Fix format.

e97cc91

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/compound keys#3

Feature/compound keys#3
whilo wants to merge 9 commits into
mainfrom
feature/compound-keys

whilo commented Feb 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

whilo commented Feb 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant