-
Notifications
You must be signed in to change notification settings - Fork 1
Added embedding benchmark along with new config file and plot updates #14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR adds optional cross-encoder reranking functionality to improve retrieval quality, implements top-k candidate retrieval in embedding matching, normalizes embeddings before Redis operations, enhances the Redis index handling with dimension probing, and improves visualization with dual-subplot charts. The changes span the core matching logic, benchmarking infrastructure, and result visualization.
Key Changes
- Adds cross-encoder reranking with configurable top-k retrieval for both Redis-based and standard neural embedding matching
- Implements top-k retrieval logic in blockwise embedding matching with support for variable k values
- Normalizes embeddings to float32 and unit length before Redis operations, with dimension probing to handle models with incorrect config dimensions
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 16 comments.
Show a summary per file
| File | Description |
|---|---|
src/customer_analysis/query_engine.py |
Adds embedding dimension probing, enables trust_remote_code, and always overwrites Redis index on initialization |
src/customer_analysis/embedding_interface.py |
Implements top-k retrieval in blockwise matching methods with support for variable k; updates dimension inference to probe models first |
src/customer_analysis/data_processing.py |
Integrates cross-encoder reranking in both run_matching and run_matching_redis; adds embedding normalization before Redis operations |
scripts/plot_multiple_precision_vs_cache_hit_ratio.py |
Creates dual-subplot visualization with precision-CHR curves and AUC bar chart; adds numpy version compatibility for trapezoid/trapz |
run_benchmark.sh |
Provides example shell script for running benchmarks with cross-encoder models |
run_benchmark.py |
Extends benchmark loop to iterate over cross-encoder models and include them in output paths |
evaluation.py |
Adds command-line arguments for cross-encoder model and rerank_k parameter |
Comments suppressed due to low confidence (2)
src/customer_analysis/embedding_interface.py:414
- The docstring for
calculate_best_matches_with_cache_large_datasetdoes not document the newkparameter or how it affects the return value shapes. The function returns arrays with shape(num_sentences,)whenk=1but(num_sentences, k)whenk>1. This should be documented to avoid confusion.
"""Large-dataset variant: find best cache match for each sentence using memmaps.
Writes two memmaps (rows for sentences, cols for cache), normalised, and
performs blockwise dot-products. If `sentence_offset` is provided and the
cache corresponds to the same corpus, the self-similarity diagonal is masked.
"""
src/customer_analysis/embedding_interface.py:490
- The docstrings for
calculate_best_matches_with_cacheandcalculate_best_matches_from_embeddings_with_cachedo not document the newkparameter or how it affects return value shapes. Whenk=1, arrays have shape(N,), but whenk>1, they have shape(N, k). This should be documented.
"""
Calculate the best similarity match for each sentence against all other
sentences using a neural embedding model.
"""
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| else: | ||
| # Top-k logic | ||
| # If columns in this block < k, take all valid | ||
| curr_block_size = col_end - col_start | ||
| if curr_block_size <= k: | ||
| top_k_in_block_idx = np.argsort(-sim, axis=1) # Sort all | ||
| top_k_in_block_val = np.take_along_axis(sim, top_k_in_block_idx, axis=1) | ||
| # Might have fewer than k if block is small | ||
| else: | ||
| # Use argpartition for top k | ||
| # We want largest k | ||
| part_idx = np.argpartition(-sim, k, axis=1)[:, :k] | ||
| top_k_in_block_val = np.take_along_axis(sim, part_idx, axis=1) | ||
|
|
||
| # Sort them to have ordered top-k (optional but good for merging) | ||
| sorted_sub_idx = np.argsort(-top_k_in_block_val, axis=1) | ||
| top_k_in_block_val = np.take_along_axis(top_k_in_block_val, sorted_sub_idx, axis=1) | ||
| top_k_in_block_idx = np.take_along_axis(part_idx, sorted_sub_idx, axis=1) | ||
|
|
||
| # Merge with accumulated bests | ||
| # chunk_best_scores: (batch, k) | ||
| # top_k_in_block_val: (batch, min(block, k)) | ||
|
|
||
| # Adjust indices to global column indices | ||
| top_k_in_block_idx_global = top_k_in_block_idx + col_start | ||
|
|
||
| combined_vals = np.concatenate([chunk_best_scores, top_k_in_block_val], axis=1) | ||
| combined_idxs = np.concatenate([chunk_best_indices, top_k_in_block_idx_global], axis=1) | ||
|
|
||
| # Find top k in combined | ||
| best_combined_args = np.argsort(-combined_vals, axis=1)[:, :k] | ||
|
|
||
| chunk_best_scores = np.take_along_axis(combined_vals, best_combined_args, axis=1) | ||
| chunk_best_indices = np.take_along_axis(combined_idxs, best_combined_args, axis=1) |
Copilot
AI
Dec 9, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The top-k retrieval logic in blockwise matching lacks test coverage. Since the repository has comprehensive tests and this is a significant new feature that changes the shape of return values and introduces complex merging logic, it should have tests covering: 1) k > 1 with various cache sizes, 2) edge cases where block size < k, 3) self-similarity masking with k > 1, and 4) correct sorting and merging across blocks.
Note
Adds optional cross-encoder reranking (top-k) across evaluation/benchmark pipelines, normalizes embeddings, hardens Redis index setup, and upgrades multi-model plotting with AUC bar charts.
--cross_encoder_model/--cross_encoder_modelsand--rerank_kflags to enable optional cross-encoder reranking._rerank_<model>.run_matching/run_matching_redis: support top-kcandidate retrieval and cross-encoder reranking; batch CE scoring and selection.NeuralEmbedding)ksupport across large/small dataset paths, including blockwise two-set search.k.RedisVectorIndex)embed_dim.overwrite=True) to match current dims.scripts/plot_multiple_precision_vs_cache_hit_ratio.py)run_benchmark.shexample with new flags.Written by Cursor Bugbot for commit 069650a. This will update automatically on new commits. Configure here.