Skip to content

feat: integrate OMEGA adaptive early termination into zvec#301

Open
driPyf wants to merge 138 commits into
alibaba:mainfrom
driPyf:feat/omega-integration
Open

feat: integrate OMEGA adaptive early termination into zvec#301
driPyf wants to merge 138 commits into
alibaba:mainfrom
driPyf:feat/omega-integration

Conversation

@driPyf
Copy link
Copy Markdown

@driPyf driPyf commented Apr 1, 2026

Closes #300

Greptile Summary

This PR introduces a new OMEGA index type to zvec, integrating adaptive early termination on top of HNSW. Instead of adding a separate search engine, the implementation keeps HNSW as the underlying graph traversal path and adds a learned query-time stopping policy that decides whether the current search state is already sufficient for a target recall. The implementation spans a new omega algorithm directory, OMEGA-aware searcher/streamer/index support, DB-layer training orchestration, Python bindings, benchmark integration, and Python workflow tests.

Key changes:

  • OMEGA is now exposed as a first-class index type with dedicated index params and query params in both the core interfaces and Python bindings.
  • OmegaSearcher and OmegaStreamer integrate OMEGA with the existing HNSW search loop through hook callbacks, so online search remains HNSW-based while query-time stop decisions come from OMEGALib.
  • The offline pipeline now includes held-out query generation, ground-truth collection, search-trace collection, model training, and persistence of omega_model/ artifacts such as model.txt, threshold_table.txt, interval_table.txt, gt_collected_table.txt, and gt_cmps_all_table.txt.
  • Runtime behavior still falls back to plain HNSW when no model is available or the dataset is below min_vector_threshold.
  • Python workflow tests now cover insert → optimize/train → online OMEGA query, as well as the fallback-to-HNSW path when OMEGA is inactive.

Issues found:

  • The current integration makes OMEGALib a required build dependency, so environments without working OpenMP or LightGBM toolchains will fail at configure time rather than transparently building without OMEGA.
  • The Python workflow tests validate that the OMEGA query path runs and that fallback results match HNSW, but they do not directly assert that adaptive early termination actually triggered inside OMEGALib.
  • The current query-time validation is intentionally behavioral rather than log- or metric-driven, so regressions in “OMEGA active but not stopping early” may not be caught by Python tests alone.

Confidence Score: 4/5

  • Likely safe to merge after normal build validation. The overall design is coherent: HNSW remains the underlying search engine, OMEGA is layered as query-time control, fallback behavior is preserved, and the offline training loop is integrated into the DB lifecycle rather than living in an external script.
  • The remaining risks are mainly around build environment requirements and the fact that Python tests validate end-to-end behavior more strongly than internal early-stop activation.
  • Pay close attention to src/core/algorithm/omega/omega_searcher.cc, src/core/algorithm/omega/omega_streamer.cc, and src/db/training/omega_training_coordinator.cc — these files define the runtime activation/fallback logic, the HNSW hook integration, and the offline training lifecycle that make the feature work end to end.

Important Files Changed

Filename Overview
src/core/algorithm/omega/omega_searcher.cc OMEGA-aware searcher that loads persisted model artifacts, decides whether OMEGA should be active for a query, and falls back to plain HNSW when the model is unavailable or the dataset is below threshold.
src/core/algorithm/omega/omega_streamer.cc Streaming/search-time integration layer that wires OMEGA into the HNSW loop through hooks, supports both inference and training modes, and persists the searcher-side OMEGA params alongside the index.
src/core/algorithm/omega/omega_context.h Extends HnswContext with OMEGA-specific per-query state such as target_recall, training_query_id, and collected training outputs.
src/core/interface/indexes/omega_index.cc Framework-facing OMEGA index wrapper that routes build/load/search lifecycle through the correct OMEGA-aware streamer/searcher path.
src/db/training/omega_training_coordinator.cc Coordinates held-out query reuse, training-data collection, retraining flow, and model output management under omega_model/.
src/db/training/omega_model_trainer.cc Bridges zvec-collected training records into OMEGALib’s C++ training API and writes the full set of OMEGA model artifacts.
src/db/index/segment/segment.cc Adds OMEGA auto-training and retraining hooks into the segment optimize/build lifecycle.
src/db/index/column/vector_column/engine_helper.hpp DB-to-core translation layer for OMEGA index params and runtime searcher config.
src/binding/python/model/param/python_param.cc Python bindings for OmegaIndexParam, OmegaQueryParam, and optimize-time retraining options.
python/tests/test_collection.py Python workflow tests covering OMEGA training, OMEGA query execution, and fallback-to-HNSW behavior.
thirdparty/CMakeLists.txt Build integration for OMEGALib and its dependency wiring into zvec.
CMakeLists.txt Top-level build wiring for OMEGA-related core libraries and tools.

Sequence Diagram

sequenceDiagram
    participant User
    participant Collection
    participant Segment
    participant OmegaTrainingCoordinator
    participant OMEGALib
    participant OmegaSearcher
    participant HNSW

    Note over User,Collection: Offline build / optimize
    User->>Collection: insert(docs)
    Collection->>Segment: build / persist vector index
    User->>Collection: optimize()
    Collection->>Segment: collect held-out queries and traces
    Segment->>OmegaTrainingCoordinator: training records + gt data
    OmegaTrainingCoordinator->>OMEGALib: train model
    OMEGALib-->>OmegaTrainingCoordinator: model + auxiliary tables
    OmegaTrainingCoordinator-->>Segment: persist omega_model/

    Note over User,Collection: Online query
    User->>Collection: query(VectorQuery, OmegaQueryParam)
    Collection->>OmegaSearcher: search(...)
    OmegaSearcher->>OmegaSearcher: should_use_omega()

    alt model available and threshold satisfied
        OmegaSearcher->>HNSW: search with OMEGA hooks
        HNSW->>OMEGALib: update SearchContext during traversal
        OMEGALib-->>HNSW: stop / continue
        HNSW-->>OmegaSearcher: results
    else fallback
        OmegaSearcher->>HNSW: plain HNSW search
        HNSW-->>OmegaSearcher: results
    end

    OmegaSearcher-->>Collection: results
    Collection-->>User: query results
Loading

driPyf and others added 30 commits January 29, 2026 02:27
Integrate OMEGALib repository as a submodule to provide OMEGA adaptive search functionality. The submodule includes GBDT inference, feature extraction, model management, and search context components.
Add OMEGA index components that wrap HNSW with adaptive search capability:
- OmegaSearcher: Wraps HnswSearcher with OMEGA model integration and automatic fallback
- OmegaBuilder: Wraps HnswBuilder for index construction
- OmegaStreamer: Wraps HnswStreamer for streaming operations
- Factory registration for all components
- CMakeLists.txt integration with omega library dependency

OMEGA mode activates when vector count >= threshold and model is loaded, otherwise falls back to standard HNSW transparently.
- Add OMEGA index type to zvec type system
- Implement OmegaIndexParams class for index configuration
- Add Python bindings for OmegaIndexParam
- Integrate OMEGA searcher with HNSW fallback mechanism
- Add comprehensive Python unit tests for OMEGA functionality
- Update schema validation to support OMEGA index type

Tests verify that OMEGA index correctly falls back to HNSW behavior
when OMEGA-specific features are not enabled, ensuring full compatibility.
…y recall

- Implement OmegaIndex with ITrainingCapable interface for training support
- Create OmegaStreamer with training mode for feature collection during search
- Add OmegaSearcher adaptive search with OMEGA early stopping prediction
- Implement training data export and collection APIs
- Add OmegaQueryParams and OmegaContext for per-query target_recall specification
- Create omega_params.h and omega_context.h for parameter management
- Update engine_helper to convert and extract OMEGA query parameters
- Integrate training mode with Collection API (enable/disable/export methods)
- Add training data collector, query generator, and model trainer components
- Add Python training API with OmegaTrainer class
- Add debug logging for OMEGA index creation and merge operations
- Adjust HnswSearcher member access modifiers for OMEGA inheritance
- Remove test_omega_fallback.py (replaced by test_collection.py tests)
- Fix memory explosion in training data collection by clearing records after copy
- Add omega_model directory creation before training to fix CSV write failure
- Remove all debug fprintf/fflush statements and empty code blocks
… params

- Parallelize ground truth computation and training searches with std::thread
- Add training_query_id support for thread-safe parallel training
- Add num_training_queries param to OmegaIndexParams (default: 1000)
- Use ef_construction as training search ef instead of hardcoded 1000
Build System Changes:
- Add ZVEC_ENABLE_OMEGA option for conditional OMEGA compilation (default: OFF)
- Add -DZVEC_ENABLE_OMEGA definition when enabled
- Update thirdparty/CMakeLists.txt to conditionally build omega library
- Update src/core/CMakeLists.txt to conditionally compile omega sources
- Update omega submodule to version with LightGBM C API support

Training System Refactor:
- Replace Python subprocess training with native LightGBM C API
  * Remove CSV export and Python _omega_training.py invocation
  * Add direct omega::OmegaTrainer integration via C++ API
  * Remove ExportToCSV, ExportGtCmpsToCSV, InvokePythonTrainer methods
- Add configurable training parameters to OmegaModelTrainerOptions:
  * num_iterations (default: 100)
  * num_leaves (default: 31)
  * learning_rate (default: 0.1)
  * num_threads (default: 8)
- Add type conversion helpers (ConvertRecord, ConvertGtCmpsData)
- Improve training performance

Training Data Collection Improvements:
- Move training record storage from OmegaStreamer to OmegaContext
  * Remove shared collected_records_ vector and training_mutex_ from OmegaStreamer
  * Store records per-query in OmegaContext via add_training_record()
  * Eliminate lock contention during parallel training searches
- Remove legacy GetTrainingRecords/ClearTrainingRecords from OmegaStreamer
- Simplify OmegaIndex training interface (return empty vectors)
- Update omega_streamer.cc to use context-based record collection

Code Cleanup:
- Wrap all OMEGA-dependent code with #ifdef ZVEC_ENABLE_OMEGA guards
- Update OmegaModelTrainerOptions documentation
- Add detailed logging for training record collection
- Improve error handling for missing OmegaContext
- Expose target_recall parameter for OMEGA adaptive early stopping
- Update OMEGA tests with 100k docs and recall validation
- Remove deprecated _omega_training.py
Major optimization:
- Move training data collection before Flush() to use in-memory graph
- Eliminate ~2 minute disk reload delay for 1M vectors
- Fix GT computation to use correct indexers (was using empty flushed ones)

Training improvements:
- Add ef_groundtruth parameter for faster GT computation using HNSW
- Support parallel training searches with per-query ground truth
- Add window_size parameter for early stopping control
- Expose all OMEGA params through Python API (OmegaIndexParam, OmegaQueryParam)

Code quality:
- Add TIMING logs for performance debugging
- Refactor TrainingDataCollector to use passed indexers instead of segment's
- Clean up training flow in merge_vector_indexer()
…query-side search path

OMEGA integration updates:
- wire the updated omega training and search behavior into zvec index build, load and query execution paths
- expose and propagate OMEGA training/query parameters through the Python API, index params and engine helper conversions
- update omega builder, searcher, streamer and context handling to match the reference behavior more closely

Training and validation updates:
- update training data collection and model training integration for the reference-aligned OMEGA workflow

Performance and debugging updates:
- add an OMEGA prediction microbenchmark for query-side inference analysis
- improve storage/index plumbing needed by the OMEGA workflow
- add query-side diagnostics to investigate early-stop calibration and repeated prediction overhead
OmegaStreamer &operator=(const OmegaStreamer &streamer) = delete;

// Training-mode configuration forwarded into per-search contexts.
void EnableTrainingMode(bool enable) {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use lower case function name

std::vector<uint64_t> query_doc_ids;
{
ScopedTimer timer("Step1: GenerateHeldOutQueries");
auto sampled = TrainingQueryGenerator::GenerateHeldOutQueries(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should telll whether the result is loaded successfully.

@feihongxu0824
Copy link
Copy Markdown
Collaborator

Hi @driPyf , thanks for the PR! Could you please resolve the merge conflicts with the main branch? Once that's done, I'll continue with the review. Thanks!

@feihongxu0824
Copy link
Copy Markdown
Collaborator

Also, could you please submit a PR to zvec-web beforehand? That PR would provide a more intuitive view of Omega's features and help users better understand how to perceive this capability.

driPyf and others added 7 commits May 20, 2026 13:27
# Conflicts:
#	.github/workflows/04-android-build.yml
#	examples/c++/CMakeLists.txt
#	examples/c/CMakeLists.txt
#	python/tests/test_collection.py
#	python/tests/test_params.py
#	python/zvec/__init__.py
#	python/zvec/model/param/query.py
#	python/zvec/model/schema/field_schema.py
#	src/binding/c/c_api.cc
#	src/binding/python/typing/python_type.cc
#	src/core/algorithm/CMakeLists.txt
#	src/core/algorithm/hnsw/hnsw_algorithm.cc
#	src/core/algorithm/hnsw/hnsw_algorithm.h
#	src/core/algorithm/hnsw/hnsw_dist_calculator.h
#	src/core/algorithm/hnsw/hnsw_searcher.cc
#	src/core/algorithm/hnsw/hnsw_searcher.h
#	src/core/algorithm/hnsw/hnsw_streamer.cc
#	src/core/algorithm/hnsw/hnsw_streamer.h
#	src/db/collection.cc
#	src/db/index/column/vector_column/vector_column_indexer.h
#	src/db/index/common/proto_converter.cc
#	src/db/index/common/proto_converter.h
#	src/db/index/common/schema.cc
#	src/db/proto/zvec.proto
#	src/include/zvec/c_api.h
#	src/include/zvec/core/framework/index_storage.h
#	src/include/zvec/core/interface/index_param.h
#	src/include/zvec/db/index_params.h
#	src/include/zvec/db/query_params.h
#	tests/core/interface/CMakeLists.txt
#	tools/core/CMakeLists.txt
Apply ruff format and clang-format to resolve CI lint violations.
- Add missing ZVEC_DEPENDENCY_LIB_DIR definition (was used but never set)
- Change FATAL_ERROR to WARNING since omega-example links libzvec.so
  which already bundles OMEGA internally
- Fix omega-example to link zvec-lib instead of non-existent zvec-core
- Add omega-example to CI workflow

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The Cache dependencies step in Android CI was using actions/cache@v5
without the required 'key' and 'path' parameters, causing CI failure.
Python's setup-python action sets LD_LIBRARY_PATH to its own lib
directory, which can contain a libgomp that conflicts with the
OpenMP runtime (libomp) used by Clang builds. This causes heap
corruption during test teardown in omega_index_integration_test.

Unsetting LD_LIBRARY_PATH ensures the system OpenMP library is
loaded. C++ test binaries find their own shared libraries via
RPATH and do not need Python's library path.
Replace old type names (ZVecErrorCode, ZVecCollection, etc.) with
the correct _t suffixed types (zvec_error_code_t, zvec_collection_t,
etc.) to match the C API header definitions.
Copy link
Copy Markdown
Collaborator

@feihongxu0824 feihongxu0824 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CMake Warning at thirdparty/omega/OMEGALib/CMakeLists.txt:30 (message):
  OMEGA: OpenMP not found, building LightGBM without OpenMP


CMake Error at thirdparty/omega/OMEGALib/lightgbm/CMakeLists.txt:27 (cmake_minimum_required):
  CMake 3.28 or higher is required.  You are running version 3.26.5

cmake version needs update to 3.28?

@feihongxu0824
Copy link
Copy Markdown
Collaborator

feihongxu0824 commented May 22, 2026

after updating cmake version to 3.28, macOS local build still fail, env:

Apple clang version 16.0.0 (clang-1600.0.26.6)
Target: arm64-apple-darwin24.1.0
Thread model: posix
InstalledDir: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin

the error msg:

alias declarations are a C++11 extension
no template named 'function' in namespace 'std'
thread_local unknown type name
delegating constructors are permitted only in C++11

It seems that C++11/17 was not specified when compiling Eigen.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature]: Integrate OMEGA adaptive early termination into zvec

7 participants