CNDB-15527: Upgrade jvector to 4.0.0-rc.8; make FusedPQ configurable#2042
CNDB-15527: Upgrade jvector to 4.0.0-rc.8; make FusedPQ configurable#2042michaeljmarshall merged 66 commits intomainfrom
Conversation
Checklist before you submit for review
|
e63f53d to
5e02971
Compare
5e02971 to
7c3c804
Compare
|
|
||
| VectorCompression.CompressionType compressionType = VectorCompression.CompressionType.values()[reader.readByte()]; | ||
| if (features.contains(FeatureId.FUSED_ADC)) | ||
| if (features.contains(FeatureId.FUSED_PQ)) |
There was a problem hiding this comment.
@marianotepper - I just noticed that the features map already has logic that loads the ProductQuantization, meaning this branch currently keeps two identical maps in memory. I think it'd make sense to possibly expose the features map in the OnDiskGraphIndex so we can remove the duplicate cost. Any reason we can't do that?
There was a problem hiding this comment.
Looks like we actually already use the one from the header, I just didn't catch it. I think we'll be able to get rid of it with a little extra work in CC.
built from the branch in this PR: datastax/jvector#588
This is a hack because it increases memory utilization and may impact other threads in the CC host. However, this will help unblock certain test scenarios.
Improve hashCode of the Key to prevent collisions Implement Comparable<Key> in case the ConcurrentHashMap falls back in to Tree mode (cherry picked from commit 091b636)
|
I have cherry-pick this into this feature branch: #2156 |
|
[java] Benchmark (dimension) Mode Cnt Score Error Units
[java] OnDiskVectorValuesBench.readAllVectorsSequential 384 avgt 5 55.764 ± 1.522 ms/op
[java] OnDiskVectorValuesBench.readAllVectorsSequential 1536 avgt 5 198.640 ± 3.520 ms/op
[java] OnDiskVectorValuesBench.writeVectorsToDisk 384 avgt 5 113.735 ± 2.768 ms/op
[java] OnDiskVectorValuesBench.writeVectorsToDisk 1536 avgt 5 500.743 ± 18.353 ms/op
…allow it
Note that the main motivation for updating this
logic comes from the implementation for the
RandomAccessVectorValues method named
threadLocalSupplier, which makes an inference
based on the result of isValueShared about
whether the class instance can be shared across
threads. Since this one cannot, it is safest
to just make the value shared. Given
how we use the resulting vector at the moment,
it is not a problem to make it shared.
[java] Benchmark (dimension) Mode Cnt Score Error Units
[java] OnDiskVectorValuesBench.readAllVectorsSequential 384 avgt 5 52.610 ± 2.211 ms/op
[java] OnDiskVectorValuesBench.readAllVectorsSequential 1536 avgt 5 184.552 ± 4.006 ms/op
[java] OnDiskVectorValuesBench.writeVectorsToDisk 384 avgt 5 113.262 ± 3.906 ms/op
[java] OnDiskVectorValuesBench.writeVectorsToDisk 1536 avgt 5 498.166 ± 14.529 ms/op
Note that the score is slightly worse, but the error ratio is high. There is certainly cost to fetching the thread from the hash map. The main motivation here is to prevent leaked resources. See datastax/jvector#635 [java] Benchmark (dimension) Mode Cnt Score Error Units [java] OnDiskVectorValuesBench.readAllVectorsSequential 384 avgt 5 59.925 ± 4.414 ms/op [java] OnDiskVectorValuesBench.readAllVectorsSequential 1536 avgt 5 212.512 ± 28.455 ms/op [java] OnDiskVectorValuesBench.writeVectorsToDisk 384 avgt 5 117.685 ± 4.208 ms/op [java] OnDiskVectorValuesBench.writeVectorsToDisk 1536 avgt 5 526.014 ± 14.489 ms/op
|
@pkolaczk - I needed to make a few tweaks to the implementation to make the |
I ran it again, and the results seem
in the same ballpark, but with higher
error bounds.
[java] Benchmark (dimension) Mode Cnt Score Error Units
[java] OnDiskVectorValuesBench.readAllVectorsSequential 384 avgt 5 61.719 ± 17.843 ms/op
[java] OnDiskVectorValuesBench.readAllVectorsSequential 1536 avgt 5 217.288 ± 37.908 ms/op
[java] OnDiskVectorValuesBench.writeVectorsToDisk 384 avgt 5 117.555 ± 5.090 ms/op
[java] OnDiskVectorValuesBench.writeVectorsToDisk 1536 avgt 5 560.162 ± 251.997 ms/op
|
❌ Build ds-cassandra-pr-gate/PR-2042 rejected by Butler6 regressions found Found 6 new test failures
Found 2 known test failures |
|
Test failures are all timeouts. Merging. |
### What is the issue Fixes: riptano/cndb#16693 Test: riptano/cndb#16694 ### What does this PR fix and why was it fixed Binary quantization was never used in production. We have a different quantization coming, but it is not ready yet. It simplifies some coming PRs to remove this logic eagerly, namely #2042.
…skAnn (#2222) Fixes: riptano/cndb#15926 CNDB Test PR: riptano/cndb#16685 We do not count the compressed vectors bytes in the `CassandraDiskAnn#ramBytesUsed()` method, which leads to significantly under reported utilization on the metrics. This change fixes the logic and adds a test that fails without the new logic. Notably, the test fails with `Expected at least 196608 bytes but got 76` without the `compressedVectorBytes()` method in place, which just goes to show how significant of a miss this is. When we merge #2042, the added test will need to be modified to account for the reduced memory consumption, but the test is valid for now.
…skAnn (#2222) Fixes: riptano/cndb#15926 CNDB Test PR: riptano/cndb#16685 We do not count the compressed vectors bytes in the `CassandraDiskAnn#ramBytesUsed()` method, which leads to significantly under reported utilization on the metrics. This change fixes the logic and adds a test that fails without the new logic. Notably, the test fails with `Expected at least 196608 bytes but got 76` without the `compressedVectorBytes()` method in place, which just goes to show how significant of a miss this is. When we merge #2042, the added test will need to be modified to account for the reduced memory consumption, but the test is valid for now.
…2042) Fixes: riptano/cndb#15527 CNDB test PR: riptano/cndb#16797 This PR upgrades jvector, which brings several improvements. Here are the git commits brought in: ``` 8b3e93cf (tag: 4.0.0-rc.8) chore: update changelog for 4.0.0-rc.8 (#627) 9d0488e5 release 4.0.0-rc.8 (#626) 570bd118 Refactor parallel writer (#608) 20c348ec Move buffer position in ByteBufferIndexWriter#writeFloats (#607) d9ddce51 Ensure extractTrainingVectors return a list of at most MAX_PQ_TRAINING_SET_SIZE (#610) d663b4f7 add config options for regression testing (#609) 7e493eee On-disk index cache for the Grid benchmark harness (#612) e263cc80 Improved dataset loading; fixes, safeties, diagnostics, and better feedback (#613) 6b235ce7 bump to next SNAPSHOT (#605) 84bf5708 (tag: 4.0.0-rc.7) chore: update changelog for 4.0.0-rc.7 (#604) fceeb885 release 4.0.0-rc.7 (#603) 51807cba add protection against bad ordinal mappings (#602) 6ca3b5e2 adding memory and disk usage stats to bench tests (#591) a66fd914 Fix OnDiskGraphIndex#ramBytesUsed NPE (#588) 0ca5a392 Move float bulk-write into IndexWriter to enforce endianness (#577) a6c6c09b Add diversityScoreFunctionFor to avoid creation of wrapper object (#592) 977c21d4 Relax the threshold of a flaky test related to an experimental feature (#598) fa808d69 adding average nodes visited to benchmark tests (#552) 3bd15e70 Virtualize and Modularize DataSetLoader logic (#593) 42259e9f Speed up ivec reads by buffering (#584) f967f1c9 virtualize DataSet (#589) 55f902f4 turn off parallel writes in grid (#582) 019a241d Parallelize graph writes (#542) 02fea879 Save allocation of a large array in PQVectors.encodeAndBuild (#574) 32a51821 javadoc for base [graph] (#548) 4eb607f8 javadoc for base [disk,exceptions] (#547) 30e8932c Enable the fused graph index (#561) d8848fc6 Start development on 4.0.0-rc.7-SNAPSHOT (#573) c57f3a62 (tag: 4.0.0-rc.6) chore: update changelog for 4.0.0-rc.6 (#572) 214b7c20 release 4.0.0-rc.6 (#571) e3686999 fix javadoc error (#570) 88669887 Ignoring testIncrementalInsertionFromOnDiskIndex_withNonIdentityOrdinalMapping and adding a TODO in buildAndMergeNewNodes (#569) 29a943e1 Computation of reconstruction errors for vector compressors (#567) d8e9cb16 Add NVQ paper in README (#560) d5cbe658 Add ImmutableGraphIndex.isHierarchical (#563) b484dae2 Harden tests for heap graph reconstruction (#543) 9471c57d Make the thresholds in TestLowCardinalityFiltering tighter (#559) 21e4a226 Begin development on 4.0.0-rc.6 (#558) 4f661d99 Revert "Start development on 4.0.0-rc.6-SNAPSHOT" fdee5779 Start development on 4.0.0-rc.6-SNAPSHOT ``` Adds a new sai on disk version: `fa` With this version, we are adding a new, experimental feature to write PQ vectors fused into the graph. In doing so, we are able to skip writing the PQ vectors to the PQ file, which results in significant memory savings since the PQ vectors in the `CassandraDiskAnn` graph searcher consumers `O(n)` memory based on the number of vectors and their quantized size. The fused pq vectors mostly fit within the page cache as we read the node and its neighbors from disk, so we see minimal latency reduction due to this change, though further testing is required to see the real impact. In order to enable fused pq, the runtime needs `cassandra.sai.latest.version=fa` or greater and `cassandra.sai.vector.enable_fused=true`. Note that because this feature is still experimental, `cassandra.sai.vector.enable_fused` defaults to `false`. Another experimental feature introduced in this commit via the jvector upgrade is parallel graph encoding and writing to disk. Writing the fused graph requires increased CPU time to encode the graph node and we write more bytes to disk, so this parallelism is likely necessary to keep vector index creation/compaction times down. The key configurations available with their associated defaults: ```java // When building a compaction graph, encode layer 0 nodes in parallel and subsequently use async io for writes. // This feature is experimental, so defaults to false. SAI_ENCODE_AND_WRITE_VECTOR_GRAPH_IN_PARALLEL_ENABLED("cassandra.sai.vector.encode_and_write_graph_in_parallel.enabled", "false"), // When parallel graph encoding is enabled, the number of threads to use for encoding. Defaults to 0, meaning // use all available processors as reported by the JVM. SAI_ENCODE_AND_WRITE_VECTOR_GRAPH_IN_PARALLEL_NUM_THREADS("cassandra.sai.vector.encode_and_write_graph_in_parallel.num_threads", "0"), // When parallel graph encoding is enabled, whether to use director buffers. Defaults to false, meaning heap // buffers are used. A buffer will be allocated per encoding thread. The size of each buffer is the size // of the encoded graph node at layer 0, which varies based on graph feature settings. SAI_ENCODE_AND_WRITE_VECTOR_GRAPH_IN_PARALLEL_USE_DIRECT_BUFFERS("cassandra.sai.vector.encode_and_write_graph_in_parallel.use_direct_buffers", "false"), ``` `OnDiskVectorValues` is now in its own file and is now thread safe in order to account for some necessary implementation details within jvector. Added `OnDiskVectorValuesWriter` to improve test coverage and to abstract away the flush issues associated with `BufferedRandomAccessWriter` as described in datastax/jvector#562. This PR also introduces new benchmarks as well as improved unit testing. The new benchmarks verify the performance of the `OnDiskVectorValues` and `OnDiskVectorValuesWriter` to confirm (at least directionally) the time associated with read and write operations. New tests have been added to verify that when we iterate over an sstable's rows, we are able to assert that the sstable's vector value's similarity to the one stored in the vector graph is ~1. This testing is valuable in that it confirms the row id to ordinal mapping is correct at every node. Previously, we relied on recall results to verify this for us. This new pattern allows us to confirm _every_ node, which is more thorough and removes most edge cases that might have led to partially correct graphs that may have achieved acceptable recall.
…2042) Fixes: riptano/cndb#15527 CNDB test PR: riptano/cndb#16797 This PR upgrades jvector, which brings several improvements. Here are the git commits brought in: ``` 8b3e93cf (tag: 4.0.0-rc.8) chore: update changelog for 4.0.0-rc.8 (#627) 9d0488e5 release 4.0.0-rc.8 (#626) 570bd118 Refactor parallel writer (#608) 20c348ec Move buffer position in ByteBufferIndexWriter#writeFloats (#607) d9ddce51 Ensure extractTrainingVectors return a list of at most MAX_PQ_TRAINING_SET_SIZE (#610) d663b4f7 add config options for regression testing (#609) 7e493eee On-disk index cache for the Grid benchmark harness (#612) e263cc80 Improved dataset loading; fixes, safeties, diagnostics, and better feedback (#613) 6b235ce7 bump to next SNAPSHOT (#605) 84bf5708 (tag: 4.0.0-rc.7) chore: update changelog for 4.0.0-rc.7 (#604) fceeb885 release 4.0.0-rc.7 (#603) 51807cba add protection against bad ordinal mappings (#602) 6ca3b5e2 adding memory and disk usage stats to bench tests (#591) a66fd914 Fix OnDiskGraphIndex#ramBytesUsed NPE (#588) 0ca5a392 Move float bulk-write into IndexWriter to enforce endianness (#577) a6c6c09b Add diversityScoreFunctionFor to avoid creation of wrapper object (#592) 977c21d4 Relax the threshold of a flaky test related to an experimental feature (#598) fa808d69 adding average nodes visited to benchmark tests (#552) 3bd15e70 Virtualize and Modularize DataSetLoader logic (#593) 42259e9f Speed up ivec reads by buffering (#584) f967f1c9 virtualize DataSet (#589) 55f902f4 turn off parallel writes in grid (#582) 019a241d Parallelize graph writes (#542) 02fea879 Save allocation of a large array in PQVectors.encodeAndBuild (#574) 32a51821 javadoc for base [graph] (#548) 4eb607f8 javadoc for base [disk,exceptions] (#547) 30e8932c Enable the fused graph index (#561) d8848fc6 Start development on 4.0.0-rc.7-SNAPSHOT (#573) c57f3a62 (tag: 4.0.0-rc.6) chore: update changelog for 4.0.0-rc.6 (#572) 214b7c20 release 4.0.0-rc.6 (#571) e3686999 fix javadoc error (#570) 88669887 Ignoring testIncrementalInsertionFromOnDiskIndex_withNonIdentityOrdinalMapping and adding a TODO in buildAndMergeNewNodes (#569) 29a943e1 Computation of reconstruction errors for vector compressors (#567) d8e9cb16 Add NVQ paper in README (#560) d5cbe658 Add ImmutableGraphIndex.isHierarchical (#563) b484dae2 Harden tests for heap graph reconstruction (#543) 9471c57d Make the thresholds in TestLowCardinalityFiltering tighter (#559) 21e4a226 Begin development on 4.0.0-rc.6 (#558) 4f661d99 Revert "Start development on 4.0.0-rc.6-SNAPSHOT" fdee5779 Start development on 4.0.0-rc.6-SNAPSHOT ``` Adds a new sai on disk version: `fa` With this version, we are adding a new, experimental feature to write PQ vectors fused into the graph. In doing so, we are able to skip writing the PQ vectors to the PQ file, which results in significant memory savings since the PQ vectors in the `CassandraDiskAnn` graph searcher consumers `O(n)` memory based on the number of vectors and their quantized size. The fused pq vectors mostly fit within the page cache as we read the node and its neighbors from disk, so we see minimal latency reduction due to this change, though further testing is required to see the real impact. In order to enable fused pq, the runtime needs `cassandra.sai.latest.version=fa` or greater and `cassandra.sai.vector.enable_fused=true`. Note that because this feature is still experimental, `cassandra.sai.vector.enable_fused` defaults to `false`. Another experimental feature introduced in this commit via the jvector upgrade is parallel graph encoding and writing to disk. Writing the fused graph requires increased CPU time to encode the graph node and we write more bytes to disk, so this parallelism is likely necessary to keep vector index creation/compaction times down. The key configurations available with their associated defaults: ```java // When building a compaction graph, encode layer 0 nodes in parallel and subsequently use async io for writes. // This feature is experimental, so defaults to false. SAI_ENCODE_AND_WRITE_VECTOR_GRAPH_IN_PARALLEL_ENABLED("cassandra.sai.vector.encode_and_write_graph_in_parallel.enabled", "false"), // When parallel graph encoding is enabled, the number of threads to use for encoding. Defaults to 0, meaning // use all available processors as reported by the JVM. SAI_ENCODE_AND_WRITE_VECTOR_GRAPH_IN_PARALLEL_NUM_THREADS("cassandra.sai.vector.encode_and_write_graph_in_parallel.num_threads", "0"), // When parallel graph encoding is enabled, whether to use director buffers. Defaults to false, meaning heap // buffers are used. A buffer will be allocated per encoding thread. The size of each buffer is the size // of the encoded graph node at layer 0, which varies based on graph feature settings. SAI_ENCODE_AND_WRITE_VECTOR_GRAPH_IN_PARALLEL_USE_DIRECT_BUFFERS("cassandra.sai.vector.encode_and_write_graph_in_parallel.use_direct_buffers", "false"), ``` `OnDiskVectorValues` is now in its own file and is now thread safe in order to account for some necessary implementation details within jvector. Added `OnDiskVectorValuesWriter` to improve test coverage and to abstract away the flush issues associated with `BufferedRandomAccessWriter` as described in datastax/jvector#562. This PR also introduces new benchmarks as well as improved unit testing. The new benchmarks verify the performance of the `OnDiskVectorValues` and `OnDiskVectorValuesWriter` to confirm (at least directionally) the time associated with read and write operations. New tests have been added to verify that when we iterate over an sstable's rows, we are able to assert that the sstable's vector value's similarity to the one stored in the vector graph is ~1. This testing is valuable in that it confirms the row id to ordinal mapping is correct at every node. Previously, we relied on recall results to verify this for us. This new pattern allows us to confirm _every_ node, which is more thorough and removes most edge cases that might have led to partially correct graphs that may have achieved acceptable recall.
MarkWolters
left a comment
There was a problem hiding this comment.
Overall looks good, excited to see these features being put into use.
| .with(nvq != null ? new NVQ(nvq) : new InlineVectors(vectorValues.dimension())); | ||
|
|
||
| if (compressor instanceof ProductQuantization && JVectorVersionUtil.shouldWriteFused(context.version())) | ||
| indexWriterBuilder.with(new FusedPQ(context.getIndexWriterConfig().getAnnMaxDegree(), (ProductQuantization) compressor)); |
There was a problem hiding this comment.
As far as I can tell compressor is always going to be either null or an instance of ProductQuantization here. It is instantiated on line 473 by calling writePQ, which either returns a ProductQuantization or null. Here there is no null check and it's also being checked and cast to ProductQuantization which seems dangerous.
There was a problem hiding this comment.
Line 531:
null instanceof ProductQuantization -> returns false, so I think we are good. To make it more clear, we can add some assertions maybe?
| } | ||
| } | ||
|
|
||
| private OnDiskGraphIndexWriter createIndexWriter(File indexFile, long termsOffset, IndexContext context, OrdinalMapper ordinalMapper, VectorCompressor<?> compressor, NVQuantization nvq) throws IOException |
There was a problem hiding this comment.
CassandraRelevantProperties implements configuration to allow for using parallel graph writes but here they are not referenced and the serial writer is instantiated.
There was a problem hiding this comment.
Good point. I was reading in the PR:
Another experimental feature introduced in this commit via the jvector upgrade is parallel graph encoding and writing to disk. Writing the fused graph requires increased CPU time to encode the graph node and we write more bytes to disk, so this parallelism is likely necessary to keep vector index creation/compaction times down. The key configurations available with their associated defaults:
From Writing the fused graph requires increased CPU time I understand we would want to enable this config. I have a note while looking into the testing also to check where it is tested. Thanks for flagging!
There was a problem hiding this comment.
@MarkWolters , are the parallel graph writes still experimental as the comments say, or they are in the same boat as FusedPQ - your test framework was fixed and you got good results for both and it is ok to potentially enable them by default after we finalize testing on your end too?
There was a problem hiding this comment.
@ekaterinadimitrova2 I would categorize them as still experimental. The feature has been tested internally but not to the extent of fused PQ. I believe it is ready to use, I'd just feel more comfortable once it goes through some testing once integrated to validate the correctness of the written graphs.
There was a problem hiding this comment.
I see,
I'd just feel more comfortable once it goes through some testing once integrated
You mean in Cassandra/Astra I guess and not JVector additional testing you plan? Did I get it right?
| } | ||
|
|
||
| @TearDown(Level.Trial) | ||
| public void tearDown() |
There was a problem hiding this comment.
Why 2 teardowns with different capitalization?
| { | ||
| // we need a long timeout because we are adding many vectors | ||
| String index = createIndexAsync("CREATE CUSTOM INDEX ON %s(val) USING 'StorageAttachedIndex' WITH OPTIONS = {'similarity_function' : 'euclidean'}"); | ||
| String index = createIndexAsync("CREATE CUSTOM INDEX ON %s(val) USING 'StorageAttachedIndex' WITH OPTIONS = {'similarity_function' : 'euclidean', 'enable_hierarchy': 'true'}"); |
There was a problem hiding this comment.
For the future I think it would be more thorough coverage to test both with hierarchy and without
|
|
||
| @Override | ||
| public void processNeighbors(int i, int i1, ScoreFunction scoreFunction, ImmutableGraphIndex.IntMarker intMarker, ImmutableGraphIndex.NeighborProcessor neighborProcessor) | ||
| { |
There was a problem hiding this comment.
Is this intentionally empty? If so should there be a comment to that effect?



What is the issue
Fixes: https://github.com/riptano/cndb/issues/15527
CNDB test PR: https://github.com/riptano/cndb/pull/16797
What does this PR fix and why was it fixed
This PR upgrades jvector, which brings several improvements. Here are the git commits brought in:
SAI Version Bump
Adds a new sai on disk version:
faFused PQ
With this version, we are adding a new, experimental feature to write PQ vectors fused into the graph. In doing so, we are able to skip writing the PQ vectors to the PQ file, which results in significant memory savings since the PQ vectors in the
CassandraDiskAnngraph searcher consumersO(n)memory based on the number of vectors and their quantized size. The fused pq vectors mostly fit within the page cache as we read the node and its neighbors from disk, so we see minimal latency reduction due to this change, though further testing is required to see the real impact.In order to enable fused pq, the runtime needs
cassandra.sai.latest.version=faor greater andcassandra.sai.vector.enable_fused=true. Note that because this feature is still experimental,cassandra.sai.vector.enable_fuseddefaults tofalse.Another experimental feature introduced in this commit via the jvector upgrade is parallel graph encoding and writing to disk. Writing the fused graph requires increased CPU time to encode the graph node and we write more bytes to disk, so this parallelism is likely necessary to keep vector index creation/compaction times down. The key configurations available with their associated defaults:
OnDiskVectorValues and OnDiskVectorValuesWriter
OnDiskVectorValuesis now in its own file and is now thread safe in order to account for some necessary implementation details within jvector. AddedOnDiskVectorValuesWriterto improve test coverage and to abstract away the flush issues associated withBufferedRandomAccessWriteras described in datastax/jvector#562.Verification
This PR also introduces new benchmarks as well as improved unit testing. The new benchmarks verify the performance of the
OnDiskVectorValuesandOnDiskVectorValuesWriterto confirm (at least directionally) the time associated with read and write operations.New tests have been added to verify that when we iterate over an sstable's rows, we are able to assert that the sstable's vector value's similarity to the one stored in the vector graph is ~1. This testing is valuable in that it confirms the row id to ordinal mapping is correct at every node. Previously, we relied on recall results to verify this for us. This new pattern allows us to confirm every node, which is more thorough and removes most edge cases that might have led to partially correct graphs that may have achieved acceptable recall.