Affected Component
ailego compute kernels, especially the single-compute and batch-compute operator paths under the distance computation stack.
Current Behavior
On zvec v0.2.1, our RVV optimization only covers part of the compute layer in ailego, specifically:
- batch compute operators
- single compute operators
So the current RVV work is effective but still partial. It accelerates important hot paths, but it is not yet a broader and more systematic RVV backend for the distance-computation stack used by FLAT/HNSW-related search flows.
Test environment:
- Hardware: SpacemiT K1 Muse Pi Pro
- Memory/Storage: 16GB + 128GB
- Dataset: cohere1m
- zvec version: 0.2.1
Based on our benchmark results, the current partial RVV optimization already brings substantial single-thread improvements on cohere1m:
FLAT, TopK=100
- FP16:
- Recall: 0.99760 -> 0.99324
- QPS: 0.05 -> 1.66
- P99 latency: 19246.82 ms -> 615.80 ms
- FP32:
- Recall: 0.99998 -> 0.99999
- QPS: 0.35 -> 1.02
- P99 latency: 2860.69 ms -> 988.96 ms
- INT8:
- Recall: 0.95123 -> 0.95123
- QPS: 0.21 -> 1.98
- P99 latency: 4764.74 ms -> 510.42 ms
- Refiner:
- Recall: 0.95123 -> 0.95123
- QPS: 0.22 -> 1.94
- P99 latency: 4652.24 ms -> 520.07 ms
HNSW, TopK=100
- FP16:
- Recall: 0.93484 -> 0.93478
- QPS: 15.22 -> 58.52
- P99 latency: 88.31 ms -> 21.37 ms
- FP32:
- Recall: 0.93505 -> 0.93493
- QPS: 45.09 -> 53.79
- P99 latency: 27.76 ms -> 23.33 ms
- INT8:
- Recall: 0.90944 -> 0.90961
- QPS: 38.00 -> 65.56
- P99 latency: 33.30 ms -> 18.94 ms
- Refiner:
- Recall: 0.93385 -> 0.93417
- QPS: 36.04 -> 61.54
- P99 latency: 34.69 ms -> 19.57 ms
Some representative observations from the current data:
- FLAT FP16 gets the largest speedup ~31.4x QPS, but with a small recall drop at TopK=100.
- FLAT FP32/INT8/Refiner show large speedups while recall stays effectively unchanged in our current tests.
- HNSW FP16/FP32 show clear latency and throughput improvement with only very small recall differences.
- HNSW INT8/Refiner improve both throughput and tail latency, while recall is slightly improved in the current benchmark.
These results show that RVV optimization in ailego is already valuable, but the current coverage is still limited to only part of the operator set.
Desired Improvement
Extend RVV support in ailego from the currently optimized single-compute and batch-compute operators to a broader, more systematic implementation.
Suggested improvements:
- Expand RVV coverage to more hot distance-computation paths beyond the currently optimized operators.
- Document which
ailego operators are RVV-accelerated and which still fall back to scalar implementations.
- Add validation for both performance and recall so RVV improvements can be evaluated not only by QPS/latency, but also by result quality.
From our current results, there is already a strong case for extending RVV coverage: even partial optimization produces major gains in several workloads.
Impact
This enhancement would improve zvec’s usability and performance on RISC-V platforms with RVV support.
Expected benefits include:
- much better throughput on RVV-capable devices
- significantly lower tail latency for FLAT and HNSW workloads
- better out-of-the-box performance on RISC-V environments
- easier extension of RVV support to additional kernels over time
Our current cohere1m benchmark on SpacemiT K1 Muse Pi Pro already shows that even partial RVV optimization can deliver substantial gains:
At the same time, the current data also shows that recall behavior should remain part of the evaluation criteria, since different data formats and index paths may respond differently to RVV optimization. Broadening coverage together with recall/performance regression checks would make the improvement more complete and safer for upstream users.
Affected Component
ailego compute kernels, especially the single-compute and batch-compute operator paths under the distance computation stack.
Current Behavior
On zvec v0.2.1, our RVV optimization only covers part of the compute layer in
ailego, specifically:So the current RVV work is effective but still partial. It accelerates important hot paths, but it is not yet a broader and more systematic RVV backend for the distance-computation stack used by FLAT/HNSW-related search flows.
Test environment:
Based on our benchmark results, the current partial RVV optimization already brings substantial single-thread improvements on cohere1m:
FLAT, TopK=100
HNSW, TopK=100
Some representative observations from the current data:
These results show that RVV optimization in
ailegois already valuable, but the current coverage is still limited to only part of the operator set.Desired Improvement
Extend RVV support in
ailegofrom the currently optimized single-compute and batch-compute operators to a broader, more systematic implementation.Suggested improvements:
ailegooperators are RVV-accelerated and which still fall back to scalar implementations.From our current results, there is already a strong case for extending RVV coverage: even partial optimization produces major gains in several workloads.
Impact
This enhancement would improve zvec’s usability and performance on RISC-V platforms with RVV support.
Expected benefits include:
Our current cohere1m benchmark on SpacemiT K1 Muse Pi Pro already shows that even partial RVV optimization can deliver substantial gains:
At the same time, the current data also shows that recall behavior should remain part of the evaluation criteria, since different data formats and index paths may respond differently to RVV optimization. Broadening coverage together with recall/performance regression checks would make the improvement more complete and safer for upstream users.