[Enhance]: Extend RVV acceleration coverage in ailego compute kernels

### Affected Component

ailego compute kernels, especially the single-compute and batch-compute operator paths under the distance computation stack.

### Current Behavior

On zvec v0.2.1, our RVV optimization only covers part of the compute layer in `ailego`, specifically:
- batch compute operators
- single compute operators

So the current RVV work is effective but still partial. It accelerates important hot paths, but it is not yet a broader and more systematic RVV backend for the distance-computation stack used by FLAT/HNSW-related search flows.

Test environment:
- Hardware: SpacemiT K1 Muse Pi Pro
- Memory/Storage: 16GB + 128GB
- Dataset: cohere1m
- zvec version: 0.2.1

Based on our benchmark results, the current partial RVV optimization already brings substantial single-thread improvements on cohere1m:

FLAT, TopK=100
- FP16:
  - Recall: 0.99760 -> 0.99324
  - QPS: 0.05 -> 1.66
  - P99 latency: 19246.82 ms -> 615.80 ms
- FP32:
  - Recall: 0.99998 -> 0.99999
  - QPS: 0.35 -> 1.02
  - P99 latency: 2860.69 ms -> 988.96 ms
- INT8:
  - Recall: 0.95123 -> 0.95123
  - QPS: 0.21 -> 1.98
  - P99 latency: 4764.74 ms -> 510.42 ms
- Refiner:
  - Recall: 0.95123 -> 0.95123
  - QPS: 0.22 -> 1.94
  - P99 latency: 4652.24 ms -> 520.07 ms

HNSW, TopK=100
- FP16:
  - Recall: 0.93484 -> 0.93478
  - QPS: 15.22 -> 58.52
  - P99 latency: 88.31 ms -> 21.37 ms
- FP32:
  - Recall: 0.93505 -> 0.93493
  - QPS: 45.09 -> 53.79
  - P99 latency: 27.76 ms -> 23.33 ms
- INT8:
  - Recall: 0.90944 -> 0.90961
  - QPS: 38.00 -> 65.56
  - P99 latency: 33.30 ms -> 18.94 ms
- Refiner:
  - Recall: 0.93385 -> 0.93417
  - QPS: 36.04 -> 61.54
  - P99 latency: 34.69 ms -> 19.57 ms

Some representative observations from the current data:
- FLAT FP16 gets the largest speedup ~31.4x QPS, but with a small recall drop at TopK=100.
- FLAT FP32/INT8/Refiner show large speedups while recall stays effectively unchanged in our current tests.
- HNSW FP16/FP32 show clear latency and throughput improvement with only very small recall differences.
- HNSW INT8/Refiner improve both throughput and tail latency, while recall is slightly improved in the current benchmark.

These results show that RVV optimization in `ailego` is already valuable, but the current coverage is still limited to only part of the operator set.

### Desired Improvement

Extend RVV support in `ailego` from the currently optimized single-compute and batch-compute operators to a broader, more systematic implementation.

Suggested improvements:
1. Expand RVV coverage to more hot distance-computation paths beyond the currently optimized operators.
2. Document which `ailego` operators are RVV-accelerated and which still fall back to scalar implementations.
3. Add validation for both performance and recall so RVV improvements can be evaluated not only by QPS/latency, but also by result quality.

From our current results, there is already a strong case for extending RVV coverage: even partial optimization produces major gains in several workloads.

### Impact

This enhancement would improve zvec’s usability and performance on RISC-V platforms with RVV support.

Expected benefits include:
- much better throughput on RVV-capable devices
- significantly lower tail latency for FLAT and HNSW workloads
- better out-of-the-box performance on RISC-V environments
- easier extension of RVV support to additional kernels over time

Our current cohere1m benchmark on SpacemiT K1 Muse Pi Pro already shows that even partial RVV optimization can deliver substantial gains:

At the same time, the current data also shows that recall behavior should remain part of the evaluation criteria, since different data formats and index paths may respond differently to RVV optimization. Broadening coverage together with recall/performance regression checks would make the improvement more complete and safer for upstream users.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Enhance]: Extend RVV acceleration coverage in ailego compute kernels #357

Affected Component

Current Behavior

Desired Improvement

Impact

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[Enhance]: Extend RVV acceleration coverage in ailego compute kernels #357

Description

Affected Component

Current Behavior

Desired Improvement

Impact

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions