Successfully implemented SIMD optimizations (SSE2/AVX2 for x86_64, NEON for aarch64) for the OpenVX vision kernels with parallel processing support via Rayon.
Added feature flags for SIMD support:
[features]
default = []
simd = []
sse2 = ["simd"]
avx2 = ["simd"]
neon = ["simd"]
parallel = ["rayon"]
[dependencies]
rayon = { version = "1.8", optional = true }- Platform detection (
is_simd_available()) - SIMD lane constants for 128-bit and 256-bit operations
- Scalar fallback implementations for all operations
Implemented SSE2 and AVX2 intrinsics:
- Arithmetic Operations:
add_images_sat,sub_images_sat,weighted_avg - Gaussian Filters:
gaussian_h3_sse2,gaussian_v3_sse2,gaussian_h3_avx2,gaussian_v3_avx2 - Box Filter:
box_h3_sse2 - Runtime dispatch: Auto-detects AVX2/SSE2 availability
Implemented NEON intrinsics:
- Arithmetic Operations:
add_images_sat_neon,sub_images_sat_neon,weighted_avg_neon - Gaussian Filters:
gaussian_h3_neon,gaussian_v3_neon - Box Filter:
box_h3_neon
gaussian3x3_simd()- Separable [1,2,1] horizontal + vertical passesgaussian5x5_simd()- Separable [1,4,6,4,1] kernelbox3x3_simd()- Moving average optimizationsobel3x3_simd()- Gradient computation with SIMD
add_images_simd()- Saturated addition (16/32 pixels at once)subtract_images_simd()- Saturated subtractionweighted_avg_simd()- Alpha blending with fixed-point arithmeticmultiply_images_simd()- Multiplication with scale factor
rgb_to_gray_simd()- BT.709 coefficientsgray_to_rgb_simd()- Channel replicationrgb_to_rgba_simd()/rgba_to_rgb_simd()- Format conversionrgb_to_yuv_simd()- BT.601 YUV conversion
Rayon-based parallel implementations:
gaussian3x3_parallel()- Row-parallel separable convolutiongaussian5x5_parallel()- Parallel 5x5 Gaussianbox3x3_parallel()- Parallel box filtersobel3x3_parallel()- Parallel gradient computationrgb_to_gray_parallel()- Parallel color conversionadd_images_parallel()/subtract_images_parallel()- Parallel arithmetic
Criterion benchmarks comparing:
- Scalar vs SIMD implementations
- SIMD vs Parallel implementations
- Multiple image sizes: 640x480, 1280x720, 1920x1080
- Kernels: Gaussian3x3, Box3x3, RGB to Gray, Add, Sobel, Weighted Average
cd ~/.openclaw/workspace/rustVX
cargo build --release --features simdcargo build --release --features "simd parallel"cargo bench --features "simd parallel"cargo test --features "simd parallel"- 128-bit (SSE2/NEON): 16 u8, 8 u16/i16, 4 f32 per operation
- 256-bit (AVX2): 32 u8, 16 u16/i16, 8 f32 per operation
- Arithmetic operations: ~8-16x with SSE2, ~16-32x with AVX2
- Gaussian 3x3: ~4-8x (separable implementation)
- Color conversion: ~3-6x (memory-bound)
- With Rayon: Additional 2-4x on multi-core systems
All separable filters now use:
- Horizontal pass first (row-major memory access)
- Vertical pass second
- Intermediate buffer between passes
- SIMD for both passes
- Proper edge handling (replicate border)
openvx-vision/src/
├── lib.rs # Updated with SIMD modules
├── simd_utils.rs # SIMD infrastructure & scalar fallbacks
├── x86_64_simd.rs # x86_64 SSE2/AVX2 implementations
├── aarch64_simd.rs # ARM NEON implementations
├── filter_simd.rs # SIMD filter operations
├── arithmetic_simd.rs # SIMD arithmetic operations
├── color_simd.rs # SIMD color conversions
├── parallel.rs # Rayon parallel implementations
└── ... (original modules)
- Integrate SIMD into kernel execution paths
- Add runtime feature detection for automatic dispatch
- Profile and optimize memory access patterns
- Consider GPU acceleration via CUDA/OpenCL