A portable SIMD abstraction library for Go, inspired by Google's Highway C++ library.
Write SIMD code once, run it on AVX2, AVX-512, ARM NEON, or pure Go fallback.
- Go 1.26+
GOEXPERIMENT=simdfor AMD64 hardware acceleration (uses nativesimd/archsimdpackage)- ARM64 uses
hwy/asmwith GoAT-generated assembly (noGOEXPERIMENTneeded)
go get github.com/ajroetker/go-highwaypackage main
import (
"fmt"
"github.com/ajroetker/go-highway/hwy"
"github.com/ajroetker/go-highway/hwy/contrib/algo"
)
func main() {
// Load data into SIMD vectors
data := []float32{1, 2, 3, 4, 5, 6, 7, 8}
v := hwy.Load(data)
// Vectorized operations
doubled := hwy.Mul(v, hwy.Set[float32](2.0))
sum := hwy.ReduceSum(doubled)
fmt.Printf("Sum of doubled: %v\n", sum)
// Transcendental functions using transforms
output := make([]float32, len(data))
algo.ExpTransform(data, output)
fmt.Printf("Exp: %v\n", output)
}Build and run:
GOEXPERIMENT=simd go run main.goThese are fundamental SIMD operations that map directly to hardware instructions:
| Category | Operations |
|---|---|
| Load/Store | Load, LoadSlice, Store, StoreSlice, Set, Zero, MaskLoad, MaskStore, Load4 |
| Arithmetic | Add, Sub, Mul, Div, Neg, Abs, Min, Max, FMA, MulAdd |
| Math | Sqrt, RSqrt, RSqrtNewtonRaphson, RSqrtPrecise, Pow |
| Reduction | ReduceSum, ReduceMin, ReduceMax |
| Comparison | Equal, NotEqual, LessThan, LessEqual, GreaterThan, GreaterEqual |
| Conditional | IfThenElse, IfThenElseZero, IfThenZeroElse, ZeroIfNegative |
| Bitwise | And, Or, Xor, Not, AndNot, ShiftLeft, ShiftRight, PopCount, ReverseBits |
| Shuffle | GetLane, Reverse, Reverse2, Reverse4, Reverse8, Broadcast, Iota |
| Type Cast | AsInt32, AsFloat32, AsInt64, AsFloat64 |
| Float Check | IsNaN, IsInf, IsFinite, RoundToEven |
| Utilities | NumLanes, SignBit, Const, ConstValue |
Load/Store skip bounds checking for performance-critical inner loops (caller must guarantee sufficient length). LoadSlice/StoreSlice are safe variants that handle short slices.
Low-level SIMD functions for direct archsimd usage:
Sqrt_AVX2_F32x8,Sqrt_AVX2_F64x4- Hardware sqrt (VSQRTPS/VSQRTPD)Sqrt_AVX512_F32x16,Sqrt_AVX512_F64x8- AVX-512 variantsPopCount_AVX2_*,PopCount_AVX512_*- Population count for bit manipulation
The contrib package is organized into two subpackages:
Algorithm Transforms (hwy/contrib/algo):
| Function | Description |
|---|---|
ExpTransform[T] |
Apply exp(x) to slices |
LogTransform[T] |
Apply ln(x) to slices |
SinTransform[T] |
Apply sin(x) to slices |
CosTransform[T] |
Apply cos(x) to slices |
TanhTransform[T] |
Apply tanh(x) to slices |
SigmoidTransform[T] |
Apply 1/(1+e^-x) to slices |
ErfTransform[T] |
Apply erf(x) to slices |
All transforms are generic over hwy.Floats (float32, float64, Float16, BFloat16).
Low-Level Math (hwy/contrib/math):
| Function | Description |
|---|---|
BaseExpVec[T] |
Exponential on SIMD vectors |
BaseLogVec[T], BaseLog2Vec[T], BaseLog10Vec[T] |
Logarithm on SIMD vectors |
BaseSinVec[T], BaseCosVec[T] |
Trigonometric functions on SIMD vectors |
BaseTanhVec[T], BaseSinhVec[T], BaseCoshVec[T] |
Hyperbolic functions on SIMD vectors |
BaseAsinhVec[T], BaseAcoshVec[T], BaseAtanhVec[T] |
Inverse hyperbolic functions |
BaseSigmoidVec[T] |
Logistic function on SIMD vectors |
BaseErfVec[T] |
Error function on SIMD vectors |
BaseExp2Vec[T], BasePowVec[T] |
Power functions on SIMD vectors |
These are vector-register-level building blocks that hwygen generates into target-specific implementations (_avx2, _avx512, _neon, _fallback). All functions support float32, float64, Float16, and BFloat16 with ~4 ULP accuracy.
| Package | Description |
|---|---|
hwy/contrib/matmul |
Matrix multiplication with SME/NEON acceleration |
hwy/contrib/matvec |
Matrix-vector multiplication |
hwy/contrib/rabitq |
RaBitQ SIMD operations for vector quantization (ANN search) |
hwy/contrib/activation |
Neural network activation functions |
hwy/contrib/nn |
Neural network primitives |
hwy/contrib/loss |
Loss functions |
hwy/contrib/sort |
SIMD-accelerated sorting algorithms |
hwy/contrib/vec |
Vector distance and similarity functions |
hwy/contrib/bitpack |
Bit packing/unpacking operations |
hwy/contrib/quantize |
Quantization operations |
hwy/contrib/varint |
Variable-length integer encoding |
hwy/contrib/image |
Image processing operations |
hwy/contrib/wavelet |
Wavelet transforms |
hwy/contrib/gguf |
GGUF file format support |
hwy/contrib/workerpool |
Worker pool utilities |
Generate optimized target-specific code from generic implementations:
go build -o bin/hwygen ./cmd/hwygen
./bin/hwygen -input mycode.go -output . -targets avx2,avx512,neon,fallbackEach target supports a generation mode suffix:
neon(default GoSimd) — generates Go code callinghwy/asmpackage methodsneon:asm— compiles the function to C, transpiles to Go assembly via GoAT, and generates//go:noescapewrappers. Use this for compute-heavy kernels where per-vector call overhead matters.neon:c— generates C source only (for inspection)
# GoSimd mode (default) — portable Go with asm package calls
./bin/hwygen -input dense.go -output . -targets avx2,avx512,neon,fallback
# Assembly mode — C → GoAT → bulk NEON assembly
./bin/hwygen -input matmul.go -output . -targets avx2,avx512,neon:asm,fallbackAVX2 and AVX-512 targets use Go 1.26's native simd/archsimd package directly. ARM64 targets (NEON, SVE) use the hwy/asm package because simd/archsimd does not yet support these architectures. The :asm mode is available for any target but is primarily useful for ARM64 where bulk assembly avoids per-vector function call overhead.
hwygen generates type-safe generic functions that automatically dispatch to the best implementation:
// Write once with generics
func BaseSoftmax[T hwy.Floats](input, output []T) {
// ... implementation using hwy.Load, hwy.Store, etc.
}
// hwygen generates:
// - BaseSoftmax_avx2, BaseSoftmax_avx2_Float64
// - BaseSoftmax_avx512, BaseSoftmax_avx512_Float64
// - BaseSoftmax_neon, BaseSoftmax_neon_Float64
// - BaseSoftmax_fallback, BaseSoftmax_fallback_Float64
// Plus a generic dispatcher:
func Softmax[T hwy.Floats](input, output []T) // dispatches by type
// And type-specific function variables:
var SoftmaxFloat32 func(input, output []float32)
var SoftmaxFloat64 func(input, output []float64)
// Tail handling is automatic - remaining elements that don't
// fit a full SIMD width are processed via the fallback path.Usage:
// Generic API - works with any float type
data32 := []float32{1, 2, 3, 4}
out32 := make([]float32, 4)
softmax.Softmax(data32, out32)
data64 := []float64{1, 2, 3, 4}
out64 := make([]float64, 4)
softmax.Softmax(data64, out64)See examples/gelu and examples/softmax for complete examples.
For maximum performance on ARM64, hwygen can generate bulk assembly via the neon:asm target that processes entire arrays in a single call, eliminating per-vector function call overhead.
Requirements:
- GoAT - C to Go assembly transpiler (included at
hwy/goat/)
# Install tool dependencies (GoAT is declared in go.mod)
go install tool
# Build hwygen
go build -o bin/hwygen ./cmd/hwygenGenerate assembly for ARM64 NEON:
# Using the :asm target suffix
./bin/hwygen -input matmul.go -output . -targets avx2,avx512,neon:asm,fallback -dispatch matmul
# Or for bulk element-wise operations
./bin/hwygen -bulk -input examples/gelu/gelu.go -output examples/gelu -targets neon -pkg geluThe neon:asm target generates:
- C source files (intermediate, kept with
-keepc) - Go assembly (
.s) via GoAT transpilation //go:noescapewrapper functions for slice-to-pointer conversion- Dispatch override files (
z_c_slices_*_neon_arm64.gen.go) that wire assembly into the dispatch table
Performance comparison (1024 elements on Apple M4 Max):
| Function | Per-Vector (GoSimd) | Bulk Assembly (neon:asm) |
Speedup |
|---|---|---|---|
| GELU F32 | 67,581 ns | 577 ns | 117x |
| GELU F64 | 122,690 ns | 1,793 ns | 68x |
Assembly mode works best for compute-heavy kernels (matmul, cross-entropy loss, fused quantized ops) and pure element-wise operations. Functions with complex control flow or reduction operations (like softmax) are better suited to the default GoSimd mode.
# With SIMD acceleration
GOEXPERIMENT=simd go build ./...
# Fallback only (pure Go)
go build ./...
# Run tests
GOEXPERIMENT=simd go test ./...
# Force fallback path (for testing)
HWY_NO_SIMD=1 GOEXPERIMENT=simd go test ./...
# Disable SME dispatch on ARM64 (falls back to NEON)
HWY_NO_SME=1 go test ./...
# Disable SVE dispatch on ARM64 Linux (falls back to NEON)
HWY_NO_SVE=1 go test ./...
# Benchmarks
GOEXPERIMENT=simd go test -bench=. -benchmem ./hwy/contrib/algo/...
GOEXPERIMENT=simd go test -bench=. -benchmem ./hwy/contrib/math/...| Architecture | SIMD Width | Backend | Status |
|---|---|---|---|
| AMD64 AVX2 | 256-bit | Go 1.26 simd/archsimd |
Supported |
| AMD64 AVX-512 | 512-bit | Go 1.26 simd/archsimd |
Supported |
| ARM64 NEON | 128-bit | hwy/asm (GoAT assembly) |
Supported |
| ARM64 SVE (Darwin) | 512-bit fixed | hwy/asm (GoAT assembly) |
Supported |
| ARM64 SVE (Linux) | Scalable | hwy/asm (GoAT assembly) |
Supported |
| ARM64 SME | Scalable | hwy/asm (GoAT assembly) |
Supported (matrix ops) |
| Pure Go | Scalar | — | Supported (fallback) |
AMD64 targets use Go 1.26's native simd/archsimd package. ARM64 targets use the hwy/asm package with GoAT-generated assembly because simd/archsimd does not yet support NEON or SVE. SME (Scalable Matrix Extension) provides dedicated matrix multiplication hardware on Apple Silicon M4 and newer ARM processors.
Apache 2.0