go-highway

A portable SIMD abstraction library for Go, inspired by Google's Highway C++ library.

Write SIMD code once, run it on AVX2, AVX-512, ARM NEON, or pure Go fallback.

Requirements

Go 1.26+
GOEXPERIMENT=simd for AMD64 hardware acceleration (uses native simd/archsimd package)
ARM64 uses hwy/asm with GoAT-generated assembly (no GOEXPERIMENT needed)

Installation

go get github.com/ajroetker/go-highway

Quick Start

package main

import (
    "fmt"
    "github.com/ajroetker/go-highway/hwy"
    "github.com/ajroetker/go-highway/hwy/contrib/algo"
)

func main() {
    // Load data into SIMD vectors
    data := []float32{1, 2, 3, 4, 5, 6, 7, 8}
    v := hwy.Load(data)

    // Vectorized operations
    doubled := hwy.Mul(v, hwy.Set[float32](2.0))
    sum := hwy.ReduceSum(doubled)

    fmt.Printf("Sum of doubled: %v\n", sum)

    // Transcendental functions using transforms
    output := make([]float32, len(data))
    algo.ExpTransform(data, output)
    fmt.Printf("Exp: %v\n", output)
}

Build and run:

GOEXPERIMENT=simd go run main.go

Features

Core Operations (`hwy` package)

These are fundamental SIMD operations that map directly to hardware instructions:

Category	Operations
Load/Store	`Load`, `LoadSlice`, `Store`, `StoreSlice`, `Set`, `Zero`, `MaskLoad`, `MaskStore`, `Load4`
Arithmetic	`Add`, `Sub`, `Mul`, `Div`, `Neg`, `Abs`, `Min`, `Max`, `FMA`, `MulAdd`
Math	`Sqrt`, `RSqrt`, `RSqrtNewtonRaphson`, `RSqrtPrecise`, `Pow`
Reduction	`ReduceSum`, `ReduceMin`, `ReduceMax`
Comparison	`Equal`, `NotEqual`, `LessThan`, `LessEqual`, `GreaterThan`, `GreaterEqual`
Conditional	`IfThenElse`, `IfThenElseZero`, `IfThenZeroElse`, `ZeroIfNegative`
Bitwise	`And`, `Or`, `Xor`, `Not`, `AndNot`, `ShiftLeft`, `ShiftRight`, `PopCount`, `ReverseBits`
Shuffle	`GetLane`, `Reverse`, `Reverse2`, `Reverse4`, `Reverse8`, `Broadcast`, `Iota`
Type Cast	`AsInt32`, `AsFloat32`, `AsInt64`, `AsFloat64`
Float Check	`IsNaN`, `IsInf`, `IsFinite`, `RoundToEven`
Utilities	`NumLanes`, `SignBit`, `Const`, `ConstValue`

Load/Store skip bounds checking for performance-critical inner loops (caller must guarantee sufficient length). LoadSlice/StoreSlice are safe variants that handle short slices.

Low-level SIMD functions for direct archsimd usage:

Sqrt_AVX2_F32x8, Sqrt_AVX2_F64x4 - Hardware sqrt (VSQRTPS/VSQRTPD)
Sqrt_AVX512_F32x16, Sqrt_AVX512_F64x8 - AVX-512 variants
PopCount_AVX2_*, PopCount_AVX512_* - Population count for bit manipulation

Extended Math (`hwy/contrib/algo` and `hwy/contrib/math` packages)

The contrib package is organized into two subpackages:

Algorithm Transforms (hwy/contrib/algo):

Function	Description
`ExpTransform[T]`	Apply exp(x) to slices
`LogTransform[T]`	Apply ln(x) to slices
`SinTransform[T]`	Apply sin(x) to slices
`CosTransform[T]`	Apply cos(x) to slices
`TanhTransform[T]`	Apply tanh(x) to slices
`SigmoidTransform[T]`	Apply 1/(1+e^-x) to slices
`ErfTransform[T]`	Apply erf(x) to slices

All transforms are generic over hwy.Floats (float32, float64, Float16, BFloat16).

Low-Level Math (hwy/contrib/math):

Function	Description
`BaseExpVec[T]`	Exponential on SIMD vectors
`BaseLogVec[T]`, `BaseLog2Vec[T]`, `BaseLog10Vec[T]`	Logarithm on SIMD vectors
`BaseSinVec[T]`, `BaseCosVec[T]`	Trigonometric functions on SIMD vectors
`BaseTanhVec[T]`, `BaseSinhVec[T]`, `BaseCoshVec[T]`	Hyperbolic functions on SIMD vectors
`BaseAsinhVec[T]`, `BaseAcoshVec[T]`, `BaseAtanhVec[T]`	Inverse hyperbolic functions
`BaseSigmoidVec[T]`	Logistic function on SIMD vectors
`BaseErfVec[T]`	Error function on SIMD vectors
`BaseExp2Vec[T]`, `BasePowVec[T]`	Power functions on SIMD vectors

These are vector-register-level building blocks that hwygen generates into target-specific implementations (_avx2, _avx512, _neon, _fallback). All functions support float32, float64, Float16, and BFloat16 with ~4 ULP accuracy.

Additional Contrib Packages

Package	Description
`hwy/contrib/matmul`	Matrix multiplication with SME/NEON acceleration
`hwy/contrib/matvec`	Matrix-vector multiplication
`hwy/contrib/rabitq`	RaBitQ SIMD operations for vector quantization (ANN search)
`hwy/contrib/activation`	Neural network activation functions
`hwy/contrib/nn`	Neural network primitives
`hwy/contrib/loss`	Loss functions
`hwy/contrib/sort`	SIMD-accelerated sorting algorithms
`hwy/contrib/vec`	Vector distance and similarity functions
`hwy/contrib/bitpack`	Bit packing/unpacking operations
`hwy/contrib/quantize`	Quantization operations
`hwy/contrib/varint`	Variable-length integer encoding
`hwy/contrib/image`	Image processing operations
`hwy/contrib/wavelet`	Wavelet transforms
`hwy/contrib/gguf`	GGUF file format support
`hwy/contrib/workerpool`	Worker pool utilities

Code Generator (hwygen)

Generate optimized target-specific code from generic implementations:

go build -o bin/hwygen ./cmd/hwygen
./bin/hwygen -input mycode.go -output . -targets avx2,avx512,neon,fallback

Target Modes

Each target supports a generation mode suffix:

neon (default GoSimd) — generates Go code calling hwy/asm package methods
neon:asm — compiles the function to C, transpiles to Go assembly via GoAT, and generates //go:noescape wrappers. Use this for compute-heavy kernels where per-vector call overhead matters.
neon:c — generates C source only (for inspection)

# GoSimd mode (default) — portable Go with asm package calls
./bin/hwygen -input dense.go -output . -targets avx2,avx512,neon,fallback

# Assembly mode — C → GoAT → bulk NEON assembly
./bin/hwygen -input matmul.go -output . -targets avx2,avx512,neon:asm,fallback

AVX2 and AVX-512 targets use Go 1.26's native simd/archsimd package directly. ARM64 targets (NEON, SVE) use the hwy/asm package because simd/archsimd does not yet support these architectures. The :asm mode is available for any target but is primarily useful for ARM64 where bulk assembly avoids per-vector function call overhead.

Generic Dispatch

hwygen generates type-safe generic functions that automatically dispatch to the best implementation:

// Write once with generics
func BaseSoftmax[T hwy.Floats](input, output []T) {
    // ... implementation using hwy.Load, hwy.Store, etc.
}

// hwygen generates:
// - BaseSoftmax_avx2, BaseSoftmax_avx2_Float64
// - BaseSoftmax_avx512, BaseSoftmax_avx512_Float64
// - BaseSoftmax_neon, BaseSoftmax_neon_Float64
// - BaseSoftmax_fallback, BaseSoftmax_fallback_Float64

// Plus a generic dispatcher:
func Softmax[T hwy.Floats](input, output []T)  // dispatches by type

// And type-specific function variables:
var SoftmaxFloat32 func(input, output []float32)
var SoftmaxFloat64 func(input, output []float64)

// Tail handling is automatic - remaining elements that don't
// fit a full SIMD width are processed via the fallback path.

Usage:

// Generic API - works with any float type
data32 := []float32{1, 2, 3, 4}
out32 := make([]float32, 4)
softmax.Softmax(data32, out32)

data64 := []float64{1, 2, 3, 4}
out64 := make([]float64, 4)
softmax.Softmax(data64, out64)

See examples/gelu and examples/softmax for complete examples.

Assembly Mode (`neon:asm`)

For maximum performance on ARM64, hwygen can generate bulk assembly via the neon:asm target that processes entire arrays in a single call, eliminating per-vector function call overhead.

Requirements:

GoAT - C to Go assembly transpiler (included at hwy/goat/)

# Install tool dependencies (GoAT is declared in go.mod)
go install tool

# Build hwygen
go build -o bin/hwygen ./cmd/hwygen

Generate assembly for ARM64 NEON:

# Using the :asm target suffix
./bin/hwygen -input matmul.go -output . -targets avx2,avx512,neon:asm,fallback -dispatch matmul

# Or for bulk element-wise operations
./bin/hwygen -bulk -input examples/gelu/gelu.go -output examples/gelu -targets neon -pkg gelu

The neon:asm target generates:

C source files (intermediate, kept with -keepc)
Go assembly (.s) via GoAT transpilation
//go:noescape wrapper functions for slice-to-pointer conversion
Dispatch override files (z_c_slices_*_neon_arm64.gen.go) that wire assembly into the dispatch table

Performance comparison (1024 elements on Apple M4 Max):

Function	Per-Vector (GoSimd)	Bulk Assembly (`neon:asm`)	Speedup
GELU F32	67,581 ns	577 ns	117x
GELU F64	122,690 ns	1,793 ns	68x

Assembly mode works best for compute-heavy kernels (matmul, cross-entropy loss, fused quantized ops) and pure element-wise operations. Functions with complex control flow or reduction operations (like softmax) are better suited to the default GoSimd mode.

Building

# With SIMD acceleration
GOEXPERIMENT=simd go build ./...

# Fallback only (pure Go)
go build ./...

# Run tests
GOEXPERIMENT=simd go test ./...

# Force fallback path (for testing)
HWY_NO_SIMD=1 GOEXPERIMENT=simd go test ./...

# Disable SME dispatch on ARM64 (falls back to NEON)
HWY_NO_SME=1 go test ./...

# Disable SVE dispatch on ARM64 Linux (falls back to NEON)
HWY_NO_SVE=1 go test ./...

# Benchmarks
GOEXPERIMENT=simd go test -bench=. -benchmem ./hwy/contrib/algo/...
GOEXPERIMENT=simd go test -bench=. -benchmem ./hwy/contrib/math/...

Supported Architectures

Architecture	SIMD Width	Backend	Status
AMD64 AVX2	256-bit	Go 1.26 `simd/archsimd`	Supported
AMD64 AVX-512	512-bit	Go 1.26 `simd/archsimd`	Supported
ARM64 NEON	128-bit	`hwy/asm` (GoAT assembly)	Supported
ARM64 SVE (Darwin)	512-bit fixed	`hwy/asm` (GoAT assembly)	Supported
ARM64 SVE (Linux)	Scalable	`hwy/asm` (GoAT assembly)	Supported
ARM64 SME	Scalable	`hwy/asm` (GoAT assembly)	Supported (matrix ops)
Pure Go	Scalar	—	Supported (fallback)

AMD64 targets use Go 1.26's native simd/archsimd package. ARM64 targets use the hwy/asm package with GoAT-generated assembly because simd/archsimd does not yet support NEON or SVE. SME (Scalable Matrix Extension) provides dedicated matrix multiplication hardware on Apple Silicon M4 and newer ARM processors.

License

Apache 2.0

Name		Name	Last commit message	Last commit date
Latest commit History 531 Commits
.github/workflows		.github/workflows
cmd/hwygen		cmd/hwygen
examples		examples
hwy		hwy
internal/cpuinfo		internal/cpuinfo
specs		specs
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
GEMINI.md		GEMINI.md
GOAT.md		GOAT.md
LICENSE		LICENSE
README.md		README.md
go.mod		go.mod
go.sum		go.sum

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

go-highway

Requirements

Installation

Quick Start

Features

Core Operations (`hwy` package)

Extended Math (`hwy/contrib/algo` and `hwy/contrib/math` packages)

Additional Contrib Packages

Code Generator (hwygen)

Target Modes

Generic Dispatch

Assembly Mode (`neon:asm`)

Building

Supported Architectures

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

go-highway

Requirements

Installation

Quick Start

Features

Core Operations (hwy package)

Extended Math (hwy/contrib/algo and hwy/contrib/math packages)

Additional Contrib Packages

Code Generator (hwygen)

Target Modes

Generic Dispatch

Assembly Mode (neon:asm)

Building

Supported Architectures

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Core Operations (`hwy` package)

Extended Math (`hwy/contrib/algo` and `hwy/contrib/math` packages)

Assembly Mode (`neon:asm`)

Packages