A small grep-like CLI that counts lines matching a literal pattern, with an AVX-512 fast path written via avo.
sgrep PATTERN FILE # mmap the file (fast path; default --advise=hugepage)
sgrep PATTERN < FILE # read from stdin
sgrep --info # report whether AVX-512 is being used
sgrep --scalar ... # force the portable byte-loop, even on AVX hardware
sgrep --advise=H PATTERN F # H ∈ none|sequential|hugepage|willneed|populate
sgrep --bench-bw FILE # memory-bandwidth probe (no scan, no pattern)
The AVX-512 path activates automatically when both AVX512F and AVX512BW are
present (detected via golang.org/x/sys/cpu); on any other CPU sgrep falls
back to a bytes.Index loop with no behavioral change.
Intel Xeon Platinum 8168 (Skylake-SP, AVX-512BW), 1 GB corpus,
~665k matching lines, pattern needleneedle (12 bytes), warm page cache,
hyperfine 10 runs, --advise=none for sgrep so all four tools use the same
mmap strategy:
| Tool | Time | Throughput | Speedup vs scalar |
|---|---|---|---|
| sgrep AVX-512 | 231 ms | ~4.6 GB/s | 3.46× |
ripgrep (-c -F) |
262 ms | ~4.1 GB/s | 3.05× |
GNU grep (-c) |
707 ms | ~1.5 GB/s | 1.13× |
| sgrep scalar | 799 ms | ~1.3 GB/s | 1.00× |
Two clean tiers: SIMD (sgrep AVX, ripgrep) at ~4 GB/s, scalar (grep, sgrep scalar) at ~1.5 GB/s. sgrep edges ripgrep by ~13% — both use a two-anchor SIMD substring scan; the gap is at the level of register width (ZMM vs YMM) and per-chunk instruction count, not algorithm.
The scan loop measured in isolation (Go benchmarks, no I/O, no process startup):
BenchmarkScalar_16MB 1538 MB/s
BenchmarkAVX512_16MB 9661 MB/s (~6.3× scalar)
BenchmarkAVX512_64KB 13841 MB/s (cache-resident)
End-to-end numbers above are lower than the inner-loop numbers because every CLI invocation pays a fixed ~40 ms of process startup, mmap setup, and first-touch page fault overhead that doesn't scale with corpus size.
When the file is not in the page cache (first read after boot, or larger
than RAM), passing --advise=hugepage (the default in file mode) tells the
kernel to back the mapping with 2 MB pages. On Linux ≥ 5.x with THP enabled,
this batches reads into 2 MB chunks and cuts page faults by 512×, giving
~3.7× lower wall time vs the default mmap on a 256 MB cold-cache scan.
The hint is a no-op on warm cache. ripgrep and grep do not issue this hint,
so on cold-cache workloads larger than the page cache, sgrep can pull
significantly further ahead than the warm-cache numbers above suggest.
The single-core memory bandwidth ceiling on this VM (warm-cache, sequential
read) measured by sgrep --bench-bw is ~11 GB/s; the AVX kernel runs at
~88% of that in the in-process benchmark, so further wins live in the I/O
path (mmap setup, prefetch, multi-file batching), not the scan kernel.
For a pattern of length L (where 2 ≤ L ≤ 64), the kernel picks two anchor
bytes — pattern[0] and pattern[L-1] — and broadcasts each into a 64-byte
ZMM register. It then walks the haystack in 64-byte windows:
firstVec = broadcast(pattern[0]) // ZMM0
lastVec = broadcast(pattern[L-1]) // ZMM1
loop over i in [0, len(hay) - L - 64]:
chunkA = hay[i : i+64] // ZMM2
chunkB = hay[i+L-1 : i+L-1+64] // ZMM3
k1 = vpcmpeqb(chunkA, firstVec) // 64-bit mask
k2 = vpcmpeqb(chunkB, lastVec) // 64-bit mask
mask = k1 AND k2
if mask != 0:
return i + tzcnt(mask) // candidate
i += 64
Each vpcmpeqb produces one bit per byte of the input, so two compares + a
KAND give us a 64-bit map of positions where both anchors match
simultaneously. Verification (full memcmp of the candidate against the
pattern) and the trailing <64-byte window happen in Go. False positives are
rare with two anchor bytes, so the verification step is cheap.
This is a stripped-down version of the Hyperscan/sse4-strstr "two-byte
anchor" idea — see Wojciech Muła's writeup on
SIMD-friendly substring search
for the original technique.
The asm (scan_amd64.s) is generated from the avo program in gen/main.go.
Regenerate with:
go run ./gen/main.go -out scan_amd64.s -stubs scan_amd64.go -pkg sgrep
sgrep.go scalar CountLinesScalar
dispatch_amd64.go AVX dispatch + Go-side verification + tail handling
dispatch_other.go non-amd64 stub (everything routes to scalar)
scan_amd64.{s,go} avo-generated AVX-512 candidate scanner
gen/main.go avo program that generates the above
cmd/sgrep/main.go CLI
cmd/gencorpus/main.go deterministic corpus generator (for benchmarks)
bench/compare.sh hyperfine driver: sgrep × {AVX, scalar} vs grep vs rg
sgrep counts the number of lines containing at least one occurrence of the
pattern, identical to grep -c PATTERN FILE on a corpus without binary data.
A line that contains the pattern multiple times still counts once. A pattern
that itself contains a \n byte returns 0 (cannot match within one line).
- Pattern must be 2–64 bytes for the AVX path; outside that range it falls back to scalar.
- Literal bytes only; no regex.
- AVX-512 only — no AVX2 fallback. (AVX2 would need 32-byte windows and
pmovmskbinstead ofKMOVQ.)
go build ./cmd/sgrep # native
GOOS=linux GOARCH=amd64 go build -o sgrep ./cmd/sgrep # cross
go test ./... # tests
go test -bench=. -benchtime=2s -run='^$' . # in-process benchmarks
bench/compare.sh # CLI shootout