Skip to content

timothyandrew/sgrep

Repository files navigation

sgrep

A small grep-like CLI that counts lines matching a literal pattern, with an AVX-512 fast path written via avo.

sgrep PATTERN FILE          # mmap the file (fast path; default --advise=hugepage)
sgrep PATTERN < FILE        # read from stdin
sgrep --info                # report whether AVX-512 is being used
sgrep --scalar ...          # force the portable byte-loop, even on AVX hardware
sgrep --advise=H PATTERN F  # H ∈ none|sequential|hugepage|willneed|populate
sgrep --bench-bw FILE       # memory-bandwidth probe (no scan, no pattern)

The AVX-512 path activates automatically when both AVX512F and AVX512BW are present (detected via golang.org/x/sys/cpu); on any other CPU sgrep falls back to a bytes.Index loop with no behavioral change.

Benchmarks

Intel Xeon Platinum 8168 (Skylake-SP, AVX-512BW), 1 GB corpus, ~665k matching lines, pattern needleneedle (12 bytes), warm page cache, hyperfine 10 runs, --advise=none for sgrep so all four tools use the same mmap strategy:

Tool Time Throughput Speedup vs scalar
sgrep AVX-512 231 ms ~4.6 GB/s 3.46×
ripgrep (-c -F) 262 ms ~4.1 GB/s 3.05×
GNU grep (-c) 707 ms ~1.5 GB/s 1.13×
sgrep scalar 799 ms ~1.3 GB/s 1.00×

Two clean tiers: SIMD (sgrep AVX, ripgrep) at ~4 GB/s, scalar (grep, sgrep scalar) at ~1.5 GB/s. sgrep edges ripgrep by ~13% — both use a two-anchor SIMD substring scan; the gap is at the level of register width (ZMM vs YMM) and per-chunk instruction count, not algorithm.

The scan loop measured in isolation (Go benchmarks, no I/O, no process startup):

BenchmarkScalar_16MB    1538 MB/s
BenchmarkAVX512_16MB    9661 MB/s    (~6.3× scalar)
BenchmarkAVX512_64KB   13841 MB/s    (cache-resident)

End-to-end numbers above are lower than the inner-loop numbers because every CLI invocation pays a fixed ~40 ms of process startup, mmap setup, and first-touch page fault overhead that doesn't scale with corpus size.

Cold-cache bonus: --advise=hugepage

When the file is not in the page cache (first read after boot, or larger than RAM), passing --advise=hugepage (the default in file mode) tells the kernel to back the mapping with 2 MB pages. On Linux ≥ 5.x with THP enabled, this batches reads into 2 MB chunks and cuts page faults by 512×, giving ~3.7× lower wall time vs the default mmap on a 256 MB cold-cache scan. The hint is a no-op on warm cache. ripgrep and grep do not issue this hint, so on cold-cache workloads larger than the page cache, sgrep can pull significantly further ahead than the warm-cache numbers above suggest.

The single-core memory bandwidth ceiling on this VM (warm-cache, sequential read) measured by sgrep --bench-bw is ~11 GB/s; the AVX kernel runs at ~88% of that in the in-process benchmark, so further wins live in the I/O path (mmap setup, prefetch, multi-file batching), not the scan kernel.

How the AVX-512 kernel works

For a pattern of length L (where 2 ≤ L ≤ 64), the kernel picks two anchor bytes — pattern[0] and pattern[L-1] — and broadcasts each into a 64-byte ZMM register. It then walks the haystack in 64-byte windows:

   firstVec  = broadcast(pattern[0])           // ZMM0
   lastVec   = broadcast(pattern[L-1])         // ZMM1

   loop over i in [0, len(hay) - L - 64]:
       chunkA = hay[i      : i+64]             // ZMM2
       chunkB = hay[i+L-1  : i+L-1+64]         // ZMM3
       k1     = vpcmpeqb(chunkA, firstVec)     // 64-bit mask
       k2     = vpcmpeqb(chunkB, lastVec)      // 64-bit mask
       mask   = k1 AND k2
       if mask != 0:
           return i + tzcnt(mask)              // candidate
       i += 64

Each vpcmpeqb produces one bit per byte of the input, so two compares + a KAND give us a 64-bit map of positions where both anchors match simultaneously. Verification (full memcmp of the candidate against the pattern) and the trailing <64-byte window happen in Go. False positives are rare with two anchor bytes, so the verification step is cheap.

This is a stripped-down version of the Hyperscan/sse4-strstr "two-byte anchor" idea — see Wojciech Muła's writeup on SIMD-friendly substring search for the original technique.

The asm (scan_amd64.s) is generated from the avo program in gen/main.go. Regenerate with:

go run ./gen/main.go -out scan_amd64.s -stubs scan_amd64.go -pkg sgrep

Layout

sgrep.go            scalar CountLinesScalar
dispatch_amd64.go   AVX dispatch + Go-side verification + tail handling
dispatch_other.go   non-amd64 stub (everything routes to scalar)
scan_amd64.{s,go}   avo-generated AVX-512 candidate scanner
gen/main.go         avo program that generates the above
cmd/sgrep/main.go   CLI
cmd/gencorpus/main.go   deterministic corpus generator (for benchmarks)
bench/compare.sh    hyperfine driver: sgrep × {AVX, scalar} vs grep vs rg

Behavior matched against grep -c

sgrep counts the number of lines containing at least one occurrence of the pattern, identical to grep -c PATTERN FILE on a corpus without binary data. A line that contains the pattern multiple times still counts once. A pattern that itself contains a \n byte returns 0 (cannot match within one line).

Limitations

  • Pattern must be 2–64 bytes for the AVX path; outside that range it falls back to scalar.
  • Literal bytes only; no regex.
  • AVX-512 only — no AVX2 fallback. (AVX2 would need 32-byte windows and pmovmskb instead of KMOVQ.)

Building

go build ./cmd/sgrep                                      # native
GOOS=linux GOARCH=amd64 go build -o sgrep ./cmd/sgrep     # cross
go test ./...                                             # tests
go test -bench=. -benchtime=2s -run='^$' .                # in-process benchmarks
bench/compare.sh                                          # CLI shootout

About

Line-counting grep with an AVX-512 kernel via avo

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors