peepopt 🐣

peepopt recompiles x86-64 binaries using peephole optimization to take advantage of instructions available in newer processors. This improves performance and reduces power consumption in some situations.

Background

When compiling a program one must decide which processor family to target, e.g., x86-64, ARMv8. They may further specialize to a subset of processors, e.g., Intel Alder Lake or newer. Most Linux distributions compile binaries for a least-common denominator profile, e.g., x86-64 v1 in Fedora, x86-64 v3 in RHEL 10. Some distributions like Gentoo can compile from source to target a more specific processor and unlock additional performance. peepopt applies inexpensive peephole optimizations that reclaim some of this performance without expensive full-program compilation.

Shift left example

Consider a C function:

uint32_t shift(uint32_t x, uint32_t y)
{
    return x << y;
}

Shifting left with x86-64-v1

The sall instruction only takes two operands which requires movl instructions to set up the input registers:

89F8           movl %edi,%eax
89F1           movl %esi,%ecx
D3E0           sall %cl,%eax
C3             ret

Shifting left with x86-64-v3 (BMI2)

The shlx instruction takes three operands which allows more flexibility and does not require movls:

C4E249F7C7     shlx %esi,%edi,%eax
C3             ret

Note that this is not equivalent to the former example since sall explicitly writes to %cl and implicitly writes to EFLAGS. When rewriting instructions peepopt examines subsequent instructions to ensure that they would not observe the replacement.

Transformations

peepopt currently implements three families of rewrites, all within the x86-64-v3 feature level (BMI1, BMI2, AVX). Every replacement fits within the bytes of the original instructions — shorter replacements are padded with no-ops — and is refused when a branch targets the interior of the rewritten range.

Variable shifts: SHL/SHR/SAR → SHLX/SHRX/SARX (BMI2)

The legacy shifts take a variable count only in %cl, so compilers emit a mov to stage it:

4889F1         mov %rsi,%rcx
48D3E0         shl %cl,%rax

The BMI2 shifts take the count from any register and do not write EFLAGS:

C4E2C9F7C0     shlx %rsi,%rax,%rax

After the rewrite %rcx no longer holds the count and EFLAGS no longer holds the shift's result flags, so peepopt scans forward to prove both are overwritten before any read. The pass can also absorb a second mov supplying the shifted value — producing the fully three-operand form shown in the example above — and when the count-setting mov is not adjacent to the shift it repacks the instructions between them downward so the pair becomes contiguous.

NOT + AND → ANDN (BMI1)

An and-not computation dst = ~a & b requires two instructions:

48F7D7         not %rdi
4821F7         and %rsi,%rdi

andn computes it in one:

C4E2C0F2FE     andn %rsi,%rdi,%rdi

Both orderings are handled. Inverting the AND destination (shown) leaves no register holding a stale value. Inverting the AND source (not %rsi; and %rsi,%rdi → andn %rdi,%rsi,%rdi) leaves %rsi holding its original rather than inverted value, so it is only rewritten when %rsi is overwritten before any later read. In both cases andn leaves PF undefined where and defines it, so PF must also be dead.

SSE copy + op → AVX three-operand VEX form

Two-operand SSE instructions destroy their first source, so compilers copy values that are still needed:

660F28D0       movapd %xmm0,%xmm2
660F58D1       addpd %xmm1,%xmm2

The VEX-encoded forms of the same operations take a separate destination, folding the copy away:

C5F958D1       vaddpd %xmm1,%xmm0,%xmm2

93 scalar and packed opcodes are mapped: floating-point arithmetic and min/max, logic, integer arithmetic with and without saturation, multiplies, averages, compares, packs and unpacks, and shifts. A memory source folds into the VEX form the same way (movapd %xmm4,%xmm5; addsd 8(%rax),%xmm5 → vaddsd 8(%rax),%xmm4,%xmm5), recomputing RIP-relative displacements for the new instruction location. Unlike the integer rewrites this needs no forward analysis: neither instruction touches EFLAGS and every register holds the same value afterward.

Optimizing existing binaries

Currently peepopt only does the simple replacements described above that can be done without increasing or decreasing the number of instruction bytes. Unused bytes are padded with no-ops which may seem wasteful but processors discard them early during execution. Further the instructions represent fewer and simpler micro-operations which increase instruction cache hit rates and reduce execution overhead.

Benchmarks

Anecdotally using the x86-64-v3 profile improves performance by a few percent:

Arch x86-64v3 proposal
- Claims 9.9% improvement for Firefox
CentOS investigation
Ubuntu x86-64-v3 packages
- Claims 1% improvement for most packages and more for numerical programs
- Mixed results for desktops
- More positive results for servers

TODO: run benchmarks for Firefox and GCC

Compilation

First install the Intel x86 encoder decoder:

git clone https://github.com/intelxed/xed.git xed
git clone https://github.com/intelxed/mbuild.git mbuild
cd xed
./mfile.py install --install-dir=kits/xed-install

Next build peepopt:

git clone https://github.com/gaul/peepopt.git peepopt
cd peepopt
XED_PATH=/path/to/xed make all

Usage

peepopt --dry-run program_file
- Show which replacements peepopt would do
peepopt [--verbose] program_file
- Optimize the input binary with replacement instructions
peepopt --dry-run --stats program_file
- Also print a histogram of why each candidate site was or was not rewritten

Future directions

10-15 byte no-ops - optimal on Sandy Bridge and newer only but Atom and Zen perform poorly
APX - expanded three-operand and no flag instructions, supported by Panther Lake and newer processors
BMI - more flexible bit manipulations
FSRM - improve memory copies on Ice Lake and newer processors
- difficult replacement due more complicated register usage
inline compiler builtins, e.g., popcount
inline indirect functions

Distributions

peepopt could automatically run during distribution package installs. This will require plugins for package managers like apt and dnf.

License

Licensed under the Apache License, Version 2.0

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
main.c		main.c
peepopt.c		peepopt.c
peepopt.h		peepopt.h
peepopt_test.c		peepopt_test.c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

peepopt 🐣

Background

Shift left example

Shifting left with x86-64-v1

Shifting left with x86-64-v3 (BMI2)

Transformations

Variable shifts: SHL/SHR/SAR → SHLX/SHRX/SARX (BMI2)

NOT + AND → ANDN (BMI1)

SSE copy + op → AVX three-operand VEX form

Optimizing existing binaries

Benchmarks

Compilation

Usage

Future directions

Distributions

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

peepopt 🐣

Background

Shift left example

Shifting left with x86-64-v1

Shifting left with x86-64-v3 (BMI2)

Transformations

Variable shifts: SHL/SHR/SAR → SHLX/SHRX/SARX (BMI2)

NOT + AND → ANDN (BMI1)

SSE copy + op → AVX three-operand VEX form

Optimizing existing binaries

Benchmarks

Compilation

Usage

Future directions

Distributions

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages