Skip to content

Faster c mem{cpy,set}#1473

Open
ludfjig wants to merge 1 commit into
hyperlight-dev:mainfrom
ludfjig:fix_mem_nt
Open

Faster c mem{cpy,set}#1473
ludfjig wants to merge 1 commit into
hyperlight-dev:mainfrom
ludfjig:fix_mem_nt

Conversation

@ludfjig
Copy link
Copy Markdown
Contributor

@ludfjig ludfjig commented May 26, 2026

Switching from musl to picolibc regressed C guest performance due to slower memcpy and memset. The picolibc x86 machine asm uses non-temporal stores when N>=256, which regresses our particular workloads. This PR removes machine/x86/memcpy.S and machine/x86/memset.S from the build and uses the generic C versions in string/memcpy.c and string/memset.c instead.

The currently checked-in asm files work like this:

size N memcpy.S memset.S
N < 16 rep movsb rep stosb
16 <= N < 256 aligned rep movsq, tail rep movsb aligned rep stosq, tail rep stosb
N >= 256 prefetchnta, 128 B unrolled movntiq, sfence 128 B unrolled movntiq, sfence

After switching to string/memcpy.c and string/memset.c, those gets lowered to unrolled cached stores. memcpy.c becomes unrolled 64-bit mov loads paired with 64-bit mov stores. memset.c becomes unrolled 16-byte movdqu stores via xmm0. Both use regular write-back cached stores with no size-dependent dispatch and no NT path.

Hyperlight C guest workloads often read the destination of a memcpy or memset again shortly after the write. This is deterimental for performance if using NT stores: For example, while building the flatbuffer result, the guest reads the source buffer that was just written. The guest then memcpys that flatbuffer into the shared output buffer, which is another read of recently written data. The host then reads the output buffer right after the call returns.

Here are some data for the number and sizes of calls to memcpy and memset, for 3 different workloads:

Echo("hello"):

len range memcpy_calls memcpy_bytes memset_calls memset_bytes
1..1 1 1 2 2
2..3 1 2 2 5
4..7 6 27 2 11
8..15 2 16 2 23
16..31 3 56 2 47
32..63 2 64 2 95
64..127 5 400 2 176
TOTAL 20 566 14 359

24K_in_8K_out:

len range memcpy_calls memcpy_bytes memset_calls memset_bytes
1..1 1 1 2 2
2..3 1 2 2 5
4..7 1 4 2 11
8..15 4 42 3 31
16..31 3 56 2 47
32..63 2 64 2 95
64..127 3 248 2 191
128..255 2 256 2 383
256..511 2 512 2 767
512..1023 2 1024 2 1535
1024..2047 2 2048 2 3071
2048..4095 2 4096 2 6143
4096..8191 2 8192 2 12287
8192..16383 5 41088 1 8192
16384..32767 1 24576 1 24688
TOTAL 33 82209 29 57448

SetByteArrayToZero(8K):

len range memcpy_calls memcpy_bytes memset_calls memset_bytes
1..1 1 1 2 2
2..3 1 2 2 5
4..7 1 4 2 11
8..15 2 16 3 31
16..31 5 92 2 47
32..63 2 64 2 95
64..127 3 248 2 191
128..255 2 256 2 383
256..511 2 512 2 767
512..1023 2 1024 2 1535
1024..2047 2 2048 2 3071
2048..4095 2 4096 2 6143
4096..8191 2 8192 2 12287
8192..16383 6 49280 3 24692
TOTAL 33 65835 30 49260
image image

@ludfjig ludfjig added the kind/enhancement For PRs adding features, improving functionality, docs, tests, etc. label May 27, 2026
@ludfjig ludfjig changed the title [test] Faster c mem{cpy,set} Faster c mem{cpy,set} May 27, 2026
Signed-off-by: Ludvig Liljenberg <4257730+ludfjig@users.noreply.github.com>
@ludfjig ludfjig marked this pull request as ready for review June 4, 2026 00:07
Copilot AI review requested due to automatic review settings June 4, 2026 00:07
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses a C-guest performance regression after switching to picolibc by avoiding picolibc’s x86 assembly memcpy/memset implementations (which use non-temporal stores for larger sizes) and instead building the generic C implementations that use cached stores—better matching Hyperlight’s “write then quickly reread” workloads.

Changes:

  • Add libc/string/memcpy.c and libc/string/memset.c to the compiled libc file set.
  • Remove libc/machine/x86/memcpy.S and libc/machine/x86/memset.S from the x86-specific compiled file set.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

kind/enhancement For PRs adding features, improving functionality, docs, tests, etc.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants