Faster c mem{cpy,set} by ludfjig · Pull Request #1473 · hyperlight-dev/hyperlight

ludfjig · 2026-05-26T22:09:00Z

Switching from musl to picolibc regressed C guest performance due to slower memcpy and memset. The picolibc x86 machine asm uses non-temporal stores when N>=256, which regresses our particular workloads. This PR removes machine/x86/memcpy.S and machine/x86/memset.S from the build and uses the generic C versions in string/memcpy.c and string/memset.c instead.

The currently checked-in asm files work like this:

size N	memcpy.S	memset.S
N < 16	rep movsb	rep stosb
16 <= N < 256	aligned rep movsq, tail rep movsb	aligned rep stosq, tail rep stosb
N >= 256	prefetchnta, 128 B unrolled movntiq, sfence	128 B unrolled movntiq, sfence

After switching to string/memcpy.c and string/memset.c, those gets lowered to unrolled cached stores. memcpy.c becomes unrolled 64-bit mov loads paired with 64-bit mov stores. memset.c becomes unrolled 16-byte movdqu stores via xmm0. Both use regular write-back cached stores with no size-dependent dispatch and no NT path.

Hyperlight C guest workloads often read the destination of a memcpy or memset again shortly after the write. This is deterimental for performance if using NT stores: For example, while building the flatbuffer result, the guest reads the source buffer that was just written. The guest then memcpys that flatbuffer into the shared output buffer, which is another read of recently written data. The host then reads the output buffer right after the call returns.

Here are some data for the number and sizes of calls to memcpy and memset, for 3 different workloads:

Echo("hello"):

len range	memcpy_calls	memcpy_bytes	memset_calls	memset_bytes
1..1	1	1	2	2
2..3	1	2	2	5
4..7	6	27	2	11
8..15	2	16	2	23
16..31	3	56	2	47
32..63	2	64	2	95
64..127	5	400	2	176
TOTAL	20	566	14	359

24K_in_8K_out:

len range	memcpy_calls	memcpy_bytes	memset_calls	memset_bytes
1..1	1	1	2	2
2..3	1	2	2	5
4..7	1	4	2	11
8..15	4	42	3	31
16..31	3	56	2	47
32..63	2	64	2	95
64..127	3	248	2	191
128..255	2	256	2	383
256..511	2	512	2	767
512..1023	2	1024	2	1535
1024..2047	2	2048	2	3071
2048..4095	2	4096	2	6143
4096..8191	2	8192	2	12287
8192..16383	5	41088	1	8192
16384..32767	1	24576	1	24688
TOTAL	33	82209	29	57448

SetByteArrayToZero(8K):

len range	memcpy_calls	memcpy_bytes	memset_calls	memset_bytes
1..1	1	1	2	2
2..3	1	2	2	5
4..7	1	4	2	11
8..15	2	16	3	31
16..31	5	92	2	47
32..63	2	64	2	95
64..127	3	248	2	191
128..255	2	256	2	383
256..511	2	512	2	767
512..1023	2	1024	2	1535
1024..2047	2	2048	2	3071
2048..4095	2	4096	2	6143
4096..8191	2	8192	2	12287
8192..16383	6	49280	3	24692
TOTAL	33	65835	30	49260

Signed-off-by: Ludvig Liljenberg <4257730+ludfjig@users.noreply.github.com>

Copilot

Pull request overview

This PR addresses a C-guest performance regression after switching to picolibc by avoiding picolibc’s x86 assembly memcpy/memset implementations (which use non-temporal stores for larger sizes) and instead building the generic C implementations that use cached stores—better matching Hyperlight’s “write then quickly reread” workloads.

Changes:

Add libc/string/memcpy.c and libc/string/memset.c to the compiled libc file set.
Remove libc/machine/x86/memcpy.S and libc/machine/x86/memset.S from the x86-specific compiled file set.

ludfjig added the kind/enhancement For PRs adding features, improving functionality, docs, tests, etc. label May 27, 2026

ludfjig changed the title ~~[test] Faster c mem{cpy,set}~~ Faster c mem{cpy,set} May 27, 2026

ludfjig force-pushed the fix_mem_nt branch from 4819be0 to 5725097 Compare May 28, 2026 18:41

Faster mem{cpy,set}

d450899

Signed-off-by: Ludvig Liljenberg <4257730+ludfjig@users.noreply.github.com>

ludfjig force-pushed the fix_mem_nt branch from 5725097 to d450899 Compare June 3, 2026 17:22

ludfjig marked this pull request as ready for review June 4, 2026 00:07

Copilot AI review requested due to automatic review settings June 4, 2026 00:07

ludfjig requested review from andreiltd, danbugs, dblnz, devigned, jprendes, jsturtevant, simongdavies, squillace and syntactically as code owners June 4, 2026 00:07

Copilot started reviewing on behalf of ludfjig June 4, 2026 00:07 View session

Copilot AI reviewed Jun 4, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faster c mem{cpy,set}#1473

Faster c mem{cpy,set}#1473
ludfjig wants to merge 1 commit into
hyperlight-dev:mainfrom
ludfjig:fix_mem_nt

ludfjig commented May 26, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ludfjig commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ludfjig commented May 26, 2026 •

edited

Loading