Quick start

Not recommended as a production solution, but it's a very fast way to benchmark if your application benefits from remapping your text and data sections to huge pages.

$ mkdir build && cd build && cmake .. -DMAKE_LD_PRELOAD_LIBRARY=1 && make
$ sudo bash -c "echo 100 > /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages"
$ numactl -N0 env LD_PRELOAD=./libelfremapper.so ./your-application

History

The applications usually benefit from remapping .text and .data ELF sections to huge pages. The performance speedup comes from significant reduction of iTLB and dTLB misses. Of course, the approach isn't new. For example, well known implementations at the moment are:

libhugetlbfs: 'remap_segments' function
Google: 'RemapHugetlbText*' functions
Facebook: 'HugifyText' function
Intel: 'MoveRegionToLargePages' function

libhugetlbfs uses huge pages, meanwhile Google/Facebook/Intel rely on transparent huge pages. The approach which is used by libhugetlbfs looks better, since it has less dependency on the particular kernel allocation/defragmentation algorithm, so provides more persistent results.

However, libhugetlbfs has several major drawbacks:

A bug with position independent executables (linked with '--pie' parameter): libhugetlbfs/libhugetlbfs#49
It might potentially unmap heap segment which immediately follows data segment in popular OS systems (e.g. Linux).
It supports remapping of maximum 3 ELF segments.
No integration with the target application: it works silently right during the startup.
It requires a proper managed hugetlbfs mount point (due to the backward compatibility with older kernels)
It requires LOAD segments aligned to a huge page size (e.g.compiled with common-page-size=2M max-page-size=2M)

Performance

Performance improves significantly for CPU-bound applications with big text/data sections (much more than 2 MB). The technique was tested on MySQL server (https://github.com/mysql/mysql-server) in Cloud environment. The server consumes about 40-50 large pages (~100 MB). The CPU bound scenarious become faster up to 10% in sysbench OLTP PS/RO (especially for very small x86_64 cloud instances with 1 vCPU and 2 GB RAM). Speedup on AArch64 CPUs is usually much better, however it should be tested in each particular case.

Analysis of iTLB and dTLB misses are excellently covered in the article [reference:1], so you might run this command before/after the remapping is done (Intel x86_64 processors):

$ perf stat -e cycles -e cpu/event=0x08,umask=0x10,name=dwalkcycles/ -e cpu/event=0x85,umask=0x10,name=iwalkcycles/ -e cpu/event=0x08,umask=0x01,name=dwalkmiss/ -e cpu/event=0x85,umask=0x01,name=iwalkmiss/ -e cpu/event=0xbc,umask=0x18,name=dloads/ -e cpu/event=0xbc,umask=0x28,name=iloads/ -p $app_pid sleep 30

Implementation

ElfRemapper does the following steps:

Read /proc/self/exe symbolic link to figure out the application name
Load /proc/self/maps to the memory, filter out LOAD segments using application name
mmap private anonymous memory with the size of LOAD segment
mremap LOAD segment to the previously mapped memory region
mmap private anonymous memory backed with huge pages with fixed address to the region where the original LOAD segment was before mremap
copy all the content of LOAD segment to the huge pages
unmap old LOAD segment
shift the break of heap segment if it overlaps with new huge page allocation
in case of errors, mremap old LOAD segment back

Advantages:

Support for position independent code (--pie)
Heap segment preserved
Any number of LOAD segments could be remapped
Could be easily integrated to the application code (e.g. using application configuration file to turn on/off the feature)
No hugetlbfs is needed (which implies no support for kernels < 2.6.32)
Application LOAD segments could have the default alignment (e.g. 4K), the algorithm merges the segments in that case. However, for security reasons it's better to link your application with the 2MB alignment for LOAD segments (see below)

Limitations:

Currently works with 2MB huge pages only
/proc filesystem is needed
Support is provided only for Linux systems (tested for kernels >= 5.4, GCC 10.3.0)
The default symbol resolution stops working for "perf" with unstripped ELF files. The workaround is to use perf JIT API (see below)

Build options

The library could be built in two ways:

With published API (default): a user must call the API manually from his application.
Without API (using cmake option MAKE_LD_PRELOAD_LIBRARY): as it stands, the functionality is called automatically during the library load, and the main usage is via LD_PRELOAD or an application may just link against the library and doesn't do anything else.
WITH_LTO option turns compiler's link time optimization on. LTO gives some performance boost, however, here the DSO code itself is cold, so turning LTO could be justified if:
- the DSO library is bundled into some bigger project which uses LTO,
- used for exposing some rare and devious errors

The second option might be convenient for the testing/benchmarking purposes, e.g. you want to try the library with your application and you don't want to recompile it.

Maintenance

Linkage with 2MB alignement for LOAD segments:

GNU ld.bfd/ld.gold linker

-zcommon-page-size=0x200000 -zmax-page-size=0x200000

LLVM ld.lld linker
```
-zcommon-page-size=0x200000
```

Using perf JIT API:

Compile your application with debug symbols (-g)

Create symbols map suitable for the perf ($app - application, pid - application pid):

nm --numeric-sort --print-size --demangle $app | awk '$4{print $1" "$2" "$4}' | grep -Ee"^0" > /tmp/perf-$pid.map

Run perf tool, it loads symbols automatically from /tmp/perf-$pid.map file

Huge page allocation:

The easiest way is to preallocate the necessary amount of huge pages for each NUMA node, e.g. (NUMA0):
```
# echo 64 > /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages
```
Allocation of the huge pages on the fly using 'overcommit' (if memory defragmentation is too high for the kernel, the huge pages accocation might fail):
```
# echo 64 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_overcommit_hugepages
```

Code formatting:

$ clang-format-12 -i elfremapper.cc

Debugging:

Turn on verbose mode inside elfremapper.cc: VERBOSE_LEVEL = 1
Recompile the shared object
All debug messages are going to be written to the logger hook as usual

Want library to be silent? Pass nullptr as the logger hook.

Acknowledgements

Many thanks to:

Alexey Kopytov
Alexey Stroganov
Sergey Glushchenko
Sergey Vojtovich
Georgy Kirichenko

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
.github/workflows		.github/workflows
example		example
include		include
src		src
CMakeLists.txt		CMakeLists.txt
INSTALL.md		INSTALL.md
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Quick start

History

Performance

Implementation

Build options

Maintenance

Acknowledgements

References

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Quick start

History

Performance

Implementation

Build options

Maintenance

Acknowledgements

References

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages