Improve performance of FFI calls with struct parameters by rutenkolk · Pull Request #23 · IGJoshua/coffi

rutenkolk · 2025-07-02T18:29:50Z

Hi, In this pull request I propose some performance improvements that mainly target ffi/defcfn defined functions taking struct arguments.

While developing a test application using the latest version of coffi, i noticed that repeatedly calling a function which took arguments that are to be serialized via defstruct defined serdes, performance took a big hit. Profiling resulted that the majority of time is spent in mem/size-of and mem/align-of:

The reason for this is, that the serde-wrapper for FFI functions called mem/alloc-instance and mem/serialize-into for non-primitive arguments, which results in calls to mem/size-of and mem/align-of, which will actually go through mem/c-layout. mem/c-layout can be a pretty expensive function on top of being a multimethod but here it is called multiple times for every argument.

One optimization proposed here is to memoize calls to mem/size-of and mem/align-of whose argument is not a MemoryLayout.

This improved performance, but unfortunately not by as much as i had hoped.

Therefore, another improvement proposed here is generating a call to mem/alloc with the size and alignment baked in, instead of doing so every time the FFI call is made using mem/alloc-instance.

The next bottleneck was actually the call to mem/serialize-into which had a similar issue as mem/alloc-instance, needing to dispatch on the multimethod mem/type-dispatch with the serde descriptor.

Leveraging the serde registry introduced with the defstruct macro, we can allow for an inline solution, should one exist in the registry. Together with the addition of some type hints for defstruct serdes, this eliminated all calls to mem/size-of, mem/align-of and mem/type-dispatch altogether and is my last proposed optimization.

With these changes in place, the actual allocation of the segments in the confined arena becomes the dominant cost of the whole FFI call, suggesting little other performance gains:

I unfortunately don't have a rigourous benchmark for the impact of the improvements, but in my private raylib example I started with an fps of around 100 and ended somewhere over 4000.

rutenkolk · 2025-07-22T12:45:37Z

There were a few concerns with the proposals.

one were scaling concerns with memoizing size-of and align-of. even something like [::mem/array ::mem/int n] would generate a new entry for every n. I have reverted the memoization.
another was that it was possibly not guaranteed that size-of and align-of could be deduced at macroexpansion time with the given type in inline-serde-wrapper. i added safeguards that dynamically check if this is possible and only then use this.

Further optimizations that are proposed now:

inlining for alloc. it turns out that calls to alloc came with significant overhead due to calling long (Rt.longCast) on the size argument indiscriminately. inlining this with the requirement of the inlined version to produce a long eliminates this overhead.
defcfn now inlines serde multimethods if they are available. this eliminates virtually all multimethod lookup when calling a defcfn defined function, which is a big cost saving for functions taking struct arguments.

One point of contention here: I'm personally not perfectly happy with how this is implemented. The function objects themselves are inlined into the forms which is a bit unclean, since functions aren't unfortunately really data in clojure. it works, but it's not the perfect solution. ideally we would wrap the whole expression with a let, define the functions there via get-method and refer to them via the symbols introduced in the let. i'm not personally certain if this is possible or should be done in inline-serde-wrapper itself though or outside of it and it would require a more major rework of this code, since we would have to know a-priori which multimethods we want to call.

at this point allocating memory itself, creating confined arenas and doing bounds checks dominates the cost of calling a function heavily. this is already nice but we can do a bit better. after looking a bit into how exactly confined arenas work, they basically don't to any stack allocation because you can allocate as much as you want to so they really only do a malloc behind the scenes. after a confined arena is used though, the memory is instantly released again, only for subsequent FFI calls to malloc again. It would be good to reuse memory, for FFI calls but also in general but it can't be globally since different threads might call the same function. Inspired by https://bugs.openjdk.org/browse/JDK-8348189 I propose the introduction of the thread-local macro and thread-local-arena function which is a thread local confined arena, which reuses memory. the linked issue mentions a potential scaling issue if a lot of threads make FFI calls, especially in the context of virtual threads or short lived ones. For that purpose I also introduce terminating-thread-local which is the same thing, but on thread exit runs a cleanup action from a different thread, mimicking the behavior of the JDK internal class jdk.internal.misc.TerminatingThreadLocal. I consider them two separate things, since the terminating-thread-local is slightly more expensive than just a thread-local, since it needs to do a bit of bookkeeping. With this however, all thread-local-arenas should be closed if a thread that initialized their thread-local-arena exits, preventing a memory leak. the actual Arena class implemented, ThreadLocalConfinedArena is initialized only once per thread itself and returned via thread-local-arena and basically works via a SegmentAllocator.slicingAllocator on a buffer Segment. It works with nested uses in thread-local-arena and it will allocate new buffers when it runs out of memory and on being closed as many times as it was opened, it will free the memory and allocate a new buffer with increased size based on how much total allocation was done during the last allocation cycle.

Before the thread-loca-arena there was a lot of actual allocation when calling a native function:

But with the thread-local-arena the allocation cost disappears. now it's calling set and get on segments that dominates the costs:

Looking at memory profiling, the creation of NativeMemorySegmentImpl objects happened mostly through SegmentFactories.allocateSegment (which is an actual malloc call):

and with thread-local-arena it happened mostly through NativeMemorySegmentImpl.dup:

NativeMemorySegmentImpl.dup is probably the most efficient way to create a segment, as it is simply increasing the pointer of an existing address as per https://github.com/openjdk/jdk/blob/master/src/java.base/share/classes/jdk/internal/foreign/NativeMemorySegmentImpl.java#L72

ertugrulcetin · 2026-02-03T10:06:34Z

@rutenkolk do you have any fork that merges all your improvement PRs, I'd like to test them locally.

ertugrulcetin · 2026-02-03T12:54:22Z

my private raylib example I started with an fps of around 100 and ended somewhere over 4000.

@IGJoshua this seems a quite improvement, what do you think?

IGJoshua · 2026-02-03T15:27:24Z

@IGJoshua this seems a quite improvement, what do you think?

Hey, this is on my radar and I intend to merge it! My main issue has been that my work has been very demanding and left me with little time for the in-depth review that I feel coffi deserves on the PRs that will be added to it. It's on my shortlist of tasks to do, I'm hoping to get to it soon :)

rutenkolk added 15 commits June 30, 2025 18:44

memoize align-of and size-of

8197e4b

inline serde wrapper allocation

13ef74f

add inline serialization for serdes

2b51c44

add type hint to generated serialization

6d202ab

create fast path for calling align-of and size-of with memorylayouts

366b238

revert align-of and size-of memoization

9ec94a6

precompute defstruct c-layout implementation

936929e

add safety macroexpansion check to size and align check

71a54c6

inline serde multimethod implementations

b3fe59e

add inlining for alloc

8ff6992

improve typehinting for generating c-string deserialization

599d537

introduce thread local arena

7f29e91

switch to thread local arena allocation for function calls

beffc4e

fix thread local arena to be thread local itself

752e7e9

add typehint to thread-local-arena

9fbbff1

rutenkolk mentioned this pull request Aug 15, 2025

Reference publicly available dependencies rutenkolk/coffimaker#2

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

Improve performance of FFI calls with struct parameters#23

Improve performance of FFI calls with struct parameters#23
rutenkolk wants to merge 15 commits intoIGJoshua:developfrom
rutenkolk:develop

rutenkolk commented Jul 2, 2025

Uh oh!

rutenkolk commented Jul 22, 2025

Uh oh!

ertugrulcetin commented Feb 3, 2026

Uh oh!

ertugrulcetin commented Feb 3, 2026

Uh oh!

IGJoshua commented Feb 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Comments

Conversation

rutenkolk commented Jul 2, 2025

Uh oh!

rutenkolk commented Jul 22, 2025

Uh oh!

ertugrulcetin commented Feb 3, 2026

Uh oh!

ertugrulcetin commented Feb 3, 2026

Uh oh!

IGJoshua commented Feb 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants