Improve performance of FFI calls with struct parameters#23
Improve performance of FFI calls with struct parameters#23rutenkolk wants to merge 15 commits intoIGJoshua:developfrom
Conversation
|
There were a few concerns with the proposals.
Further optimizations that are proposed now:
One point of contention here: I'm personally not perfectly happy with how this is implemented. The function objects themselves are inlined into the forms which is a bit unclean, since functions aren't unfortunately really data in clojure. it works, but it's not the perfect solution. ideally we would wrap the whole expression with a
Before the thread-loca-arena there was a lot of actual allocation when calling a native function: But with the Looking at memory profiling, the creation of and with
|
|
@rutenkolk do you have any fork that merges all your improvement PRs, I'd like to test them locally. |
@IGJoshua this seems a quite improvement, what do you think? |
Hey, this is on my radar and I intend to merge it! My main issue has been that my work has been very demanding and left me with little time for the in-depth review that I feel coffi deserves on the PRs that will be added to it. It's on my shortlist of tasks to do, I'm hoping to get to it soon :) |




Hi, In this pull request I propose some performance improvements that mainly target
ffi/defcfndefined functions taking struct arguments.While developing a test application using the latest version of coffi, i noticed that repeatedly calling a function which took arguments that are to be serialized via
defstructdefined serdes, performance took a big hit. Profiling resulted that the majority of time is spent inmem/size-ofandmem/align-of:The reason for this is, that the serde-wrapper for FFI functions called
mem/alloc-instanceandmem/serialize-intofor non-primitive arguments, which results in calls tomem/size-ofandmem/align-of, which will actually go throughmem/c-layout.mem/c-layoutcan be a pretty expensive function on top of being a multimethod but here it is called multiple times for every argument.One optimization proposed here is to memoize calls to
mem/size-ofandmem/align-ofwhose argument is not aMemoryLayout.This improved performance, but unfortunately not by as much as i had hoped.
Therefore, another improvement proposed here is generating a call to
mem/allocwith the size and alignment baked in, instead of doing so every time the FFI call is made usingmem/alloc-instance.The next bottleneck was actually the call to
mem/serialize-intowhich had a similar issue asmem/alloc-instance, needing to dispatch on the multimethodmem/type-dispatchwith the serde descriptor.Leveraging the serde registry introduced with the defstruct macro, we can allow for an inline solution, should one exist in the registry. Together with the addition of some type hints for
defstructserdes, this eliminated all calls tomem/size-of,mem/align-ofandmem/type-dispatchaltogether and is my last proposed optimization.With these changes in place, the actual allocation of the segments in the confined arena becomes the dominant cost of the whole FFI call, suggesting little other performance gains:
I unfortunately don't have a rigourous benchmark for the impact of the improvements, but in my private raylib example I started with an fps of around 100 and ended somewhere over 4000.